SAN Failure

Even though a high level of resilience is built into SAN’s, they are still quite prone to failure. When a SAN failure hit a Plan B customer, causing multiple disks to fail on the morning of the 26 February, CRM support was the first to notice that the CRM system had gone down. Although running healthily at the start of the day, just after 10am the company lost access.

The customer in question delivers workforce management software, helping over 700 public and private sector customers worldwide to manage large, multi-skilled workforces. Without their CRM system, the sales and customer service teams are not able to access customer details to maintain these business critical functions. In addition, the customer portal requires access to the CRM system in order for customers to log support queries. Soon after the failure, calls started coming in from customers to request support over the phone and find out why the normal portal access wasn’t working. This soon adds pressure to the resources of the business if not managed and rectified quickly.

Dealing with the disaster

Our customer had carried out a full disaster recovery exercise the previous August and therefore felt comfortable with their disaster recovery process. This helped the senior management team cope with the situation without stress, which in turn made them more efficient at recovering their IT systems in a calm and controlled manner.

At 10:18 the Director of Business Systems and IT called Plan B Disaster Recovery to discuss recovery of their CRM system. Plan B offers a fully managed DR service which guarantees to recover the business’s IT systems to a point where users are working again, business as usual, within twice the minutes it takes to boot their IT system. We do this by fully recovering their IT systems every 24 hours and then booting them up to run a series of tests, to application level, to certify that recovery systems are working. By doing this every 24 hours in advance of a failure, the recovery can be guaranteed, and the process is simple and fast, requiring just the click of a button to recover a customer’s IT systems.

In the event of the incident, our client passed the security process instantly, which authorises Plan B to start recovery. It was agreed that the best course of action was to invoke the Plan B service. Our client would work from a standby copy of their CRM system which had been built in anticipation on the Plan B’s virtual platform the previous night, and which was instantly available. They would work from this platform until their SAN issue was resolved. At 10:26 the recovery was invoked, and at 10:38 their CRM system was fully booted and working. Our client’s CRM system was therefore fully recovered 20 minutes after they first called in, and 12 minutes after the invocation started. Running of this system was handed over to their IT team.

At 11:38 when we spoke to our main contact, connectivity to the CRM server was confirmed and everything was announced as working well.

Our client ran their CRM system from the recovery platform until the 9 March (11 days) by which time they had resolved the issues to their SAN and were ready to migrate back to their live system.

The business impact

Having been through IT disasters previously, our contacts were aware of what could go wrong during recoveries, and the length of time it can take to revert to ‘business as usual’. Without testing the DR service regularly, they appreciate that there are often unknown services and applications on servers that the DR process is unplanned for –resulting in a lot of extra time, resource and stress at the point of recovery.

Even through their DR exercise in August went smoothly, and recoveries are carried out and tested to application level every 24 hours at Plan B, our client was still surprised by just how quickly and easily their CRM system was recovered.

If the Plan B DR service was not in place, recovery of the CRM system would have taken a lot longer, and it’s highly probable that impact to the business would have been felt. Interruption to customer portal access and CRM systems would have been felt, potentially causing damage to brand reputation. Loss of customers and therefore revenue is always a concern when the processes and functions of a business are constrained.

As it was, only the senior managers in the sales, customer service and IT departments were involved, and due to the fast recovery times there was:

  1. No financial impact
  2. No customer communication or involvement required
  3. No board level reporting or involvement required

At a review following the incident, it was decided that no changes to the DR process could improve the outcome. In fact the customer was so impressed that they are considering whether to protect non-critical servers with the Plan B’s managed disaster recovery service.

Discover how badly a SAN failure can affect your business

Read about SSP’s SAN failure to discover what can happen if you are not protected by Plan B.

Our CRM system is business critical and fast recovery was critical to operations Even though we have practiced our DR plan recently, recovery was faster and much less stressful than expected. It not only worked, but it was so simple that only a few senior managers were even aware of the outage. It demonstrates how robust our critical servers are and how safe our business is in the hands of Plan B.

The secret to their success is in the fact they have already carried out a full recovery within the last 24 hours, so everyone is 100% confident it will recover – FAST