Buck past trends and improve IT recoverability in 2016

Forrester and Disaster Recovery Journal have been keeping a track on recovery time trends. Every 3 years since 2007 they’ve been asking whether companies have been able to recover from their most recent disruption in less than 1 hour. The results are quite outstanding and the way they’re heading we can only predict that barely 1% of companies in 2016 will be able to recover their IT systems in less that 1 hour.

Q: Could you recover from IT disruption in under 1 hour? Those that replied ‘Yes':

RTO under 1 hour

So should your New Year’s Resolution for your business be to assess your risk from IT downtime? What can you do to better protect your IT systems from such incidents and ensure your recoveries are faster and less damaging to your business?

  1. Review your Disaster Recovery solution. If you’re looking to achieve recovery times of less than 1 hour, then using a backup product to recover from is out of the question. All cloud Disaster Recovery service providers who recover part or all of your IT systems following an incident will also be out of the question – most Disaster Recovery providers will offer recovery times of 4 hours or more as they will have a lot of work to do following the failure. The only solutions that can give you recovery times of under 1 hour are a physical standby solution (where you have a replica physical IT system ready and working for you to switch to should your production system go down), or a virtual standby solution (where your have a virtual replica IT system ready and working for you to switch to should your production system go down). This is where clever marketing can make things confusing. A standby solution is one where your entire IT system, including operating systems, applications and data are in a secondary location fully configured and ready to be used should your primary system fail. Often the word ‘replication’ can be misconstrued as standby, but replication can mean anything from virtual machine replication, data replication or entire system replication so it’s important to clarify what you’re looking at. Only entire system replication will offer recovery times of under 1 hour if you have a full system failure.

    Having said this both of the standby options (physical and virtual) still carry an element of risk:

    – Recovery/standby systems are often not 100% up to date. Every change made to the production systems needs to also be made to the standby system in order for seamless failover to work. If a database has been moved from one server to another for example, but this change not made on the standby system, then when you failover, your standby system won’t be the same as your live system and you’ll encounter errors which will delay recovery time. Standby systems require a lot of maintenance to keep them up to date.

    – Infrequent testing can mean standby systems aren’t working, as errors are not getting picked up in time. In addition to budgeting for the standby system (hardware/data centre costs, software licensing, replication tools, connectivity etc), companies need to budget for maintenance and testing of their secondary sites (unless this service is offered by your DRaaS provider). Often however the testing budget is overlooked as it can make the solution non viable, and testing therefore happens once a year – if that, whereas best practice is to test it daily. According to our disaster recovery research, 24% of tests fail. This means that despite big investments in standby systems, the lack of maintenance and testing can mean there is still a risk of this investment not paying dividends.

  2. Test whether your recovery time meets your RTO (recovery time objective) in a Disaster Recovery test scenario. Although it may be seen as disruptive to business, a controlled environment is much better for testing your DR solution in than an unplanned situation, where a lack of systems and communication methods can have some really detrimental consequences. It’s ultimately the executive board who are responsible for risk so they should be supporting efforts to reduce risk, and encouraging more frequent and thorough testing. Our guide to disaster recovery testing tells you what you should be testing as best practice. Many companies will uncover flaws with their current Disaster Recovery solution which may just need some attention, or may be the justification for an upgrade to reduce recovery times.
  3.  Outsource your Disaster Recovery. And if you’re thinking “you would say that”, just take a moment to consider being in a scenario where you are without business critical systems – email, website, operational systems are all down. What happens next? Your IT team get bombarded with problems to fix, while you get pressure from the executive board to get the business back to full service, and the board ultimately get pressure from investors and public relations. The best people to recover your IT systems and get you back to full service as quickly as possible are therefore those that are well practised and removed from the business pressures so they can focus solely on the recovery. Let your IT team deal with the internal pressures and individual critical needs, and then the final configuration requirements if needed. Let a specialist DR company deal with the recovery, as it will be much quicker.

 

And if you’re struggling to cut through the marketing and technical lingo whilst reviewing new providers ask them exactly what will need to be done following an IT disaster (the recovery process and configuration requirements) and what sort of guarantees they will give on your recovery times.