In a world when more and more companies are relying on magnetic data storage solutions rather than paper copies there is an ongoing drive not only to produce more robust and reliable data storage systems, but also to develop the expertise to recover data when the storage systems fail. RAID data storage systems are a fairly robust data storage technology, but even they are susceptible to critical failures that can lead to the possibility that all of the data on the drives involved is lost.
This is something that should be considered as part of the business risk management, but often this type of critical failure is overlooked. There are three main areas where failure can occur:
• Hardware – if the is a single disk failure then recovery is often relatively straightforward, but if there is multiple disk failure, or controller card failure data recovery can be a lot more complex
• Software – if the software that is being used to write the data to the hard disks, or the server registry, becomes corrupt then then new data may not be written correctly and access to existing data may be lost
• Operator error – in many cases it is operator error that introduces serious faults into a RAID system that causes the data to become inaccessible. This is particularly true if there is poor management of the RAID data storage configurations and data management
In cases where only a single drive fails there are relatively simple recovery processes available, which in the case of RAID 5 involves the replacement of the damaged disk with a new one and the file saving mechanism will recreate the contents of the damaged drive. However, when the entire system crashes there are no easy fixes.
If the entire system crashes there are two possible causes; a logistical cause, or a physical cause. Logistical causes are most likely to be caused by operator error as they may introduce inappropriate parameters within the software configuration that leaves the RAID system unable to function. These types of critical RAID system recoveries are possible to rectify by utilising specialist software that allows the system to be reconfigured.
Physical crashes are either due to physical malfunctioning of the drives or by the hardware controller. In this case the hardware components will need to be replaced or repaired. This process can be a lot more time intensive, especially if there is physical damage or corruption to a number of the drives.
In order to be able to deal effectively with a critical RAID crash there will need to be a management plan in place that covers at least the following elements:
• Knowledge – An appropriate expert should be identified to carry out an initial assessment of the crash
• Planning – The cost benefits of recovery of the drives in question also need to be fully understood
• Cost – The cost of carrying out the RAID recovery needs to be factored into budgets in case it is needed
Any risk management plan should also include preventative measures by ensuring that operators are trained and that RAID systems are regularly serviced and maintained, both physically and with software updates.