White Paper: Page (1) of 2 - 06/30/06 Email this story to a friend. email article Print this page (Article printing at MyDmn.com).print page facebook

Surviving Two Disk Failures

Introducing Various ?RAID 6? Implementations By Robert Maddock, Nigel Hart & Tom Kean

Introduction
The following white paper discusses the reasons why RAID 6 has become more interesting to implement in Storage controllers in recent years. This is in part due to the adoption of high capacity (SATA) disk drives and has meant that the likelihood of failure during a rebuild operation has become likely.

The paper discusses the implications generically for RAID 6 algorithms against other techniques to protect against double drive failures. There are multiple RAID 6 algorithms and these are discussed at a high level. Each of these has some differences in their abilities, but generically some applications are easier deployed onto RAID 6 arrays than others.

Rationale
The capacity of disk drives is continuously increasing, and the cost of storage is falling. So users are storing more and more data. But the reliability of disk drives does not seem to be improving (and moving to ATA drives may make it worse). So the number of data loss incidents in an installation is probably increasing. Customers of course want the number to fall, since dealing with such incidents is a time-consuming and expensive operation. The ideal would be a storage product that ?never loses data. Is this feasible? Many storage controllers today offer RAID arrays that can survive the failure of one disk drive without losing data. RAID-5 and RAID-1 (mirroring) are the most common examples. When a disk fails, it will be replaced and the contents rebuilt from the information on the remaining disks. There are two ways in which a further failure can lead to data loss.


1. A second disk drive failure
If a second disk fails before the replace and rebuild is complete, the whole RAID array is lost. All the LUNs on this array will be lost, and if virtualisation is being used, sections of many LUNs may be lost. To reduce the chance of this happening, it is important to do the replace and rebuild as quickly as possible. Having a hot spare available allows the replacement to happen immediately, and rebuilding quickly reduces the exposure time, at the cost of a more severe performance reduction for normal operations. But as disk capacity increases, the disk data rate usually increases by a smaller amount, so the rebuild time increases, and with it the chance of data loss. Server class disk drives have a specified MTBF of 1 million hours or more. So an 8+P RAID-5 array will suffer a disk drive failure every 111,111 hours. With a replace and rebuild time of 10 hours, the chance of a second failure in an 8+P RAID-5 array, for example, is 80 in 1,000,000. So the chance of array loss is once every 1.4E9 hours.

This sounds godlike the changes of having an array failure are incredibly small, but it is doubtful whether the 1,000,000 hour MTBF figure is real. It isnt clear that any disk drives really reach this. Certainly some models of disk drive are worse than others, and if the factory builds a bad batch, they may well end up in a drawer together built into a RAID array. If the figure is ten times worse, the incidence of array loss is 100 times worse. If the customer has 100 RAID arrays, the figure is 100 times worse again. Now we are at 1.4E5 hours, about 15 years. So over a ten-year life it is likely to happen.

All this assumes the drive failures are independent events. Even worse, although difficult to quantify, are situations where some influence common to the drives of an array leads to failures. Stiction after a power failure was a notorious example in the past, but various environmental problems could cause linked failures.

2. A read error during rebuild
To rebuild the contents of a failed disk drive in an 8+P RAID-5 array, we must read all the data on each of the 8 remaining disk drives. Server class drives specify an unrecoverable read error rate of about 1 in 1E14 bits. It isnt clear exactly what this means, and disk drive manufacturers provide little detail, but it suggests that reading eight drives of 300GB each, i.e. 8*300E9*10 = 2.4E13 bits, will have about one chance in 4 of a read error. If such a read error occurs, there is at least one block of the data on the array, which is lost. With or without virtualisation, this will be one block in one LUN. This will happen every 4 disk drive failures, i.e. every 4.4E5 hours. If the customer has 100 arrays it will happen every 4.4E3 hours every 6 months. This is with disk drives that meet their specification. If they are ten times worse, it happens 100 times more often every other day.

Repair error Apart from data loss, there is one more argument that has been advanced for multiple redundancies. Apparently, it is quite common that when the repairman arrives to repair a RAID-5 array in which a disk has failed, he pulls out the wrong disk. Although this should not lead to long-term data loss, it certainly causes loss of access to data until the mistake is rectified. In array with multiple redundancies, access would not be interrupted.

What can be done? It seems unlikely that disk drive manufacturers will come up with orders of magnitude improvements in reliability, as long as we are using spinning platters and magnetics. So the only way to improve things substantially is to use arrays which can survive two (or more) failures without losing data. There are many different designs for arrays with multiple redundancy, with different advantages and disadvantages. The main trade-off is, as always, between cost and performance. The more disks that are used for a given amount of data, the higher the cost. With disks capable of a given number of operations per second, the number of disk operations required to implement each read or write to the controller determines performance. (There are cost and performance trade-offs in the controller itself as well. If the algorithms used are complex, the controller is likely to be more expensive, both to develop and to manufacture, for a given level of performance.)

Various design for multiple redundancy

1. Triple mirroring Three copies of the data are stored on three separate disks. It is expensive, since it uses three times as many disks as JBOD. But it is simple, and it performs quite well. Each read requires one disk read, from any of the three copies, and each write requires three disk writes: one to each copy.

2. RAID-51 Copies of the data are stored on two separate RAID-5 arrays, on separate disks of course. This is moderately expensive, since it uses more than twice as many disks as JBOD. The performance is quite good for reads, but modest for writes, since two RAID-5 writes are required. Each read requires one disk read, from either copy. Each write requires a read-modify-write to two disks on both RAID-5 arrays, although the reads only need to be done from one, making a total of six operations. As with RAID-5, long sequential writes can be optimised into full-stride writes.

RAID-51 is particularly suitable when the two RAID-5 copies can be geographically separated, since complete failure of either site leaves one complete RAID-5, providing all the data with continuing redundancy. 3. RAID 5&0 A variant of RAID-51 is to store one RAID-5 copy of the data and one RAID-0 copy. This reduces the cost slightly (only one copy of parity) and improves the performance as well, since writing to the RAID- 0 copy is more efficient. Writes now require read-modify writes to two disks on the RAID-5, and one disk write on the RAID-0. 4. RAID-6 The term RAID-6 here is used to describe any technique that stores strips of user data plus two derived redundancy strips, similar to the one parity strip in RAID-5. These redundancy strips are such that the data can be recovered after the loss of any two data or redundancy strips. The name RAID-6 should be used with caution, since it has been used for various different things (any salesman would see it as one better than RAID-5!). Another term is RAID-DP (double parity), but this is confusing too, since Network Appliance use this for their double parity technique, and HP use the even more confusing term RAID-5DP for the method used in their VA-7000 product. RAID-6 offers low cost, since only one copy of the data is stored, plus the redundancy strips. The performance is poor, although the performance per physical disk is no worse than RAID-51. Each read requires one disk read. Each write requires a read-modify-write on three disks. As with RAID-5, long sequential writes can be optimised. The following table summarises the characteristics of these various methods and compares them to common methods with single or no redundancy.

It is worth noting that the probability of three independent failures is very small, probably less likely than a geographical failure - fire, flood, earthquake, etc. - and so geographical separation is probably more useful than RAID-51 on one site. Such separation can be implemented with RAID 1/10, RAID- 51, RAID 5&0, or triple mirroring. 

Page: 1 2 Next Page


Related Keywords:RAID 6 , storage controllers, disk failures

HOT THREADS on DMN Forums
Content-type: text/html  Rss  Add to Google Reader or
Homepage    Add to My AOL  Add to Excite MIX  Subscribe in
NewsGator Online 
Real-Time - what users are saying - Right Now!

Our Privacy Policy --- @ Copyright, 2015 Digital Media Online, All Rights Reserved