Improving Reliability of Software RAID

Registered by Clint Byrum

mdadm currently has a gaggle of open bugs, and every cycle the RAID ISO tests produce new and interesting bugs. It seems like we're doing something a bit wrong with Software RAID. There are some valid solutions here: https://wiki.ubuntu.com/ReliableRaid which should be discussed and either refuted or implemented.

Blueprint information

Status:
Started
Approver:
Steve Langasek
Priority:
High
Drafter:
Dimitri John Ledkov
Direction:
Approved
Assignee:
Dimitri John Ledkov
Definition:
Approved
Series goal:
Accepted for raring
Implementation:
Started
Milestone target:
milestone icon ubuntu-13.04
Started by
Steve Langasek

Related branches

Sprints

Whiteboard

Past Points:
[kees] Collect historical work done on improving raid: TODO
[kees] Write detailed specification of mdadm initramfs requirements: TODO
[kees] Write detailed specification of mdadm post-initramfs requirements: TODO
Test RAID over LVM and LVM over RAID: TODO

Notes from etherpad:

there are a lot of bugs

- put together a tree of failure conditions
- map the intention of how to deal with it
- check existing code against intention, fix deltas
- suggest/recommend smart monitoring in servers
- Look into automated testing of ALL supported RAID modes
- Test case for LVM over RAID
- Investigate and test booting without mdadm.conf
- Investigate not autostarting certain arrays
- Interface with upstream for feedback (invite to UDS-P)
- Fully document (maybe in conjunction with upstream?) and review existing documentation around software raid debugging and general maintenance.
Multiple locations for various disk/array information can be used to diagnose what is where, and who's done what, for example:
** ll /dev/disk/by-id/
** mdadm --detail /dev/md127
** cat /proc/mdstat
** messages from kernel include references to "ata*.*" with no easy way to trace it back to a "real" /dev/sd* device
** lshw
** how to identify out of the drives, which is which? often "dd if=/dev/sd* of=/dev/null" (where * is the failed drive) and see which drive is solidly active.
- The upstream documentation https://raid.wiki.kernel.org/index.php/Linux_Raid is very basic when it comes to diagnostics or failure conditions, also quite outdated in many situations.

Intentions
- Preserve Data Integrity
- Detect Known Failure Modes as early as possible
- Allow system to run and reboot fine even if partial hardware failures occur(ed).
- Provide Options for how to handle failure modes:
- The BOOT_DEGRADED=false option allows admins to safeguard against mdadm bugs and to do recovery manually, but should default to true.

Failure Modes
- Degraded array at boot
- Failed drive at runtime
- Removed drive at runtime (where metadata is intact)
- Adding out-of-sync drive
- Adding failed drive
- Drives producing corrupt reads without failure
- Old RAID configuration resurrection
- LVM starting up on mirror halves
- Hardware is fail_ing_ (SMART details)
  - can we link the SMART data to some kind of user reporting?

Validate the behavior of drivers/arrays/drives
- does the driver notice a yanked drive?
- does the driver notice a failed drive?
- how does the driver react to a new drive getting inserted?
** New drive being inserted with alternate raid config
** Some controllers are hot swap, some not, how to identify?

Debugging failure cases (user side?)
- logic to align dm/ata information as expressed in dmesg etc with the /dev/sdx devices that mdadm knows about

Links:
https://wiki.ubuntu.com/HotplugRaid

drussell 2011-05-17: Added more content to the etherpad session post UDS...
slangasek 2011-10-31: this has been reproposed for a session at UDS-P, but I don't think any of the facts have changed. Why is another session needed for this?
cbyrum 2011-10-31: Agreed Steve, this just needs to get done, nothing has changed.
drussell - 2011-10-31: Absolutely agreed... so how do we focus on getting this done?
dmitrij.ledkov 2012-05-18: Adding foundations-q-degraded-hw-notification as a dependency for sending degraded raid notifications to the user & integrating SMART notifications.

foundations-q-event-based-initramfs dependency is not fully determined yet. It is pending on investigation of current RAID deficiencies. Only then event-based-initramfs might become a hard dependency.

(?)

Work Items

Work items:
create (Ubuntu|Upstream) RAID Architecture Specification: INPROGRESS
update existing (i.e. out-of-date) RAID documentation: TODO
identify and document all failure conditions: TODO
investigate if foundations-q-event-based-initramfs is required for completing this spec (UPDATE it is due to cryptsetup): DONE
test existing handling of failure modes for RAID: TODO
establish automated RAID testing and RAID failure condition testing: BLOCKED
review blueprint linked bugs for likelihood of fixing this cycle: DONE
get input from kees on the whole topic of Reliable Raid: DONE
backport critical reliability fixes to 12.04 LTS: DONE

Dependency tree

* Blueprints in grey have been implemented.