Saturday, 27 August 2016

Linux RAID-1 [7]: Recovery and fault-finding

/dev/md0:
        Version : 1.2
  Creation Time : Sat Jul  2 04:03:31 2016
     Raid Level : raid1
     Array Size : 976631360 (931.39 GiB 1000.07 GB)
  Used Dev Size : 976631360 (931.39 GiB 1000.07 GB)
   Raid Devices : 2
  Total Devices : 1
    Persistence : Superblock is persistent

    Update Time : Sat Aug 27 15:49:27 2016
          State : clean, degraded 
 Active Devices : 1
Working Devices : 1
 Failed Devices : 0
  Spare Devices : 0

           Name : patrick-H97-D3H:0
           UUID : fb59dadf:72ab4a7e:f821e41a:daa5988b
         Events : 524054

    Number   Major   Minor   RaidDevice State
       0       0        0        0      removed
       1       8       48        1      active sync   /dev/sdd

As you can see from the above, the RAID-1 array on my main computer has problems, with one of the disks falling out of the array. Hence in the table at the bottom, where the two devices in the array are listed, one is shown as removed, and the overall state of the array shows as "degraded".

The cause is shown in this extract from the kernel log:

Aug 27 11:05:09 MainPC kernel: [25592.706805] ata3: link is slow to respond, please be patient (ready=0)
Aug 27 11:05:12 MainPC kernel: [25595.794847] ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Aug 27 11:05:12 MainPC kernel: [25595.801763] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x100)
Aug 27 11:05:12 MainPC kernel: [25595.801765] ata3.00: revalidation failed (errno=-5)
Aug 27 11:05:21 MainPC kernel: [25604.206942] ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Aug 27 11:05:21 MainPC kernel: [25604.213541] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x100)
Aug 27 11:05:21 MainPC kernel: [25604.213543] ata3.00: revalidation failed (errno=-5)
Aug 27 11:05:21 MainPC kernel: [25604.213556] ata3: limiting SATA link speed to 3.0 Gbps
Aug 27 11:05:29 MainPC kernel: [25612.239030] ata3: SATA link down (SStatus 0 SControl 320)
Aug 27 11:05:29 MainPC kernel: [25612.239034] ata3.00: link offline, clearing class 1 to NONE
Aug 27 11:05:29 MainPC kernel: [25612.239037] ata3.00: disabled
Aug 27 11:05:29 MainPC kernel: [25612.239048] sd 2:0:0:0: rejecting I/O to offline device
Aug 27 11:05:29 MainPC kernel: [25612.239051] sd 2:0:0:0: killing request
Aug 27 11:05:29 MainPC kernel: [25612.239056] ata3.00: detaching (SCSI 2:0:0:0)
Aug 27 11:05:29 MainPC kernel: [25612.239183] sd 2:0:0:0: [sdb] Start/Stop Unit failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Aug 27 11:05:29 MainPC kernel: [25612.242814] sd 2:0:0:0: [sdb] Synchronizing SCSI cache
Aug 27 11:05:29 MainPC kernel: [25612.242840] sd 2:0:0:0: [sdb] Synchronize Cache(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Aug 27 11:05:29 MainPC kernel: [25612.242841] sd 2:0:0:0: [sdb] Stopping disk
Aug 27 11:05:29 MainPC kernel: [25612.242847] sd 2:0:0:0: [sdb] Start/Stop Unit failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Aug 27 11:05:29 MainPC kernel: [25612.251453] ata3: exception Emask 0x10 SAct 0x0 SErr 0x4040000 action 0xe frozen

Clearly the disk is identified as /dev/sdb, which was the disk missing from the mdadm output. There were plenty of other similar messages in the kernel log from later in the boot cycle, and the physical aspect apparent as the computer was coming one, during which time I was seeing these messages scrolling up the screen, was that disk would keep spinning up and then stopping again and resetting. The messages come up on the screen because I have textmode set in grub due to the need to have textmode startups with this computer after first installing it to facilitate swapping from the broken Nouveau drivers to NVidia for the graphics card. 

The sequence of kernel messages finally ends at 11:06:53 when the disk apparently came back online so it took nearly 2 minutes to get the disk working and meantime even though the SMART tests look OK mdadm has decided to drop it out of the array.  These disks are a pair of WD Caviar Black 1 TBs that I purchased four years and five days ago. Obviously a consideration is whether to replace them with 2 TB units as I am currently using two computers to store data each in a 1 TB array because only I ran out of disk space on the 1 TB array some time ago and decided splitting the data across two computers was a cheaper option, as some old spare 1 TB disks were available for the second computer.

I have decided to replace both disks in the array with 2 TB disks, a long deferred decision as mentioned above. This will let me bring the array up to 2 TB size and therefore allow the other computer to be eliminated as an extra storage location, because I am looking ahead to a day when having two separate computers in two rooms of the house takes up too much space and in fact, having four computers in two rooms as is actually the case, is extravagant. So the idea then would be to have just one computer desk with two computers attached to it.

Obviously there are usual lessons about backups to be learned. In this case I had a recent backup that was done when Xubuntu was put onto the computer. I also have a regular series of backups with some removable disks and the use of the RAID array is in itself a backup technique. But if there was a fire here I could still lose a month's worth of data, currently none of those backup disks are out of the house so I have to get back to taking at least one of them offsite regularly again. However at the moment not many photos are being taken and most other stuff like maps and study are backed up on Google Drive, so I could probably manage. Full rsync backups are what I mainly do at the moment because I haven't worked out anything else.

mdadm and some other systems exist for monitoring RAID arrays but I haven't used any monitoring tool to date.

Well I ran two SMART tests on the disk using smartctl and the short one was find and the long one came back as a read failure. In fact gnome-disk-utility, the GUI we all use to look at disks in distros like Ubuntu, has now updated itself to show the disk as failed. So therefore I am working for sure on the basis that as of now I need to make some backups and replace this disk because it really is stuffed. New disks will arrive next week and new array will get put in soon as after that. So I make a backup to a removable drive in the caddy of /home and then redirect to it, then take out both the old disks, put the new ones in, copy /home back to them and then redirect back to them.