Table of Contents
Mirror replacement
This is a generic example of drive replacement for RAID 1.
Partition layout and number of raid device can vary.
PARTITIONS
$ fdisk -l /dev/sda Disk /dev/sda: 2.7 TiB, 3000592982016 bytes, 5860533168 sectors Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 4096 bytes I/O size (minimum/optimal): 4096 bytes / 4096 bytes Disklabel type: gpt Disk identifier: 9B5F217F-3F30-4564-A171-E5B4AD5C208F Device Start End Sectors Size Type /dev/sda1 4096 1052671 1048576 512M Linux RAID /dev/sda2 1052672 5860533134 5859480463 2.7T Linux RAID /dev/sda3 2048 4095 2048 1M BIOS boot Partition table entries are not in disk order. $ fdisk -l /dev/sdb Disk /dev/sdb: 2.7 TiB, 3000592982016 bytes, 5860533168 sectors Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disklabel type: gpt Disk identifier: BCCDC02E-470A-45B6-B980-DA44E29FEC14 Device Start End Sectors Size Type /dev/sdb1 4096 1052671 1048576 512M Linux RAID /dev/sdb2 1052672 5860533134 5859480463 2.7T Linux RAID /dev/sdb3 2048 4095 2048 1M BIOS boot Partition table entries are not in disk order.
SMART
Using SMART we can detect the imminent failure of a drive using Smartmontools.
Smartmontools can also be configured to send us notifies.
$ sudo smartctl --all /dev/sdb smartctl 6.4 2014-10-07 r4002 [x86_64-linux-3.16.0-5-amd64] (local build) Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Western Digital RE4 (SATA 6Gb/s) Device Model: WDC WD3000FYYZ-01UL1B2 Serial Number: WD-WCC134KLD74Z LU WWN Device Id: 5 0014ee 2b5e68c42 Firmware Version: 01.01K03 User Capacity: 3,000,592,982,016 bytes [3.00 TB] Sector Size: 512 bytes logical/physical Rotation Rate: 7200 rpm Device is: In smartctl database [for details use: -P show] ATA Version is: ATA8-ACS (minor revision not indicated) SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s) Local Time is: Mon Oct 28 10:13:11 2019 GMT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x85) Offline data collection activity was aborted by an interrupting command from host. Auto Offline Data Collection: Enabled. Self-test execution status: ( 241) Self-test routine in progress... 10% of test remaining. Total time to complete Offline data collection: (34320) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 372) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x70bd) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 152 152 021 Pre-fail Always - 11383 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 7 5 Reallocated_Sector_Ct 0x0033 189 189 140 Pre-fail Always - 357 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 068 068 000 Old_age Always - 23380 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 7 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 3 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 3 194 Temperature_Celsius 0x0022 099 096 000 Old_age Always - 53 196 Reallocated_Event_Count 0x0032 194 194 000 Old_age Always - 6 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 23325 - # 2 Short offline Completed without error 00% 23300 - # 3 Short offline Completed without error 00% 23180 - # 4 Short offline Completed without error 00% 23108 - # 5 Short offline Completed without error 00% 22989 - # 6 Short offline Completed without error 00% 22944 - # 7 Extended offline Completed without error 00% 22927 - # 8 Short offline Aborted by host 10% 22901 - # 9 Short offline Completed without error 00% 22892 - #10 Short offline Completed without error 00% 22821 - #11 Short offline Completed without error 00% 22797 - #12 Short offline Completed without error 00% 22677 - #13 Short offline Completed without error 00% 22653 - #14 Short offline Completed without error 00% 22629 - #15 Short offline Completed without error 00% 22605 - #16 Short offline Completed without error 00% 22581 - #17 Short offline Completed without error 00% 22533 - #18 Short offline Completed without error 00% 22461 - #19 Short offline Completed without error 00% 22393 - #20 Short offline Completed without error 00% 22293 - #21 Short offline Completed without error 00% 22221 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
The above drive has bad sectors demonstrated by the reallocation counters.
Although the drive is working for now, it probably will not last so it needs to replaced.
MDADM
Scan
$ mdadm --detail --scan ARRAY /dev/md/0 metadata=1.2 name=rescue:0 UUID=6e5c4531:3c91df9d:8ab440c4:fc779ac4 ARRAY /dev/md/1 metadata=1.2 name=rescue:1 UUID=4692c051:89d658ca:72ef7264:20f41ab9
Detail before removal
$ mdadm --detail /dev/md0 /dev/md0: Version : 1.2 Creation Time : Wed Jun 3 22:54:50 2015 Raid Level : raid1 Array Size : 523968 (511.77 MiB 536.54 MB) Used Dev Size : 523968 (511.77 MiB 536.54 MB) Raid Devices : 2 Total Devices : 2 Persistence : Superblock is persistent Update Time : Mon Oct 28 03:19:43 2019 State : clean Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0 Name : rescue:0 UUID : 6e5c4531:3c91df9d:8ab440c4:fc779ac4 Events : 356 Number Major Minor RaidDevice State 0 8 1 0 active sync /dev/sda1 2 8 17 1 active sync /dev/sdb1 $ mdadm --detail /dev/md1 /dev/md1: Version : 1.2 Creation Time : Wed Jun 3 22:54:51 2015 Raid Level : raid1 Array Size : 2929608960 (2793.89 GiB 2999.92 GB) Used Dev Size : 2929608960 (2793.89 GiB 2999.92 GB) Raid Devices : 2 Total Devices : 2 Persistence : Superblock is persistent Update Time : Mon Oct 28 12:53:01 2019 State : clean Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0 Name : rescue:1 UUID : 4692c051:89d658ca:72ef7264:20f41ab9 Events : 248991 Number Major Minor RaidDevice State 0 8 2 0 active sync /dev/sda2 2 8 18 1 active sync /dev/sdb2
Fail and remove
$ mdadm /dev/md0 --fail /dev/sdb1 --remove /dev/sdb1 mdadm: set /dev/sdb1 faulty in /dev/md0 mdadm: hot removed /dev/sdb1 from /dev/md0 $ mdadm /dev/md1 --fail /dev/sdb2 --remove /dev/sdb2 mdadm: set /dev/sdb2 faulty in /dev/md1 mdadm: hot removed /dev/sdb2 from /dev/md1 $ dmesg [45115503.745119] md/raid1:md0: Disk failure on sdb1, disabling device. md/raid1:md0: Operation continuing on 1 devices. [45115503.781346] RAID1 conf printout: [45115503.781350] --- wd:1 rd:2 [45115503.781353] disk 0, wo:0, o:1, dev:sda1 [45115503.781354] disk 1, wo:1, o:0, dev:sdb1 [45115503.806356] RAID1 conf printout: [45115503.806359] --- wd:1 rd:2 [45115503.806361] disk 0, wo:0, o:1, dev:sda1 [45115503.820337] md: unbind<sdb1> [45115503.845667] md: export_rdev(sdb1) [45115514.733880] md/raid1:md1: Disk failure on sdb2, disabling device. md/raid1:md1: Operation continuing on 1 devices. [45115514.749460] RAID1 conf printout: [45115514.749463] --- wd:1 rd:2 [45115514.749464] disk 0, wo:0, o:1, dev:sda2 [45115514.749466] disk 1, wo:1, o:0, dev:sdb2 [45115514.766374] RAID1 conf printout: [45115514.766378] --- wd:1 rd:2 [45115514.766380] disk 0, wo:0, o:1, dev:sda2 [45115514.774490] md: unbind<sdb2> [45115514.790339] md: export_rdev(sdb2)
Detail after removal
$ mdadm --detail /dev/md0 /dev/md0: Version : 1.2 Creation Time : Wed Jun 3 22:54:50 2015 Raid Level : raid1 Array Size : 523968 (511.77 MiB 536.54 MB) Used Dev Size : 523968 (511.77 MiB 536.54 MB) Raid Devices : 2 Total Devices : 1 Persistence : Superblock is persistent Update Time : Mon Oct 28 13:28:28 2019 State : clean, degraded Active Devices : 1 Working Devices : 1 Failed Devices : 0 Spare Devices : 0 Name : rescue:0 UUID : 6e5c4531:3c91df9d:8ab440c4:fc779ac4 Events : 359 Number Major Minor RaidDevice State 0 8 1 0 active sync /dev/sda1 2 0 0 2 removed $ mdadm --detail /dev/md1 /dev/md1: Version : 1.2 Creation Time : Wed Jun 3 22:54:51 2015 Raid Level : raid1 Array Size : 2929608960 (2793.89 GiB 2999.92 GB) Used Dev Size : 2929608960 (2793.89 GiB 2999.92 GB) Raid Devices : 2 Total Devices : 1 Persistence : Superblock is persistent Update Time : Mon Oct 28 13:37:40 2019 State : clean, degraded Active Devices : 1 Working Devices : 1 Failed Devices : 0 Spare Devices : 0 Name : rescue:1 UUID : 4692c051:89d658ca:72ef7264:20f41ab9 Events : 250540 Number Major Minor RaidDevice State 0 8 2 0 active sync /dev/sda2 2 0 0 2 removed
Replacement
First erase data on damaged drive for privacy, then determine the serial number of the drive.
Replace the drive with an equally sized or larger device.
Erase data
Write zeros to all sectors.
$ dd if=/dev/zero of=/dev/sdb bs=4M dd: error writing ‘/dev/sdb’: No space left on device 715398+0 records in 715397+0 records out 3000592982016 bytes (3.0 TB) copied, 20639 s, 145 MB/s
The above took 5 hours 36 minutes.
SMART
Determine serial number and check status.
Zeroing the drive has triggered new sectors failures.
smartctl --all /dev/sdb|less smartctl 6.4 2014-10-07 r4002 [x86_64-linux-3.16.0-5-amd64] (local build) Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Western Digital RE4 (SATA 6Gb/s) Device Model: WDC WD3000FYYZ-01UL1B2 Serial Number: WD-WCC134KLD74Z LU WWN Device Id: 5 0014ee 2b5e68c42 Firmware Version: 01.01K03 User Capacity: 3,000,592,982,016 bytes [3.00 TB] Sector Size: 512 bytes logical/physical Rotation Rate: 7200 rpm Device is: In smartctl database [for details use: -P show] ATA Version is: ATA8-ACS (minor revision not indicated) SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s) Local Time is: Tue Oct 29 02:00:42 2019 GMT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (34320) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 372) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x70bd) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 152 152 021 Pre-fail Always - 11383 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 7 5 Reallocated_Sector_Ct 0x0033 186 186 140 Pre-fail Always - 447 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 068 068 000 Old_age Always - 23396 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 7 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 3 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 3 194 Temperature_Celsius 0x0022 103 096 000 Old_age Always - 49 196 Reallocated_Event_Count 0x0032 194 194 000 Old_age Always - 6 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 16 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 16 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 15 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 23383 - # 2 Short offline Completed without error 00% 23325 - # 3 Short offline Completed without error 00% 23300 - # 4 Short offline Completed without error 00% 23180 - # 5 Short offline Completed without error 00% 23108 - # 6 Short offline Completed without error 00% 22989 - # 7 Short offline Completed without error 00% 22944 - # 8 Extended offline Completed without error 00% 22927 - # 9 Short offline Aborted by host 10% 22901 - #10 Short offline Completed without error 00% 22892 - #11 Short offline Completed without error 00% 22821 - #12 Short offline Completed without error 00% 22797 - #13 Short offline Completed without error 00% 22677 - #14 Short offline Completed without error 00% 22653 - #15 Short offline Completed without error 00% 22629 - #16 Short offline Completed without error 00% 22605 - #17 Short offline Completed without error 00% 22581 - #18 Short offline Completed without error 00% 22533 - #19 Short offline Completed without error 00% 22461 - #20 Short offline Completed without error 00% 22393 - #21 Short offline Completed without error 00% 22293 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
Add new drive
Copy parition table
Copy sda to sdb using a backup
sgdisk -b sda.bak /dev/sda sgdisk -l sda.bak /dev/sdb
Copy directly
sgdisk -R /dev/sdb /dev/sda
Add unique id
sgdisk -G /dev/sdb
Add to array
mdadm --manage /dev/md0 --add /dev/sdb1 mdadm --manage /dev/md1 --add /dev/sdb2
Status
cat /proc/mdstat Personalities : [raid1] md1 : active raid1 sdb2[2] sda2[0] 2929608960 blocks super 1.2 [2/1] [U_] [>....................] recovery = 0.2% (7099008/2929608960) finish=742.6min speed=65584K/sec md0 : active raid1 sdb1[2] sda1[0] 523968 blocks super 1.2 [2/2] [UU] unused devices: <none>
dmesg
[126629.165345] md: bind<sdb1> [126629.213425] RAID1 conf printout: [126629.213428] --- wd:1 rd:2 [126629.213430] disk 0, wo:0, o:1, dev:sda1 [126629.213432] disk 1, wo:1, o:1, dev:sdb1 [126629.213580] md: recovery of RAID array md0 [126629.213606] md: minimum _guaranteed_ speed: 1000 KB/sec/disk. [126629.213624] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery. [126629.213659] md: using 128k window, over a total of 523968k. [126636.136690] md: md0: recovery done. [126636.247571] RAID1 conf printout: [126636.247574] --- wd:2 rd:2 [126636.247577] disk 0, wo:0, o:1, dev:sda1 [126636.247578] disk 1, wo:0, o:1, dev:sdb1
[126715.311174] md: bind<sdb2> [126715.407833] RAID1 conf printout: [126715.407836] --- wd:1 rd:2 [126715.407838] disk 0, wo:0, o:1, dev:sda2 [126715.407839] disk 1, wo:1, o:1, dev:sdb2 [126715.407949] md: recovery of RAID array md1 [126715.407971] md: minimum _guaranteed_ speed: 1000 KB/sec/disk. [126715.407989] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery. [126715.408017] md: using 128k window, over a total of 2929608960k. ...
bootloader
grub-install /dev/sdb
Off-site
mdadm --assemble /dev/mdX /dev/sdX