====Mirror replacement==== This is a generic example of drive replacement for RAID 1. Partition layout and number of raid device can vary. ===PARTITIONS=== $ fdisk -l /dev/sda Disk /dev/sda: 2.7 TiB, 3000592982016 bytes, 5860533168 sectors Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 4096 bytes I/O size (minimum/optimal): 4096 bytes / 4096 bytes Disklabel type: gpt Disk identifier: 9B5F217F-3F30-4564-A171-E5B4AD5C208F Device Start End Sectors Size Type /dev/sda1 4096 1052671 1048576 512M Linux RAID /dev/sda2 1052672 5860533134 5859480463 2.7T Linux RAID /dev/sda3 2048 4095 2048 1M BIOS boot Partition table entries are not in disk order. $ fdisk -l /dev/sdb Disk /dev/sdb: 2.7 TiB, 3000592982016 bytes, 5860533168 sectors Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disklabel type: gpt Disk identifier: BCCDC02E-470A-45B6-B980-DA44E29FEC14 Device Start End Sectors Size Type /dev/sdb1 4096 1052671 1048576 512M Linux RAID /dev/sdb2 1052672 5860533134 5859480463 2.7T Linux RAID /dev/sdb3 2048 4095 2048 1M BIOS boot Partition table entries are not in disk order. ===SMART=== Using SMART we can detect the imminent failure of a drive using Smartmontools. Smartmontools can also be configured to send us notifies. $ sudo smartctl --all /dev/sdb smartctl 6.4 2014-10-07 r4002 [x86_64-linux-3.16.0-5-amd64] (local build) Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Western Digital RE4 (SATA 6Gb/s) Device Model: WDC WD3000FYYZ-01UL1B2 Serial Number: WD-WCC134KLD74Z LU WWN Device Id: 5 0014ee 2b5e68c42 Firmware Version: 01.01K03 User Capacity: 3,000,592,982,016 bytes [3.00 TB] Sector Size: 512 bytes logical/physical Rotation Rate: 7200 rpm Device is: In smartctl database [for details use: -P show] ATA Version is: ATA8-ACS (minor revision not indicated) SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s) Local Time is: Mon Oct 28 10:13:11 2019 GMT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x85) Offline data collection activity was aborted by an interrupting command from host. Auto Offline Data Collection: Enabled. Self-test execution status: ( 241) Self-test routine in progress... 10% of test remaining. Total time to complete Offline data collection: (34320) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 372) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x70bd) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 152 152 021 Pre-fail Always - 11383 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 7 5 Reallocated_Sector_Ct 0x0033 189 189 140 Pre-fail Always - 357 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 068 068 000 Old_age Always - 23380 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 7 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 3 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 3 194 Temperature_Celsius 0x0022 099 096 000 Old_age Always - 53 196 Reallocated_Event_Count 0x0032 194 194 000 Old_age Always - 6 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 23325 - # 2 Short offline Completed without error 00% 23300 - # 3 Short offline Completed without error 00% 23180 - # 4 Short offline Completed without error 00% 23108 - # 5 Short offline Completed without error 00% 22989 - # 6 Short offline Completed without error 00% 22944 - # 7 Extended offline Completed without error 00% 22927 - # 8 Short offline Aborted by host 10% 22901 - # 9 Short offline Completed without error 00% 22892 - #10 Short offline Completed without error 00% 22821 - #11 Short offline Completed without error 00% 22797 - #12 Short offline Completed without error 00% 22677 - #13 Short offline Completed without error 00% 22653 - #14 Short offline Completed without error 00% 22629 - #15 Short offline Completed without error 00% 22605 - #16 Short offline Completed without error 00% 22581 - #17 Short offline Completed without error 00% 22533 - #18 Short offline Completed without error 00% 22461 - #19 Short offline Completed without error 00% 22393 - #20 Short offline Completed without error 00% 22293 - #21 Short offline Completed without error 00% 22221 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. The above drive has bad sectors demonstrated by the reallocation counters. Although the drive is working for now, it probably will not last so it needs to replaced. ===MDADM=== ==Scan== $ mdadm --detail --scan ARRAY /dev/md/0 metadata=1.2 name=rescue:0 UUID=6e5c4531:3c91df9d:8ab440c4:fc779ac4 ARRAY /dev/md/1 metadata=1.2 name=rescue:1 UUID=4692c051:89d658ca:72ef7264:20f41ab9 ==Detail before removal== $ mdadm --detail /dev/md0 /dev/md0: Version : 1.2 Creation Time : Wed Jun 3 22:54:50 2015 Raid Level : raid1 Array Size : 523968 (511.77 MiB 536.54 MB) Used Dev Size : 523968 (511.77 MiB 536.54 MB) Raid Devices : 2 Total Devices : 2 Persistence : Superblock is persistent Update Time : Mon Oct 28 03:19:43 2019 State : clean Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0 Name : rescue:0 UUID : 6e5c4531:3c91df9d:8ab440c4:fc779ac4 Events : 356 Number Major Minor RaidDevice State 0 8 1 0 active sync /dev/sda1 2 8 17 1 active sync /dev/sdb1 $ mdadm --detail /dev/md1 /dev/md1: Version : 1.2 Creation Time : Wed Jun 3 22:54:51 2015 Raid Level : raid1 Array Size : 2929608960 (2793.89 GiB 2999.92 GB) Used Dev Size : 2929608960 (2793.89 GiB 2999.92 GB) Raid Devices : 2 Total Devices : 2 Persistence : Superblock is persistent Update Time : Mon Oct 28 12:53:01 2019 State : clean Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0 Name : rescue:1 UUID : 4692c051:89d658ca:72ef7264:20f41ab9 Events : 248991 Number Major Minor RaidDevice State 0 8 2 0 active sync /dev/sda2 2 8 18 1 active sync /dev/sdb2 ==Fail and remove== $ mdadm /dev/md0 --fail /dev/sdb1 --remove /dev/sdb1 mdadm: set /dev/sdb1 faulty in /dev/md0 mdadm: hot removed /dev/sdb1 from /dev/md0 $ mdadm /dev/md1 --fail /dev/sdb2 --remove /dev/sdb2 mdadm: set /dev/sdb2 faulty in /dev/md1 mdadm: hot removed /dev/sdb2 from /dev/md1 $ dmesg [45115503.745119] md/raid1:md0: Disk failure on sdb1, disabling device. md/raid1:md0: Operation continuing on 1 devices. [45115503.781346] RAID1 conf printout: [45115503.781350] --- wd:1 rd:2 [45115503.781353] disk 0, wo:0, o:1, dev:sda1 [45115503.781354] disk 1, wo:1, o:0, dev:sdb1 [45115503.806356] RAID1 conf printout: [45115503.806359] --- wd:1 rd:2 [45115503.806361] disk 0, wo:0, o:1, dev:sda1 [45115503.820337] md: unbind [45115503.845667] md: export_rdev(sdb1) [45115514.733880] md/raid1:md1: Disk failure on sdb2, disabling device. md/raid1:md1: Operation continuing on 1 devices. [45115514.749460] RAID1 conf printout: [45115514.749463] --- wd:1 rd:2 [45115514.749464] disk 0, wo:0, o:1, dev:sda2 [45115514.749466] disk 1, wo:1, o:0, dev:sdb2 [45115514.766374] RAID1 conf printout: [45115514.766378] --- wd:1 rd:2 [45115514.766380] disk 0, wo:0, o:1, dev:sda2 [45115514.774490] md: unbind [45115514.790339] md: export_rdev(sdb2) ==Detail after removal== $ mdadm --detail /dev/md0 /dev/md0: Version : 1.2 Creation Time : Wed Jun 3 22:54:50 2015 Raid Level : raid1 Array Size : 523968 (511.77 MiB 536.54 MB) Used Dev Size : 523968 (511.77 MiB 536.54 MB) Raid Devices : 2 Total Devices : 1 Persistence : Superblock is persistent Update Time : Mon Oct 28 13:28:28 2019 State : clean, degraded Active Devices : 1 Working Devices : 1 Failed Devices : 0 Spare Devices : 0 Name : rescue:0 UUID : 6e5c4531:3c91df9d:8ab440c4:fc779ac4 Events : 359 Number Major Minor RaidDevice State 0 8 1 0 active sync /dev/sda1 2 0 0 2 removed $ mdadm --detail /dev/md1 /dev/md1: Version : 1.2 Creation Time : Wed Jun 3 22:54:51 2015 Raid Level : raid1 Array Size : 2929608960 (2793.89 GiB 2999.92 GB) Used Dev Size : 2929608960 (2793.89 GiB 2999.92 GB) Raid Devices : 2 Total Devices : 1 Persistence : Superblock is persistent Update Time : Mon Oct 28 13:37:40 2019 State : clean, degraded Active Devices : 1 Working Devices : 1 Failed Devices : 0 Spare Devices : 0 Name : rescue:1 UUID : 4692c051:89d658ca:72ef7264:20f41ab9 Events : 250540 Number Major Minor RaidDevice State 0 8 2 0 active sync /dev/sda2 2 0 0 2 removed ===Replacement=== First erase data on damaged drive for privacy, then determine the serial number of the drive. Replace the drive with an equally sized or larger device. ==Erase data== Write zeros to all sectors. $ dd if=/dev/zero of=/dev/sdb bs=4M dd: error writing ‘/dev/sdb’: No space left on device 715398+0 records in 715397+0 records out 3000592982016 bytes (3.0 TB) copied, 20639 s, 145 MB/s The above took 5 hours 36 minutes. ==SMART== Determine serial number and check status. Zeroing the drive has triggered new sectors failures. smartctl --all /dev/sdb|less smartctl 6.4 2014-10-07 r4002 [x86_64-linux-3.16.0-5-amd64] (local build) Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Western Digital RE4 (SATA 6Gb/s) Device Model: WDC WD3000FYYZ-01UL1B2 Serial Number: WD-WCC134KLD74Z LU WWN Device Id: 5 0014ee 2b5e68c42 Firmware Version: 01.01K03 User Capacity: 3,000,592,982,016 bytes [3.00 TB] Sector Size: 512 bytes logical/physical Rotation Rate: 7200 rpm Device is: In smartctl database [for details use: -P show] ATA Version is: ATA8-ACS (minor revision not indicated) SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s) Local Time is: Tue Oct 29 02:00:42 2019 GMT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (34320) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 372) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x70bd) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 152 152 021 Pre-fail Always - 11383 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 7 5 Reallocated_Sector_Ct 0x0033 186 186 140 Pre-fail Always - 447 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 068 068 000 Old_age Always - 23396 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 7 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 3 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 3 194 Temperature_Celsius 0x0022 103 096 000 Old_age Always - 49 196 Reallocated_Event_Count 0x0032 194 194 000 Old_age Always - 6 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 16 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 16 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 15 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 23383 - # 2 Short offline Completed without error 00% 23325 - # 3 Short offline Completed without error 00% 23300 - # 4 Short offline Completed without error 00% 23180 - # 5 Short offline Completed without error 00% 23108 - # 6 Short offline Completed without error 00% 22989 - # 7 Short offline Completed without error 00% 22944 - # 8 Extended offline Completed without error 00% 22927 - # 9 Short offline Aborted by host 10% 22901 - #10 Short offline Completed without error 00% 22892 - #11 Short offline Completed without error 00% 22821 - #12 Short offline Completed without error 00% 22797 - #13 Short offline Completed without error 00% 22677 - #14 Short offline Completed without error 00% 22653 - #15 Short offline Completed without error 00% 22629 - #16 Short offline Completed without error 00% 22605 - #17 Short offline Completed without error 00% 22581 - #18 Short offline Completed without error 00% 22533 - #19 Short offline Completed without error 00% 22461 - #20 Short offline Completed without error 00% 22393 - #21 Short offline Completed without error 00% 22293 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. ==Add new drive== Copy parition table Copy sda to sdb using a backup sgdisk -b sda.bak /dev/sda sgdisk -l sda.bak /dev/sdb Copy directly sgdisk -R /dev/sdb /dev/sda Add unique id sgdisk -G /dev/sdb Add to array mdadm --manage /dev/md0 --add /dev/sdb1 mdadm --manage /dev/md1 --add /dev/sdb2 Status cat /proc/mdstat Personalities : [raid1] md1 : active raid1 sdb2[2] sda2[0] 2929608960 blocks super 1.2 [2/1] [U_] [>....................] recovery = 0.2% (7099008/2929608960) finish=742.6min speed=65584K/sec md0 : active raid1 sdb1[2] sda1[0] 523968 blocks super 1.2 [2/2] [UU] unused devices: dmesg [126629.165345] md: bind [126629.213425] RAID1 conf printout: [126629.213428] --- wd:1 rd:2 [126629.213430] disk 0, wo:0, o:1, dev:sda1 [126629.213432] disk 1, wo:1, o:1, dev:sdb1 [126629.213580] md: recovery of RAID array md0 [126629.213606] md: minimum _guaranteed_ speed: 1000 KB/sec/disk. [126629.213624] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery. [126629.213659] md: using 128k window, over a total of 523968k. [126636.136690] md: md0: recovery done. [126636.247571] RAID1 conf printout: [126636.247574] --- wd:2 rd:2 [126636.247577] disk 0, wo:0, o:1, dev:sda1 [126636.247578] disk 1, wo:0, o:1, dev:sdb1 [126715.311174] md: bind [126715.407833] RAID1 conf printout: [126715.407836] --- wd:1 rd:2 [126715.407838] disk 0, wo:0, o:1, dev:sda2 [126715.407839] disk 1, wo:1, o:1, dev:sdb2 [126715.407949] md: recovery of RAID array md1 [126715.407971] md: minimum _guaranteed_ speed: 1000 KB/sec/disk. [126715.407989] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery. [126715.408017] md: using 128k window, over a total of 2929608960k. ... bootloader grub-install /dev/sdb === Off-site === mdadm --assemble /dev/mdX /dev/sdX