Table of Contents
Mirror replacement
This is a generic example of drive replacement for RAID 1.
Partition layout and number of raid device can vary.
PARTITIONS
$ fdisk -l /dev/sda Disk /dev/sda: 2.7 TiB, 3000592982016 bytes, 5860533168 sectors Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 4096 bytes I/O size (minimum/optimal): 4096 bytes / 4096 bytes Disklabel type: gpt Disk identifier: 9B5F217F-3F30-4564-A171-E5B4AD5C208F Device Start End Sectors Size Type /dev/sda1 4096 1052671 1048576 512M Linux RAID /dev/sda2 1052672 5860533134 5859480463 2.7T Linux RAID /dev/sda3 2048 4095 2048 1M BIOS boot Partition table entries are not in disk order. $ fdisk -l /dev/sdb Disk /dev/sdb: 2.7 TiB, 3000592982016 bytes, 5860533168 sectors Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disklabel type: gpt Disk identifier: BCCDC02E-470A-45B6-B980-DA44E29FEC14 Device Start End Sectors Size Type /dev/sdb1 4096 1052671 1048576 512M Linux RAID /dev/sdb2 1052672 5860533134 5859480463 2.7T Linux RAID /dev/sdb3 2048 4095 2048 1M BIOS boot Partition table entries are not in disk order.
SMART
Using SMART we can detect the imminent failure of a drive using Smartmontools.
Smartmontools can also be configured to send us notifies.
$ sudo smartctl --all /dev/sdb
smartctl 6.4 2014-10-07 r4002 [x86_64-linux-3.16.0-5-amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Western Digital RE4 (SATA 6Gb/s)
Device Model: WDC WD3000FYYZ-01UL1B2
Serial Number: WD-WCC134KLD74Z
LU WWN Device Id: 5 0014ee 2b5e68c42
Firmware Version: 01.01K03
User Capacity: 3,000,592,982,016 bytes [3.00 TB]
Sector Size: 512 bytes logical/physical
Rotation Rate: 7200 rpm
Device is: In smartctl database [for details use: -P show]
ATA Version is: ATA8-ACS (minor revision not indicated)
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Mon Oct 28 10:13:11 2019 GMT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x85) Offline data collection activity
was aborted by an interrupting command from host.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 241) Self-test routine in progress...
10% of test remaining.
Total time to complete Offline
data collection: (34320) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 372) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x70bd) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 152 152 021 Pre-fail Always - 11383
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 7
5 Reallocated_Sector_Ct 0x0033 189 189 140 Pre-fail Always - 357
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 068 068 000 Old_age Always - 23380
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 7
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 3
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 3
194 Temperature_Celsius 0x0022 099 096 000 Old_age Always - 53
196 Reallocated_Event_Count 0x0032 194 194 000 Old_age Always - 6
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 23325 -
# 2 Short offline Completed without error 00% 23300 -
# 3 Short offline Completed without error 00% 23180 -
# 4 Short offline Completed without error 00% 23108 -
# 5 Short offline Completed without error 00% 22989 -
# 6 Short offline Completed without error 00% 22944 -
# 7 Extended offline Completed without error 00% 22927 -
# 8 Short offline Aborted by host 10% 22901 -
# 9 Short offline Completed without error 00% 22892 -
#10 Short offline Completed without error 00% 22821 -
#11 Short offline Completed without error 00% 22797 -
#12 Short offline Completed without error 00% 22677 -
#13 Short offline Completed without error 00% 22653 -
#14 Short offline Completed without error 00% 22629 -
#15 Short offline Completed without error 00% 22605 -
#16 Short offline Completed without error 00% 22581 -
#17 Short offline Completed without error 00% 22533 -
#18 Short offline Completed without error 00% 22461 -
#19 Short offline Completed without error 00% 22393 -
#20 Short offline Completed without error 00% 22293 -
#21 Short offline Completed without error 00% 22221 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
The above drive has bad sectors demonstrated by the reallocation counters.
Although the drive is working for now, it probably will not last so it needs to replaced.
MDADM
Scan
$ mdadm --detail --scan ARRAY /dev/md/0 metadata=1.2 name=rescue:0 UUID=6e5c4531:3c91df9d:8ab440c4:fc779ac4 ARRAY /dev/md/1 metadata=1.2 name=rescue:1 UUID=4692c051:89d658ca:72ef7264:20f41ab9
Detail before removal
$ mdadm --detail /dev/md0
/dev/md0:
Version : 1.2
Creation Time : Wed Jun 3 22:54:50 2015
Raid Level : raid1
Array Size : 523968 (511.77 MiB 536.54 MB)
Used Dev Size : 523968 (511.77 MiB 536.54 MB)
Raid Devices : 2
Total Devices : 2
Persistence : Superblock is persistent
Update Time : Mon Oct 28 03:19:43 2019
State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
Name : rescue:0
UUID : 6e5c4531:3c91df9d:8ab440c4:fc779ac4
Events : 356
Number Major Minor RaidDevice State
0 8 1 0 active sync /dev/sda1
2 8 17 1 active sync /dev/sdb1
$ mdadm --detail /dev/md1
/dev/md1:
Version : 1.2
Creation Time : Wed Jun 3 22:54:51 2015
Raid Level : raid1
Array Size : 2929608960 (2793.89 GiB 2999.92 GB)
Used Dev Size : 2929608960 (2793.89 GiB 2999.92 GB)
Raid Devices : 2
Total Devices : 2
Persistence : Superblock is persistent
Update Time : Mon Oct 28 12:53:01 2019
State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
Name : rescue:1
UUID : 4692c051:89d658ca:72ef7264:20f41ab9
Events : 248991
Number Major Minor RaidDevice State
0 8 2 0 active sync /dev/sda2
2 8 18 1 active sync /dev/sdb2
Fail and remove
$ mdadm /dev/md0 --fail /dev/sdb1 --remove /dev/sdb1 mdadm: set /dev/sdb1 faulty in /dev/md0 mdadm: hot removed /dev/sdb1 from /dev/md0 $ mdadm /dev/md1 --fail /dev/sdb2 --remove /dev/sdb2 mdadm: set /dev/sdb2 faulty in /dev/md1 mdadm: hot removed /dev/sdb2 from /dev/md1 $ dmesg [45115503.745119] md/raid1:md0: Disk failure on sdb1, disabling device. md/raid1:md0: Operation continuing on 1 devices. [45115503.781346] RAID1 conf printout: [45115503.781350] --- wd:1 rd:2 [45115503.781353] disk 0, wo:0, o:1, dev:sda1 [45115503.781354] disk 1, wo:1, o:0, dev:sdb1 [45115503.806356] RAID1 conf printout: [45115503.806359] --- wd:1 rd:2 [45115503.806361] disk 0, wo:0, o:1, dev:sda1 [45115503.820337] md: unbind<sdb1> [45115503.845667] md: export_rdev(sdb1) [45115514.733880] md/raid1:md1: Disk failure on sdb2, disabling device. md/raid1:md1: Operation continuing on 1 devices. [45115514.749460] RAID1 conf printout: [45115514.749463] --- wd:1 rd:2 [45115514.749464] disk 0, wo:0, o:1, dev:sda2 [45115514.749466] disk 1, wo:1, o:0, dev:sdb2 [45115514.766374] RAID1 conf printout: [45115514.766378] --- wd:1 rd:2 [45115514.766380] disk 0, wo:0, o:1, dev:sda2 [45115514.774490] md: unbind<sdb2> [45115514.790339] md: export_rdev(sdb2)
Detail after removal
$ mdadm --detail /dev/md0
/dev/md0:
Version : 1.2
Creation Time : Wed Jun 3 22:54:50 2015
Raid Level : raid1
Array Size : 523968 (511.77 MiB 536.54 MB)
Used Dev Size : 523968 (511.77 MiB 536.54 MB)
Raid Devices : 2
Total Devices : 1
Persistence : Superblock is persistent
Update Time : Mon Oct 28 13:28:28 2019
State : clean, degraded
Active Devices : 1
Working Devices : 1
Failed Devices : 0
Spare Devices : 0
Name : rescue:0
UUID : 6e5c4531:3c91df9d:8ab440c4:fc779ac4
Events : 359
Number Major Minor RaidDevice State
0 8 1 0 active sync /dev/sda1
2 0 0 2 removed
$ mdadm --detail /dev/md1
/dev/md1:
Version : 1.2
Creation Time : Wed Jun 3 22:54:51 2015
Raid Level : raid1
Array Size : 2929608960 (2793.89 GiB 2999.92 GB)
Used Dev Size : 2929608960 (2793.89 GiB 2999.92 GB)
Raid Devices : 2
Total Devices : 1
Persistence : Superblock is persistent
Update Time : Mon Oct 28 13:37:40 2019
State : clean, degraded
Active Devices : 1
Working Devices : 1
Failed Devices : 0
Spare Devices : 0
Name : rescue:1
UUID : 4692c051:89d658ca:72ef7264:20f41ab9
Events : 250540
Number Major Minor RaidDevice State
0 8 2 0 active sync /dev/sda2
2 0 0 2 removed
Replacement
First erase data on damaged drive for privacy, then determine the serial number of the drive.
Replace the drive with an equally sized or larger device.
Erase data
Write zeros to all sectors.
$ dd if=/dev/zero of=/dev/sdb bs=4M dd: error writing ‘/dev/sdb’: No space left on device 715398+0 records in 715397+0 records out 3000592982016 bytes (3.0 TB) copied, 20639 s, 145 MB/s
The above took 5 hours 36 minutes.
SMART
Determine serial number and check status.
Zeroing the drive has triggered new sectors failures.
smartctl --all /dev/sdb|less
smartctl 6.4 2014-10-07 r4002 [x86_64-linux-3.16.0-5-amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Western Digital RE4 (SATA 6Gb/s)
Device Model: WDC WD3000FYYZ-01UL1B2
Serial Number: WD-WCC134KLD74Z
LU WWN Device Id: 5 0014ee 2b5e68c42
Firmware Version: 01.01K03
User Capacity: 3,000,592,982,016 bytes [3.00 TB]
Sector Size: 512 bytes logical/physical
Rotation Rate: 7200 rpm
Device is: In smartctl database [for details use: -P show]
ATA Version is: ATA8-ACS (minor revision not indicated)
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Tue Oct 29 02:00:42 2019 GMT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (34320) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 372) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x70bd) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 152 152 021 Pre-fail Always - 11383
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 7
5 Reallocated_Sector_Ct 0x0033 186 186 140 Pre-fail Always - 447
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 068 068 000 Old_age Always - 23396
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 7
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 3
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 3
194 Temperature_Celsius 0x0022 103 096 000 Old_age Always - 49
196 Reallocated_Event_Count 0x0032 194 194 000 Old_age Always - 6
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 16
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 16
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 15
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 23383 -
# 2 Short offline Completed without error 00% 23325 -
# 3 Short offline Completed without error 00% 23300 -
# 4 Short offline Completed without error 00% 23180 -
# 5 Short offline Completed without error 00% 23108 -
# 6 Short offline Completed without error 00% 22989 -
# 7 Short offline Completed without error 00% 22944 -
# 8 Extended offline Completed without error 00% 22927 -
# 9 Short offline Aborted by host 10% 22901 -
#10 Short offline Completed without error 00% 22892 -
#11 Short offline Completed without error 00% 22821 -
#12 Short offline Completed without error 00% 22797 -
#13 Short offline Completed without error 00% 22677 -
#14 Short offline Completed without error 00% 22653 -
#15 Short offline Completed without error 00% 22629 -
#16 Short offline Completed without error 00% 22605 -
#17 Short offline Completed without error 00% 22581 -
#18 Short offline Completed without error 00% 22533 -
#19 Short offline Completed without error 00% 22461 -
#20 Short offline Completed without error 00% 22393 -
#21 Short offline Completed without error 00% 22293 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
Add new drive
Copy parition table
Copy sda to sdb using a backup
sgdisk -b sda.bak /dev/sda sgdisk -l sda.bak /dev/sdb
Copy directly
sgdisk -R /dev/sdb /dev/sda
Add unique id
sgdisk -G /dev/sdb
Add to array
mdadm --manage /dev/md0 --add /dev/sdb1 mdadm --manage /dev/md1 --add /dev/sdb2
Status
cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sdb2[2] sda2[0]
2929608960 blocks super 1.2 [2/1] [U_]
[>....................] recovery = 0.2% (7099008/2929608960) finish=742.6min speed=65584K/sec
md0 : active raid1 sdb1[2] sda1[0]
523968 blocks super 1.2 [2/2] [UU]
unused devices: <none>
dmesg
[126629.165345] md: bind<sdb1> [126629.213425] RAID1 conf printout: [126629.213428] --- wd:1 rd:2 [126629.213430] disk 0, wo:0, o:1, dev:sda1 [126629.213432] disk 1, wo:1, o:1, dev:sdb1 [126629.213580] md: recovery of RAID array md0 [126629.213606] md: minimum _guaranteed_ speed: 1000 KB/sec/disk. [126629.213624] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery. [126629.213659] md: using 128k window, over a total of 523968k. [126636.136690] md: md0: recovery done. [126636.247571] RAID1 conf printout: [126636.247574] --- wd:2 rd:2 [126636.247577] disk 0, wo:0, o:1, dev:sda1 [126636.247578] disk 1, wo:0, o:1, dev:sdb1
[126715.311174] md: bind<sdb2> [126715.407833] RAID1 conf printout: [126715.407836] --- wd:1 rd:2 [126715.407838] disk 0, wo:0, o:1, dev:sda2 [126715.407839] disk 1, wo:1, o:1, dev:sdb2 [126715.407949] md: recovery of RAID array md1 [126715.407971] md: minimum _guaranteed_ speed: 1000 KB/sec/disk. [126715.407989] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery. [126715.408017] md: using 128k window, over a total of 2929608960k. ...
bootloader
grub-install /dev/sdb
Off-site
mdadm --assemble /dev/mdX /dev/sdX

