====Mirror replacement====
This is a generic example of drive replacement for RAID 1.
Partition layout and number of raid device can vary.
===PARTITIONS===
$ fdisk -l /dev/sda
Disk /dev/sda: 2.7 TiB, 3000592982016 bytes, 5860533168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 9B5F217F-3F30-4564-A171-E5B4AD5C208F
Device Start End Sectors Size Type
/dev/sda1 4096 1052671 1048576 512M Linux RAID
/dev/sda2 1052672 5860533134 5859480463 2.7T Linux RAID
/dev/sda3 2048 4095 2048 1M BIOS boot
Partition table entries are not in disk order.
$ fdisk -l /dev/sdb
Disk /dev/sdb: 2.7 TiB, 3000592982016 bytes, 5860533168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: BCCDC02E-470A-45B6-B980-DA44E29FEC14
Device Start End Sectors Size Type
/dev/sdb1 4096 1052671 1048576 512M Linux RAID
/dev/sdb2 1052672 5860533134 5859480463 2.7T Linux RAID
/dev/sdb3 2048 4095 2048 1M BIOS boot
Partition table entries are not in disk order.
===SMART===
Using SMART we can detect the imminent failure of a drive using Smartmontools.
Smartmontools can also be configured to send us notifies.
$ sudo smartctl --all /dev/sdb
smartctl 6.4 2014-10-07 r4002 [x86_64-linux-3.16.0-5-amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Western Digital RE4 (SATA 6Gb/s)
Device Model: WDC WD3000FYYZ-01UL1B2
Serial Number: WD-WCC134KLD74Z
LU WWN Device Id: 5 0014ee 2b5e68c42
Firmware Version: 01.01K03
User Capacity: 3,000,592,982,016 bytes [3.00 TB]
Sector Size: 512 bytes logical/physical
Rotation Rate: 7200 rpm
Device is: In smartctl database [for details use: -P show]
ATA Version is: ATA8-ACS (minor revision not indicated)
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Mon Oct 28 10:13:11 2019 GMT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x85) Offline data collection activity
was aborted by an interrupting command from host.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 241) Self-test routine in progress...
10% of test remaining.
Total time to complete Offline
data collection: (34320) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 372) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x70bd) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 152 152 021 Pre-fail Always - 11383
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 7
5 Reallocated_Sector_Ct 0x0033 189 189 140 Pre-fail Always - 357
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 068 068 000 Old_age Always - 23380
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 7
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 3
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 3
194 Temperature_Celsius 0x0022 099 096 000 Old_age Always - 53
196 Reallocated_Event_Count 0x0032 194 194 000 Old_age Always - 6
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 23325 -
# 2 Short offline Completed without error 00% 23300 -
# 3 Short offline Completed without error 00% 23180 -
# 4 Short offline Completed without error 00% 23108 -
# 5 Short offline Completed without error 00% 22989 -
# 6 Short offline Completed without error 00% 22944 -
# 7 Extended offline Completed without error 00% 22927 -
# 8 Short offline Aborted by host 10% 22901 -
# 9 Short offline Completed without error 00% 22892 -
#10 Short offline Completed without error 00% 22821 -
#11 Short offline Completed without error 00% 22797 -
#12 Short offline Completed without error 00% 22677 -
#13 Short offline Completed without error 00% 22653 -
#14 Short offline Completed without error 00% 22629 -
#15 Short offline Completed without error 00% 22605 -
#16 Short offline Completed without error 00% 22581 -
#17 Short offline Completed without error 00% 22533 -
#18 Short offline Completed without error 00% 22461 -
#19 Short offline Completed without error 00% 22393 -
#20 Short offline Completed without error 00% 22293 -
#21 Short offline Completed without error 00% 22221 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
The above drive has bad sectors demonstrated by the reallocation counters.
Although the drive is working for now, it probably will not last so it needs
to replaced.
===MDADM===
==Scan==
$ mdadm --detail --scan
ARRAY /dev/md/0 metadata=1.2 name=rescue:0 UUID=6e5c4531:3c91df9d:8ab440c4:fc779ac4
ARRAY /dev/md/1 metadata=1.2 name=rescue:1 UUID=4692c051:89d658ca:72ef7264:20f41ab9
==Detail before removal==
$ mdadm --detail /dev/md0
/dev/md0:
Version : 1.2
Creation Time : Wed Jun 3 22:54:50 2015
Raid Level : raid1
Array Size : 523968 (511.77 MiB 536.54 MB)
Used Dev Size : 523968 (511.77 MiB 536.54 MB)
Raid Devices : 2
Total Devices : 2
Persistence : Superblock is persistent
Update Time : Mon Oct 28 03:19:43 2019
State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
Name : rescue:0
UUID : 6e5c4531:3c91df9d:8ab440c4:fc779ac4
Events : 356
Number Major Minor RaidDevice State
0 8 1 0 active sync /dev/sda1
2 8 17 1 active sync /dev/sdb1
$ mdadm --detail /dev/md1
/dev/md1:
Version : 1.2
Creation Time : Wed Jun 3 22:54:51 2015
Raid Level : raid1
Array Size : 2929608960 (2793.89 GiB 2999.92 GB)
Used Dev Size : 2929608960 (2793.89 GiB 2999.92 GB)
Raid Devices : 2
Total Devices : 2
Persistence : Superblock is persistent
Update Time : Mon Oct 28 12:53:01 2019
State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
Name : rescue:1
UUID : 4692c051:89d658ca:72ef7264:20f41ab9
Events : 248991
Number Major Minor RaidDevice State
0 8 2 0 active sync /dev/sda2
2 8 18 1 active sync /dev/sdb2
==Fail and remove==
$ mdadm /dev/md0 --fail /dev/sdb1 --remove /dev/sdb1
mdadm: set /dev/sdb1 faulty in /dev/md0
mdadm: hot removed /dev/sdb1 from /dev/md0
$ mdadm /dev/md1 --fail /dev/sdb2 --remove /dev/sdb2
mdadm: set /dev/sdb2 faulty in /dev/md1
mdadm: hot removed /dev/sdb2 from /dev/md1
$ dmesg
[45115503.745119] md/raid1:md0: Disk failure on sdb1, disabling device.
md/raid1:md0: Operation continuing on 1 devices.
[45115503.781346] RAID1 conf printout:
[45115503.781350] --- wd:1 rd:2
[45115503.781353] disk 0, wo:0, o:1, dev:sda1
[45115503.781354] disk 1, wo:1, o:0, dev:sdb1
[45115503.806356] RAID1 conf printout:
[45115503.806359] --- wd:1 rd:2
[45115503.806361] disk 0, wo:0, o:1, dev:sda1
[45115503.820337] md: unbind
[45115503.845667] md: export_rdev(sdb1)
[45115514.733880] md/raid1:md1: Disk failure on sdb2, disabling device.
md/raid1:md1: Operation continuing on 1 devices.
[45115514.749460] RAID1 conf printout:
[45115514.749463] --- wd:1 rd:2
[45115514.749464] disk 0, wo:0, o:1, dev:sda2
[45115514.749466] disk 1, wo:1, o:0, dev:sdb2
[45115514.766374] RAID1 conf printout:
[45115514.766378] --- wd:1 rd:2
[45115514.766380] disk 0, wo:0, o:1, dev:sda2
[45115514.774490] md: unbind
[45115514.790339] md: export_rdev(sdb2)
==Detail after removal==
$ mdadm --detail /dev/md0
/dev/md0:
Version : 1.2
Creation Time : Wed Jun 3 22:54:50 2015
Raid Level : raid1
Array Size : 523968 (511.77 MiB 536.54 MB)
Used Dev Size : 523968 (511.77 MiB 536.54 MB)
Raid Devices : 2
Total Devices : 1
Persistence : Superblock is persistent
Update Time : Mon Oct 28 13:28:28 2019
State : clean, degraded
Active Devices : 1
Working Devices : 1
Failed Devices : 0
Spare Devices : 0
Name : rescue:0
UUID : 6e5c4531:3c91df9d:8ab440c4:fc779ac4
Events : 359
Number Major Minor RaidDevice State
0 8 1 0 active sync /dev/sda1
2 0 0 2 removed
$ mdadm --detail /dev/md1
/dev/md1:
Version : 1.2
Creation Time : Wed Jun 3 22:54:51 2015
Raid Level : raid1
Array Size : 2929608960 (2793.89 GiB 2999.92 GB)
Used Dev Size : 2929608960 (2793.89 GiB 2999.92 GB)
Raid Devices : 2
Total Devices : 1
Persistence : Superblock is persistent
Update Time : Mon Oct 28 13:37:40 2019
State : clean, degraded
Active Devices : 1
Working Devices : 1
Failed Devices : 0
Spare Devices : 0
Name : rescue:1
UUID : 4692c051:89d658ca:72ef7264:20f41ab9
Events : 250540
Number Major Minor RaidDevice State
0 8 2 0 active sync /dev/sda2
2 0 0 2 removed
===Replacement===
First erase data on damaged drive for privacy, then determine
the serial number of the drive.
Replace the drive with an equally sized or larger device.
==Erase data==
Write zeros to all sectors.
$ dd if=/dev/zero of=/dev/sdb bs=4M
dd: error writing ‘/dev/sdb’: No space left on device
715398+0 records in
715397+0 records out
3000592982016 bytes (3.0 TB) copied, 20639 s, 145 MB/s
The above took 5 hours 36 minutes.
==SMART==
Determine serial number and check status.
Zeroing the drive has triggered new sectors failures.
smartctl --all /dev/sdb|less
smartctl 6.4 2014-10-07 r4002 [x86_64-linux-3.16.0-5-amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Western Digital RE4 (SATA 6Gb/s)
Device Model: WDC WD3000FYYZ-01UL1B2
Serial Number: WD-WCC134KLD74Z
LU WWN Device Id: 5 0014ee 2b5e68c42
Firmware Version: 01.01K03
User Capacity: 3,000,592,982,016 bytes [3.00 TB]
Sector Size: 512 bytes logical/physical
Rotation Rate: 7200 rpm
Device is: In smartctl database [for details use: -P show]
ATA Version is: ATA8-ACS (minor revision not indicated)
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Tue Oct 29 02:00:42 2019 GMT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (34320) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 372) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x70bd) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 152 152 021 Pre-fail Always - 11383
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 7
5 Reallocated_Sector_Ct 0x0033 186 186 140 Pre-fail Always - 447
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 068 068 000 Old_age Always - 23396
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 7
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 3
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 3
194 Temperature_Celsius 0x0022 103 096 000 Old_age Always - 49
196 Reallocated_Event_Count 0x0032 194 194 000 Old_age Always - 6
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 16
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 16
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 15
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 23383 -
# 2 Short offline Completed without error 00% 23325 -
# 3 Short offline Completed without error 00% 23300 -
# 4 Short offline Completed without error 00% 23180 -
# 5 Short offline Completed without error 00% 23108 -
# 6 Short offline Completed without error 00% 22989 -
# 7 Short offline Completed without error 00% 22944 -
# 8 Extended offline Completed without error 00% 22927 -
# 9 Short offline Aborted by host 10% 22901 -
#10 Short offline Completed without error 00% 22892 -
#11 Short offline Completed without error 00% 22821 -
#12 Short offline Completed without error 00% 22797 -
#13 Short offline Completed without error 00% 22677 -
#14 Short offline Completed without error 00% 22653 -
#15 Short offline Completed without error 00% 22629 -
#16 Short offline Completed without error 00% 22605 -
#17 Short offline Completed without error 00% 22581 -
#18 Short offline Completed without error 00% 22533 -
#19 Short offline Completed without error 00% 22461 -
#20 Short offline Completed without error 00% 22393 -
#21 Short offline Completed without error 00% 22293 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
==Add new drive==
Copy parition table
Copy sda to sdb using a backup
sgdisk -b sda.bak /dev/sda
sgdisk -l sda.bak /dev/sdb
Copy directly
sgdisk -R /dev/sdb /dev/sda
Add unique id
sgdisk -G /dev/sdb
Add to array
mdadm --manage /dev/md0 --add /dev/sdb1
mdadm --manage /dev/md1 --add /dev/sdb2
Status
cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sdb2[2] sda2[0]
2929608960 blocks super 1.2 [2/1] [U_]
[>....................] recovery = 0.2% (7099008/2929608960) finish=742.6min speed=65584K/sec
md0 : active raid1 sdb1[2] sda1[0]
523968 blocks super 1.2 [2/2] [UU]
unused devices:
dmesg
[126629.165345] md: bind
[126629.213425] RAID1 conf printout:
[126629.213428] --- wd:1 rd:2
[126629.213430] disk 0, wo:0, o:1, dev:sda1
[126629.213432] disk 1, wo:1, o:1, dev:sdb1
[126629.213580] md: recovery of RAID array md0
[126629.213606] md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
[126629.213624] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
[126629.213659] md: using 128k window, over a total of 523968k.
[126636.136690] md: md0: recovery done.
[126636.247571] RAID1 conf printout:
[126636.247574] --- wd:2 rd:2
[126636.247577] disk 0, wo:0, o:1, dev:sda1
[126636.247578] disk 1, wo:0, o:1, dev:sdb1
[126715.311174] md: bind
[126715.407833] RAID1 conf printout:
[126715.407836] --- wd:1 rd:2
[126715.407838] disk 0, wo:0, o:1, dev:sda2
[126715.407839] disk 1, wo:1, o:1, dev:sdb2
[126715.407949] md: recovery of RAID array md1
[126715.407971] md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
[126715.407989] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
[126715.408017] md: using 128k window, over a total of 2929608960k.
...
bootloader
grub-install /dev/sdb
=== Off-site ===
mdadm --assemble /dev/mdX /dev/sdX