Table of Contents

Mirror replacement

This is a generic example of drive replacement for RAID 1.

Partition layout and number of raid device can vary.

PARTITIONS

$ fdisk -l /dev/sda

Disk /dev/sda: 2.7 TiB, 3000592982016 bytes, 5860533168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 9B5F217F-3F30-4564-A171-E5B4AD5C208F

Device       Start        End    Sectors  Size Type
/dev/sda1     4096    1052671    1048576  512M Linux RAID
/dev/sda2  1052672 5860533134 5859480463  2.7T Linux RAID
/dev/sda3     2048       4095       2048    1M BIOS boot

Partition table entries are not in disk order.

$ fdisk -l /dev/sdb

Disk /dev/sdb: 2.7 TiB, 3000592982016 bytes, 5860533168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: BCCDC02E-470A-45B6-B980-DA44E29FEC14

Device       Start        End    Sectors  Size Type
/dev/sdb1     4096    1052671    1048576  512M Linux RAID
/dev/sdb2  1052672 5860533134 5859480463  2.7T Linux RAID
/dev/sdb3     2048       4095       2048    1M BIOS boot

Partition table entries are not in disk order.

SMART

Using SMART we can detect the imminent failure of a drive using Smartmontools.

Smartmontools can also be configured to send us notifies.

$ sudo smartctl --all /dev/sdb

smartctl 6.4 2014-10-07 r4002 [x86_64-linux-3.16.0-5-amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital RE4 (SATA 6Gb/s)
Device Model:     WDC WD3000FYYZ-01UL1B2
Serial Number:    WD-WCC134KLD74Z
LU WWN Device Id: 5 0014ee 2b5e68c42
Firmware Version: 01.01K03
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Mon Oct 28 10:13:11 2019 GMT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x85) Offline data collection activity
                                        was aborted by an interrupting command from host.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 241) Self-test routine in progress...
                                        10% of test remaining.
Total time to complete Offline
data collection:                (34320) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 372) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x70bd) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   152   152   021    Pre-fail  Always       -       11383
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       7
  5 Reallocated_Sector_Ct   0x0033   189   189   140    Pre-fail  Always       -       357
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   068   068   000    Old_age   Always       -       23380
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       7
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       3
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       3
194 Temperature_Celsius     0x0022   099   096   000    Old_age   Always       -       53
196 Reallocated_Event_Count 0x0032   194   194   000    Old_age   Always       -       6
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     23325         -
# 2  Short offline       Completed without error       00%     23300         -
# 3  Short offline       Completed without error       00%     23180         -
# 4  Short offline       Completed without error       00%     23108         -
# 5  Short offline       Completed without error       00%     22989         -
# 6  Short offline       Completed without error       00%     22944         -
# 7  Extended offline    Completed without error       00%     22927         -
# 8  Short offline       Aborted by host               10%     22901         -
# 9  Short offline       Completed without error       00%     22892         -
#10  Short offline       Completed without error       00%     22821         -
#11  Short offline       Completed without error       00%     22797         -
#12  Short offline       Completed without error       00%     22677         -
#13  Short offline       Completed without error       00%     22653         -
#14  Short offline       Completed without error       00%     22629         -
#15  Short offline       Completed without error       00%     22605         -
#16  Short offline       Completed without error       00%     22581         -
#17  Short offline       Completed without error       00%     22533         -
#18  Short offline       Completed without error       00%     22461         -
#19  Short offline       Completed without error       00%     22393         -
#20  Short offline       Completed without error       00%     22293         -
#21  Short offline       Completed without error       00%     22221         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

The above drive has bad sectors demonstrated by the reallocation counters.

Although the drive is working for now, it probably will not last so it needs to replaced.

MDADM

Scan
$ mdadm --detail --scan

ARRAY /dev/md/0 metadata=1.2 name=rescue:0 UUID=6e5c4531:3c91df9d:8ab440c4:fc779ac4
ARRAY /dev/md/1 metadata=1.2 name=rescue:1 UUID=4692c051:89d658ca:72ef7264:20f41ab9
Detail before removal
$ mdadm --detail /dev/md0

/dev/md0:
        Version : 1.2
  Creation Time : Wed Jun  3 22:54:50 2015
     Raid Level : raid1
     Array Size : 523968 (511.77 MiB 536.54 MB)
  Used Dev Size : 523968 (511.77 MiB 536.54 MB)
   Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

    Update Time : Mon Oct 28 03:19:43 2019
          State : clean
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

           Name : rescue:0
           UUID : 6e5c4531:3c91df9d:8ab440c4:fc779ac4
         Events : 356

    Number   Major   Minor   RaidDevice State
       0       8        1        0      active sync   /dev/sda1
       2       8       17        1      active sync   /dev/sdb1

$ mdadm --detail /dev/md1

/dev/md1:
        Version : 1.2
  Creation Time : Wed Jun  3 22:54:51 2015
     Raid Level : raid1
     Array Size : 2929608960 (2793.89 GiB 2999.92 GB)
  Used Dev Size : 2929608960 (2793.89 GiB 2999.92 GB)
   Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

    Update Time : Mon Oct 28 12:53:01 2019
          State : clean
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

           Name : rescue:1
           UUID : 4692c051:89d658ca:72ef7264:20f41ab9
         Events : 248991

    Number   Major   Minor   RaidDevice State
       0       8        2        0      active sync   /dev/sda2
       2       8       18        1      active sync   /dev/sdb2
Fail and remove
$ mdadm /dev/md0 --fail /dev/sdb1 --remove /dev/sdb1

mdadm: set /dev/sdb1 faulty in /dev/md0
mdadm: hot removed /dev/sdb1 from /dev/md0

$ mdadm /dev/md1 --fail /dev/sdb2 --remove /dev/sdb2

mdadm: set /dev/sdb2 faulty in /dev/md1
mdadm: hot removed /dev/sdb2 from /dev/md1

$ dmesg

[45115503.745119] md/raid1:md0: Disk failure on sdb1, disabling device.
md/raid1:md0: Operation continuing on 1 devices.
[45115503.781346] RAID1 conf printout:
[45115503.781350]  --- wd:1 rd:2
[45115503.781353]  disk 0, wo:0, o:1, dev:sda1
[45115503.781354]  disk 1, wo:1, o:0, dev:sdb1
[45115503.806356] RAID1 conf printout:
[45115503.806359]  --- wd:1 rd:2
[45115503.806361]  disk 0, wo:0, o:1, dev:sda1
[45115503.820337] md: unbind<sdb1>
[45115503.845667] md: export_rdev(sdb1)

[45115514.733880] md/raid1:md1: Disk failure on sdb2, disabling device.
md/raid1:md1: Operation continuing on 1 devices.
[45115514.749460] RAID1 conf printout:
[45115514.749463]  --- wd:1 rd:2
[45115514.749464]  disk 0, wo:0, o:1, dev:sda2
[45115514.749466]  disk 1, wo:1, o:0, dev:sdb2
[45115514.766374] RAID1 conf printout:
[45115514.766378]  --- wd:1 rd:2
[45115514.766380]  disk 0, wo:0, o:1, dev:sda2
[45115514.774490] md: unbind<sdb2>
[45115514.790339] md: export_rdev(sdb2)
Detail after removal
$ mdadm --detail /dev/md0

/dev/md0:
        Version : 1.2
  Creation Time : Wed Jun  3 22:54:50 2015
     Raid Level : raid1
     Array Size : 523968 (511.77 MiB 536.54 MB)
  Used Dev Size : 523968 (511.77 MiB 536.54 MB)
   Raid Devices : 2
  Total Devices : 1
    Persistence : Superblock is persistent

    Update Time : Mon Oct 28 13:28:28 2019
          State : clean, degraded
 Active Devices : 1
Working Devices : 1
 Failed Devices : 0
  Spare Devices : 0

           Name : rescue:0
           UUID : 6e5c4531:3c91df9d:8ab440c4:fc779ac4
         Events : 359

    Number   Major   Minor   RaidDevice State
       0       8        1        0      active sync   /dev/sda1
       2       0        0        2      removed

$ mdadm --detail /dev/md1

/dev/md1:
        Version : 1.2
  Creation Time : Wed Jun  3 22:54:51 2015
     Raid Level : raid1
     Array Size : 2929608960 (2793.89 GiB 2999.92 GB)
  Used Dev Size : 2929608960 (2793.89 GiB 2999.92 GB)
   Raid Devices : 2
  Total Devices : 1
    Persistence : Superblock is persistent

    Update Time : Mon Oct 28 13:37:40 2019
          State : clean, degraded
 Active Devices : 1
Working Devices : 1
 Failed Devices : 0
  Spare Devices : 0

           Name : rescue:1
           UUID : 4692c051:89d658ca:72ef7264:20f41ab9
         Events : 250540

    Number   Major   Minor   RaidDevice State
       0       8        2        0      active sync   /dev/sda2
       2       0        0        2      removed

Replacement

First erase data on damaged drive for privacy, then determine the serial number of the drive.

Replace the drive with an equally sized or larger device.

Erase data

Write zeros to all sectors.

$ dd if=/dev/zero of=/dev/sdb bs=4M

dd: error writing ‘/dev/sdb’: No space left on device
715398+0 records in
715397+0 records out
3000592982016 bytes (3.0 TB) copied, 20639 s, 145 MB/s

The above took 5 hours 36 minutes.

SMART

Determine serial number and check status.

Zeroing the drive has triggered new sectors failures.

smartctl --all /dev/sdb|less
smartctl 6.4 2014-10-07 r4002 [x86_64-linux-3.16.0-5-amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital RE4 (SATA 6Gb/s)
Device Model:     WDC WD3000FYYZ-01UL1B2
Serial Number:    WD-WCC134KLD74Z
LU WWN Device Id: 5 0014ee 2b5e68c42
Firmware Version: 01.01K03
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Tue Oct 29 02:00:42 2019 GMT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (34320) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 372) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x70bd) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   152   152   021    Pre-fail  Always       -       11383
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       7
  5 Reallocated_Sector_Ct   0x0033   186   186   140    Pre-fail  Always       -       447
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   068   068   000    Old_age   Always       -       23396
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       7
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       3
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       3
194 Temperature_Celsius     0x0022   103   096   000    Old_age   Always       -       49
196 Reallocated_Event_Count 0x0032   194   194   000    Old_age   Always       -       6
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       16
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       16
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       15

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     23383         -
# 2  Short offline       Completed without error       00%     23325         -
# 3  Short offline       Completed without error       00%     23300         -
# 4  Short offline       Completed without error       00%     23180         -
# 5  Short offline       Completed without error       00%     23108         -
# 6  Short offline       Completed without error       00%     22989         -
# 7  Short offline       Completed without error       00%     22944         -
# 8  Extended offline    Completed without error       00%     22927         -
# 9  Short offline       Aborted by host               10%     22901         -
#10  Short offline       Completed without error       00%     22892         -
#11  Short offline       Completed without error       00%     22821         -
#12  Short offline       Completed without error       00%     22797         -
#13  Short offline       Completed without error       00%     22677         -
#14  Short offline       Completed without error       00%     22653         -
#15  Short offline       Completed without error       00%     22629         -
#16  Short offline       Completed without error       00%     22605         -
#17  Short offline       Completed without error       00%     22581         -
#18  Short offline       Completed without error       00%     22533         -
#19  Short offline       Completed without error       00%     22461         -
#20  Short offline       Completed without error       00%     22393         -
#21  Short offline       Completed without error       00%     22293         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
Add new drive

Copy parition table

Copy sda to sdb using a backup

sgdisk -b sda.bak /dev/sda
sgdisk -l sda.bak /dev/sdb

Copy directly

sgdisk -R /dev/sdb /dev/sda

Add unique id

sgdisk -G /dev/sdb

Add to array

mdadm --manage /dev/md0 --add /dev/sdb1
mdadm --manage /dev/md1 --add /dev/sdb2

Status

cat /proc/mdstat

Personalities : [raid1]
md1 : active raid1 sdb2[2] sda2[0]
      2929608960 blocks super 1.2 [2/1] [U_]
      [>....................]  recovery =  0.2% (7099008/2929608960) finish=742.6min speed=65584K/sec

md0 : active raid1 sdb1[2] sda1[0]
      523968 blocks super 1.2 [2/2] [UU]

unused devices: <none>

dmesg

[126629.165345] md: bind<sdb1>
[126629.213425] RAID1 conf printout:
[126629.213428]  --- wd:1 rd:2
[126629.213430]  disk 0, wo:0, o:1, dev:sda1
[126629.213432]  disk 1, wo:1, o:1, dev:sdb1
[126629.213580] md: recovery of RAID array md0
[126629.213606] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
[126629.213624] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
[126629.213659] md: using 128k window, over a total of 523968k.
[126636.136690] md: md0: recovery done.
[126636.247571] RAID1 conf printout:
[126636.247574]  --- wd:2 rd:2
[126636.247577]  disk 0, wo:0, o:1, dev:sda1
[126636.247578]  disk 1, wo:0, o:1, dev:sdb1
[126715.311174] md: bind<sdb2>
[126715.407833] RAID1 conf printout:
[126715.407836]  --- wd:1 rd:2
[126715.407838]  disk 0, wo:0, o:1, dev:sda2
[126715.407839]  disk 1, wo:1, o:1, dev:sdb2
[126715.407949] md: recovery of RAID array md1
[126715.407971] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
[126715.407989] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
[126715.408017] md: using 128k window, over a total of 2929608960k.
...

bootloader

grub-install /dev/sdb

Off-site

mdadm --assemble /dev/mdX /dev/sdX