Fakenology DS918 – replacing hard drives

Summary

This post describes how to shorten the time and lessen the wear on the drives of a Synology NAS when replacing all the disks. My method delays the “reshaping” phase until the last drive has been inserted (from where you can let the NAS do its thing) by manually removing the newly created partitions for SHR expansion that is created when each new disk is inserted.

Disk replacement in DS918

Time to fix this device, since there has been a long time since one of the drives in this unit failed (all backed up, I could simply just replace all disks and start over with it).

This unit used the surviving 3TB drives from my DS1517 after two of them had failed, and I later replaced all the disks in that unit to 14TB ones to get more storage space. This is documented in Inner secrets of Synology Hybrid RAID (SHR) – Part 1

I have written a short summary as a reply in this thread on Reddit: Replacing all drives with larger drives, should I expect it to progressively take longer for repairs with each new larger drive that is swapped in?

Another post on this topic in the Synology community forum: Replacing all Disks: Hot Swap & Rebuild or Recreate

Replacing the first drive (the failed one)

To make sure I correctly identified the drive that had to be replaced, I checked logs, raid status and disk status before pulling the drive. As I already knew, /dev/sdd was the one that had failed, so I needed to find out in which slot (as expected, the fourth slot, but this should always be checked) it was fitted in:

dmesg output (filtered)
This confirms the problems with /dev/sdd:

[    5.797428] sd 3:0:0:0: [sdd] 5860533168 512-byte logical blocks: (3.00 TB/2.73 TiB)
[    5.797439] sd 3:0:0:0: [sdd] 4096-byte physical blocks
[    5.797656] sd 3:0:0:0: [sdd] Write Protect is off
[    5.797666] sd 3:0:0:0: [sdd] Mode Sense: 00 3a 00 00
[    5.797767] sd 3:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[    5.869271]  sdd: sdd1 sdd2 sdd5
[    5.870466] sd 3:0:0:0: [sdd] Attached SCSI disk
[    7.851964] md: invalid raid superblock magic on sdd5
[    7.857051] md: sdd5 does not have a valid v0.90 superblock, not importing!
[    7.857169] md:  adding sdd1 ...
[    7.857175] md: sdd2 has different UUID to sda1
[    7.857205] md: bind
[    7.857336] md: running: 
[    7.857368] md: kicking non-fresh sdd1 from array!
[    7.857376] md: unbind
[    7.862026] md: export_rdev(sdd1)
[    7.890854] md:  adding sdd2 ...
[    7.893244] md: bind
[    7.893365] md: running: 
[   33.692736] md: bind
[   33.693189] md: kicking non-fresh sdd5 from array!
[   33.693209] md: unbind
[   33.696096] md: export_rdev(sdd5)

/proc/mdstat
The content of /proc/mdstat also confirms that /dev/sdd is not used for the main storage (md2) and md0 (DSM):

Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md2 : active raid5 sda5[0] sdc5[2] sdb5[1]
      8776306368 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/3] [UUU_]

md1 : active raid1 sda2[0] sdb2[1] sdc2[2] sdd2[3]
      2097088 blocks [16/4] [UUUU____________]

md0 : active raid1 sda1[0] sdb1[1] sdc1[3]
      2490176 blocks [16/3] [UU_U____________]

As seen above, for md2 the last device is indicated as missing, and reading on the line above “md2 : active raid5 sda5[0] sdc5[2] sdb5[1]” list the order of the drives in “[UUU_]”, so this translates to [sda5 sdb5 sdc5 -]
The same goes for the md0 status where the order is different “md0 : active raid1 sda1[0] sdb1[1] sdc1[3]”, which translates to [sda1 sdb1 – sdc1]

smartctl output
I used smartctl to find out the drives mapped to /dev/sda[b-d]

root@DS918:~# smartctl --all /dev/sda
smartctl 6.5 (build date May  7 2020) [x86_64-linux-4.4.59+] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               HGST
Product:              HDN724030ALE640
...
Serial number:        PK2238P3G3B8VJ
root@DS918:~# smartctl --all /dev/sdb
smartctl 6.5 (build date May  7 2020) [x86_64-linux-4.4.59+] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               HGST
Product:              HDN724030ALE640
...
Serial number:        PK2234P9JGDEXY
root@DS918:~# smartctl --all /dev/sdc
smartctl 6.5 (build date May  7 2020) [x86_64-linux-4.4.59+] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               HGST
Product:              HDN724030ALE640
...
Serial number:        PK2238P3G343GJ
root@DS918:~# smartctl --all /dev/sdd
smartctl 6.5 (build date May  7 2020) [x86_64-linux-4.4.59+] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.14 (AF)
Device Model:     ST3000DM001-1ER166
Serial Number:    W50090JM

As the fourth drive was a Seagate, it was easy to shut down the unit and check which drive it was, but with smartctl, you will be able to identify the drives by reading the serial number on its label.

The full smartctl output for the failed drive:

root@DS918:~# smartctl --all /dev/sdd
smartctl 6.5 (build date May  7 2020) [x86_64-linux-4.4.59+] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.14 (AF)
Device Model:     ST3000DM001-1ER166
Serial Number:    W50090JM
LU WWN Device Id: 5 000c50 07c46d0aa
Firmware Version: CC43
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Fri May  2 15:45:31 2025 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 113) The previous self-test completed having
                                        the read element of the test failed.
Total time to complete Offline
data collection:                (  122) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 332) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x1085) SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME                                                   FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate                                              0x000f   086   084   006    Pre-fail  Always       -       221839714
  3 Spin_Up_Time                                                     0x0003   092   092   000    Pre-fail  Always       -       0
  4 Start_Stop_Count                                                 0x0032   100   100   020    Old_age   Always       -       84
  5 Reallocated_Sector_Ct                                            0x0033   098   098   010    Pre-fail  Always       -       1968
  7 Seek_Error_Rate                                                  0x000f   090   060   030    Pre-fail  Always       -       998592914
  9 Power_On_Hours                                                   0x0032   051   051   000    Old_age   Always       -       43677
 10 Spin_Retry_Count                                                 0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count                                                0x0032   100   100   020    Old_age   Always       -       34
183 Runtime_Bad_Block                                                0x0032   099   099   000    Old_age   Always       -       1
184 End-to-End_Error                                                 0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect                                               0x0032   001   001   000    Old_age   Always       -       2714
188 Command_Timeout                                                  0x0032   100   097   000    Old_age   Always       -       4 7 8
189 High_Fly_Writes                                                  0x003a   099   099   000    Old_age   Always       -       1
190 Airflow_Temperature_Cel                                          0x0022   070   062   045    Old_age   Always       -       30 (Min/Max 27/38)
191 G-Sense_Error_Rate                                               0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count                                          0x0032   100   100   000    Old_age   Always       -       8
193 Load_Cycle_Count                                                 0x0032   065   065   000    Old_age   Always       -       70652
194 Temperature_Celsius                                              0x0022   030   040   000    Old_age   Always       -       30 (0 16 0 0 0)
197 Current_Pending_Sector                                           0x0012   001   001   000    Old_age   Always       -       49760
198 Offline_Uncorrectable                                            0x0010   001   001   000    Old_age   Offline      -       49760
199 UDMA_CRC_Error_Count                                             0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours                                                0x0000   100   253   000    Old_age   Offline      -       38460h+46m+12.675s
241 Total_LBAs_Written                                               0x0000   100   253   000    Old_age   Offline      -       15195564747
242 Total_LBAs_Read                                                  0x0000   100   253   000    Old_age   Offline      -       1092464909408

SMART Error Log Version: 1
ATA Error Count: 2713 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 2713 occurred at disk power-on lifetime: 32907 hours (1371 days + 3 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 53 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 08 ff ff ff 4f 00  33d+13:49:54.056  READ FPDMA QUEUED
  ef 10 02 00 00 00 a0 00  33d+13:49:54.048  SET FEATURES [Enable SATA feature]
  27 00 00 00 00 00 e0 00  33d+13:49:54.048  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]
  ec 00 00 00 00 00 a0 00  33d+13:49:54.047  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00  33d+13:49:54.047  SET FEATURES [Set transfer mode]

Error 2712 occurred at disk power-on lifetime: 32907 hours (1371 days + 3 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 53 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 08 ff ff ff 4f 00  33d+13:49:49.959  READ FPDMA QUEUED
  60 00 08 ff ff ff 4f 00  33d+13:49:49.958  READ FPDMA QUEUED
  ef 10 02 00 00 00 a0 00  33d+13:49:49.949  SET FEATURES [Enable SATA feature]
  27 00 00 00 00 00 e0 00  33d+13:49:49.949  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]
  ec 00 00 00 00 00 a0 00  33d+13:49:49.949  IDENTIFY DEVICE

Error 2711 occurred at disk power-on lifetime: 32907 hours (1371 days + 3 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 53 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 08 ff ff ff 4f 00  33d+13:49:46.267  READ FPDMA QUEUED
  60 00 08 ff ff ff 4f 00  33d+13:49:46.267  READ FPDMA QUEUED
  60 00 08 ff ff ff 4f 00  33d+13:49:46.267  READ FPDMA QUEUED
  60 00 08 ff ff ff 4f 00  33d+13:49:46.266  READ FPDMA QUEUED
  ef 10 02 00 00 00 a0 00  33d+13:49:46.258  SET FEATURES [Enable SATA feature]

Error 2710 occurred at disk power-on lifetime: 32907 hours (1371 days + 3 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 53 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 08 ff ff ff 4f 00  33d+13:49:41.370  READ FPDMA QUEUED
  60 00 08 ff ff ff 4f 00  33d+13:49:41.370  READ FPDMA QUEUED
  60 00 08 ff ff ff 4f 00  33d+13:49:41.370  READ FPDMA QUEUED
  60 00 08 ff ff ff 4f 00  33d+13:49:41.370  READ FPDMA QUEUED
  60 00 08 ff ff ff 4f 00  33d+13:49:41.369  READ FPDMA QUEUED

Error 2709 occurred at disk power-on lifetime: 32907 hours (1371 days + 3 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 53 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 08 ff ff ff 4f 00  33d+13:49:36.656  READ FPDMA QUEUED
  60 00 08 ff ff ff 4f 00  33d+13:49:36.656  READ FPDMA QUEUED
  60 00 08 ff ff ff 4f 00  33d+13:49:36.656  READ FPDMA QUEUED
  60 00 08 ff ff ff 4f 00  33d+13:49:36.656  READ FPDMA QUEUED
  60 00 08 ff ff ff 4f 00  33d+13:49:36.656  READ FPDMA QUEUED

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       10%     43133         4084368632
# 2  Short offline       Completed: read failure       10%     42391         4084368632
# 3  Short offline       Completed: read failure       40%     41719         4084368632
# 4  Short offline       Completed: read failure       10%     40975         4084368632
# 5  Short offline       Completed: read failure       80%     40231         4084368632
# 6  Short offline       Completed: read failure       10%     39511         4084368632
# 7  Short offline       Completed: read failure       10%     38766         4084368632
# 8  Short offline       Completed: read failure       10%     32938         4084368632
# 9  Short offline       Completed without error       00%     32193         -
#10  Short offline       Completed without error       00%     31449         -
#11  Short offline       Completed without error       00%     30743         -
#12  Short offline       Completed without error       00%     29998         -
#13  Short offline       Completed without error       00%     29278         -
#14  Short offline       Completed without error       00%     28534         -
#15  Short offline       Completed without error       00%     27790         -
#16  Short offline       Completed without error       00%     27070         -
#17  Short offline       Completed without error       00%     26328         -
#18  Short offline       Completed without error       00%     25608         -
#19  Short offline       Completed without error       00%     24865         -
#20  Short offline       Completed without error       00%     24196         -
#21  Short offline       Completed without error       00%     23452         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

I powered down the unit (using DSM ui) and identified and removed the broken drive. Then I started it up again (the replacement drive was not inserted). When the replacement drive was inserted DSM didn’t immediately see it, so I just did a reboot of the unit to make it appear as an unused drive, then selecting “Repair” from “Storage Manager/Storage Pool”.

The rebuilding process – first drive

I monitored the rebuilding process a few times, but did not take any notes of when (how long it took). Just let it finish during the night:

root@DS918:~# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md2 : active raid5 sdd5[4] sda5[0] sdc5[2] sdb5[1]
      8776306368 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/3] [UUU_]
      [>....................]  recovery =  0.5% (15458048/2925435456) finish=534.6min speed=90708K/sec

md1 : active raid1 sdd2[3] sdc2[2] sdb2[1] sda2[0]
      2097088 blocks [16/4] [UUUU____________]

md0 : active raid1 sdd1[2] sda1[0] sdb1[1] sdc1[3]
      2490176 blocks [16/4] [UUUU____________]

unused devices: 
root@DS918:~# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md2 : active raid5 sdd5[4] sda5[0] sdc5[2] sdb5[1]
      8776306368 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/3] [UUU_]
      [=>...................]  recovery =  9.6% (282353128/2925435456) finish=434.7min speed=101335K/sec

md1 : active raid1 sdd2[3] sdc2[2] sdb2[1] sda2[0]
      2097088 blocks [16/4] [UUUU____________]

md0 : active raid1 sdd1[2] sda1[0] sdb1[1] sdc1[3]
      2490176 blocks [16/4] [UUUU____________]

unused devices: 
root@DS918:~# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md2 : active raid5 sdd5[4] sda5[0] sdc5[2] sdb5[1]
      8776306368 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/3] [UUU_]
      [=======>.............]  recovery = 38.2% (1118697376/2925435456) finish=525.1min speed=57343K/sec

md1 : active raid1 sdd2[3] sdc2[2] sdb2[1] sda2[0]
      2097088 blocks [16/4] [UUUU____________]

md0 : active raid1 sdd1[2] sda1[0] sdb1[1] sdc1[3]
      2490176 blocks [16/4] [UUUU____________]

unused devices: 
root@DS918:~# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md2 : active raid5 sdd5[4] sda5[0] sdc5[2] sdb5[1]
      8776306368 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/3] [UUU_]
      [=======>.............]  recovery = 39.4% (1152686672/2925435456) finish=402.3min speed=73435K/sec

md1 : active raid1 sdd2[3] sdc2[2] sdb2[1] sda2[0]
      2097088 blocks [16/4] [UUUU____________]

md0 : active raid1 sdd1[2] sda1[0] sdb1[1] sdc1[3]
      2490176 blocks [16/4] [UUUU____________]

unused devices: 
root@DS918:~# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md2 : active raid5 sdd5[4] sda5[0] sdc5[2] sdb5[1]
      8776306368 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/3] [UUU_]
      [=========>...........]  recovery = 49.3% (1443636996/2925435456) finish=297.2min speed=83074K/sec

md1 : active raid1 sdd2[3] sdc2[2] sdb2[1] sda2[0]
      2097088 blocks [16/4] [UUUU____________]

md0 : active raid1 sdd1[2] sda1[0] sdb1[1] sdc1[3]
      2490176 blocks [16/4] [UUUU____________]

unused devices: 
root@DS918:~# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md2 : active raid5 sdd5[4] sda5[0] sdc5[2] sdb5[1]
      8776306368 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/4] [UUUU]

md1 : active raid1 sdd2[3] sdc2[2] sdb2[1] sda2[0]
      2097088 blocks [16/4] [UUUU____________]

md0 : active raid1 sdd1[2] sda1[0] sdb1[1] sdc1[3]
      2490176 blocks [16/4] [UUUU____________]

unused devices: 

Partition layout before and after first disk swap

With the broken disk removed, the partition layout of the remaining disks looked like this:

root@DS918:~# sfdisk -l
/dev/sda1                  2048         4982527         4980480  fd
/dev/sda2               4982528         9176831         4194304  fd
/dev/sda5               9453280      5860326239      5850872960  fd

/dev/sdb1                  2048         4982527         4980480  fd
/dev/sdb2               4982528         9176831         4194304  fd
/dev/sdb5               9453280      5860326239      5850872960  fd

/dev/sdc1                  2048         4982527         4980480  fd
/dev/sdc2               4982528         9176831         4194304  fd
/dev/sdc5               9453280      5860326239      5850872960  fd

When the rebuild process had started, the new disk (/dev/sdd) got the same partition layout as the others, but also a partition for the remaining space (for now unused/unusable)

root@DS918:~# sfdisk -l
/dev/sda1                  2048         4982527         4980480  fd
/dev/sda2               4982528         9176831         4194304  fd
/dev/sda5               9453280      5860326239      5850872960  fd

/dev/sdb1                  2048         4982527         4980480  fd
/dev/sdb2               4982528         9176831         4194304  fd
/dev/sdb5               9453280      5860326239      5850872960  fd

/dev/sdc1                  2048         4982527         4980480  fd
/dev/sdc2               4982528         9176831         4194304  fd
/dev/sdc5               9453280      5860326239      5850872960  fd

/dev/sdd1                  2048         4982527         4980480  fd
/dev/sdd2               4982528         9176831         4194304  fd
/dev/sdd5               9453280      5860326239      5850872960  fd
/dev/sdd6            5860342336     15627846239      9767503904  fd

Second disk pulled out

Now that the first disk had been replaced and the raid was rebuild, I just pulled out the second one to replace.

root@DS918:~# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md2 : active raid5 sdd5[4] sda5[0] sdb5[1]
      8776306368 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/3] [UU_U]

md1 : active raid1 sdd2[3] sdb2[1] sda2[0]
      2097088 blocks [16/3] [UU_U____________]

md0 : active raid1 sdd1[2] sda1[0] sdb1[1]
      2490176 blocks [16/3] [UUU_____________]

unused devices: 

When I inserted the replacement disk, it was (this time) detected by the unit (since the drive was already known before pulling it).

[53977.141054] ata3: link reset sucessfully clear error flags
[53977.157449] ata3.00: ATA-9: ST8000AS0002-1NA17Z, AR17, max UDMA/133
[53977.157458] ata3.00: 15628053168 sectors, multi 0: LBA48 NCQ (depth 31/32), AA
[53977.157462] ata3.00: SN:            Z841474Z
[53977.158764] ata3.00: configured for UDMA/133
[53977.158779] ata3.00: Write Cache is enabled
[53977.163030] ata3: EH complete
[53977.164533] scsi 2:0:0:0: Direct-Access     ATA      ST8000AS0002-1NA17Z      AR17 PQ: 0 ANSI: 5
[53977.165256] sd 2:0:0:0: [sdc] 15628053168 512-byte logical blocks: (8.00 TB/7.28 TiB)
[53977.165273] sd 2:0:0:0: [sdc] 4096-byte physical blocks
[53977.165298] sd 2:0:0:0: Attached scsi generic sg2 type 0
[53977.165534] sd 2:0:0:0: [sdc] Write Protect is off
[53977.165547] sd 2:0:0:0: [sdc] Mode Sense: 00 3a 00 00
[53977.165662] sd 2:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[53977.217123]  sdc: sdc1 sdc2
[53977.218062] sd 2:0:0:0: [sdc] Attached SCSI disk

Full content from dmesg output since starting the unit. This shows that the rebuild time of md2 was about 10 hours, more or less as expected when it was started (early output of /proc/mdstat).

[  323.429093] perf interrupt took too long (5018 > 5000), lowering kernel.perf_event_max_sample_rate to 25000
[36561.042407] md: md2: recovery done.
[36561.200565] md: md2: set sdd5 to auto_remap [0]
[36561.200576] md: md2: set sda5 to auto_remap [0]
[36561.200581] md: md2: set sdc5 to auto_remap [0]
[36561.200585] md: md2: set sdb5 to auto_remap [0]
[36561.405942] RAID conf printout:
[36561.405954]  --- level:5 rd:4 wd:4
[36561.405959]  disk 0, o:1, dev:sda5
[36561.405963]  disk 1, o:1, dev:sdb5
[36561.405967]  disk 2, o:1, dev:sdc5
[36561.405971]  disk 3, o:1, dev:sdd5
[53370.783902] ata3: device unplugged sstatus 0x0
[53370.783962] ata3: exception Emask 0x10 SAct 0x0 SErr 0x4010000 action 0xe frozen
[53370.791503] ata3: irq_stat 0x00400040, connection status changed
[53370.797628] ata3: SError: { PHYRdyChg DevExch }
[53370.802258] ata3: hard resetting link
[53371.525046] ata3: SATA link down (SStatus 0 SControl 300)
[53371.525054] ata3: No present pin info for SATA link down event
[53373.531047] ata3: hard resetting link
[53373.836045] ata3: SATA link down (SStatus 0 SControl 300)
[53373.836054] ata3: No present pin info for SATA link down event
[53373.841917] ata3: limiting SATA link speed to 1.5 Gbps
[53375.841041] ata3: hard resetting link
[53376.146048] ata3: SATA link down (SStatus 0 SControl 310)
[53376.146056] ata3: No present pin info for SATA link down event
[53376.151920] ata3.00: disabled
[53376.151928] ata3.00: already disabled (class=0x2)
[53376.151933] ata3.00: already disabled (class=0x2)
[53376.151958] ata3: EH complete
[53376.151980] ata3.00: detaching (SCSI 2:0:0:0)
[53376.152704] sd 2:0:0:0: [sdc] tag#21 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
[53376.152717] sd 2:0:0:0: [sdc] tag#21 CDB: opcode=0x35 35 00 00 00 00 00 00 00 00 00
[53376.152730] blk_update_request: I/O error, dev sdc, sector in range 4980736 + 0-2(12)
[53376.153061] md: super_written gets error=-5
[53376.153061] syno_md_error: sdc1 has been removed
[53376.153061] raid1: Disk failure on sdc1, disabling device.
                Operation continuing on 3 devices
[53376.177112] sd 2:0:0:0: [sdc] Synchronizing SCSI cache
[53376.177232] sd 2:0:0:0: [sdc] Synchronize Cache(10) failed: Result: hostbyte=0x04 driverbyte=0x00
[53376.177238] sd 2:0:0:0: [sdc] Stopping disk
[53376.177269] sd 2:0:0:0: [sdc] Start/Stop Unit failed: Result: hostbyte=0x04 driverbyte=0x00
[53376.183106] RAID1 conf printout:
[53376.183118]  --- wd:3 rd:16
[53376.183125]  disk 0, wo:0, o:1, dev:sda1
[53376.183130]  disk 1, wo:0, o:1, dev:sdb1
[53376.183135]  disk 2, wo:0, o:1, dev:sdd1
[53376.183140]  disk 3, wo:1, o:0, dev:sdc1
[53376.184338] SynoCheckRdevIsWorking (11054): remove active disk sdc5 from md2 raid_disks 4 mddev->degraded 0 mddev->level 5
[53376.184376] syno_md_error: sdc5 has been removed
[53376.184387] md/raid:md2: Disk failure on sdc5, disabling device.
               md/raid:md2: Operation continuing on 3 devices.
[53376.196472] SynoCheckRdevIsWorking (11054): remove active disk sdc2 from md1 raid_disks 16 mddev->degraded 12 mddev->level 1
[53376.196491] syno_md_error: sdc2 has been removed
[53376.196497] raid1: Disk failure on sdc2, disabling device.
                Operation continuing on 3 devices
[53376.198033] RAID1 conf printout:
[53376.198035]  --- wd:3 rd:16
[53376.198038]  disk 0, wo:0, o:1, dev:sda1
[53376.198040]  disk 1, wo:0, o:1, dev:sdb1
[53376.198042]  disk 2, wo:0, o:1, dev:sdd1
[53376.206669] syno_hot_remove_disk (10954): cannot remove active disk sdc2 from md1 ... rdev->raid_disk 2 pending 0
[53376.330347] md: ioctl lock interrupted, reason -4, cmd -2145908384
[53376.446860] RAID conf printout:
[53376.446869]  --- level:5 rd:4 wd:3
[53376.446874]  disk 0, o:1, dev:sda5
[53376.446879]  disk 1, o:1, dev:sdb5
[53376.446883]  disk 2, o:0, dev:sdc5
[53376.446886]  disk 3, o:1, dev:sdd5
[53376.454062] RAID conf printout:
[53376.454072]  --- level:5 rd:4 wd:3
[53376.454077]  disk 0, o:1, dev:sda5
[53376.454082]  disk 1, o:1, dev:sdb5
[53376.454086]  disk 3, o:1, dev:sdd5
[53376.460958] SynoCheckRdevIsWorking (11054): remove active disk sdc1 from md0 raid_disks 16 mddev->degraded 13 mddev->level 1
[53376.460968] RAID1 conf printout:
[53376.460972]  --- wd:3 rd:16
[53376.460978]  disk 0, wo:0, o:1, dev:sda2
[53376.460984]  disk 1, wo:0, o:1, dev:sdb2
[53376.460987] md: unbind
[53376.460992]  disk 2, wo:1, o:0, dev:sdc2
[53376.460998]  disk 3, wo:0, o:1, dev:sdd2
[53376.467047] RAID1 conf printout:
[53376.467056]  --- wd:3 rd:16
[53376.467062]  disk 0, wo:0, o:1, dev:sda2
[53376.467066]  disk 1, wo:0, o:1, dev:sdb2
[53376.467070]  disk 3, wo:0, o:1, dev:sdd2
[53376.470067] md: export_rdev(sdc1)
[53376.475613] md: unbind
[53376.480044] md: export_rdev(sdc5)
[53377.207047] SynoCheckRdevIsWorking (11054): remove active disk sdc2 from md1 raid_disks 16 mddev->degraded 13 mddev->level 1
[53377.207072] md: unbind
[53377.212034] md: export_rdev(sdc2)
[53958.581765] ata3: device plugged sstatus 0x1
[53958.581811] ata3: exception Emask 0x10 SAct 0x0 SErr 0x4000000 action 0xe frozen
[53958.589278] ata3: irq_stat 0x00000040, connection status changed
[53958.595322] ata3: SError: { DevExch }
[53958.599069] ata3: hard resetting link
[53964.371031] ata3: link is slow to respond, please be patient (ready=0)
[53968.757039] ata3: softreset failed (device not ready)
[53968.762111] ata3: SRST fail, set srst fail flag
[53968.766667] ata3: hard resetting link
[53974.538032] ata3: link is slow to respond, please be patient (ready=0)
[53977.141041] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[53977.141054] ata3: link reset sucessfully clear error flags
[53977.157449] ata3.00: ATA-9: ST8000AS0002-1NA17Z, AR17, max UDMA/133
[53977.157458] ata3.00: 15628053168 sectors, multi 0: LBA48 NCQ (depth 31/32), AA
[53977.157462] ata3.00: SN:            Z841474Z
[53977.158764] ata3.00: configured for UDMA/133
[53977.158779] ata3.00: Write Cache is enabled
[53977.163030] ata3: EH complete
[53977.164533] scsi 2:0:0:0: Direct-Access     ATA      ST8000AS0002-1NA17Z      AR17 PQ: 0 ANSI: 5
[53977.165256] sd 2:0:0:0: [sdc] 15628053168 512-byte logical blocks: (8.00 TB/7.28 TiB)
[53977.165273] sd 2:0:0:0: [sdc] 4096-byte physical blocks
[53977.165298] sd 2:0:0:0: Attached scsi generic sg2 type 0
[53977.165534] sd 2:0:0:0: [sdc] Write Protect is off
[53977.165547] sd 2:0:0:0: [sdc] Mode Sense: 00 3a 00 00
[53977.165662] sd 2:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[53977.217123]  sdc: sdc1 sdc2
[53977.218062] sd 2:0:0:0: [sdc] Attached SCSI disk

Even when the new drive had been detected, I rebooted the unit “just to be sure”, then I initiated the repair process from DSM again. After letting it run for a while, I checked the status:

root@DS918:~# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md2 : active raid5 sdc5[5] sda5[0] sdd5[4] sdb5[1]
      8776306368 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/3] [UU_U]
      [=>...................]  recovery =  9.1% (266365608/2925435456) finish=564.1min speed=78549K/sec

md1 : active raid1 sdc2[3] sdd2[2] sdb2[1] sda2[0]
      2097088 blocks [16/4] [UUUU____________]

md0 : active raid1 sdc1[3] sda1[0] sdb1[1] sdd1[2]
      2490176 blocks [16/4] [UUUU____________]

unused devices: 

Partition layout after the second disk swap

After changing the second drive (/dev/sdc) and starting the rebuild process, the new disk got the same partition layout as the first replaced one (/dev/sdd):

root@DS918:~# sfdisk -l
/dev/sda1                  2048         4982527         4980480  fd
/dev/sda2               4982528         9176831         4194304  fd
/dev/sda5               9453280      5860326239      5850872960  fd

/dev/sdb1                  2048         4982527         4980480  fd
/dev/sdb2               4982528         9176831         4194304  fd
/dev/sdb5               9453280      5860326239      5850872960  fd

/dev/sdc1                  2048         4982527         4980480  fd
/dev/sdc2               4982528         9176831         4194304  fd
/dev/sdc5               9453280      5860326239      5850872960  fd
/dev/sdc6            5860342336     15627846239      9767503904  fd

/dev/sdd1                  2048         4982527         4980480  fd
/dev/sdd2               4982528         9176831         4194304  fd
/dev/sdd5               9453280      5860326239      5850872960  fd
/dev/sdd6            5860342336     15627846239      9767503904  fd

The rebuild will again take about 10 hours to finish.

What’s expected to happen next

Because there are now two partitions with unused space on the new drives, the md2 volume will be rebuilt as RAID5 on sd[a-d]5 + the additional space of RAID1 on sdc6 and sdd6. There seems no way of stopping this stupidity, since it will have to be redone again after swapping the next disk. Just sit back and wait for the expansion of the mdraid volume.

Unless…
It might be a time saver to delete the unused partition on the first disk, so that the storage cannot be expanded (what happens will depend on if DSM notices the non-partitioned space and still makes that mirror of sdd6 + sdc6)
There’s only one way to find out:

root@DS918:~# parted /dev/sdd
GNU Parted 3.2
Using /dev/sdd
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) rm 6
rm 6
(parted) print
print
Model: ATA ST8000AS0002-1NA (scsi)
Disk /dev/sdd: 8002GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags:

Number  Start   End     Size    File system     Name  Flags
 1      1049kB  2551MB  2550MB  ext4                  raid
 2      2551MB  4699MB  2147MB  linux-swap(v1)        raid
 5      4840MB  3000GB  2996GB                        raid

(parted) quit

10 hours later…
At right about 10 hours later, md2 was almost rebuilt. No problems so far, but what follows will be interesting as I removed that extra partition (which would have been a part of the LV used for storage). I really hope that the NAS would be ready to accept the next disk in the replacemend procedure right after the sync is finished.

root@DS918:~# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md2 : active raid5 sdc5[5] sda5[0] sdd5[4] sdb5[1]
      8776306368 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/3] [UU_U]
      [==================>..]  recovery = 94.4% (2763366352/2925435456) finish=35.3min speed=76346K/sec

md1 : active raid1 sdc2[3] sdd2[2] sdb2[1] sda2[0]
      2097088 blocks [16/4] [UUUU____________]

md0 : active raid1 sdc1[3] sda1[0] sdb1[1] sdd1[2]
      2490176 blocks [16/4] [UUUU____________]

unused devices: 

Results after the second disk swap

As hoped for, no automatic LV change was initiated, saving me a lot of hours (at least, for now) skipping the reshape operation which would then have to be done at least one time after swapping out the remaining disks.

root@DS918:~# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md2 : active raid5 sdc5[5] sda5[0] sdd5[4] sdb5[1]
      8776306368 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/4] [UUUU]

md1 : active raid1 sdc2[3] sdd2[2] sdb2[1] sda2[0]
      2097088 blocks [16/4] [UUUU____________]

md0 : active raid1 sdc1[3] sda1[0] sdb1[1] sdd1[2]
      2490176 blocks [16/4] [UUUU____________]

unused devices: 

Partitions on /dev/sdc at this stage:

root@DS918:~# sfdisk -l
/dev/sda1                  2048         4982527         4980480  fd
/dev/sda2               4982528         9176831         4194304  fd
/dev/sda5               9453280      5860326239      5850872960  fd

/dev/sdb1                  2048         4982527         4980480  fd
/dev/sdb2               4982528         9176831         4194304  fd
/dev/sdb5               9453280      5860326239      5850872960  fd

/dev/sdc1                  2048         4982527         4980480  fd
/dev/sdc2               4982528         9176831         4194304  fd
/dev/sdc5               9453280      5860326239      5850872960  fd
/dev/sdc6            5860342336     15627846239      9767503904  fd

/dev/sdd1                  2048         4982527         4980480  fd
/dev/sdd2               4982528         9176831         4194304  fd
/dev/sdd5               9453280      5860326239      5850872960  fd

Replacing the third disk

I’m doing it exactly the same way as when I replaced the second disk:
Pull out the drive
Replace and check
Reboot just to be sure
Rebuild
Remove the extra partition on sdc to prevent reshaping after rebuild

After the 3rd disk change had been accepted (to be resynced), I got some unexpected things happening. Even with the removed partition on sdc, DSM decided that it could make partition changes to make the most use of the disks available:

root@DS918:~# sfdisk -l
/dev/sda1                  2048         4982527         4980480  fd
/dev/sda2               4982528         9176831         4194304  fd
/dev/sda5               9453280      5860326239      5850872960  fd

/dev/sdb1                  2048         4982527         4980480  fd
/dev/sdb2               4982528         9176831         4194304  fd
/dev/sdb5               9453280      5860326239      5850872960  fd
/dev/sdb6            5860342336     15627846239      9767503904  fd

/dev/sdc1                  2048         4982527         4980480  fd
/dev/sdc2               4982528         9176831         4194304  fd
/dev/sdc5               9453280      5860326239      5850872960  fd
/dev/sdc6            5860342336     15627846239      9767503904  fd

/dev/sdd1                  2048         4982527         4980480  fd
/dev/sdd2               4982528         9176831         4194304  fd
/dev/sdd5               9453280      5860326239      5850872960  fd
/dev/sdd6            5860342336     15627846239      9767503904  fd

The removed partition from sdd was recreated, and now sdb6, sdc6 and sdd6 will be a RAID5 which will be striped onto the storage LV. Not what I hoped for, but probably nothing that could have been done to prevent it from happening (I think all three extra partitions would have been created even if I removed the one from sdc).

Checking the mdraid status, I noticed that there might be some hope (again, by removing the extra partition on each of the disks that have been completely replaced):

root@DS918:~# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md2 : active raid5 sdb5[6] sda5[0] sdd5[4] sdc5[5]
      8776306368 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/3] [U_UU]
      [=>...................]  recovery =  8.5% (249970824/2925435456) finish=696.8min speed=63988K/sec

md1 : active raid1 sdb2[3] sdd2[2] sdc2[1] sda2[0]
      2097088 blocks [16/4] [UUUU____________]

md0 : active raid1 sdb1[1] sda1[0] sdc1[3] sdd1[2]
      2490176 blocks [16/4] [UUUU____________]

unused devices: 

As the new partitions are not in use yet, I just remove them from the disks (sdc and sdd) using parted.
After removing these partitions, the disks looks like I want them for now:

root@DS918:~# sfdisk -l
/dev/sda1                  2048         4982527         4980480  fd
/dev/sda2               4982528         9176831         4194304  fd
/dev/sda5               9453280      5860326239      5850872960  fd

/dev/sdb1                  2048         4982527         4980480  fd
/dev/sdb2               4982528         9176831         4194304  fd
/dev/sdb5               9453280      5860326239      5850872960  fd
/dev/sdb6            5860342336     15627846239      9767503904  fd

/dev/sdc1                  2048         4982527         4980480  fd
/dev/sdc2               4982528         9176831         4194304  fd
/dev/sdc5               9453280      5860326239      5850872960  fd

/dev/sdd1                  2048         4982527         4980480  fd
/dev/sdd2               4982528         9176831         4194304  fd
/dev/sdd5               9453280      5860326239      5850872960  fd

On the next disk replacement (the last), I will let it expand the storage pool to use the free space from the new disks (as they are 8TB each and the old ones were 3TB, this will add 15TB to the volume).

Snapshots from DSM web UI

The first snapshot of the UI was done after replacing the third disk when something unexpected happened, but I include the story up to that point for those few interested in reading my stuff 🙂

These snapshots (taken while disk 3 is being rebuilt) are still a vaild representation for how the unit was configured before changing (disk 4, 3 and 2, as I began from the bottom with the broken one).

I began with a total volume of 8TB, which I replaced the failing drive with a new 8TB. This made the volume size unchangeable (because redundancy cannot be done with the help of only that 5TB unused space on the new drive).

When changing the second drive, DSM told me the new size of the would be about 12TB, which is the old 8TB (RAID5 across the four disks) + the 5GB free space from the new drives (partition 6 mirrored). This was not what I wanted, so I deleted partition 6 from one of the drives, and that worked, preventing the storage pool from being expanded.

Replacing the third disk (as I have detailed just above) made the assumption that I really wanted to use the free space from the two other drives + the third of the same kind (even with the extra partition removed from sdc). This time I got noticed that the storage pool will grow to about 17TB. Still not what I wanted, and after checking that nothing had been changed, I went on removing the 5TB partitions from sdc and sdd.

11,7 hours later…
Storage pool untouched.

root@DS918:~# cat /etc/lvm/backup/vg1
# Generated by LVM2 version 2.02.132(2)-git (2015-09-22): Sat May  3 05:20:41 2025

contents = "Text Format Volume Group"
version = 1

description = "Created *after* executing '/sbin/pvresize /dev/md2'"

creation_host = "DS918" # Linux DS918 4.4.59+ #25426 SMP PREEMPT Mon Dec 14 18:48:50 CST 2020 x86_64
creation_time = 1746242441      # Sat May  3 05:20:41 2025

vg1 {
        id = "jkiRc4-0zwx-ye9v-1eFm-OL0u-7oSS-x51FA8"
        seqno = 4
        format = "lvm2"                 # informational
        status = ["RESIZEABLE", "READ", "WRITE"]
        flags = []
        extent_size = 8192              # 4 Megabytes
        max_lv = 0
        max_pv = 0
        metadata_copies = 0

        physical_volumes {

                pv0 {
                        id = "yu1P7E-7o1a-8CsP-mbaR-mye5-N4pk-1fAk8O"
                        device = "/dev/md2"     # Hint only

                        status = ["ALLOCATABLE"]
                        flags = []
                        dev_size = 17552611584  # 8.17357 Terabytes
                        pe_start = 1152
                        pe_count = 2142652      # 8.17357 Terabytes
                }
        }

        logical_volumes {

                syno_vg_reserved_area {
                        id = "3YdjJW-zkx6-DoKs-jEz0-kTXo-rpke-eYIw8P"
                        status = ["READ", "WRITE", "VISIBLE"]
                        flags = []
                        segment_count = 1

                        segment1 {
                                start_extent = 0
                                extent_count = 3        # 12 Megabytes

                                type = "striped"
                                stripe_count = 1        # linear

                                stripes = [
                                        "pv0", 0
                                ]
                        }
                }

                volume_1 {
                        id = "BFxwgA-3pr2-3BHr-AXo3-rJ6r-F7tP-vC7Te7"
                        status = ["READ", "WRITE", "VISIBLE"]
                        flags = []
                        segment_count = 1

                        segment1 {
                                start_extent = 0
                                extent_count = 2142649  # 8.17356 Terabytes

                                type = "striped"
                                stripe_count = 1        # linear

                                stripes = [
                                        "pv0", 3
                                ]
                        }
                }
        }
}

mdraid volumes untouched:

root@DS918:~# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md2 : active raid5 sdb5[6] sda5[0] sdd5[4] sdc5[5]
      8776306368 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/4] [UUUU]

md1 : active raid1 sdb2[3] sdd2[2] sdc2[1] sda2[0]
      2097088 blocks [16/4] [UUUU____________]

md0 : active raid1 sdb1[1] sda1[0] sdc1[3] sdd1[2]
      2490176 blocks [16/4] [UUUU____________]

unused devices: 

LV also untouched, just as I wanted.

root@DS918:~# lvdisplay
  --- Logical volume ---
  LV Path                /dev/vg1/syno_vg_reserved_area
  LV Name                syno_vg_reserved_area
  VG Name                vg1
  LV UUID                3YdjJW-zkx6-DoKs-jEz0-kTXo-rpke-eYIw8P
  LV Write Access        read/write
  LV Creation host, time ,
  LV Status              available
  # open                 0
  LV Size                12.00 MiB
  Current LE             3
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     768
  Block device           252:0

  --- Logical volume ---
  LV Path                /dev/vg1/volume_1
  LV Name                volume_1
  VG Name                vg1
  LV UUID                BFxwgA-3pr2-3BHr-AXo3-rJ6r-F7tP-vC7Te7
  LV Write Access        read/write
  LV Creation host, time ,
  LV Status              available
  # open                 1
  LV Size                8.17 TiB
  Current LE             2142649
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     4096
  Block device           252:1

Replacing the last drive

I follow the same procedure as with the other ones, with one exception: I let the Synology do its magic and expand the storage pool by leaving the 5GB partitions on the drives.
Pull out the drive
Replace and check
Reboot just to be sure
Rebuild
Let the Synology expand the storage pool

After the reboot, I just did a “Repair” on the pool again, and confirmed that the new size will be about 21TB (old size of 8TB + RAID5 on four 5TB partitions giving the 15TB extra space):

Partition layout on the disks after starting the rebuild:

root@DS918:~# sfdisk -l
/dev/sda1                  2048         4982527         4980480  fd
/dev/sda2               4982528         9176831         4194304  fd
/dev/sda5               9453280      5860326239      5850872960  fd
/dev/sda6            5860342336     15627846239      9767503904  fd

/dev/sdb1                  2048         4982527         4980480  fd
/dev/sdb2               4982528         9176831         4194304  fd
/dev/sdb5               9453280      5860326239      5850872960  fd
/dev/sdb6            5860342336     15627846239      9767503904  fd

/dev/sdc1                  2048         4982527         4980480  fd
/dev/sdc2               4982528         9176831         4194304  fd
/dev/sdc5               9453280      5860326239      5850872960  fd
/dev/sdc6            5860342336     15627846239      9767503904  fd

/dev/sdd1                  2048         4982527         4980480  fd
/dev/sdd2               4982528         9176831         4194304  fd
/dev/sdd5               9453280      5860326239      5850872960  fd
/dev/sdd6            5860342336     15627846239      9767503904  fd

Now I just have to wait…

Something unexpected happened
After that reboot (before initiating rebuild), “md2” for some reason changed to “md4”. The reason for this could be that “md2” and “md3” were unavailable because the last disk came from a FreeBSD:ed older Buffalo, so mdraid detected this and reassigned “md2” as “md4”.

For reference only, the partition tables just after inserting the disk that now should be the new and last replacement:

root@DS918:~# sfdisk -l
/dev/sda1                  2048         2002943         2000896  83
/dev/sda2               2002944        12003327        10000384  83
/dev/sda3              12003328        12005375            2048  83
/dev/sda4              12005376        12007423            2048  83
/dev/sda5              12007424        14008319         2000896  83
/dev/sda6              14008320      7814008319      7800000000  83
/dev/sda7            7814008832     15614008831      7800000000  83

/dev/sdb1                  2048         4982527         4980480  fd
/dev/sdb2               4982528         9176831         4194304  fd
/dev/sdb5               9453280      5860326239      5850872960  fd
/dev/sdb6            5860342336     15627846239      9767503904  fd

/dev/sdc1                  2048         4982527         4980480  fd
/dev/sdc2               4982528         9176831         4194304  fd
/dev/sdc5               9453280      5860326239      5850872960  fd

/dev/sdd1                  2048         4982527         4980480  fd
/dev/sdd2               4982528         9176831         4194304  fd
/dev/sdd5               9453280      5860326239      5850872960  fd

So at least until the next reboot, the output from /proc/mdstat would look like this:

root@DS918:~# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md4 : active raid5 sda5[7] sdb5[6] sdd5[4] sdc5[5]
      8776306368 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/3] [_UUU]
      [==============>......]  recovery = 73.4% (2148539264/2925435456) finish=112.3min speed=115278K/sec

md1 : active raid1 sda2[3] sdd2[2] sdc2[1] sdb2[0]
      2097088 blocks [16/4] [UUUU____________]

md0 : active raid1 sda1[0] sdb1[1] sdc1[3] sdd1[2]
      2490176 blocks [16/4] [UUUU____________]

unused devices: 

Thinking…
The expasion of the storage should not take a long time using my method for preventing expansion in between every disk swap.
The manual method of doing this expansion would be to create a mdraid5 of the four drives, adding these to the LVM configuration as pv, then adding that pv to the “volume_1” stripe. Unless the Synology decides to merge md2 and md3 (which I assume will be created using the 4x5TB partitions)…

Expanding the storage volume

When resyncing md4 (previously named md2) finished, a new mdraid using the four 5TB partitions was created, and a resync was initiated (as it’s not ZFS, this might be necessary even when there is “no data” to sync). As it looks like now, this step will take about 52 hours (going much slower than previous resyncs, so it might be a temporary low speed).

root@DS918:~# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md2 : active raid5 sda6[3] sdd6[2] sdc6[1] sdb6[0]
      14651252736 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
      [=>...................]  resync =  5.1% (249940160/4883750912) finish=3159.2min speed=24445K/sec

md4 : active raid5 sda5[7] sdb5[6] sdd5[4] sdc5[5]
      8776306368 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/4] [UUUU]

md1 : active raid1 sda2[3] sdd2[2] sdc2[1] sdb2[0]
      2097088 blocks [16/4] [UUUU____________]

md0 : active raid1 sda1[0] sdb1[1] sdc1[3] sdd1[2]
      2490176 blocks [16/4] [UUUU____________]

unused devices: 

mdadm –detail /dev/md2 gives some more information:

root@DS918:~# mdadm --detail /dev/md2
/dev/md2:
        Version : 1.2
  Creation Time : Sun May  4 20:17:34 2025
     Raid Level : raid5
     Array Size : 14651252736 (13972.52 GiB 15002.88 GB)
  Used Dev Size : 4883750912 (4657.51 GiB 5000.96 GB)
   Raid Devices : 4
  Total Devices : 4
    Persistence : Superblock is persistent

    Update Time : Sun May  4 23:45:30 2025
          State : active, resyncing
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 64K

  Resync Status : 5% complete

           Name : DS918:2  (local to host DS918)
           UUID : cc2a3e88:4f844ebd:2fbf5461:f29bbaf0
         Events : 43

    Number   Major   Minor   RaidDevice State
       0       8       22        0      active sync   /dev/sdb6
       1       8       38        1      active sync   /dev/sdc6
       2       8       54        2      active sync   /dev/sdd6
       3       8        6        3      active sync   /dev/sda6

I also found out that the storage pool (but not the volume) has now been expanded to its final size of 21.7TB:

On the “Volume” page, I can go on creating a new volume,which is not what I want. I suppose expanding the current volume will be possible after the resync of the newly added space is done.

I cancelled on the last step where the volume was going to be created, as I want to expand the main storage volume instead.

On the “Linux” side (mdraid and LVM), I found out that the “Physical Volume” had been created and that volume had been added to the “Volume Group” vg1:

When md2 was fully synced

At the end of the the resync of md2, which took about 79 hours (estimated time was 52 hours, but the speed dropped during the resync, and the estimated time increased over the two following days), I was still not able to extend the storage volume from the location I expected it to be in (the “Action” drop-down button under “Volume” in “Storage Manager”). My mistake here was to not check “Configure” under that same drop-down button.

I added new drives to my Synology NAS, but the available capacity didn’t increase. What can I do?

So for DSM 6.2 (for the Fakenology), this is where it’s done:

From the “Configuration” page, the volume size can be changed to any size greater than the current size, or to “max” which will add the newly created storage to the volume.

This option to change the size of the volume might have been there all the time (during synchronization), but in any case, it would probably had been better to leave it alone until first sync finalized anyway.

Now the mdraid volumes look like this:

root@DS918:/volume1# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md2 : active raid5 sda6[3] sdd6[2] sdc6[1] sdb6[0]
      14651252736 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/4] [UUUU]

md4 : active raid5 sda5[7] sdb5[6] sdd5[4] sdc5[5]
      8776306368 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/4] [UUUU]

md1 : active raid1 sda2[3] sdd2[2] sdc2[1] sdb2[0]
      2097088 blocks [16/4] [UUUU____________]

md0 : active raid1 sda1[0] sdb1[1] sdc1[3] sdd1[2]
      2490176 blocks [16/4] [UUUU____________]

unused devices: 

At this stage, the storage pool is still untouched, but as shown in the images above, another pv has been added:

root@DS918:/volume1# cat /etc/lvm/backup/vg1
# Generated by LVM2 version 2.02.132(2)-git (2015-09-22): Thu May  8 03:03:03 2025

contents = "Text Format Volume Group"
version = 1

description = "Created *after* executing '/sbin/pvresize /dev/md2'"

creation_host = "DS918" # Linux DS918 4.4.59+ #25426 SMP PREEMPT Mon Dec 14 18:48:50 CST 2020 x86_64
creation_time = 1746666183      # Thu May  8 03:03:03 2025

vg1 {
        id = "jkiRc4-0zwx-ye9v-1eFm-OL0u-7oSS-x51FA8"
        seqno = 7
        format = "lvm2"                 # informational
        status = ["RESIZEABLE", "READ", "WRITE"]
        flags = []
        extent_size = 8192              # 4 Megabytes
        max_lv = 0
        max_pv = 0
        metadata_copies = 0

        physical_volumes {

                pv0 {
                        id = "yu1P7E-7o1a-8CsP-mbaR-mye5-N4pk-1fAk8O"
                        device = "/dev/md4"     # Hint only

                        status = ["ALLOCATABLE"]
                        flags = []
                        dev_size = 17552611584  # 8.17357 Terabytes
                        pe_start = 1152
                        pe_count = 2142652      # 8.17357 Terabytes
                }

                pv1 {
                        id = "YZWW7p-8HaZ-9kDy-7hVv-v2Sk-Vlyu-LkkhXU"
                        device = "/dev/md2"     # Hint only

                        status = ["ALLOCATABLE"]
                        flags = []
                        dev_size = 29302504320  # 13.645 Terabytes
                        pe_start = 1152
                        pe_count = 3576965      # 13.645 Terabytes
                }
        }

        logical_volumes {

                syno_vg_reserved_area {
                        id = "3YdjJW-zkx6-DoKs-jEz0-kTXo-rpke-eYIw8P"
                        status = ["READ", "WRITE", "VISIBLE"]
                        flags = []
                        segment_count = 1

                        segment1 {
                                start_extent = 0
                                extent_count = 3        # 12 Megabytes

                                type = "striped"
                                stripe_count = 1        # linear

                                stripes = [
                                        "pv0", 0
                                ]
                        }
                }

                volume_1 {
                        id = "BFxwgA-3pr2-3BHr-AXo3-rJ6r-F7tP-vC7Te7"
                        status = ["READ", "WRITE", "VISIBLE"]
                        flags = []
                        segment_count = 1

                        segment1 {
                                start_extent = 0
                                extent_count = 2142649  # 8.17356 Terabytes

                                type = "striped"
                                stripe_count = 1        # linear

                                stripes = [
                                        "pv0", 3
                                ]
                        }
                }
        }
}

The next (and last step) is to add the new space to the storage volume (Volume 1). This is being done by adding a second segment to “volume_1”, which contains pv1 in the stripe list. of “volume_1”. When the segment has been added, the file system on “volume_1” is resized using the resize2fs command (this took a couple of minutes to finish).

root@DS918:/volume1# cat /etc/lvm/backup/vg1
# Generated by LVM2 version 2.02.132(2)-git (2015-09-22): Sun May 11 21:15:43 2025

contents = "Text Format Volume Group"
version = 1

description = "Created *after* executing '/sbin/lvextend --alloc inherit /dev/vg1/volume_1 --size 22878208M'"

creation_host = "DS918" # Linux DS918 4.4.59+ #25426 SMP PREEMPT Mon Dec 14 18:48:50 CST 2020 x86_64
creation_time = 1746990943      # Sun May 11 21:15:43 2025

vg1 {
        id = "jkiRc4-0zwx-ye9v-1eFm-OL0u-7oSS-x51FA8"
        seqno = 8
        format = "lvm2"                 # informational
        status = ["RESIZEABLE", "READ", "WRITE"]
        flags = []
        extent_size = 8192              # 4 Megabytes
        max_lv = 0
        max_pv = 0
        metadata_copies = 0

        physical_volumes {

                pv0 {
                        id = "yu1P7E-7o1a-8CsP-mbaR-mye5-N4pk-1fAk8O"
                        device = "/dev/md4"     # Hint only

                        status = ["ALLOCATABLE"]
                        flags = []
                        dev_size = 17552611584  # 8.17357 Terabytes
                        pe_start = 1152
                        pe_count = 2142652      # 8.17357 Terabytes
                }

                pv1 {
                        id = "YZWW7p-8HaZ-9kDy-7hVv-v2Sk-Vlyu-LkkhXU"
                        device = "/dev/md2"     # Hint only

                        status = ["ALLOCATABLE"]
                        flags = []
                        dev_size = 29302504320  # 13.645 Terabytes
                        pe_start = 1152
                        pe_count = 3576965      # 13.645 Terabytes
                }
        }

        logical_volumes {

                syno_vg_reserved_area {
                        id = "3YdjJW-zkx6-DoKs-jEz0-kTXo-rpke-eYIw8P"
                        status = ["READ", "WRITE", "VISIBLE"]
                        flags = []
                        segment_count = 1

                        segment1 {
                                start_extent = 0
                                extent_count = 3        # 12 Megabytes

                                type = "striped"
                                stripe_count = 1        # linear

                                stripes = [
                                        "pv0", 0
                                ]
                        }
                }

                volume_1 {
                        id = "BFxwgA-3pr2-3BHr-AXo3-rJ6r-F7tP-vC7Te7"
                        status = ["READ", "WRITE", "VISIBLE"]
                        flags = []
                        segment_count = 2

                        segment1 {
                                start_extent = 0
                                extent_count = 2142649  # 8.17356 Terabytes

                                type = "striped"
                                stripe_count = 1        # linear

                                stripes = [
                                        "pv0", 3
                                ]
                        }
                        segment2 {
                                start_extent = 2142649
                                extent_count = 3576903  # 13.6448 Terabytes

                                type = "striped"
                                stripe_count = 1        # linear

                                stripes = [
                                        "pv1", 0
                                ]
                        }
                }
        }
}

root@DS918:/volume1# df -h
Filesystem         Size  Used Avail Use% Mounted on
/dev/md0           2.3G  987M  1.2G  45% /
none               983M     0  983M   0% /dev
/tmp               996M  944K  995M   1% /tmp
/run               996M  8.2M  988M   1% /run
/dev/shm           996M  4.0K  996M   1% /dev/shm
none               4.0K     0  4.0K   0% /sys/fs/cgroup
cgmfs              100K     0  100K   0% /run/cgmanager/fs
/dev/vg1/volume_1   22T  7.2T   15T  33% /volume1
root@DS918:/volume1#

Buffalo LS220D – borked again

Another Seagate drive failed

Yesterday, when I had the JotteCloud client started to backup the content of the shares on one of my LS220s, I noticed a slowdown in reading files from that device.

Checking the dmesg output I found a repeating pattern of IO errors

end_request: I/O error, dev sda, sector 1231207416
ata1: EH complete
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
ata1.00: edma_err_cause=00000084 pp_flags=00000001, dev error, EDMA self-disable
ata1.00: failed command: READ DMA EXT
ata1.00: cmd 25/00:08:f8:bb:62/00:00:49:00:00/e0 tag 0 dma 4096 in
         res 51/40:00:f8:bb:62/40:00:49:00:00/00 Emask 0x9 (media error)
ata1.00: status: { DRDY ERR }
ata1.00: error: { UNC }
ata1: hard resetting link
ata1: SATA link up 3.0 Gbps (SStatus 123 SControl F300)
ata1.00: configured for UDMA/133
ata1: EH complete
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
ata1.00: edma_err_cause=00000084 pp_flags=00000001, dev error, EDMA self-disable
ata1.00: failed command: READ DMA EXT
ata1.00: cmd 25/00:08:f8:bb:62/00:00:49:00:00/e0 tag 0 dma 4096 in
         res 51/40:00:f8:bb:62/40:00:49:00:00/00 Emask 0x9 (media error)
ata1.00: status: { DRDY ERR }
ata1.00: error: { UNC }
ata1: hard resetting link
ata1: SATA link up 3.0 Gbps (SStatus 123 SControl F300)
ata1.00: configured for UDMA/133
ata1: EH complete
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
ata1.00: edma_err_cause=00000084 pp_flags=00000001, dev error, EDMA self-disable
ata1.00: failed command: READ DMA EXT
ata1.00: cmd 25/00:08:f8:bb:62/00:00:49:00:00/e0 tag 0 dma 4096 in
         res 51/40:00:f8:bb:62/40:00:49:00:00/00 Emask 0x9 (media error)
ata1.00: status: { DRDY ERR }
ata1.00: error: { UNC }
ata1: hard resetting link
ata1: SATA link up 3.0 Gbps (SStatus 123 SControl F300)
ata1.00: configured for UDMA/133
ata1: EH complete
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
ata1.00: edma_err_cause=00000084 pp_flags=00000001, dev error, EDMA self-disable
ata1.00: failed command: READ DMA EXT
ata1.00: cmd 25/00:08:f8:bb:62/00:00:49:00:00/e0 tag 0 dma 4096 in
         res 51/40:00:f8:bb:62/40:00:49:00:00/00 Emask 0x9 (media error)
ata1.00: status: { DRDY ERR }
ata1.00: error: { UNC }
ata1: hard resetting link
ata1: SATA link up 3.0 Gbps (SStatus 123 SControl F300)
ata1.00: configured for UDMA/133
ata1: EH complete
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
ata1.00: edma_err_cause=00000084 pp_flags=00000001, dev error, EDMA self-disable
ata1.00: failed command: READ DMA EXT
ata1.00: cmd 25/00:08:f8:bb:62/00:00:49:00:00/e0 tag 0 dma 4096 in
         res 51/40:00:f8:bb:62/40:00:49:00:00/00 Emask 0x9 (media error)
ata1.00: status: { DRDY ERR }
ata1.00: error: { UNC }
ata1: hard resetting link
ata1: SATA link up 3.0 Gbps (SStatus 123 SControl F300)
ata1.00: configured for UDMA/133
ata1: EH complete
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
ata1.00: edma_err_cause=00000084 pp_flags=00000001, dev error, EDMA self-disable
ata1.00: failed command: READ DMA EXT
ata1.00: cmd 25/00:08:f8:bb:62/00:00:49:00:00/e0 tag 0 dma 4096 in
         res 51/40:00:f8:bb:62/40:00:49:00:00/00 Emask 0x9 (media error)
ata1.00: status: { DRDY ERR }
ata1.00: error: { UNC }
ata1: hard resetting link
ata1: SATA link up 3.0 Gbps (SStatus 123 SControl F300)
ata1.00: configured for UDMA/133
sd 0:0:0:0: [sda] Unhandled sense code
sd 0:0:0:0: [sda]  Result: hostbyte=0x00 driverbyte=0x08
sd 0:0:0:0: [sda]  Sense Key : 0x3 [current] [descriptor]
Descriptor sense data with sense descriptors (in hex):
        72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
        49 62 bb f8
sd 0:0:0:0: [sda]  ASC=0x11 ASCQ=0x4
sd 0:0:0:0: [sda] CDB: cdb[0]=0x28: 28 00 49 62 bb f8 00 00 08 00
end_request: I/O error, dev sda, sector 1231207416
ata1: EH complete

So, this time sda (the first of the two drives) were failing (sdb had been replaced earlier), and as with the other LS220s I run these in RAID0 (stripe) mode, so both drives are needed for proper operation. I do probably have most of the content backed up since the last failure and since the last regular backups to JottaCloud)

As usual, I fired up my computer used for data-rescuing, and as usual it complained that it hadn’t been used for a too-long time, so the BIOS settings had been forgotten. Had to change that back to booting from CD (Trinity Rescue Kit), but probably missed to enable the internal SATA connectors this time.

Running both the source and destination drives on the same controller makes the rescuing a bit slow, but I’ll just let it run.

I stumbled upon errors early on (during copying the system partitions), so I stopped here and investigated what’s going on:

dmesg output on my rescue-system

sd 7:0:0:0: [sdb] CDB: Read(10): 28 00 00 72 b5 e7 00 00 80 00
end_request: I/O error, dev sdb, sector 7517744

This is the layout of the partitions from another LS220s

Number  Start      End          Size         File system  Name     Flags
 1      34s        2000000s     1999967s     ext3         primary
 2      2000896s   11999231s    9998336s                  primary
 3      11999232s  12000000s    769s                      primary  bios_grub
 4      12000001s  12000001s    1s                        primary
 5      12000002s  14000000s    1999999s                  primary
 6      14000128s  7796875263s  7782875136s               primary

As seen in the dmesg output, the problem was within the second partition, so I restarted ddrescue with the -i parameter to set the start position at start of partition 3 (dealing with 2 later, which is a part of the root filesystem (md1, which consists of sda2 and sdb2) that should be mirrored on the other drive (I have had problems with broken mirrors before, so it might even be that I have no valid copy of this partition).

ddrescue -i 11999232b -d /dev/sdb /dev/sdc /sda1/ddrescue/buff6-1.log

About 12 hours later, I’m almost halfway through the data partition (partition 6) for the mdraid volume. A few errors so far, but ddrescue will get back to those bad parts and try splitting them into smaller pieces later on.

Initial status (read from logfile)
rescued:         0 B,  errsize:       0 B,  errors:       0
Current status
rescued:     1577 GB,  errsize:    394 kB,  current rate:   36765 kB/s
   ipos:     1583 GB,   errors:       3,    average rate:   38855 kB/s
   opos:     1583 GB

I will eventually loose some files here, but the primary goal is to get the drive recognized as a direct replacement for the failed one.

Getting closer…

About 30 hours later, the most of the drive had been copied over to the replacement disk. I saved the errors on the root partition to the last step (which of the final, but most time consuming part is taking place now), where I first copied as much as possible from around where the early errors occured, then getting closer to the problematic section on each additional run.

Initial status (read from logfile)
rescued:     4000 GB,  errsize:   4008 kB,  errors:      83
Current status
rescued:     4000 GB,  errsize:    518 kB,  current rate:        0 B/s
   ipos:     3828 MB,   errors:      86,    average rate:      254 B/s
   opos:     3828 MB
Splitting error areas...

I gave it another couple of hours and was ready to abort and test if it would boot.. Then it just finished.

Current status
rescued:     4000 GB,  errsize:    493 kB,  current rate:        0 B/s
   ipos:     6035 MB,   errors:      94,    average rate:        2 B/s
   opos:     6035 MB
Finished

Options from here

As I have many backup copies of the content from the NAS this disk is a part of the storage volume on, only a few files (if any at all) will be missing if I restore what I have (have to check and remove duplicates from the backups, but that’s another story), so doing this rescue has already from the beginning only been for educational purposes.

Ignore the 518kB (finished at 493kB) of errors and test if it boots
Before going on, I will create a partial backup image from the drive I recovered the data to, which will be all partitions up to the gap between the system partitions and the beginning of the data partition. The size of the non-data partitions is only about 7GB (14 million blocks) as seen in the partition table:

Number  Start      End          Size         File system
 1      34s        2000000s     1999967s     ext3
 2      2000896s   11999231s    9998336s     
 3      11999232s  12000000s    769s         
 4      12000001s  12000001s    1s           
 5      12000002s  14000000s    1999999s     
 6      14000128s  7796875263s  7782875136s  

It’s probably a good idea to not include the start block of the data partition (14000128s), but also safe to skip halfway through the gap between the partitions.
dd if=/dev/sdc of=/sda1/ddrescue/buffa6-1-p1-5 bs=512 count=14000064
This way, I can easily restore the system partitions whenever something goes wrong.

Trying to boot the NAS with the recreated disk and the working one
This is the easiest thing I can try. After running ddrescue from the third partition onwards and until the end, there were only about 40kB unrecoverable data (with or without content). This means that I at least can connect the disks to a Linux machine and mount the mdraid volume there for recovery.
But booting it up in the NAS is the “better” solution in this case.

If booting the disks in the NAS fails
If the NAS won’t boot, my next step will be to try to recover more of the missing content from the root partition (I actually ended up doing this before trying to boot, and was able to recover 25kB more).

Use root partitions from another Buffalo
This would have been my next idea to try out. I have a few more of these 2-disk devices, so I can shut one down and clone the system partitions of both the disks, then dd them back to the disks for “Buffalo 6”. This will give it the same IP as the cloned one, but that’s easy to change if this will make it boot again.
I didn’t have to try this. Can save this for the next crash 🙂

The boot attempt..

Mounted the “new” disk 1 in the caddy, then started up the NAS.. Responds to ping on its IP..
And I was also able to log in to it as “root” (previous modifications to allow SSH root access).. Looking good..

[root@BUFFALO-6 ~]# df -h
Filesystem                Size      Used Available Use% Mounted on
udev                     10.0M         0     10.0M   0% /dev
/dev/md1                  4.7G    784.9M      3.7G  17% /
tmpfs                   121.1M     84.0K    121.0M   0% /tmp
/dev/ram1                15.0M    108.0K     14.9M   1% /mnt/ram
/dev/md0                968.7M    216.4M    752.2M  22% /boot
/dev/md10                 7.2T    707.2G      6.6T  10% /mnt/array1
[root@BUFFALO-6 ~]#

mdraid looks OK, except (what I already suspected) that the mirrors for the system partitions were broken (I forgot to fix that the last time I replaced a disk (the other one) in it)..

[root@BUFFALO-6 ~]# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md10 : active raid0 sda6[0] sdb6[1]
      7782874112 blocks super 1.2 512k chunks

md0 : active raid1 sda1[0]
      999872 blocks [2/1] [U_]

md1 : active raid1 sda2[0]
      4995008 blocks super 1.2 [2/1] [U_]

md2 : active raid1 sda5[0]
      999424 blocks super 1.2 [2/1] [U_]

Another easy fix..

mdadm --manage /dev/md0 --add /dev/sdb1
mdadm --manage /dev/md1 --add /dev/sdb2
mdadm --manage /dev/md2 --add /dev/sdb5

All mirrored partitions OK now:

[root@BUFFALO-6 ~]# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md10 : active raid0 sda6[0] sdb6[1]
      7782874112 blocks super 1.2 512k chunks

md0 : active raid1 sdb1[1] sda1[0]
      999872 blocks [2/2] [UU]
      bitmap: 0/1 pages [0KB], 65536KB chunk

md1 : active raid1 sdb2[2] sda2[0]
      4995008 blocks super 1.2 [2/2] [UU]
      bitmap: 1/1 pages [4KB], 65536KB chunk

md2 : active raid1 sdb5[2] sda5[0]
      999424 blocks super 1.2 [2/2] [UU]
      bitmap: 1/1 pages [4KB], 65536KB chunk

How to use Jottacloud as S3 storage

After some trial and (mostly) error (using NFS), I got the backups from xcp-ng working against my Jottacloud storage. So for documentation purposes, I’ll start over (and also fix some of the mistakes I made with the settings, which stored the files in the wrong place).

Requirements

This is written for the use of:
* xcp-ng (Xen Orchestra) for my virtual servers that needs to be backed up
* Jottacloud account (or another one that uses the Jotta backend) for storage
* local server acting as the client for Jottacloud and as the S3 server which Xen Orchestra/xcp-ng will connect to. This can be a virtual one or a separate machine, and it can be used for anything else (I use the virtual server which is also my Xen Orchestra host).
* rclone on the local server as both a client and server, to handle what’s coming in through S3 and push it to Jotta

It should be no problem adopting it to other configurations, but this is what I have to test on.

The local server

As I decided to use the host for Xen Orchestra, it was to begin with a little underpowered (even disk space was low, because I went cheap on it in the beginning), I had to increase disk space and extend the root file system first (I will not describe how, but doing it afterwards involves taking the swap partition down and remove it to be able to expand the file system). I also increased the RAM from 2GB to 4GB and CPU count from 2 to 16 to give it the performance whenever needed.

Installing rclone is straightforward on Linux systems, so just follow the single-line instruction in the documentation:

sudo -v ; curl https://rclone.org/install.sh | sudo bash

Configuring a Jottacloud remote for rclone

After the installation it’s time to configure a remote. The ‘remote’ is in this case the Jottacloud service. By configuring rclone, you will set up a ‘device’ and a ‘mountpoint’ on the Jotta storage.
To compare with the Jotta GUI client, the ‘device’ is the computer being backed up, and ‘mountpoint’ is the folder to be backed up (that is one of the entries listed in the Jotta client main window).
The specifics for each service can be found on its own page in the documentation:
Jottacloud configuration for rclone
A few thing to be mentioned about the whitelabel variants of Jottacloud is that many of them requires you to select “Legacy authentication” (if you do not have the option to use Jotta-CLI and generate a login token)
You may also have to select some non-default replies in the configuration guide when you run the rclone config command. The differences I had to make compared to the description in the documentation is marked in red in the session below (my text input is marked with bold on prompts marked in red):

No remotes found, make a new one?
n) New remote
s) Set configuration password
q) Quit config
n/s/q> n
name> s3
Option Storage.
Type of storage to configure.
Choose a number from below, or type in your own value.
[snip]
XX / Jottacloud
   \ (jottacloud)
[snip]
Storage> 28 (this is currently the number in the list)

Option client_id.
OAuth Client Id.
Leave blank normally.
Enter a value. Press Enter to leave empty.
client_id>

Option client_secret.
OAuth Client Secret.
Leave blank normally.
Enter a value. Press Enter to leave empty.
client_secret>

Edit advanced config?
y) Yes
n) No (default)
y/n>

Option config_type.
Select authentication type.
Choose a number from below, or type in an existing value of type string.
Press Enter for the default (standard).
   / Standard authentication.
 1 | Use this if you're a normal Jottacloud user.
   \ (standard)
   / Legacy authentication.
 2 | This is only required for certain whitelabel versions of Jottacloud and not recommended for normal users.
   \ (legacy)
   / Telia Cloud authentication.
 3 | Use this if you are using Telia Cloud (Sweden).
   \ (telia_se)
   / Telia Sky authentication.
 4 | Use this if you are using Telia Sky (Norway).
   \ (telia_no)
   / Tele2 Cloud authentication.
 5 | Use this if you are using Tele2 Cloud.
   \ (tele2)
   / Onlime Cloud authentication.
 6 | Use this if you are using Onlime Cloud.
   \ (onlime)
config_type> 2

Do you want to create a machine specific API key?

Rclone has it's own Jottacloud API KEY which works fine as long as one only uses rclone on a single machine. When you want to use rclone with this account on more than one machine it's recommended to create a machine specific API key. These keys can NOT be shared between machines.
y) Yes
n) No (default)
y/n> y

Option config_username.
Username (e-mail address)
Enter a value.
config_username> yourjottaemail@fake.com

Option config_password.
Password (only used in setup, will not be stored)
Choose an alternative below. Press Enter for the default (n).
y) Yes, type in my own password
g) Generate random password
n) No, leave this optional password blank (default)
y/g/n> y
Enter the password:
password:
Confirm the password:
password:

Use a non-standard device/mountpoint?
Choosing no, the default, will let you access the storage used for the archive section of the official Jottacloud client. If you instead want to access the sync or the backup section, for example, you must choose yes.
y) Yes
n) No (default)
y/n> y

Option config_device.
The device to use. In standard setup the built-in Jotta device is used, which contains predefined mountpoints for archive, sync etc. All other devices are treated as backup devices by the official Jottacloud client. You may create a new by entering a unique name.
Choose a number from below, or type in your own value of type string.
Press Enter for the default (Jotta).
 1 > Jotta
 2 > your other
 3 > devices configured
 4 > from the client
config_device> deb12-xo

Option config_mountpoint.
The mountpoint to use on the non-standard device deb12-xo.
You may create a new by entering a unique name.
Choose a number from below, or type in your own value of type string.
Press Enter for the default (xcp-ng).
 1 > xcp-ng
 2 > xcpng-s3
config_mountpoint> rclone-s3

Configuration complete.
Options:
- type: jottacloud
- configVersion: 0
- client_id: yourownstringofcharacters
- client_secret: thisissecretsoIwouldnotshowit
- username: yourjottaemail@fake.com
- password:
- auth_code:
- token: {"access_token":"supersecretstuffheredonotshare","expiry":"2025-01-28T16:12:46.077725923+01:00"}
- device: deb12-xo
- mountpoint: rclone-s3
Keep this "s3" remote?
y) Yes this is OK (default)
e) Edit this remote
d) Delete this remote
y/e/d> y

Current remotes:

Name                 Type
====                 ====
s3                   jottacloud

e) Edit existing remote
n) New remote
d) Delete remote
r) Rename remote
c) Copy remote
s) Set configuration password
q) Quit config
e/n/d/r/c/s/q> q

After the configuration step, you have to connect to it for it to be created online. This is also needed because the ‘bucket’ (folder) has to be created as explained in the rclone serve s3 documentation.

To create the device (deb12-xo above) and mountpoint (rclone-s3 above), mount the remote using the rclone mount command:

mkdir s3
rclone -vv mount s3: /home/xo/s3

(you will see a warning that it is recommended to use –vfs-cache-mode for the remote, but it’s safe to ignore it in this step)

In another shell, go to the directory where you mounted the remote. Verify that the remote is mounted with df -h .
Create a folder with the name that will be your xcp-ng backup location (“bucket name”). This will create the directories (mountpoint and bucket) on the online storage. If this is not done (or just if you want), check the logs in the other shell.

In this step, I created the directory “xcp-ng” inside my S3 folder.

Now everything is prepared for configuring it within xcp-ng.
Jump out of the s3 directory (just “cd”), then break the connection and mount with ctrl-c in the other shell.

Start the s3 server (temporarily for testing):

rclone -vv serve s3 --auth-key jottatest,sUperSecret --addr=0.0.0.0:8080 s3:

Verify that you can connect using a S3 client

This is an optional step, but you might find it useful now and later.
To prevent possible mistakes when directly testing with xcp-ng, you can download some S3 client to test with first. I found S3 Browser useful enough for testing.
Account setup in this client is simple:
Display name: whatever you want
Account type: S3 Compatible Storage
API endpoint: ip address and port of machine running rclone serve command, as usual like: 10.0.0.222:8080
Access KeyID: the user name (“jottatest” above)
Secret Access Key: the password (“sUperSecret” above)

Check that uploading files works by verifying that they becomes visible online.
The path (visible, not URL) to the files at Jottacloud would be (from the Web UI):
Backups > deb12-xo > rclone-s3 > xcp-ng

Setting up S3 storage in Xen Orchestra

To set up the newly created S3 storage in Xen Orchestra, go to “Settings/Remotes”.

Set type to “Amazon Web Services S3”
Disable HTTPS (for now, the rclone server supports it, but I haven’t tested it)
AWS S3 endpoint: IP address and port of machine running rclone serve s3
AWS S3 bucket name: xcp-ng
Directory: / (blank would currently not be accepted by Xen Orchestra)
Access key ID: the user name, “jottatest”
Secret (field below): the password (“sUperSecret”)

If the test above went well, the connection in Orchestra will just go through and will be enabled and speed-tested.

Now the setup is ready for the first tests with backing up VMs through the connection. If all goes well, make the rclone command run at the start of the computer or VM it’s on. Any way you like.
It can be told to run in the background with the –daemon option. All information you need is in the documentation or in the forums.

Troubleshooting

If in some case anything goes wrong or stops working (a rare problem, as I heard of, but maybe related to using the Jotta backend or rclone for the connection and serving the S3 storage), create a new bucket and send backups to that instead of the broken one.
Split into multiple buckets if problems continues.

You can also use rclone ncdu to browse the content, and by trial and mostly error move backup folders from the broken bucket to the new one, one at a time to find out where it fails. The method I ended up using was to first rename the bucket to free the name I wanted to keep, then look for the backup folder names in ncdu and move each of them with another move command.

rclone move s3:/xcp-ng s3:xcp-ng-borked
rclone mkdir s3:/xcp-ng
rclone mkdir s3:/xcp-ng/xo-vm-backups

To avoid typing mistakes, I use a variable for the backup folder name which I copy from ncdu:

d=34ba1017-66d2-6a15-ed2a-b57f1a912431
rclone move s3:/xcp-ng-borked/xo-vm-backups/${d} s3:/xcp-ng/xo-vm-backups/${d}

Moving the backup folders like this will only take a couple of seconds, since it is done on the server side (Jotta). I measured the moving speed to about 10GB/s (294GB was moved in 33 seconds), and my largest backup folder of 500GB took 43 seconds.
That’s well spent time if you want to find out which backup is broken and stops every other backup against that remote working.

If failures continues, just start over

If the xcp-ng backups continues to fail, you can keep the old backups but just start over by configuring a new mountpoint (backup folder inside the “device”).

Before making these changes, disable all the remotes in Xen Orchestra, then shut down the “rclone serve s3” process to avoid any access to the old content.
The easiest method for keeping the old content and starting over with the new mountpoint is to rename the remote in the rclone.conf file:

[s3]
type = jottacloud
configVersion = 0
client_id = *SecretStuffHere*
client_secret = *VerySecretStuffHere*
username = not@myrealemail.se
password =

Change the remote name to anything else to free up the old name (I usually use something with “bork” in the name, such as “s3-bork-202505” to indicate when it failed).

Create a new remote as described above, but use a new mountpoint name (root folder for the buckets), then create the individual buckets (folders) by mounting the remote.

In Xen Orchestra, the configuration do not have to change at all.

Enable one remote at a time, then for each one enabled, run the backups using that remote manually once to see if it succeeds.

Buffalo LS220D – lost drive (hiccup)

Yesterday I noticed that the LEDs were blinking amber on one of my LS220D boxes. My initial thought was that a disk had failed (it’s just a backup of my backup). Checked with the “NAS Navigator” application, and it stated that it was unable to mount the data array (md10) (I have not logged the full error message here, as I continued the attempts to solve the situation).

dmesg output

I logged in as root (see other posts) to check what had gone wrong.
‘dmesg’ revealed that a disk had been lost during smartctl (multiple repeats of the below content):

program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Unable to handle kernel NULL pointer dereference at virtual address 000000a4
pgd = c27d4000
[000000a4] *pgd=0fe93831, *pte=00000000, *ppte=00000000
Internal error: Oops: 817 [#50]
Modules linked in: usblp usb_storage ohci_hcd ehci_hcd xhci_hcd usbcore usb_common
CPU: 0    Tainted: G      D       (3.3.4 #1)
PC is at sg_scsi_ioctl+0xe0/0x374
LR is at sg_scsi_ioctl+0xcc/0x374
pc : []    lr : []    psr: 60000013
sp : cafb5d58  ip : 00000000  fp : 00000024
r10: 00000006  r9 : c41d1860  r8 : 00000012
r7 : 00000000  r6 : 00000024  r5 : beee5550  r4 : beee5548
r3 : cafb4000  r2 : cafb5d58  r1 : 00000000  r0 : 00000000
Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user
Control: 10c5387d  Table: 027d4019  DAC: 00000015
Process smartctl (pid: 1027, stack limit = 0xcafb42e8)
Stack: (0xcafb5d58 to 0xcafb6000)
5d40:          c057c2b8 60000013
5d60: c21f27f0 beee5548 c2274800 0000005d cafb5de4 00000000 c998edcc 00000004
5d80: c99800c8 c00a6e64 c9d034e0 00000028 c998edc8 00000029 c27d4000 c00a8fc0
5da0: 00000000 00000000 00000000 c998ed08 c2274800 56e6994b beee5a48 beee5548
5dc0: 0000005d 0000005d c2274800 c21f27f0 cafb4000 56e6994b beee7e34 beee5548
5de0: 0000005d 0000005d c2274800 c21f27f0 cafb4000 ffffffed beee7e34 c0245494
5e00: 00000053 fffffffd 00002006 00000024 beee5af8 beee5ae0 beee5ab8 00004e20
5e20: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
5e40: c27d4000 00000000 c27d4000 cb0023e0 c87f3d30 00000028 beee7e34 c00be67c
5e60: c27d4000 00000028 cafb5fb0 56e6994b 00000001 0000005d c8014040 beee5548
5e80: 0000005d c0245530 beee5548 0000005d 00000001 00000001 beee5548 c222c000
5ea0: c8014040 c02a6284 beee5548 beee5548 c8014040 c2274800 00000001 0000005d
5ec0: 00000000 c02422a0 beee5548 c0242be0 00000000 cafb5f78 00000001 c2949000
5ee0: ffffff9c c8014040 00000000 00000007 c054ff34 00039db8 cafb5fb0 beee5548
5f00: c21e0470 00000003 00000003 c000e3c8 cafb4000 00000000 beee7e34 c00e0060
5f20: 00000000 00000000 cf34be00 2c1b812a 5e6a6136 2c1b812a cf1a2548 00000000
5f40: 00000000 00000000 00000003 00000003 c95a2ec0 c2949000 c95a2ec8 00000020
5f60: 00000003 c95a2ec0 beee5548 00000001 00000003 c000e3c8 cafb4000 00000000
5f80: beee7e34 c00e010c 00000003 00000000 beee5548 beee5548 0006d614 beee5a8c
5fa0: 00000036 c000e200 beee5548 0006d614 00000003 00000001 beee5548 00000000
5fc0: beee5548 0006d614 beee5a8c 00000036 00000000 00000003 00000006 beee7e34
5fe0: beee5ae0 beee5540 00039688 b6da5cec 80000010 00000003 cfcfcfcf 00000014
[] (sg_scsi_ioctl+0xe0/0x374) from [] (scsi_cmd_ioctl+0x39c/0x3fc)
[] (scsi_cmd_ioctl+0x39c/0x3fc) from [] (scsi_cmd_blk_ioctl+0x3c/0x44)
[] (scsi_cmd_blk_ioctl+0x3c/0x44) from [] (sd_ioctl+0x8c/0xb8)
[] (sd_ioctl+0x8c/0xb8) from [] (__blkdev_driver_ioctl+0x20/0x28)
[] (__blkdev_driver_ioctl+0x20/0x28) from [] (blkdev_ioctl+0x670/0x6c0)
[] (blkdev_ioctl+0x670/0x6c0) from [] (do_vfs_ioctl+0x49c/0x514)
[] (do_vfs_ioctl+0x49c/0x514) from [] (sys_ioctl+0x34/0x58)
[] (sys_ioctl+0x34/0x58) from [] (ret_fast_syscall+0x0/0x30)
Code: e1a0200d e7d3a2a8 e3c23d7f e3c3303f (e1c0aab4)
---[ end trace 660c9d3c9b4a9034 ]---

fdisk output

Using ‘fdisk’ (incorrect for this NAS), I listed the partitions on /dev/sda and /dev/sdb (nothing about /dev/sda):

[root@BUFFALO-4 ~]# fdisk -l /dev/sda
[root@BUFFALO-4 ~]# fdisk -l /dev/sdb

WARNING: GPT (GUID Partition Table) detected on '/dev/sdb'! The util fdisk doesn't support GPT. Use GNU Parted.

Disk /dev/sdb: 4000.8 GB, 4000787030016 bytes
255 heads, 63 sectors/track, 486401 cylinders, total 7814037168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x00000000

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1               1  4294967295  2147483647+  ee  GPT
Partition 1 does not start on physical sector boundary.

smartctl output

[root@BUFFALO-4 ~]# smartctl --scan
/dev/sda -d scsi # /dev/sda, SCSI device
/dev/sdb -d scsi # /dev/sdb, SCSI device

[root@BUFFALO-4 ~]# smartctl --all /dev/sda
smartctl 6.3 2014-07-26 r3976 [armv7l-linux-3.3.4] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

Segmentation fault

[root@BUFFALO-4 ~]# smartctl --all /dev/sdb
smartctl 6.3 2014-07-26 r3976 [armv7l-linux-3.3.4] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Green (AF, SATA 6Gb/s)
Device Model:     WDC WD40EZRX-22SPEB0
Serial Number:    WD-WCC4E1UUZH74
LU WWN Device Id: 5 0014ee 2b768eeb4
Firmware Version: 80.00A80
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Thu Jul 14 12:10:33 2022 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
          was completed without error.
          Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
          without error or no self-test has ever
          been run.
Total time to complete Offline
data collection: (52320) seconds.
Offline data collection
capabilities:     (0x7b) SMART execute Offline immediate.
          Auto Offline data collection on/off support.
          Suspend Offline collection upon new
          command.
          Offline surface scan supported.
          Self-test supported.
          Conveyance Self-test supported.
          Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
          power-saving mode.
          Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
          General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 523) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x7035) SCT Status supported.
          SCT Feature Control supported.
          SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   196   187   021    Pre-fail  Always       -       7183
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       36
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   055   054   000    Old_age   Always       -       33525
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       36
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       28
193 Load_Cycle_Count        0x0032   001   001   000    Old_age   Always       -       7866202
194 Temperature_Celsius     0x0022   113   103   000    Old_age   Always       -       39
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       1

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status   Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%         8         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Nothing more to do than to reboot.

After reboot

The storage array was still not mounted, smartctl could now find /dev/sda:

[root@BUFFALO-4 ~]# df -h
Filesystem Size      Used Available Use% Mounted on
udev      10.0M         0     10.0M   0% /dev
/dev/md1   4.7G    766.8M      3.7G  17% /
tmpfs    121.1M     84.0K    121.0M   0% /tmp
/dev/ram1 15.0M    100.0K     14.9M   1% /mnt/ram
/dev/md0 968.7M    216.4M    752.2M  22% /boot

[root@BUFFALO-4 ~]# smartctl --all /dev/sda
smartctl 6.3 2014-07-26 r3976 [armv7l-linux-3.3.4] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Green (AF, SATA 6Gb/s)
Device Model:     WDC WD40EZRX-22SPEB0
Serial Number:    WD-WCC4E1XUDU4T
LU WWN Device Id: 5 0014ee 20cbde2d7
Firmware Version: 80.00A80
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Thu Jul 14 12:13:56 2022 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84) Offline data collection activity
          was suspended by an interrupting command from host.
          Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
          without error or no self-test has ever
          been run.
Total time to complete Offline
data collection: (52560) seconds.
Offline data collection
capabilities:     (0x7b) SMART execute Offline immediate.
          Auto Offline data collection on/off support.
          Suspend Offline collection upon new
          command.
          Offline surface scan supported.
          Self-test supported.
          Conveyance Self-test supported.
          Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
          power-saving mode.
          Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
          General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 526) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x7035) SCT Status supported.
          SCT Feature Control supported.
          SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   250   204   021    Pre-fail  Always       -       4500
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       38
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   053   051   000    Old_age   Always       -       34713
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       38
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       30
193 Load_Cycle_Count        0x0032   001   001   000    Old_age   Always       -       7823449
194 Temperature_Celsius     0x0022   122   106   000    Old_age   Always       -       30
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       13
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       11
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       14

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status   Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%         8         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Partition table after reboot

Now when both disks are in place again, I ran the (correct) command to list the partitions on all drives:

[root@BUFFALO-4 ~]# parted -l /dev/sdb
Model: ATA WDC WD40EZRX-22S (scsi)
Disk /dev/sda: 4001GB
Sector size (logical/physical): 512B/4096B
Partition Table: gpt
Disk Flags:

Number  Start   End     Size    File system  Name     Flags
 1      17.4kB  1024MB  1024MB  ext3         primary
 2      1024MB  6144MB  5119MBprimary
 3      6144MB  6144MB  394kB primary  bios_grub
 4      6144MB  6144MB  512B  primary
 5      6144MB  7168MB  1024MBprimary
 6      7168MB  3992GB  3985GBprimary


Model: ATA WDC WD40EZRX-22S (scsi)
Disk /dev/sdb: 4001GB
Sector size (logical/physical): 512B/4096B
Partition Table: gpt
Disk Flags:

Number  Start   End     Size    File system  Name     Flags
 1      17.4kB  1024MB  1024MB  ext3         primary
 2      1024MB  6144MB  5119MBprimary
 3      6144MB  6144MB  394kB primary  bios_grub
 4      6144MB  6144MB  512B  primary
 5      6144MB  7168MB  1024MBprimary
 6      7168MB  3992GB  3985GBprimary

...

Looks ok, so I tried mounting /dev/md10:

root@BUFFALO-4 ~]# mount /dev/md10 /mnt/array1/
[root@BUFFALO-4 ~]# df -h
Filesystem Size      Used Available Use% Mounted on
udev      10.0M         0     10.0M   0% /dev
/dev/md1   4.7G    766.8M      3.7G  17% /
tmpfs    121.1M     84.0K    121.0M   0% /tmp
/dev/ram1 15.0M    100.0K     14.9M   1% /mnt/ram
/dev/md0 968.7M    216.4M    752.2M  22% /boot
/dev/md10  7.2T      5.7T      1.6T  79% /mnt/array1
[root@BUFFALO-4 ~]# ls /mnt/array1/
backup/         buffalo_fix.sh* share/          spool/
[root@BUFFALO-4 ~]# ls /mnt/array1/share/
acp_commander/    buff4_public.txt  buff4_share.txt   buff4_web.txt

Checking the file system for errors

As I was able to mount the partition, I did a file system check after unmounting it:

[root@BUFFALO-4 ~]# xfs_repair /dev/md10
Phase 1 - find and verify superblock...
Not enough RAM available for repair to enable prefetching.
This will be _slow_.
You need at least 1227MB RAM to run with prefetching enabled.
Phase 2 - using internal log
        - zero log...
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
...
        - agno = 30
        - agno = 31
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 1
...
        - agno = 30
        - agno = 31
Phase 5 - rebuild AG headers and trees...
        - reset superblock...
Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
doubling cache size to 1024
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
done
[root@BUFFALO-4 ~]# mount /dev/md10 /mnt/array1
[root@BUFFALO-4 ~]# ls /mnt/array1/
backup/         buffalo_fix.sh* share/          spool/

Another reboot, then checking to find out that md10 was still not mounted.
The error in NAS Navigator is: “E14:RAID array 1 could not be mounted. (2022/07/14 12:36:18)”

Time to check ‘dmesg’ again:

md/raid1:md2: active with 1 out of 2 mirrors
md2: detected capacity change from 0 to 1023410176
md: md1 stopped.
md: bind
md/raid1:md1: active with 1 out of 2 mirrors
md1: detected capacity change from 0 to 5114888192
md: md0 stopped.
md: bind
md/raid1:md0: active with 1 out of 2 mirrors
md0: detected capacity change from 0 to 1023868928
 md0: unknown partition table
kjournald starting.  Commit interval 5 seconds
EXT3-fs (md0): using internal journal
EXT3-fs (md0): mounted filesystem with writeback data mode
 md1: unknown partition table
kjournald starting.  Commit interval 5 seconds
EXT3-fs (md1): using internal journal
EXT3-fs (md1): mounted filesystem with writeback data mode
kjournald starting.  Commit interval 5 seconds
EXT3-fs (md1): using internal journal
EXT3-fs (md1): mounted filesystem with writeback data mode
 md2: unknown partition table
Adding 999420k swap on /dev/md2.  Priority:-1 extents:1 across:999420k
kjournald starting.  Commit interval 5 seconds
EXT3-fs (md0): using internal journal
EXT3-fs (md0): mounted filesystem with writeback data mode

The above shows that md0, md1 and md2 went up, but are missing its mirror partition (this from /dev/sda that disappeared).
Further down in dmesg output

md: md10 stopped.
md: bind
md: bind
md/raid0:md10: md_size is 15565748224 sectors.
md: RAID0 configuration for md10 - 1 zone
md: zone0=[sda6/sdb6]
      zone-offset=         0KB, device-offset=         0KB, size=7782874112KB

md10: detected capacity change from 0 to 7969663090688
 md10: unknown partition table
XFS (md10): Mounting Filesystem
XFS (md10): Ending clean mount
XFS (md10): Quotacheck needed: Please wait.
XFS (md10): Quotacheck: Done.
udevd[3963]: starting version 174
md: cannot remove active disk sda6 from md10 ...
[root@BUFFALO-4 ~]# mount /dev/md10 /mnt/array1/
[root@BUFFALO-4 ~]# ls -l /mnt/array1/
total 4
drwxrwxrwx    3 root     root            21 Dec 14  2019 backup/
-rwx------    1 root     root           571 Oct 14  2018 buffalo_fix.sh*
drwxrwxrwx    3 root     root            91 Sep 16  2019 share/
drwxr-xr-x    2 root     root             6 Oct 21  2016 spool/

What the h… “cannot remove active disk sda6 from md10”

Checking md raid status

[root@BUFFALO-4 ~]# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md10 : active raid0 sda6[0] sdb6[1]
      7782874112 blocks super 1.2 512k chunks

md0 : active raid1 sdb1[1]
      999872 blocks [2/1] [_U]

md1 : active raid1 sdb2[1]
      4995008 blocks super 1.2 [2/1] [_U]

md2 : active raid1 sdb5[1]
      999424 blocks super 1.2 [2/1] [_U]

unused devices: 
[root@BUFFALO-4 ~]# mdadm --detail /dev/md10
/dev/md10:
        Version : 1.2
  Creation Time : Fri Oct 21 15:58:46 2016
     Raid Level : raid0
     Array Size : 7782874112 (7422.33 GiB 7969.66 GB)
   Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

    Update Time : Fri Oct 21 15:58:46 2016
          State : clean
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

     Chunk Size : 512K

           Name : LS220D896:10
           UUID : 5ed0c596:60b32df6:9ac4cd3a:59c3ddbc
         Events : 0

    Number   Major   Minor   RaidDevice State
       0       8        6        0      active sync   /dev/sda6
       1       8       22        1      active sync   /dev/sdb6

So here, md10 is fully working and md0, md1 and md2 are missing their second device. Simple to correct, just adding them back:

[root@BUFFALO-4 ~]# mdadm --manage /dev/md0 --add /dev/sda1
mdadm: added /dev/sda1
[root@BUFFALO-4 ~]# mdadm --manage /dev/md1 --add /dev/sda2
mdadm: added /dev/sda2
[root@BUFFALO-4 ~]# mdadm --manage /dev/md2 --add /dev/sda5
mdadm: added /dev/sda5
[root@BUFFALO-4 ~]# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md10 : active raid0 sda6[0] sdb6[1]
      7782874112 blocks super 1.2 512k chunks

md0 : active raid1 sda1[0] sdb1[1]
      999872 blocks [2/2] [UU]
      bitmap: 0/1 pages [0KB], 65536KB chunk

md1 : active raid1 sda2[2] sdb2[1]
      4995008 blocks super 1.2 [2/1] [_U]
      [====>................]  recovery = 24.2% (1212672/4995008) finish=1.2min speed=48506K/sec

md2 : active raid1 sda5[2] sdb5[1]
      999424 blocks super 1.2 [2/1] [_U]
        resync=DELAYED

unused devices: 

Some time later, sync was finished, and I rebooted again. Finally, after this reboot /dev/md10 is automatically mounted to /mnt/array1 again.

Problem solved 🙂

smartctl notes

The values of attributes 5, 197 and 198 should be zero for a healthy drive, so one disk in the NAS is actually failing, but the cause of the hiccup (disconnect) was a core dump by smatctl weekly scan.

  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       13
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      11

JottaCloud secrets

I dug into the sqlite databases used by the JottaCloud client (and branded ones like Elgiganten) and found something that can be useful for other diggers…

This documentation is for the windows version of the client. The path to the database files and the path formats within the databases will differ for the client for other OSes.

11-Jan-2023: Updated example queries in the comments. Added ‘Find duplicates’.

Preparing

This method works for finding the location on the windows version:
Open the client interface, go to settings, then under the “General” tab, you will find a button that opens the log file location:

A window with the location ‘C:\Users\{myuser}\AppData\Roaming\Jotta\JottaWorld\log’ will be opened. Go to the parent directory, and there you will find the ‘db’ directory.

Keep this location open and QUIT the Jotta client (from the taskbar or any other effective method)

Copy the ‘db’ (or its parent ‘JottaWorld’) folder to a work- (or backup) location. NEVER do anything without having a backup copy of the ‘db’ folder, or even the whole ‘JottaWorld’ (parent) folder in case something goes wrong.

Examining the databases

From here, I will be examining each of the databases (.db files) and go through what I’ve found out. I will use the sqlite3 client supplied by microsoft-invented Ubuntu, the alternative is (on windows) to use a native sqlite3 client the same way, or just copy the ‘JottaWorld’ or ‘db’ directory to a computer with Linux (or any other real operating system) installed.

Basic sqlite3 usage

To open the database in sqlite3, simply use the sqlite3 command followed by the database name:

sqlite3 c.db

To show all tables in a database:

.tables

To show the table layout:

.schema {table name}

Select and update statements works basically as in other SQL clients.

c.db (outside the ‘db’ folder)

An empty database with a single table ‘c’, defined as:

CREATE TABLE c (id INTEGER PRIMARY KEY ASC AUTOINCREMENT,type integer, time integer, size integer, attempts integer, checksum string, path string, known );

The use of it is for me unknown (as the table is empty in my db).
This database was last changed almost two years before I stopped the Jotta client.

dl.db

Contains only one table ‘requests’ defined as

CREATE TABLE requests (id integer primary key autoincrement, callerid integer, localpath, remotepath, created integer, modified integer, revision integer, size integer, checksum varchar(32), queue integer, state integer, attempts integer, flags integer );

The use of it is for me unknown (as the table is empty in my db).
This database was last changed a week before I stopped the Jotta client.

dlsq.db

Database for the Jotta Sync folder. This folder is by default synced in full on all computers set up against the same Jotta account. There is no selective sync or OneDrive-like on-demand sync in Jotta, the only option is to completely disable the sync folder on the “Sync” tab in the settings. The sync folder location can be changed there too.

Tables:

jwt_blockingevents

(empty)

jwt_files

Information about all files

jwt_folders

Information about all folders

jwt_queuedfiles

Files checksummed and queued for transfer

jwt_shares

Shared files and folders within the sync folder

jwt_folders
The table is defined as:

CREATE TABLE jwt_folders (jwc_id INTEGER PRIMARY KEY ASC AUTOINCREMENT, jwc_stateid, jwc_remotepath, jwc_remotehash, jwc_localpath, jwc_localhash, jwc_basepath, jwc_relativepath, jwc_folderhash , jwc_state, jwc_parent, jwc_newpath);
jwc_id

Folder id, used in the jwc_parent column and in jwc_files

jwc_stateid

empty on the data I have

jwc_remotepath

Path to the folder at Jotta, starting with ‘/{Jotta user name}/Jotta/Sync/’

jwc_remotehash

md5sum of the folder (?) a folder cannot be hashed

jwc_localpath

The full local path to the folder

jwc_localhash

md5sum of the folder (?) a folder cannot be hashed

jwc_basepath

empty on the data I have

jwc_relativepath

Path relative to the Sync folder location, empty on many of the entries

jwc_folderhash

empty on the data I have

jwc_state

State as cleartext ‘Updated’ if all files are synced

jwc_parent

id (jwc_id) of parent folder

jwc_newpath

empty on the data I have

jwt_files
The table is defined as:

CREATE TABLE jwt_files (jwc_id INTEGER PRIMARY KEY ASC AUTOINCREMENT, jwc_remotepath, jwc_remotesize INTEGER, jwc_remotehash, jwc_localpath, jwc_localsize INTEGER, jwc_localhash, jwc_relativepath, jwc_created INTEGER, jwc_modified INTEGER, jwc_updated INTEGER, jwc_status, jwc_checksum, jwc_state, jwc_uuid, jwc_revision , jwc_folderid, jwc_newpath);
jwc_id

File id

jwc_remotepath

Path to the file at Jotta, starting with ‘/{Jotta user name}/Jotta/Sync/’

jwc_remotesize

File size on the remote end (should match localsize)

jwc_remotehash

md5sum of something at the remote end

jwc_localpath

The full local path to the file

jwc_localsize

File size on the local side (should match remotesize)

jwc_localhash

md5sum of something at the local side

jwc_relativepath

Path relative to the remote location, empty on many of the entries

jwc_created

timestamp of file creation

jwc_modified

timestamp of file modification

jwc_updated

zero on all my files

jwc_status

empty on the data I have

jwc_checksum

file md5 checksum

jwc_state

either ‘UpdatedFileState’ or ‘MovingFileState’ (used on renamed files, see ‘jwc_newpath’)

jwc_uuid

don’t know, ‘{00000000-0000-0000-0000-000000000000}’ on most files

jwc_revision

0, 1 or 11 on all my files

jwc_folderid

id (jwc_id from jwt_folders) of containing folder

jwc_newpath

New local name of a file renamed because of an upload error

jwt_queuedfiles
The table is defined as:

CREATE TABLE jwt_queuedfiles (jwc_id INTEGER PRIMARY KEY ASC AUTOINCREMENT, jwc_remotepath, jwc_remotesize INTEGER, jwc_localpath, jwc_localsize INTEGER, jwc_relativepath, jwc_created INTEGER, jwc_modified INTEGER, jwc_status, jwc_checksum, jwc_revision INTEGER, jwc_queueid, jwc_type, jwc_hash , jwc_folderid);

It was empty in my current copy of the database, but it should be more or less like jwt_files (used only temporarily).

jwt_shares
The table is defined as:

CREATE TABLE jwt_shares (jwc_id INTEGER PRIMARY KEY ASC AUTOINCREMENT, jwc_shareid, jwc_localpath, jwc_remotepath, jwc_owner, jwc_members );

Mostly self-explanatory, except for the two fields I’m unable to explain 🙂
jwc_shareid is in the form of jwc_uuid given above, jwc_owner is probably some secret string about my user (at Jotta) that I’m not supposed to share. It’s an 24 character alphanumeric string.

jobs.db

Contains only one table ‘jobs’ defined as

CREATE TABLE jobs (id integer primary key autoincrement, status integer, uri, name, path, databasepath, files integer, bytes integer );

The use of it is for me unknown (as the table is empty in my db).
This database file was last changed almost a year before I stopped the client.

mm.db

Backup folders. This is the only table I have made manual changes to (I made the listed folder name in the GUI more obvious on some entries). Never change anything without having a backup, and never change anything while the client is running.

Tables:

backup_schedule

The backup schedule (Schedule tab in settings)

backup_schedule_copy

Backup copy of the backup schedule

excludes

Files and folders excluded from backup

excludes_copy

Internal backup copy of the excludes table

mountpoints

All backup folders set in the client

backup_schedule and backup_schedule_copy
The backup schedule in settings seems to be a very simplified one. By modifying the database it looks like they prepared to allow for different backup time settings every day (I don’t know if it works).
The table is defined as:

CREATE TABLE backup_schedule(id INTEGER PRIMARY KEY, mountpoint INTEGER, start_day TEXT, start_hour INTEGER, start_minute INTEGER, end_day TEXT, end_hour INTEGER, end_minute INTEGER);

All self-explanatory except “mountpoint”, which is set to “-1” when I create a schedule. If the schedule is set to any of the multi-day variants (“weekends”,”weekdays”,”everyday”) there will be multiple entries in the database, one for each day:

sqlite> select * from backup_schedule;
1|-1|Monday|2|0|Monday|7|0
2|-1|Sunday|2|0|Sunday|7|0
3|-1|Saturday|2|0|Saturday|7|0
4|-1|Wednesday|2|0|Wednesday|7|0
5|-1|Tuesday|2|0|Tuesday|7|0
6|-1|Friday|2|0|Friday|7|0
7|-1|Thursday|2|0|Thursday|7|0
sqlite> select * from backup_schedule;
1|-1|Sunday|2|0|Sunday|7|0
2|-1|Saturday|2|0|Saturday|7|0
sqlite>

My guess about the ‘mountpoint’ column (which is set to “-1” by the schedule settings in the client) is that it refers to the ‘mountpoints’ table, so theoretically it should be possible to create separate schedules for every one of the mountpoints by directly entering them into the database…
The ‘backup_schedule_copy’ table contains the schedule before making changes through the client.

excludes and excludes_copy
All files and folders that are excluded by the backup. This also includes the system and hidden files and folders that are not backed up. From the client settings, it is possible to include hidden files and folders.
The table is defined as:

CREATE TABLE excludes(id INTEGER PRIMARY KEY, mountpoint INTEGER, pattern VARCHAR(1024));

Not much to explain here. ‘mountpoint’ is set to ‘-1’, and I find no possible use for it to match an entry in the ‘mountpoints’ table. ‘pattern’ allows for simple pattern matching (*) for the full local path of a file or folder to exclude from backup.

mountpoints
This table contains all the backup folders defined in the client.
The table is defined as:

CREATE TABLE mountpoints(jwc_id INTEGER PRIMARY KEY ASC AUTOINCREMENT,jwc_name,jwc_path,jwc_device,jwc_description,jwc_status,jwc_location,jwc_type,jwc_ip,jwc_suspended );
jwc_name

Name displayed in the client

jwc_path

The path for the folder to backup

jwc_device

Computer name (for the Jotta side ?)

jwc_description

Computer name

jwc_status

Status, can be any of the following:
Scanning
ScheduleWaiting
AllGood
Uploading
QueuedForScan

jwc_location

‘Local’ or ‘Remote’

jwc_type

Zero on all my entries

jwc_ip

127.0.0.1 for local paths, empty for remote

jwc_suspended

“Suspended” for paused backups, blank otherwise

I find the content of jwc_status to more often be incorrect than correct, while writing this it is scanning one of my network drives, but in the database it says “Uploading”. Many entries are “Up to date” according to the client, but listed as different things in the db.

reque_c and reque_u

Two more sqlite3 database files that are without their extension (.db)

reque_c contains a table with queued uploads (scanned files, on queue for checksumming), which has the same definition as reque_u. As these files are queued for checksumming, the “checksum” field in the blob is an empty string. Content of the extraData fields in the blob is written to sm.db in (before) this stage.

reque_u contains a table with queued uploads (checksummed, waiting for upload slot):

CREATE TABLE uploads (id INTEGER PRIMARY KEY, tag INTEGER, blob BLOB );

id: just the entry id, duplicated (last value) in the blob
tag: the oddly named field for the mountpoint id (in mm.db), repeated in the blob
blob contains JSON array of file information:

{
        "checksum": "6cd9bca0e441280fb72ff5cf6f7991b3",
        "cre": 1657809534,
        "extraData": {
            "id": 9730953,
            "parent": 12740
        },
        "localpath": "C:/Users/peo/Downloads/Toro Reelmaster 216 - Operators Manual - MODEL NO. 03410TE—70001 & UP.pdf",
        "mod": 1657809535,
        "remotepath": "/jfs/LAPTOP-3/Downloads/Toro Reelmaster 216 - Operators Manual - MODEL NO. 03410TE—70001 & UP.pdf",
        "size": 1972185,
        "tag": 9,
        "timeout": 0
    },
    "id": 9907
}

Most content of the blob is self-explanatory if you have read until here.
checksum: the md5 checksum of the file
cre, mod: timestamp of creation and last modification
extraData:id is the new file id and extraData:parent is the folder containing the file (folders table in mm.db). This information was written to the database in the scanning phase (reque_c).

sm.db

Contains information on all backed up files
Tables:

files

Information for all backed up files

folders

Information for all backed up folders

mountpoint_status

(empty)

folders
The table is defined as:

CREATE TABLE folders (id integer primary key autoincrement, path text UNIQUE, state integer, parent integer, mountpoint integer, checksum varchar(20));
path

Full local path to the folder

state

Contains a value of 1,2,5,6 or 7 in my database, have no idea of what it represents

parent

Id of parent folder (in this table)

mountpoint

mountpoint id in mm.db

checksum

md5 checksum on something (a folder cannot be checksummed)

files
The table is defined as:

CREATE TABLE files (id integer primary key autoincrement, path text UNIQUE, parent integer, size integer, modified integer, created integer, checksum varchar(16), state integer, mountpoint integer);
path

The full path of the backed up file

parent

the id of the containing folder (in folders table)

size

file size

modified

timestamp of modification

created

timestamp of creation

checksum

md5 checksum of file

state

Contains a value of 6 or 7 in my database, have no idea of what it represents

mountpoint

mountpoint id in mm.db

So why all this trouble analyzing the database ?

I wanted an easy way of finding my files by its md5 checksum, that was one of the reasons. Another thing (not solved yet) is that I want to find out the way of recreating the share link for a specific file or folder within a public shared folder on my Jotta account (this without going through the web interface, I mean, it’s already shared inside an accessible folder).

Odd things noticed are that there are md5 checksums for folders, and three different ones in the sync folder (the jwt_files and jwt_folders tables in the dlsq.db), but for the individual files there is only the files’ real md5 checksum.

Anyway… that investigation will continue some other day…

Comment below if you find the way to calculate the share-id, or find it useful in any other way 🙂

Xpenology – Synology DSM on non-Synology hardware

This bunch of resources need to be reorganized some day.. I just made it to close off a rotting web browser window..

General

https://xpenology.org/
https://xpenology.org/installation/
https://xpenology.club/category/tutorials/
https://xpenology.com/forum/topic/9394-installation-faq/?tab=comments#comment-81101
https://xpenology.com/forum/topic/9392-general-faq/?tab=comments#comment-82390

Specific hardware

https://xpenology.com/forum/topic/20314-buffalo-terastation-ts5800d/
https://en.wikipedia.org/wiki/Haswell_(microarchitecture)

Misc

https://xpenology.com/forum/topic/24864-transcoding-without-a-valid-serial-number/
https://xpenology.com/forum/topic/38939-serial-number-for-ds918/
https://xpenogen.github.io/serial_generator/index.html

https://xpenology.com/forum/topic/29872-tutorial-mount-boot-stick-partitions-in-windows-edit-grubcfg-add-extralzma/
https://xpenology.com/forum/topic/12422-xpenology-tool-for-windows-x64/page/5/

Unsorted

https://xpenology.com/forum/topic/12952-dsm-62-loader/page/75/
https://xpenology.com/forum/topic/28183-running-623-on-esxi-synoboot-is-broken-fix-available/
https://xpenology.com/forum/topic/13333-tutorialreference-6x-loaders-and-platforms/
https://xpenology.com/forum/topic/7973-tutorial-installmigrate-dsm-52-to-61x-juns-loader/
https://xpenology.com/forum/topic/7294-links-to-dsm-and-critical-updates/

Synology DSM archive

https://archive.synology.com/download/Os/DSM/6.2.3-25426-3

Errors

https://xpenology.com/forum/topic/14114-usb-stick-no-vidpid/
https://xpenology.com/forum/topic/9853-dsm_ds3617xs-installation-error-the-file-is-probably-corrupt-13/
https://xpenology.com/forum/topic/13253-error-21-problem/

Synology DSM 7 and broken FTP support in curl

I recently updated my DS1517 to DSM 7 and noticed that FTP support has been left out in curl/libcurl they included. This is how I compiled the latest version of curl, including support for all omitted protocols. It still needs more fixing, since I was not able to compile it with SSL support (so no https, which is included in curl in DSM 7).

My guide is for the Synology DS1517 (ARM). You have to download the correct files for your NAS and set the correct options (paths and names) for the compile tools if you have another model.

The problem

For some unknown reason, Synology decided to drop support for all protocols except http and https in the included curl binary with DSM7:

root@DS1517:~# curl --version
curl 7.75.0 (arm-unknown-linux-gnueabi) libcurl/7.75.0 OpenSSL/1.1.1k zlib/1.2.11 c-ares/1.14.0 nghttp2/1.41.0
Release-Date: 2021-02-03
Protocols: http https
Features: alt-svc AsynchDNS Debug HTTP2 HTTPS-proxy IPv6 Largefile libz NTLM NTLM_WB SSL TrackMemory UnixSockets
root@DS1517:~#

The outcome of following this guide:

curl.ftp --version
curl 7.79.1 (arm-unknown-linux-gnueabihf) libcurl/7.79.1
Release-Date: 2021-09-22
Protocols: dict file ftp gopher http imap mqtt pop3 rtsp smtp telnet tftp
Features: alt-svc AsynchDNS IPv6 Largefile UnixSockets

As seen and mentioned above, I was not able to enable SSL in my compiled version, so this will not replace curl included in DSM7, but could be installed in /bin under another name as it has the libcurl statically linked in the binary.

What you need to compile for the Synology

The first thing you need is a Linux installation as a development system containing the Synology toolkit for cross-compiling.
A fairly standard installation will do, at least mine did (but that also includes PHP, MySQL, Apache and other useful stuff). This is preferably done on a virtual machine, but you can of course use a physical computer for it.

You also need the Synology DSM toolchain for the CPU in the NAS you want to compile for. I found the links in the Synology Developer Guide (beta).
There is also supposed to be a online version of the guide, but at least for me, all the links within it were not working.

Get the toolchain
To find out which toolchain you need, run the command ‘uname -a’:

root@DS1517:~# uname -a
Linux DS1517 3.10.108 #41890 SMP Thu Jul 15 03:42:22 CST 2021 armv7l GNU/Linux synology_alpine_ds1517

As seen above, the DS1517 reports “synology_alpine_ds1517”, so you should look for the “alpine” versions of downloads for this NAS.
Get the correct toolchain for your NAS from Synology toolkit downloads. For the DS1517, I downloaded the file “alpine-gcc472_glibc215_alpine-GPL.txz”:
Download and unpack on the development system:

wget "https://global.download.synology.com/download/ToolChain/toolchain/7.0-41890/Annapurna%20Alpine%20Linux%203.10.108/alpine-gcc472_glibc215_alpine-GPL.txz"
tar xJf alpine-gcc472_glibc215_alpine-GPL.txz -C /usr/local/

The above will download and unpack the toolchain to the /usr/local/arm-linux-gnueabihf folder. This contains Linux executables for the GNU compilers (gcc, g++ etc).

arm-linux-gnueabihf-gcc: No such file or directory
Now, whenever you try to execute any of the commands extracted to the bin directory, you will probably get the “No such file or directory” error (even with the correct path and filename and the file is executable).
If you examine the executable files using the ‘file’ command you will discover that these are 32-bit executables:

root@ubu-01:~# file /usr/local/arm-linux-gnueabihf/bin/arm-linux-gnueabihf-gcc-4.7.2
/usr/local/arm-linux-gnueabihf/bin/arm-linux-gnueabihf-gcc-4.7.2: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux.so.2, for GNU/Linux 2.6.15, stripped

I found the solution to the problem here:
arm-linux-gnueabihf-gcc: No such file or directory
In short:

dpkg --add-architecture i386
apt-get update
apt-get install git build-essential fakeroot
apt-get install gcc-multilib
apt-get install zlib1g:i386

Now that we have the cross-compiling toolkit working, let’s continue with curl.

Cross-compile curl for Synology NAS

The current version at the time I wrote the guide was 7.79.1, so I download the source and then uncompress it:

wget https://curl.se/download/curl-7.79.1.tar.gz
tar xfz curl-7.79.1.tar.gz
cd curl-7.79.1

Set some variables and GCC options

export TC="arm-linux-gnueabihf"
export PATH=$PATH:/usr/local/${TC}/bin
export CPPFLAGS="-I/usr/local/${TC}/${TC}/include"
export AR=${TC}-ar
export AS=${TC}-as
export LD=${TC}-ld
export RANLIB=${TC}-ranlib
export CC=${TC}-gcc
export NM=${TC}-nm

Build and install into installdir

./configure --disable-shared --enable-static --without-ssl --host=${TC} --prefix=/usr/local/${TC}/${TC}
make
make install

The above will build a statically linked curl binary for the Synology, and put the binary in the ‘bin’ folder indicated by the path specified with –prefix.

The final step is to copy the ‘curl’ binary over to the Synology (not to /bin yet) and test it, use “–version” to check that the binary supports FTP and the other by Synology omitted protocols:

./curl --version
curl 7.79.1 (arm-unknown-linux-gnueabihf) libcurl/7.79.1
Release-Date: 2021-09-22
Protocols: dict file ftp gopher http imap mqtt pop3 rtsp smtp telnet tftp
Features: alt-svc AsynchDNS IPv6 Largefile UnixSockets

If everything seems ok, copy the file to /bin and give it another name:

cp -p curl /bin/curl.ftp

If it complains about different version of curl and libcurl, you failed somewhere when trying to link the correct libcurl statically.

Most useful sources for this article:

https://global.download.synology.com/download/Document/Software/DeveloperGuide/Firmware/DSM/7.0/enu/DSM_Developer_Guide_7_0_Beta.pdf
https://thalib.github.io/2017/02/17/32bit-no-such-a-file-or-directory/
https://curl.se/docs/install.html

Inner secrets of Synology Hybrid RAID (SHR) – Part 2b – My Synology case

At about 30% into the reshaping phase (after the first disk swap), my NAS went unresponsive (disconnected both shell and GUI), and I had to wait all day until I came home and did a hard reset on it and hoped everything went well..

In the meantime, I logged a case to the Synology support. They were not of any direct help, and the hard reset did take the NAS back to continuing the reshaping process.

My case with Synology support

==
2020-12-01 13:51:37
==
Replaced one of the smallest drives in my NAS yesterday (SHR) as a first step for later expansion (I will replace all drives with larger ones before expanding – if possible to delay any automatic expansion until then).

About 80% finished with rebuilding yesterday, but for some reason it started over after the first round.

Today about 30% finished when I lost the connection to the NAS (over ssh and the web interface). It does not auto-reboot and does not respond to ping.

To lessen the risk of data loss, what should my first step be ? Can I just pull the plug and hard-reboot the NAS with the current disks mounted (14TB, 3TB, 3TB, 8TB, 8TB in a SHR config), or is it better to replace or remove the disk that I recently replaced (in slot 1: 14TB in place of the previous still untouched 3TB) ?

What are the steps to getting the volume back online if it does not mount automatically ?

As the NAS is down, I am not able to upload any logs, but attached is the rebuild status before the crash.

==
2020-12-01 15:28:58
Synology response (besides the auto response “send us logs”)
Not useful at all, exactly what I did, “Mark” who replied did not read anything..
==
Hello,

Thank you for contacting Synology.

If you wish to replace a drive in your unit, please perform these steps one by one allowing for the repair to complete before replacing any further drives.
1. Pull out the drive in question.
2. Insert a replacement drive.
3. Proceed to the Storage Manager > Storage Pool > select the volume in question and click “Manage/Action”
4. Run through the wizard to repair the volume in question with the replacement drive.
5. Once complete, proceed to the Storage Manager > Volume and Configure/Edit the volume to configure the volume to have additional size.
Please see the link below for more help.
https://www.synology.com/en-uk/knowledgebase/DSM/help/DSM/StorageManager/storage_pool_expand_replace_disk

Please bare in mind that you benefit from the additional space from the drives you will need to replace at least 2 drives for larger ones in RAID 5/SHR or 3 drives in RAID6/SHR2.
You can see the type of RAID used via – DSM > Storage Manager > Storage Pool.

If you have any further questions please do not hesitate to get in touch.

Best Regards,
Mark

==
2020-12-01 16:02:14
My reply
==
Ok, so I restart the problem description then:

I did (yesterday):
0. Power down Synology
1. Pull out the drive in question.
2. Insert a replacement drive.
3. Proceed to the Storage Manager > Storage Pool > select the volume in question and click “Manage/Action”
4. Run through the wizard to repair the volume in question with the replacement drive.

THEN, today:
4b. Today about 30% finished when I lost the connection to the NAS (over ssh and the web interface). It does not auto-reboot and does not respond to ping.

SO what now ?
As the NAS is unresponsive I will never reach step 5:

To lessen the risk of data loss, what should my first step be ? Can I just pull the plug and hard-reboot the NAS with the current disks mounted (14TB, 3TB, 3TB, 8TB, 8TB in a SHR config), or is it better to replace or remove the disk that I recently replaced (in slot 1: 14TB in place of the previous still untouched 3TB) ?

What are the steps to getting the volume back online if it does not mount automatically ?

Also, is there an option to DELAY the expansion until all drives have been replaces, as you replied changeing the first drive will not expand the volume, but I’m not there yet since I’m stuck in a crash (unresponsive system)

==
2020-12-02 23:25:46
My reply on Synologys’ suggestion to collect logs using the support centre
==
How do I launch “Support Center” on the device when it is unresponsive (which was my initial question – what to do when it hangs in the middle of repairing/reshaping) ?

I forced it off and restarted and hoped for the best – reshaping continued and the second disk is now in reshaping mode.

My other question has not yet been answered:

Is it possible to delay the time consuming step of reshaping until all disks have been replaced ?

Initial configuration: 3TB 3TB 3TB 8TB 8TB

After replacement of the first disk: 14TB 3TB 3TB 8TB 8TB, after reshaping the first disk got a partition to match the 8TB disks.

After replacement of the second disk: 14TB 14TB 3TB 8TB 8TB, while reshaping again, now disk 1 and 2 looks similar with one partition matching the largest of the remaining 3TB disk, one matching the largest on the 8TB disks and the remainder (roughly about 6TB) the same on both 14TB disks.

When replacing the third 3TB disk, I assume the following would happen:
(14TB 14TB 14TB 8TB 8TB)

On the first and second disk, the (about) 3TB partition will be replaced with a partition to match the 8TB disks. Then the remainder (3 disks with 6TB unallocated space) will be used for another raid5 (after yet another reshape)

So my question again; is it possible to delay reshaping until I have had all the disks replaced. I understand that the “rebuild” is needed in between every replacement, but “reshape” should be needed only once.

==
2020-12-03 12:19:07
Synology response
==
Hello,

Thank you for the reply.

I’m afraid you cannot delay or prevent this process, once it starts it needs to run until fruition.

I would suggest to leave this running for now, if the volume does crash fully in the mean time I can take a look at what we can do to recover the volume, but there is not much I can do currently I’m afraid.

If you have any further question please do not hesitate to get in touch.

Best Regards,
Mark
==

The crash

https://unix.stackexchange.com/questions/299981/recover-from-raid-5-to-raid-6-reshape-and-crash-mdadm-reports-0k-sec-rebuild
https://www.google.com/search?q=restart+synology+while+rebuilding
https://community.synology.com/enu/forum/17/post/20414

General SHR and mdraid links

https://www.youtube.com/results?search_query=synology+shr
https://bobcares.com/blog/raid-resync/
https://www.google.com/search?q=mdraid+reshape

Buffalo LS-QVL root access

https://forums.buffalotech.com/index.php?topic=37677.0

Get the updated acp_commander.jar from Github
https://github.com/1000001101000/acp-commander

All needed files are in place, just needs some tweaking.. As with everything-Buffalo, I don’t know if it survives a reboot.

java -jar acp_commander.jar -t 192.168.0.10 -pw AdminPassword -c "(echo newrootpass;echo newrootpass)|passwd"
java -jar acp_commander.jar -t 192.168.0.10 -pw AdminPassword -c "sed -i 's/PermitRootLogin/#PermitRootLogin/g' /etc/sshd_config"
java -jar acp_commander.jar -t 192.168.0.10 -pw AdminPassword -c "echo 'PermitRootLogin yes' >>/etc/sshd_config"
java -jar acp_commander.jar -t 192.168.0.10 -pw AdminPassword -c "sed -i 's/root/rooot/g' /etc/ftpusers"

or

java -jar acp_commander.jar -t 192.168.0.10 -pw AdminPassword -s

then execute the same commands in the shell:

(echo newrootpass;echo newrootpass)|passwd
sed -i 's/PermitRootLogin/#PermitRootLogin/g' /etc/sshd_config
echo "PermitRootLogin yes" >>/etc/sshd_config
sed -i 's/root/rooot/g' /etc/ftpusers

Error message “pam_listfile(sshd:auth): Refused user root for service sshd” in /var/log/messages during my first login attempt
The last command above is there because root login was denied using this file (/etc/ftpusers) as a list of users to deny access in /etc/pam.d/login (or /etc/pam.d/sshd).
I found the hint to theck the pam.d configuration here (after checking the logs for any reason to the login error):
https://docs.jdcloud.com/en/virtual-machines/ssh-login-error-service-sshd

Since I wrote this note, I went over to installing Debian on my LS-QVL, so I can no longer verify each of the steps taken to gain root access.

Inner secrets of Synology Hybrid RAID (SHR) – Part 2

Changing the first disk and my case to Synology support

Now it was time to replace the first disk. As I assumed this would never go wrong (!) and did not plan to document the upgrade, I did not take out any information about the partitions, mdraids and volumes during this first disk swap.

The instructions from Synology are quite good for this (until something breaks down):
Replace Drives to Expand Storage Capacity

Basically it says: replace the disks one by one, start with the smallest and wait until completion before replacing the next.

For the first disk swap, I actually shut down my DS1517 before replacing the disk (many models, including DS1517, supports hot swapping the disks). When the disk was replaced and I powered up the DS1517, and as expected I got the “RAID degraded” beep.
Did a check that the new drive was recognized, and then started the repair of the storage pool. As this will usually take many hours, and this was done in the evening, I have no idea of the actual time spent for repairing (rebuilding) the pool. This was about 90% finished when I stopped looking at the status around midnight that day.

The next day, I see that it had “restarted” (lower percentage than yesterday), but this is actually the next step that is initiated directly after repairing the pool. It’s called “reshaping” and during that process other mdraids are changed and adjusted (if possible) against the new disk.

Changes during the first disk swap

These are only assumptions, because I did not take enough info in between swapping the disk and until about a third into reshaping.

At the point of changing the first disk (refer to the previous part of my article), my storage pool/volume consisted of two mdraids joined together:
md2: RAID 5 of sda5, sdb5, sdc5, sdd5, sde5: total size about 11.7TB
md3: RAID 1 of sdd6, sde6: total size of about 4.8TB

When I pulled the first drive (3TB) and replaced it with a 14TB drive, I assume the partition table on that disk was created like this (status pulled from the mid of reshaping after first disk swap, so I’m pretty sure this is correct):

/dev/sda1                  2048         4982527         4980480  fd
/dev/sda2               4982528         9176831         4194304  fd
/dev/sda5               9453280      5860326239      5850872960  fd
/dev/sda6            5860342336     15627846239      9767503904  fd

sda5 was matched up with the size of the old sda5 (and the ‘5’-partitions on the other disks)
sda6 was also created in either the step before rebuild, or right before reshaping (this partition match the size with the ‘6’-partitions on sdd and sde.
Because the (14T) disk is larger than the previous largest (8TB) one, there are some non-partitioned wasted space (about 5.8TB which will come into use after the next disk swap).

Reshaping

Again, I have not taken any full status dumps so that my information can be confirmed, but this is what I see afterwards, and adding my guesses to it because of the better logging of later disk swaps.

After the storage pool was repaired, reshaping started automatically. During this step, the RAID1 consisting of sdd6 and sde6 (md3) were changed into RAID5 consisting of sda6, sdd6 and sde6.

At about 30% into the reshaping phase, my NAS went unresponsive (disconnected both shell and GUI), and I had to wait all day until I came home and did a hard reset on it and hoped everything went well..

In the meantime, I logged a case to the Synology support (see “Part 2b” of this article). They were not of any direct help, and the hard reset did take the NAS back to continuing the reshaping process.