Running Proxmox in a root-on-zfs configuration in a RAID10 pool results in an interesting artifact: We need a boot volume from which to start our system and initialize the elements required to recognize a ZFS pool. In effect, the first mirror pair in our disk set will have (at least) two partitions: a regular filesystem on the first partition and a second partition to participate in the ZFS pool.
To see how it all works together, I tried failing a drive and replacing it with a different one.
Happy-case
If the drives would have had identical sector sizes, the operation would have been simple. In this case, sdb is the good mirror volume and sda is the new, empty drive. We want to copy the working partition table from the good drive to the new one, and then randomize the UUID of the new drive to avoid catastrophic confusion on the part of ZFS:
# sgdisk /dev/sdb -R /dev/sda
# sgdisk -G /dev/sda
After that, we should be able to use gdisk to view the partition table, to identify what partition does what, and simply copy the contents of the good partitions from the good mirror to the new drive:
# gdisk /dev/sda
GPT fdisk (gdisk) version 1.0.1
Partition table scan:
MBR: protective
BSD: not present
APM: not present
GPT: present
Found valid GPT with protective MBR; using GPT.
Command (? for help): p
Disk /dev/sda: 5860533168 sectors, 2.7 TiB
Logical sector size: 512 bytes
Disk identifier (GUID): xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 5860533134
Partitions will be aligned on 8-sector boundaries
Total free space is 0 sectors (0 bytes)
Number Start (sector) End (sector) Size Code Name
1 34 2047 1007.0 KiB EF02
2 2048 5860516749 2.7 TiB BF01 zfs
9 5860516750 5860533134 8.0 MiB BF07
Command (? for help): q
# dd if=/dev/sdb1 of=/dev/sda1
# dd if=/dev/sdb9 of=/dev/sda9
Then we would add the new disk to our ZFS pool and have it resilvered:
# zpool replace rpool /dev/sda2
To view the resilvering process:
# zpool status -v
pool: rpool
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Sat Sep 1 18:48:13 2018
2.43T scanned out of 2.55T at 170M/s, 0h13m to go
1.22T resilvered, 94.99% done
config:
NAME STATE READ WRITE CKSUM
rpool DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
replacing-0 DEGRADED 0 0 0
old UNAVAIL 0 63 0 corrupted data
sda2 ONLINE 0 0 0 (resilvering)
sdb2 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
sdc ONLINE 0 0 0
sdd ONLINE 0 0 0
logs
sde1 ONLINE 0 0 0
cache
sde2 ONLINE 0 0 0
errors: No known data errors
The process is time consuming on large drives, but since ZFS both understands the underlying disk layout and the filesystem on top of it, resilvering will only occur on blocks that are in use, which may save us a lot of time, depending on the extent to which our filesystem is filled.
When resilvering is done, we’ll just make sure there’s something to boot from on the new drive:
# grub-install /dev/sda
Installing for i386-pc platform.
Installation finished. No error reported.
Real life intervenes
Unfortunately for me, the new drive I tried had the modern 4 KB sector size (“Advanced Format / 4Kn”), while my old drives were stuck with the older 512 B standard. This led to the interesting side effect that my new drive was too small to fit volumes according to the healthy mirror drive’s partition table:
# sgdisk /dev/sdb -R /dev/sda
Caution! Secondary header was placed beyond the disk's limits! Moving the header, but other problems may occur!
In the end, what I ended up doing was to useĀ gdisk to create a new partition table with volume sizes for partitions 1 and 9 as similar as possible to those of the healthy mirror (but not smaller!), entirely skipping the steps involving theĀ sgdisk utility. The rest of the steps were identical.
The next problem I encountered was a bit worse: Even though ZFS in the Proxmox VE installation managed 4Kn drives just fine, there was simply no way to get the HP MicroServer Gen7 host to boot from one, so back to the old 3 TB WD RED I went.
Conclusion
Running root-on-zfs in a striped mirrors (“RAID10”) configuration complicates the replacement of any of the drives in the first mirror pair slightly compared to a setup where the ZFS pool is used for storage only.
Fortunately the difference is minimal, and except for the truly dangerous syntax and unclear documentation of the sgdisk command, replacing a boot disk really boils down to four steps:
- Make sure the relevant partitions exist.
- Copy non-ZFS-data from the healthy drive to the new one.
- Resilver the ZFS volume.
- Install GRUB.
In a pure data disk, the only thing we have to think about is step 3.
On the other hand, running too new hardware components in old servers doesn’t always work as intended. Note to the future me: Any meaningful expansion of disk space will require newer server hardware than the N54L-based MicroServer.