ZFS Backup Strategy with Sanoid and Syncoid
In my previous post, I discussed how I’ve migrated VMs to new storage. This gave me cause to also take a look at my backup configuration, to ensure I can still come back from catastrophic events.
I use Sanoid and Syncoid for this purpose. Let’s see how that may look.
Primary backup
My backup strategy is multi-pronged. I take frequent snapshots of the file system where my VMs live. This is convenient for oupsies, but of course doesn’t protect me should a catastrophic storage failure occur. Snapshots and the eventual cleanup of them are managed by Sanoid.
The relevant parts of my /etc/sanoid/sanoid.conf:
[ssdpool]
use_template = ssdpool
recursive = yes
process_children_only = no
[template_ssdpool]
frequently = 0
hourly = 36
daily = 30
monthly = 0
yearly = 0
autosnap = yes
autoprune = yes
We can see that the ssdpool file system gets recursive snapshots every hour plus daily. Thirty-six of these hourly snapshots are kept, and then I keep 30 of the daily snapshots on the primary storage.
In every step, I can verify that this actually works by listing ZFS snapshots:
sudo zfs list -t snapshot
Secondary backup
To protect against pool failures, I have a secondary pool in my hypervisor host, set up for backup storage. I call Syncoid from cron for this purpose.
This is handled by this part of my /etc/cron.d/syncoid:
02 * * * * root syncoid -r --no-sync-snap --compress=lzo --quiet ssdpool backuppool/ssdpoolbackup
On the second minute of every hour of every day, I recursively send ZFS snapshots from ssdpool to backuppool/ssdpoolbackup.
Of course, I also have to ensure the backup volume doesn’t fill up with snapshots. Sanoid to the rescue again:
[backuppool/ssdpoolbackup]
use_template = backuppoolssd
recursive = yes
process_children_only = no
[template_backuppoolssd]
frequently = 0
hourly = 36
daily = 30
monthly = 0
yearly = 0
autosnap = no
autoprune = yes
For backuppool/ssdpoolbackup I don’t create new snapshots, but I keep the same pruning configuration that I have for the primary snapshots.
Tertiary backup
Finally I have a completely separate physical server where I store an additional copy of my backups. This one is configured to connect to the primary server and fetch snapshots, again using syncoid. This way of doing things makes it slightly harder to mess things up on the backup server from the primary server.
The cron configuration line for syncoid on this machine looks like this:
20 * * * * root syncoid --sshkey=/root/.ssh/syncoid -r --no-sync-snap --compress=lzo --quiet syncoid@prodsrv1:backuppool/ssdpoolbackup backuppool/prodsrv1/ssdpool
It’s very similar to the one running on the main server, but fetches the snapshots from the backup pool on the main server over an ssh tunnel, and then stores them in a local backup pool. As you see, I’ve given it a delay of a few minutes compared to the crontab on the main server, so in most circumstances we should manage to have the backup copy done in a timely manner for minimal data loss in case of an issue.
Of course here too, I make sure to clean up snapshots to avoid filling the disks over time:
[backuppool/prodsrv1/ssdpool]
use_template = backupssd
recursive = yes
process_children_only = no
[template_backupssd]
frequently = 0
hourly = 36
daily = 90
monthly = 0
yearly = 0
autosnap = no
autoprune = yes
The only big difference to the primary and secondary backup copies, is that I store daily snapshots for longer on the backup server.
Testing backups
To verify that the backup works, we can perform a small experiment: I have an otherwise empty VM, where I’ve simply created a text file in my home directory:
$ echo Test > is_this_file_gone.txt
$ cat is_this_file_gone.txt
Test
After waiting for a snapshot to be made, I remove the test file:
$ rm is_this_file_gone.txt
$ cat is_this_file_gone.txt
cat: is_this_file_gone.txt: No such file or directory
Now let’s restore the VM from my tertiary copy. If that works, we know that the intermediate backup routines work too.
From my backup server, I will send the snapshot back to the production machine, then see if I can start the VM from the snapshot.
First we find the relevant snapshots:
$ zfs list -t snapshot
NAME USED AVAIL REFER MOUNTPOINT
backuppool/prodsrv1/ssdpool/vdisks/testsrv1@autosnap_2025-11-16_11:00:25_daily 0B - 96K -
backuppool/prodsrv1/ssdpool/vdisks/testsrv1@autosnap_2025-11-16_11:00:25_hourly 0B - 96K -
backuppool/prodsrv1/ssdpool/vdisks/testsrv1@autosnap_2025-11-16_12:00:25_hourly 0B - 2.84G -
Now let’s take a specific snapshot and send it back:
sudo syncoid --sshkey=/root/.ssh/syncoid \
-r \
--no-sync-snap \
--compress=lzo \
--include-snaps=autosnap_2025-11-16_12:00:25_hourly \
backuppool/prodsrv1/ssdpool/vdisks/testsrv1 \
syncoid@prodsrv1:ssdpool/restoretest
To test the contents of the snapshot non-destructively, on the main server, we’ll stop the VM, move its expected disk image out of the way, and copy in the disk image we pulled from our backup.
cd /ssdpool/vdisks/testsrv1
sudo virsh shutdown testsrv1
# After waiting for the VM to stop:
sudo mv testsrv1.qcow2 testsrv1.qcow2.actual
sudo cp --sparse=always /ssdpool/restoretest/testsrv1.qcow2 ./
sudo virsh start testsrv1
Once the machine starts up, we can test our backup:
$cat is_this_file_gone.txt
Test
As we’re happy everything works, we clean up the VM directory and restart the machine
sudo virsh shutdown testsrv1
sudo rm testsrv1.qcow2
sudo mv testsrv1.qcow2.actual testsrv1.qcow2
sudo virsh start testsrv1
I always like to perform a dry-run before running destructive ZFS changes:
$ sudo zfs destroy -nvpr ssdpool/restoretest
destroy ssdpool/restoretest@autosnap_2025-11-16_12:00:25_hourly
destroy ssdpool/restoretest
This looks like exactly what we want to do. Remove the n argument to execute the command:
sudo zfs destroy -vpr ssdpool/restoretest
And that’s it, really. I’ve confirmed that snapshots work and that I can restore them.