Sunday, 12 January 2020

Rsync Backup System [5]: Ongoing Backup Journey

Since writing the first four articles on this topic, I have not yet figured out how to implement an incremental backup strategy, and so I am still doing only the full backup every few months. Hence right now I am doing my first full backup of mainpc for three months.

The first issue that has come up is zfs saying there are no pools available on the server that mounts the backup disks and runs the backup. This apparently is because zpool is not designed with removable disks in mind, even though they have been unmounted prior to removal. After a lot of head scratching trying to figure out how to get zfs to recognise the previously created pool on the removable disk I had inserted for the backup, I eventually stumbled across the zpool import command. Once that was issued the pool was said to be online and the zfs set mountpoint command could be issued:

zfs set mountpoint=/mnt/backup/fullbackup mpcbkup2

At that point I could go to the mount path and work with the contents of the disk.
One possibility for this issue could be that the disk is physically in a different path on the computer than the one it was created in. A key factor is that zfs appears to be designed primarily to work with /dev/sdx device paths. We already know that for regular disk mounting we can use UUIDs to get around the problem of the device path /dev/sdx changing at the whim of the operating system at boot time, which actually does happen. For some reason or other the backup disk in this instance was at /dev/sdh when it may have originally been in a zpool mounted at /dev/sdb according to the earlier articles I wrote. At any rate, using the import command can bring the zpool back to life. Perhaps there is a need to have a command that suspends a zpool when the disk is taken offline, since it appears to be necessary to do more than unmount the disk.

I looked into this further and after considering the options, the best command to use is
zpool export mpcbkup2

This command basically closes down the pool for it to be removed from the system. The system will then report that no pools are available. The next time we need to use it, it can be imported back into the system and then mounted as above. 

I also need to buy another backup disk for ensuring each major server has two full backup volumes. At the moment I don't have enough disks to assure this. It will only cost about $100 to get another 2 TB disk.

The lingering question of course is how to detect which files were backed up on which date and therefore which files need to be backed up incrementally. To make any progress with this, the very first step is to find out if rsync can log the files that have been successfully backed up, and find some way of automatically scanning the log for the names of files, and then store them somewhere (for example, an SQLite database). Another option is to find some way of having rsync only back up files that have been modified after the date of the last full backup. So far I actually haven't spent nearly any time at all thinking too much about these options, because the simplest solution by far would be to have some sort of command or script that handles this completely automatically. Linux lacks the file modification flag that is implemented in Windows (the archive bit). People explain this away by saying the FS has superior capabilities and that a file modification bit is a very crude capability, but it still isn't possible to get around the fact that being able to reset that flag after each backup, and then being able to scan for files where it has been set again, are very easy to implement. In Linux there is no easy way of being able to record that information unless you store a date somewhere for each file and then scan against those dates. I explored the possibility of writing an extended attribute for a file, the question there is where this data is stored as we do not want to modify the file itself. So I hope to spend more time over the next month or so exploring these issues further.

I am currently trialling having a log file produced and using this form of the rsync command:

rsync -arXvz --progress --delete backupuser@192.168.x.y:/home/patrick/ /mnt/backup/fullbackup/patrick --log-file=/home/patrick/rsync.log --log-file-format "mainpc|%f|%M|%l|%b|%o|%U"

The only issue to date with it is there is supposed to be a %a option but that is not being recognised so having the remote IP address logged is not available so the script has been customised to output the actual remote computer name as a literal.  The rest of the information is logged in the format and the pipe characters can be used as delimiters to separate the parameters. So I have made some progress on this issue and now the question is how to use the information to analyse what is needed for future backups.

With the backup of mainpc I wiped the disk first, but for serverpc I chose to send the data to the existing backup which means it can run a lot faster because it is not transferring every single piece of data to the disk. So that option should speed up the backup and using the --delete option will remove files on the disk that are no longer present in the source directory.

I have to set up mediapc to be able to back itself up, which will be implemented as a "backupuser" account that is logged on to in the terminal virtual console, and then runs the backup for mediapc locally with read permissions to the source files/folders.

An option for progressing the backups is simply to use the full backup disks to create the incrementals as well, but this is possibly going to need bigger disks. The advantage is that rsync can handle this by default with the incrementals pushed into a separate directory. I would want to have possibly three backup disks for each comptuer, which brings the need to buy more disks.

At any rate this will take some time to devise. Another alternative is separate incremental disks, with a second removable disk caddy installed in this computer, so that it can backup to a completely separate disk. This does have the advantage of not requiring new disks, and keeping the full and incremental backups separate.