carpie.net

Simple automated backup system for Linux

I hate re-doing things. One of my biggest fears is that the masterpiece of code I just churned out will be lost tomorrow and I'll have to do it again. It's perfect the way it is! I could never achieve that the same way again! Ok, that was a bit dramatic, but seriously, data has some meaning to all of us, so if it's worth having, it's worth being protected from accidental loss. In this post, I'm going to describe a simple system for backing up just the data you care about on your Linux system.

What this system is, and what it is not

Backing up data can be done in a multitude of ways depending on what you need and what level of protection that you want. Backup can be as simple as copying data to another drive periodically, or as complex as a redundant RAID system augmented with physically off-site backups. I chose the system presented in this post because it has the following characteristics:

Simple - I like simple because it means I'll actually maintain it!
Instant access - Data backed up with the system is immediately available should I need it. I can literally just access it without having to perform any other steps.
Configurable - I don't back up everything, I like to pick and choose.
Automatable - This system can be automated (in case I forget!) using the tools already built-in with Linux

Before you implement this system, let me explain what it does not provide. If you need the following characteristics, you may want to look into other solutions:

Versioning - This system is a direct backup. It is not versioned in any manner. If you need a file from Monday that you overwrote with bad data on Tuesday and your backup ran since then, you are out of luck.
System Imaging - This system is not one that takes an image of your system and lets you "rollback" to a previous image.

If this system still meets your needs, read on...

How does this system work?

In this system, a host machine will define what data needs to be backed up. Then it will periodically send data that has changed to the backup server. To implement this, you will need the following:

Backup server

This post will simply call this the "server". It is where your data will be backed up to. It can be any Linux machine, but in this post we will be using a Raspberry Pi with an external USB hard drive for storage. For this solution, the server will need to be always on for backup purposes. Using a Pi is good choice because of it's low energy usage.

Machine in need of backup

There can be any number of machines you choose to back up. This post will describe one that I will call the "host" machine. Again, you can set up as many "hosts" as you wish.

Tools

This system uses a Linux tool called rsync to perform backups. rsync is a file copying tool, but it has the ability to copy only differences and copy over a network connection. In this system, rsync will use another tool SSH to connect to the server. Fortunately, these tools are readily available in Linux.

Server OS installation

We are going to use a Raspberry Pi with an external USB hard drive as our backup system. If you already have a server machine available, you can skip to host setup. Otherwise, we will install Raspbian Lite on an SD card for our Pi, then give it a static IP on our network. For this article, I assume an IP address of 192.168.0.5. Instructions on how to do this can be found in my article Setting up Raspbian with a static IP. Do that and then proceed to the next section...

Adding an external hard drive to the server

Next, we are going to add a external hard drive to the server. This isn't strictly necessary if the Pi SD card has enough space for what you want to back up. But I'll assume you want to back up your entire Dr. Who video collection, so you'll need an external drive.

The first step is to create an ext4 partition on the drive. For this example, I'm intending to use the whole drive as a single partition. Please note that this will destroy everything on the drive! Go ahead and plug the drive into one of the Pi's USB port. You can tell if the Pi detected the drive by typing dmesg and examining the output. It should be something similar to:

dmesg

...
usb 1-1.2: new high-speed USB device number 6 using dwc_otg
usb 1-1.2: New USB device found, idVendor=174c, idProduct=55aa
usb 1-1.2: New USB device strings: Mfr=2, Product=3, SerialNumber=1
usb 1-1.2: Product: USB3-SATA-UASP1
usb 1-1.2: Manufacturer: plugable
usb 1-1.2: SerialNumber: 12345678903A
usb-storage 1-1.2:1.0: USB Mass Storage device detected
usb-storage 1-1.2:1.0: Quirks match for vid 174c pid 55aa: 400000
scsi host0: usb-storage 1-1.2:1.0
scsi 0:0:0:0: Direct-Access     HGST HTS 545050A7E380     0    PQ: 0 ANSI: 6
sd 0:0:0:0: [sda] 976773168 512-byte logical blocks: (500 GB/466 GiB)
sd 0:0:0:0: [sda] 4096-byte physical blocks
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 43 00 00 00
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
sd 0:0:0:0: Attached scsi generic sg0 type 0
sda: sda1 sda2 sda3 sda4 sda5
sd 0:0:0:0: [sda] Attached SCSI disk

There is a lot of information here, but the important piece is [sda] lines in this example. These means the Pi is calling the new drive sda so that's what we are going to format. Also notice that this drive has 5 partitions on it already, but we are going to wipe those clean. If you proceed with the next step, all drive data will be lost! So please, make sure you are ok with that!

Let's make an ext4 partition out of the whole disk and format it:

sudo parted -a optimal -s /dev/sda mklabel gpt mkpart primary ext4 0% 100%
sudo mkfs.ext4 -F /dev/sda1

The first command above uses parted to create a new GPT partition table and a single partition from 0% to 100%, e.g. the entire drive. The second command makes an ext4 file system on the newly created partition. The drive is now ready to go. If you want to check to be sure:

sudo parted /dev/sda print

Model: HGST HTS 545050A7E380 (scsi)
Disk /dev/sda: 500GB
Sector size (logical/physical): 512B/4096B
Partition Table: gpt
Disk Flags: 

Number  Start   End    Size   File system  Name     Flags
 1      1049kB  500GB  500GB  ext4         primary

Now that the drive is formatted the way we want, we need to make sure it gets mounted on boot. We also need a path to "mount" the drive on (e.g., specify what path in the file system shows the drive's data). For the latter, let's use /opt/backup. Go ahead and make that directory:

sudo mkdir /opt/backup

Now we need to tell Linux to always mount our drive on startup at the path /opt/backup. We do this by putting an entry in /etc/fstab. We need a consistent way to identify the drive however. The drive showed up for us as /dev/sda1 and will continue to do so as long as we have one drive. If we were to add another, it may not. So to be sure we always have the correct drive mounted at that location, we will get it's blkid and use that in /etc/fstab. To do that:

sudo blkid

/dev/sda1: UUID="90cf3896-b327-40a1-aaf9-b524c80c6a3a" TYPE="ext4" PARTLABEL="primary" PARTUUID="fce70f97-7356-4c80-83b3-9b3a2eba4472

You will get a UUID different from the one above. Using your UUID, add an entry like this to /etc/fstab:

UUID=90cf3896-b327-40a1-aaf9-b524c80c6a3a /opt/backup ext4	defaults,noatime,nodiratime	0	2

That should do it for drive setup. Let's test by rebooting and seeing that it is mounted correctly:

sudo reboot

After reboot, you should be able to see that the drive is mounted:

mount | grep sda1

/dev/sda1 on /opt/backup type ext4 (rw,noatime,nodiratime,data=ordered)

Later on, we are going to want to access this drive as the default pi user (instead of root), so let's change the owner of the backup directory to the pi user:

sudo chown -R pi.pi /opt/backup

Host setup

As mentioned earlier, we are using rsync via SSH to access the server. We want to be able to automate the backup, so we'll be using a password-less SSH key (which just means we can make an SSH connection to the server without having to enter a password). Obviously we don't our normal SSH key to be password-less, so let's make a key specific for the backup key. Later, we'll see how to increase the security of this key to only allow access to rsync and to a specific directory.

On your host machine (e.g. the one needing to be backed up), if you do not have a .ssh directory in your home directory, make it and protect it appropriately:

mkdir ~/.ssh
chmod 700 ~/.ssh

Now let's create our key:

ssh-keygen -C "host backup key" -N "" -f ~/.ssh/backup_key

This will create two new files in ~/.ssh/, backup_key and backup_key.pub. The first is the private key (keep it safe!) and the second is the public key.

We'll see how to use the public key to set up the server in a later step, but since we're already on the host machine, let's set up our scripts to do the backup.

You can back up whatever you wish. For the purposes of this post, I'm going to show you how to back up your home directory. We will put all of our backup related files in a directory for organizational purposes:

mkdir ~/backup
cd ~/backup

Now create ~/backup/do_backup.sh with the following content:

#!/bin/bash
#
# Script to perform an automated backup of my home directory to an
# rsync server
#
[email protected]:

/usr/bin/rsync -avz --delete-excluded --exclude-from /home/<user>/backup/exclude /home/<user>/ $BACKUP_SERVER

Go ahead and make it executable:

chmod a+x ~/backup/do_backup.sh

Before moving on, let's break this down so we understand what it's doing. First, we define our backup server as a variable. It doesn't matter much here, but later you might decided to back up some other directory (like say /etc), so you would add another rsync command. It will be convenient to only have to specify the server once. The options -avz tell rsync to copy files in "archive mode" (e.g. recurse preserving permissions, symlinks, times, etc.), be verbose, and compress during transfer respectively. The --delete-excluded option tells rsync to delete files on the server that no longer exist on the host and to delete any server files that are in your exclude file. So, if you delete a file (or add it to excludes), on the next backup, rsync will delete it from the server as well. The -exclude-from option tells rsync to read our exclude file (which we will create next) and exclude files/directories in there from being backed up. The next option is the local directory to backup. (Be sure to replace <user> with your username!) Finally, the last option is the server address and path to back up to. You may notice the server address ends with an empty path (e.g. :). This is because we are going to specify on the server which path this backup goes to, so we do not need to do that here.

Now let's create that exclude file. Create ~/backup/exclude with the following content (as a starter):

# Don't copy my ssh keys to external backup
/.ssh
# No need to backup cache files
/.cache
# No need to backup my backup logs
**backup.log**

The above file keeps your SSH keys from going out to your external storage (where they may be stolen). It also avoids backing up cache files and the backup logs we are going to create a little later. Really though, you'll modify this file as you watch your backup output and decide what's not really important to backup. Alternatively, you can just let it all be backed up. It's up to you. The wildcard pattern for excludes is quite extensive, so I'm not going to cover it here. For more information run man rsync and search for INCLUDE/EXCLUDE PATTERN RULES.

Before we can try this out, we need to do some server setup. The server is going to need our public key, so let's go ahead and send that over:

scp ~/.ssh/backup_key.pub [email protected]:

If this is your first connect, scp will ask you to verify the identity of the server. Just type yes and it will continue.

Server setup

Now let's set up the server to receive backups via rsync. We're going to need our own .ssh directory, so go ahead and create that:

mkdir ~/.ssh
chmod 700 ~/.ssh

Now, let's accept the public key for our host:

mv ~/backup_key.pub ~/.ssh/authorized_keys

We just gave a password-less key access to our server! Yikes! Let's quickly remedy that. Edit ~/.ssh/authorized_keys and insert the following before your key data:

command="/usr/local/bin/rrsync /opt/backup/<user>/",no-agent-forwarding,no-port-forwarding,no-pty,no-user-rc,no-X11-forwarding

Be sure to replace <user> with your username. The result will look something like:

command="/usr/local/bin/rrsync /opt/backup/<user>/",no-agent-forwarding,no-port-forwarding,no-pty,no-user-rc,no-X11-forwarding ssh-rsa AAAAB3NzaC1yc2EAAAA...

This addition will restrict the password-less key to running /usr/local/bin/rrsync with access only to /opt/backup/<user> and its descendants. We need to create that directory for rsync to use. Again, replace <user> with your username:

mkdir /opt/backup/<user>

As of now, rrsync doesn't yet exist on our server, so let's get that installed and set up:

sudo apt-get install rsync
sudo sh -c 'zcat /usr/share/doc/rsync/scripts/rrsync.gz > /usr/local/bin/rrsync'
sudo chmod a+x /usr/local/bin/rrsync

That arcane-looking second line extracts the rrsync tool from the rsync package to somewhere we can run it. For some reason, it's provided as an archive in the rsync package.

That's it for the server. It's good to go. Let's switch back to the host machine and get this thing going...

Testing and automating the backup

Back on the host, let's try out our script!

~/backup/do_backup.sh

This is your first backup, so it may take a long time depending on how much data you are backing up. You should get a lot of output starting and ending with something like:

sending incremental file list
./

...

sent 59,801,940,060 bytes  received 14,443,146 bytes  6,052,145.82 bytes/sec
total size is 67,179,956,865  speedup is 1.12

Congratulations! Your data is safer than it was a few minutes ago! It is worth noting that, even though that first backup took a while, subsequent backups will only send changes, so they will likely be much quicker. That's the purpose of rsync.

But we're not done. Let's automate this. First run crontab -e. If you've never made a crontab entry before, crontab will have you choose an editor. Once you've done that, enter the following line and save the file:

5   2  *   *   *     /home/<user>/backup/do_backup.sh >> /home/user/backup/backup.log 2>&1 &

As always, replace <user> with your username. This line tells cron to run your backup script at 5 minutes after 2 AM every morning and append the the output to backup.log in your backup directory. You can modify the starting time and/or the period if you want.

Alternatively, you do not have to automate at all. You can simply run ~/backup/do_backup.sh at any time to perform a backup on demand.

Now that you are backing up every day, that log file could potentially get very large. To solve this, we'll add an entry to the Linux log rotate utility. Create /etc/logrotate.d/mybackup with the following content:

/home/<user>/backup/backup.log {
  rotate 7
  weekly
  compress
  missingok
  create 644 <user> <user>
}

And yes, the <user> thing applies here too. This will cause log rotate to compress and rotate your log every week. Older logs will get deleted so you don't have to worry about log output filling up your hard drive.

Uh oh. I need access to that backup

Ok. What did you do? Ugh, you deleted your Dr. Who collection because you were mad that they changed doctors again? Sigh, it's ok. Here's how to get it back. From your host machine (now doctor-less), let's restore your video collection:

scp -o PreferredAuthentications=password -r [email protected]:/opt/backup/<user>/Videos/DrWho ~/Videos

That command will, over the network, copy recursively the DrWho directory on the backup to a directory named DrWho in your local Videos directory.

Notice the -o PreferredAuthentications=password in the command above. This is needed because, by default, SSH prefers to use keys to authenticate to the remote machine as they are generally stronger security than passwords. Since we set up a key for the backup, SSH will try to use that but it will fail because we limited the key to just rsync usage. You can eliminate the need for the -o PreferredAuthentications=password option by setting up an SSH key to access the remote machine as explained in Setting up passwordless SSH logins.

There are alternative ways to get to your data. One easy way is to use sshfs to "mount" your backup server to a local directory. This lets you use a visual file manager to copy stuff around. But for most cases I find it quicker to use a simple one-liner scp call to get me out of trouble. So I leave finding other ways as an exercise to the reader...

Summary

We've come a long way, so let's recap what we've done. We've set up a Raspberry Pi with external drive as our backup server. It can live in some lonely corner of our house reliably and quietly performing it's duty of protecting our data from ourselves. We've set up a single host machine to back up our important data every morning while we sleep. We've learned how to set up a host machine, so we can perform the same actions on any number of hosts to have them backup their data as well. We're only limited by our desire and backup server hard drive space!