At the European Customer Conference a couple of weeks back, one of the topics was the use of DRBD. DRBD is a kernel-based block device that replicates the data blocks of a device from one machine to another. The documentation I developed for that and MySQL is available here. Fundamentally, with DRBD, you set up a physical device, configure DRBD on top of that, and write to the DRBD device. In the background, on the primary, the DRBD device writes the data to the physical disk and replicates those changed blocks to the seconday, which in turn writes the data to it’s physical device. The result is a block level copy of the source data. In an HA solution, which means that you can switch over from your primary host to your secondary host in the event of system failure and be sure pretty certain that the data on the primary and seconday are the same. In short, DRBD simplifies one of the more complex aspects of the typical HA solution by copying the data needed during the switch. Because DRBD is a Linux Kernel module you can’t use it on other platforms, like Mac OS X or Solaris. But there is another solution: ZFS. ZFS supports filesystem snapshots. You can create a snapshot at any time, and you can create as many snapshots as you like.Let’s take a look at a typical example. Below I have a simple OpenSolaris system running with two pools, the root pool and another pool I’ve mount at /opt
:
Filesystem size used avail capacity Mounted onrpool/ROOT/opensolaris-1 7.3G 3.6G 508M 88% //devices 0K 0K 0K 0% /devices/dev 0K 0K 0K 0% /devctfs 0K 0K 0K 0% /system/contractproc 0K 0K 0K 0% /procmnttab 0K 0K 0K 0% /etc/mnttabswap 465M 312K 465M 1% /etc/svc/volatileobjfs 0K 0K 0K 0% /system/objectsharefs 0K 0K 0K 0% /etc/dfs/sharetab/usr/lib/libc/libc_hwcap1.so.1 4.1G 3.6G 508M 88% /lib/libc.so.1fd 0K 0K 0K 0% /dev/fdswap 466M 744K 465M 1% /tmpswap 465M 40K 465M 1% /var/runrpool/export 7.3G 19K 508M 1% /exportrpool/export/home 7.3G 1.5G 508M 75% /export/homerpool 7.3G 60K 508M 1% /rpoolrpool/ROOT 7.3G 18K 508M 1% /rpool/ROOTopt 7.8G 1.0G 6.8G 14% /opt
I’ll store my data in a directory on /opt
. To help demonstrate some of the basic replication stuff, I have other things stored in /opt
as well:
total 17drwxr-xr-x 31 root bin 50 Jul 21 07:32 DTT/drwxr-xr-x 4 root bin 5 Jul 21 07:32 SUNWmlib/drwxr-xr-x 14 root sys 16 Nov 5 09:56 SUNWspro/drwxrwxrwx 19 1000 1000 40 Nov 6 19:16 emacs-22.1/lrwxrwxrwx 1 root root 48 Nov 5 09:56 uninstall_Sun_Studio_12.class -> SUNWspro/installer/uninstall_Sun_Studio_12.class
To create a snapshot of the filesystem, you use zfs snapshot
, and then specify the pool and the snapshot name:
# zfs snapshot opt@snap1
To get a list of snapshots you’ve already taken:
# zfs list -t snapshotNAME USED AVAIL REFER MOUNTPOINTopt@snap1 0 - 1.03G -rpool@install 19.5K - 55K -rpool/ROOT@install 15K - 18K -rpool/ROOT/opensolaris-1@install 59.8M - 2.22G -rpool/ROOT/opensolaris-1@opensolaris-1 100M - 2.29G -rpool/ROOT/opensolaris-1/opt@install 0 - 3.61M -rpool/ROOT/opensolaris-1/opt@opensolaris-1 0 - 3.61M -rpool/export@install 15K - 19K -rpool/export/home@install 20K - 21K -
The snapshots themselves are stored within the filesystem metadata, and the space required to keep them will vary as time goes on because of the way the the snapshots are created. The initial creation of a snapshot is really quick, because instead of taking an entire copy of the data and metadata required to hold the entire snapshot, ZFS merely records the point in time and metadata of when the snaphot was created.As you make more changes to the original filesystem, the size of the snapshot increases because more space is required to keep the record of the old blocks. Furthermore, if you create lots of snapshots, say one per day, and then delete the snapshots from earlier in the week, the size of the newer snapshots may also increase, as the changes that make up the newer state have to be included in the more recent snapshots, rather than being spread over the seven snapshots that make up the week. The result is that creating snapshots is generally very fast, and storing snapshots is very efficient. As an example, creating a snapshot of a 40GB filesystem takes less than 20ms on my machine. The only issue, from a backup perspective, is that snaphots exist within the confines of the original filesystem. To get the snapshot out into a format that you can copy to another filesystem, tape, etc. you use the zfs send
command to create a stream version of the snapshot. For example, to write out the snapshot to a file:
# zfs send opt@snap1 >/backup/opt-snap1
Or tape, if you are still using it:
# zfs send opt@snap1 >/dev/rmt/0
You can also write out the incremental changes between two snapshots using zfs send
:
# zfs send opt@snap1 opt@snap2 >/backup/opt-changes
To recover a snapshot, you use zfs recv
which applies the snapshot information either to a new filesytem, or to an existing one. I’ll skip the demo of this for the moment, because it will make more sense in the context of what we’ll do next. Both zfs send
and zfs recv
work on streams of the snapshot information, in the same way as cat
or sed
do. We’ve already seen some examples of that when we used standard redirection to write the information out to a file. Because they are stream based, you can use them to replicate information from one system to another by combining zfs send
, ssh
, and zfs recv
. For example, let’s say I’ve created a snapshot of my opt
filesystem and want to copy that data to a new system into a pool called slavepool
:
# zfs send opt@snap1 |ssh mc@slave pfexec zfs recv -F slavepool
The first part, zfs send opt@snap1
, streams the snapshot, the second, ssh mc@slave
, and the third, pfexec zfs recv -F slavepool
, receives the streamed snapshot data and writes it to slavepool. In this instance, I’ve specified the -F
option which forces the snapshot data to be applied, and is therefore destructive. This is fine, as I’m creating the first version of my replicated filesystem. On the slave machine, if I look at the replicated filesystem:
# ls -al /slavepool/total 23drwxr-xr-x 6 root root 7 Nov 8 09:13 ./drwxr-xr-x 29 root root 34 Nov 9 07:06 ../drwxr-xr-x 31 root bin 50 Jul 21 07:32 DTT/drwxr-xr-x 4 root bin 5 Jul 21 07:32 SUNWmlib/drwxr-xr-x 14 root sys 16 Nov 5 09:56 SUNWspro/drwxrwxrwx 19 1000 1000 40 Nov 6 19:16 emacs-22.1/lrwxrwxrwx 1 root root 48 Nov 5 09:56 uninstall_Sun_Studio_12.class -> SUNWspro/installer/uninstall_Sun_Studio_12.class
Wow – that looks familiar!Once you’ve snapshotted once, to synchronize the filesystem again, I just need to create a new snapshot, and then use the incremental snapshot feature of zfs send
to send the changes over to the slave machine again:
# zfs send -i opt@snapshot1 opt@snapshot2 |ssh mc@192.168.0.93 pfexec zfs recv slavepool
Actually, this operation will fail. The reason is that the filesystem on the slave machine can currently be modified, and you can’t apply the incremental changes to a destination filesystem that has changed. What’s changed? The metadata about the filesystem, like the last time it was accessed – in this case, it will have been our ls
that caused the problem. To fix that, set the filesystem on the slave to be read-only:
# zfs set readonly=on slavepool
Setting readonly
means that we can’t change the filesystem on the slave by normal means – that is, I can’t change the files or metadata (modification times and so on). It also means that operations that would normally update metadata (like our ls
) will silently perform their function without attempting to update the filesystem state. In essence, our slave filesystem is nothing but a static copy of our original filesystem. However, even when enabled to readonly, a filesystem can have snapshots applied to it. Now it’s read only, re-run the initial copy:
# zfs send opt@snap1 |ssh mc@slave pfexec zfs recv -F slavepool
Now we can make changes to the original and replicate them over. Since we’re dealing with MySQL, let’s initialize a database on the original pool. I’ve updated the configuration file to use /opt/mysql-data
as the data directory, and now I can initialize the tables:
# mysql_install_db --defaults-file=/etc/mysql/5.0/my.cnf --user=mysql
Now, we can synchronize the information to our slave machine and filesystem by creating another snapshot and then doing an incremental zfs send
:
# zfs snapshot opt@snap2
Just to demonstrate the efficiency of the snapshots, the size of the data created during initialization is 39K:
# du -sh /opt/mysql-data/ 39K /opt/mysql-data
If I check the size used by the snapshots:
# zfs list -t snapshotNAME USED AVAIL REFER MOUNTPOINTopt@snap1 47K - 1.03G -opt@snap2 0 - 1.05G -
The size of the snapshot is 47K. Note, by the way, that it is 47K in snap1
, because currently snap2
should be more or less equal to our current filesystem state.Now, let’s synchronize this over:
# zfs send -i opt@snap1 opt@snap2|ssh mc@192.168.0.93 pfexec zfs recv slavepool
Note we don’t have to force the operation this time – we’re synchronizing the incremental changes from what are identical filesystems, just on different systems. And double check that the slave has it:
# ls -al /slavepool/mysql-data/
Now we can start up MySQL, create some data, and then synchronize the information over again, replicating the changes. To do that, you have to create a new snapshot, then do the send/recv to the slave to synchronize the changes. The rate at which you do it is entirely up to you, but keep in mind that if you have a lot of changes then doing it as frequently as once a minute may lead to your data becoming behind the because of the time taken to transfer the filesystem changes over the network – running snapshot with MySQL running in the background still takes comparatively little time. To demonstrate that, here’s the time taken to create a snapshot mid-way through a 4 million row insert into an InnoDB table:
# time zfs snapshot opt@snap3real 0m0.142suser 0m0.006ssys 0m0.027s
I told you it was quick :)However, the send/recv operation took a few minutes to complete, with about 212MB of data transferred over a very slow network connection, and the machine was busy writing those additional records.Ideally you want to set up a simple script that will handle that sort of snapshot/replication for you and run it past cron
to do the work for you. You might also want to try ready-made tools like Tim Foster’s zfs replication tool, which you can find out about here. Tim’s system works through SMF to handle the replication and is very configurable. It even handles automatic deletion of old, synchronized, snapshots. Of course, all of this is useless unless once replicated from one machine to another we can actually use the databases. Let’s assume that there was a failure and we needed to fail over to the slave machine. To do:
- Stop the script on the master, if it’s still up and running.
- Set the slave filesystem to be read/write:
# zfs set readonly=off slavepool
- Start up
mysqld
on the slave. If you are using InnoDB, Falcon or Maria you should get auto-recovery, if it’s needed, to make sure the table data is correct, as shown here when I started up from our mid-INSERT snapshot:
InnoDB: The log sequence number in ibdata files does not matchInnoDB: the log sequence number in the ib_logfiles!081109 15:59:59 InnoDB: Database was not shut down normally!InnoDB: Starting crash recovery.InnoDB: Reading tablespace information from the .ibd files...InnoDB: Restoring possible half-written data pages from the doublewriteInnoDB: buffer...081109 16:00:03 InnoDB: Started; log sequence number 0 1142807951081109 16:00:03 [Note] /slavepool/mysql-5.0.67-solaris10-i386/bin/mysqld: ready for connections.Version: '5.0.67' socket: '/tmp/mysql.sock' port: 3306 MySQL Community Server (GPL)
Yay – we’re back up and running. On MyISAM, or other tables, you need to run REPAIR TABLE
, and you might even have lost some information, but it should be minor. The point is, a mid-INSERT ZFS snapshot, combined with replication, could be a good way of supporting a hot-backup of your system on Mac OS X or Solaris/OpenSolaris. Probably, the most critical part is finding the sweet spot between the snapshot replication time, and how up to date you want to be in a failure situation. It’s also worth pointing out that you can replicate to as many different hosts as you like, so if you want wanted to replicate your ZFS data to two or three hosts, you could.