I had a server with 5 scsi disks which I needed to get the maximum filesystem space out of and also have cheap replication to Auckland (while still having disk redundancy). The server had a Perc 3 raid controller onboard which I could have used to create a 5 disk raid container but then I would have had to use rsync for the replication and tests showed it wasnt going to scale to 400G of smallish files well (it chokes).

Along comes ZFS. I can create a snapshot of the filesystem and then get a binary stream of the differences between the new shapshot and the previous one. This is very efficient and the data I have to send over the internet is roughly equivalent to the amount of data that has changed on disk. Its also very fast, I the files do not need to be checked for timestamp/checksum changes since its done at the zpool level, zfs just pumps out the changes as fast as I can send them.

I decided to use RAID1 for the boot+root filesystem, its a mirror of 5 disks which makes writing slow but that doesnt matter as its essentially readonly. I also used raid3 for swap and raidz for ZFS which is an improved raid5.

Installing FreeBSD

Boot up on the FreeBSD 7.0 CD and start the installation. The only real difference here from a regular installation is that we only create a 512M / filesystem, no var or usr. Because of this we need to choose the minimal install so that it will fit, the ports and manpages can be installed later.

To create D you’ll need to enter any mount point you want and then use the M option to clear it. This ensures that it will not mount or be created as a file system.

Go through the post-install configuration and then exit and reboot.

Adding the other disks

Now create a Disk pool using the D label we prepared during the install.

Create the partitions on the remaining disks, this is a copy of da0.

# fdisk -BI da1
# fdisk -BI da2
# fdisk -BI da3
# fdisk -BI da4
# bsdlabel da0s1 > /tmp/label
# bsdlabel -RB da1s1 /tmp/label
# bsdlabel -RB da2s1 /tmp/label
# bsdlabel -RB da3s1 /tmp/label
# bsdlabel -RB da4s1 /tmp/label

Mirror the boot filesystem (excluding da0s1a which is currently our root filesystem)

# gmirror label boot da{1,2,3,4}s1a
# kldload geom_mirror
# newfs /dev/mirror/boot
# mount /dev/mirror/boot /mnt

Create the raid3 swap

# swapoff /dev/da0s1b
# graid3 label swap da{0,1,2,3,4}s1b

Create the ZFS pool

# zpool create tank raidz da{0,1,2,3,4}s1d

Create some extra/common mountpoints

# zfs create tank/usr
# zfs create tank/var
# zfs create tank/tmp

Edit /boot/loader.conf to contain the following

zfs_load="YES"
geom_mirror_load="YES"
geom_raid3_load="YES"
vm.kmem_size=629145600
vm.kmem_size_max=629145600 

Edit /etc/rc.conf and enable ZFS

# echo 'zfs_enable="YES"' >> /etc/rc.conf

Edit /etc/fstab to contain the following

/dev/mirror/boot        /               ufs     rw              1       1
/dev/raid3/swap         none            swap    sw              0       0
tank/var                /var            zfs     rw              0       0
tank/usr                /usr            zfs     rw              0       0
tank/tmp                /tmp            zfs     rw              0       0
/dev/acd0               /cdrom          cd9660  ro,noauto       0       0

Duplicate the root filesystem to the new boot mirror

# find -x / | cpio -pmd /mnt
# rm -rf /mnt/var/*
# rm -rf /mnt/usr/*
# rm -rf /mnt/tmp/*
# find -x /var /usr | cpio -pmd /tank

Set the mount points and reboot into the new raided system

# zfs set mountpoint=legacy tank/usr
# zfs set mountpoint=legacy tank/var
# zfs set mountpoint=legacy tank/tmp
# zfs set mountpoint=none tank
# shutdown -r now

Thats it, now we have a mirrored root with /var and /usr on our shiny new ZFS pool.

Post config

Activate the remaining boot mirror partition.

# gmirror insert boot da0s1a

Create filesystem quotas so that the system directories can not chew too much space, reserved space is also an option.

# zfs set quota=512m tank/tmp
# zfs set quota=4G tank/usr
# zfs set quota=1G tank/var

Set root password, create user accounts, etc.

The replicated filesystem

Create the media filesystem

# zfs create -o mountpoint=/media tank/media

The /media filesystem can use the entire ZFS pool, about 500G to play with.

# zfs list
NAME         USED  AVAIL  REFER  MOUNTPOINT
tank        2.26G   523G  28.8K  none
tank/media  28.8K   523G  28.8K  /media
tank/tmp    41.5K   512M  41.5K  legacy
tank/usr    2.22G  1.78G  2.22G  legacy
tank/var    34.1M   990M  34.1M  legacy

The filesystem needs to have the initial sync so it is created on the remote server, just take a snapshot of the current state and send it over the wire.

# zfs snapshot tank/media@base
# zfs send tank/media@base | ssh slave 'zfs receive slave/media'

So the media filesystem is now on the remote slave box, zfs list will show it.

slave# zfs list
NAME                     USED  AVAIL  REFER  MOUNTPOINT
slave                    661M  2.28G   173M  none
slave/media               18K  2.28G    18K  none
slave/media@base            0      -    18K  -

Create a file and sync it to the remote end.

# dd if=/dev/random of=/media/dummy bs=1m count=100
100+0 records in
100+0 records out
104857600 bytes transferred in 2.963048 secs (35388425 bytes/sec)

# zfs snapshot tank/media@20071202
# zfs send -i tank/media@base tank/media@20071202 | ssh slave 'zfs receive -F slave/media'

Now check the slave

# zfs list
NAME                     USED  AVAIL  REFER  MOUNTPOINT
slave                     761M  2.18G   173M  none
slave/media               100M  2.18G   100M  /media
slave/media@base           16K      -    18K  -
slave/media@20071202         0      -   100M  -

# ls -lh /media
total 102479
-rw-r--r--  1 root  wheel   100M Dec  3 09:33 dummy

Done, from now on we just create a snapshot and sync it to the last one. I need to roll it into a nice script and run it from a cron. It doesnt matter how often things are synced, it can be a often as a few seconds or even once a day. Only the changes since the last snapshot are sent.

Disaster recovery

To get to single user mode do the following:

Once in, the ZFS filesystems need to be mounted

# /etc/rc.d/hostid start
# /etc/rc.d/zfs start
# mount -a

If the system is really hosed then you may need to use the statically linked binaries in /rescue

Check some stuff out...

# df -h
# zpool list

AndrewThompson/ZfsForReplication (last edited 2007-12-02 23:49:37 by AndrewThompson)