I had a server with 5 scsi disks which I needed to get the maximum filesystem space out of and also have cheap replication to Auckland (while still having disk redundancy). The server had a Perc 3 raid controller onboard which I could have used to create a 5 disk raid container but then I would have had to use rsync for the replication and tests showed it wasnt going to scale to 400G of smallish files well (it chokes).
Along comes ZFS. I can create a snapshot of the filesystem and then get a binary stream of the differences between the new shapshot and the previous one. This is very efficient and the data I have to send over the internet is roughly equivalent to the amount of data that has changed on disk. Its also very fast, I the files do not need to be checked for timestamp/checksum changes since its done at the zpool level, zfs just pumps out the changes as fast as I can send them.
I decided to use RAID1 for the boot+root filesystem, its a mirror of 5 disks which makes writing slow but that doesnt matter as its essentially readonly. I also used raid3 for swap and raidz for ZFS which is an improved raid5.
Installing FreeBSD
Boot up on the FreeBSD 7.0 CD and start the installation. The only real difference here from a regular installation is that we only create a 512M / filesystem, no var or usr. Because of this we need to choose the minimal install so that it will fit, the ports and manpages can be installed later.
- Go Standard install
- Partition fdisk on da0 with Auto option (one slice, whole disk) and install the bootloader
- Disklabel da0 with options
name
size
type
mountpoint
A:
512Mb
UFS2
/
B:
2G
swap
D:
rest of disk
-
-
To create D you’ll need to enter any mount point you want and then use the M option to clear it. This ensures that it will not mount or be created as a file system.
- Distribution: choose Minimal install set, say No to installing ports.
Go through the post-install configuration and then exit and reboot.
Adding the other disks
Now create a Disk pool using the D label we prepared during the install.
Create the partitions on the remaining disks, this is a copy of da0.
# fdisk -BI da1 # fdisk -BI da2 # fdisk -BI da3 # fdisk -BI da4 # bsdlabel da0s1 > /tmp/label # bsdlabel -RB da1s1 /tmp/label # bsdlabel -RB da2s1 /tmp/label # bsdlabel -RB da3s1 /tmp/label # bsdlabel -RB da4s1 /tmp/label
Mirror the boot filesystem (excluding da0s1a which is currently our root filesystem)
# gmirror label boot da{1,2,3,4}s1a
# kldload geom_mirror
# newfs /dev/mirror/boot
# mount /dev/mirror/boot /mntCreate the raid3 swap
# swapoff /dev/da0s1b
# graid3 label swap da{0,1,2,3,4}s1bCreate the ZFS pool
# zpool create tank raidz da{0,1,2,3,4}s1dCreate some extra/common mountpoints
# zfs create tank/usr # zfs create tank/var # zfs create tank/tmp
Edit /boot/loader.conf to contain the following
zfs_load="YES" geom_mirror_load="YES" geom_raid3_load="YES" vm.kmem_size=629145600 vm.kmem_size_max=629145600
Edit /etc/rc.conf and enable ZFS
# echo 'zfs_enable="YES"' >> /etc/rc.conf
Edit /etc/fstab to contain the following
/dev/mirror/boot / ufs rw 1 1 /dev/raid3/swap none swap sw 0 0 tank/var /var zfs rw 0 0 tank/usr /usr zfs rw 0 0 tank/tmp /tmp zfs rw 0 0 /dev/acd0 /cdrom cd9660 ro,noauto 0 0
Duplicate the root filesystem to the new boot mirror
# find -x / | cpio -pmd /mnt # rm -rf /mnt/var/* # rm -rf /mnt/usr/* # rm -rf /mnt/tmp/* # find -x /var /usr | cpio -pmd /tank
Set the mount points and reboot into the new raided system
# zfs set mountpoint=legacy tank/usr # zfs set mountpoint=legacy tank/var # zfs set mountpoint=legacy tank/tmp # zfs set mountpoint=none tank # shutdown -r now
Thats it, now we have a mirrored root with /var and /usr on our shiny new ZFS pool.
Post config
Activate the remaining boot mirror partition.
# gmirror insert boot da0s1a
Create filesystem quotas so that the system directories can not chew too much space, reserved space is also an option.
# zfs set quota=512m tank/tmp # zfs set quota=4G tank/usr # zfs set quota=1G tank/var
Set root password, create user accounts, etc.
The replicated filesystem
Create the media filesystem
# zfs create -o mountpoint=/media tank/media
The /media filesystem can use the entire ZFS pool, about 500G to play with.
# zfs list NAME USED AVAIL REFER MOUNTPOINT tank 2.26G 523G 28.8K none tank/media 28.8K 523G 28.8K /media tank/tmp 41.5K 512M 41.5K legacy tank/usr 2.22G 1.78G 2.22G legacy tank/var 34.1M 990M 34.1M legacy
The filesystem needs to have the initial sync so it is created on the remote server, just take a snapshot of the current state and send it over the wire.
# zfs snapshot tank/media@base # zfs send tank/media@base | ssh slave 'zfs receive slave/media'
So the media filesystem is now on the remote slave box, zfs list will show it.
slave# zfs list NAME USED AVAIL REFER MOUNTPOINT slave 661M 2.28G 173M none slave/media 18K 2.28G 18K none slave/media@base 0 - 18K -
Create a file and sync it to the remote end.
# dd if=/dev/random of=/media/dummy bs=1m count=100 100+0 records in 100+0 records out 104857600 bytes transferred in 2.963048 secs (35388425 bytes/sec) # zfs snapshot tank/media@20071202 # zfs send -i tank/media@base tank/media@20071202 | ssh slave 'zfs receive -F slave/media'
Now check the slave
# zfs list NAME USED AVAIL REFER MOUNTPOINT slave 761M 2.18G 173M none slave/media 100M 2.18G 100M /media slave/media@base 16K - 18K - slave/media@20071202 0 - 100M - # ls -lh /media total 102479 -rw-r--r-- 1 root wheel 100M Dec 3 09:33 dummy
Done, from now on we just create a snapshot and sync it to the last one. I need to roll it into a nice script and run it from a cron. It doesnt matter how often things are synced, it can be a often as a few seconds or even once a day. Only the changes since the last snapshot are sent.
Disaster recovery
To get to single user mode do the following:
- Boot up
- Choose #4 (Single user mode) at the boot menu.
Once in, the ZFS filesystems need to be mounted
# /etc/rc.d/hostid start # /etc/rc.d/zfs start # mount -a
If the system is really hosed then you may need to use the statically linked binaries in /rescue
Check some stuff out...
# df -h # zpool list
