BtrFS as ongoing project

BtrFS is still an ongoing project for me, but if it will become a production platform for me soon is the question. Also playing with mirroring on BtrFS level made me wonder even more as it does the calculating about storage usage a little bit differently. Normally with mirroring you see the storage you can allocate and has been allocated. With BtrFS you see the total amount of data available on all disks combined as show in the example below.

$ sudo btrfs filesystem df /mnt
Data, RAID1: total=5.98GB, used=5.36GB
System, RAID1: total=8.00MB, used=4.00KB
System: total=4.00MB, used=0.00
Metadata, RAID1: total=256.00MB, used=6.01MB
$ df -h /mnt
Filesystem                Size  Used Avail Use% Mounted on
/dev/mapper/vg01-btrfsm1   16G   11G  4.8G  70% /mnt

I really like ZFS, but I really wonder if BtrFS could replace it. For now I see too many drawbacks in how BtrFS has been implemented and how distributions may use it. Maybe when Debian 8 is in testing it may be a better time to give BtrFS another chance, but swap space and encrypted file systems are still problems that need to be tackled.

BtrFS and readonly snapshots

In a previous posting I started with BtrFS and as mentioned BtrFS supports snapshotting. With this you can create a point in time copy of a subvolume and even create a clone that can be used as a new working subvolume. To start we first need the BtrFS volume which can and must always be identified as subvolid 0. This as the default volume to be mounted can be altered to a subvolume instead of the real root of a BtrFS volume. We start with updating /etc/fstab so we can mount the BtrFS volume.

LABEL=datavol	/home	btrfs	defaults,subvol=home	0	0
LABEL=datavol	/media/btrfs-datavol	btrfs	defaults,noauto,subvolid=0	0	0

As /media is a temporary file system, meaning it is being recreated with every reboot, we need to create a mountpoint for the BtrFS volume before mounting. After that we create two read-only snapshots with a small delay in between. As there is currently no naming guide for how to call snapshots, I adopted the ZFS naming schema with the @-sign as separator between the subvolume name and timestamp.

$ sudo mkdir -m 0755 /media/btrfs-datavol
$ sudo mount /media/btrfs-datavol
$ cd /media/btrfs-datavol
$ sudo btrfs subvolume snapshot -r home home\@`date "+%Y%M%d-%H%m%S-%Z"`
Create a readonly snapshot of 'home' in './home@20124721-080109-CET
...
$ sudo btrfs subvolume snapshot -r home home\@`date "+%Y%M%d-%H%m%S-%Z"`
Create a readonly snapshot of 'home' in './home@20124721-080131-CET'
$ ls -l
total 0
drwxr-xr-x 1 root root 52 nov 21  2010 home
drwxr-xr-x 1 root root 52 nov 21  2010 home@20124721-080109-CET
drwxr-xr-x 1 root root 52 nov 21  2010 home@20124721-080131-CET

We now have two read-only snapshots and lets test to see if they are real read-only subvolumes. The creation a new file shouldn’t be possible.

$ sudo touch home@20124721-080109-CET/test.txt
touch: cannot touch `home@20124721-080109-CET/test.txt': Read-only file system

Creating snapshots is fun and handy for migrations or as on disk backup solution, but they do consume space as the delta’s between snapshots is being kept on disk. Meaning that changes between the snapshots are being keept on disk even when you remove them. Freeing diskspace will not only be removing them from the current snapshot, but also removing previous snapshots that include the removed data.

$ sudo btrfs subvolume delete home@20124721-080109-CET
Delete subvolume '/media/btrfs-datavol/home@20124721-080109-CET'
$ ls -l 
total 0
drwxr-xr-x 1 root root 52 nov 21  2010 home
drwxr-xr-x 1 root root 52 nov 21  2010 home@20124721-080131-CET

As last step we unmount the BtrFS volume again. This is where ZFS and BtrFS differ too much for my taste. To create and access snapshots on ZFS the zpool doesn’t needs to be mounted, but then again with the first few release of ZFS the zpool needed to mounted as well. So there is still hope as BtrFS is still under development.

$ sudo umount /media/btrfs-datavol

Seeing what is possible with BtrFS, Sun’s TimeSlider becomes an option. Also the option of Live Upgrades with rollbacks as is possible with Solaris 11, but for that BtrFS with read-write snapshots needs to be tested in the near future.

First steps with BtrFS

After using ZFS on Solaris, I missed the ZFS features on Linux and with no chance of ZFS coming to Linux I had to do with MD and LVM. Or at least until BtrFS became mature enough and since the Linux 3.0 that time slowly has come. With Linux 3.0 BtrFS supports autodefragmentation and scrubbing of volumes. The second is maybe the most important feature of both ZFS and BtrFS as it can be used to actively scan data on disk for errors.

The first tests with BtrFS where in a virtual machine already a longtime ago, but the userland tools where still in development. Now the command btrfs follows the path set by Sun Microsystems and basically combines the commands zfs and zpool for ZFS. But nothing compares to a test in the real world and so I broke a mirror and created a BtrFS volume with the name datavol:

$ sudo mkfs.btrfs -L 'datavol' /dev/sdb2

Now we can mount the volume and create a subvolume on it which we are going to be using as our new home volume for users homedirectories.

$ sudo mount /dev/sdb2 /mnt
$ sudo btrfs subvolume create /mnt/home
$ sudo umount /dev/sdb2

When updating /etc/fstab we can tell mount to use the volumename instead of a physical path to a device or some obscure UUID number. Also you can tell which subvolume you want to mount.

LABEL=datavol	/home	btrfs	defaults,subvol=home	0	0

After unmounting and disabling the original volume for /home we can mount everything and copy all the data with rsync for example to see how BtrFS is working in the real world.

$ sudo mount -a

As hinted before scrubbing is important as you can verify that all your data and metadata on disk is still correct. You can do a read-write test by default or only read test to see if all data can be accessed. There is even an option to read parts of the volume that are still unused. In the example below the subvolume for /home is being scrubbed and with success.

$ sudo btrfs scrub status /home
scrub status for afed6685-315d-4c4d-bac2-865388b28fd2
	scrub started at Sat Jan 17 15:11:58 2012, running for 106 seconds
	total bytes scrubbed: 5.77GB with 0 errors
...
$ sudo btrfs scrub status /mnt
scrub status for afed6685-315d-4c4d-bac2-865388b28fd2
	scrub started at Sat Jan 17 15:11:58 2012 and finished after 11125 seconds
	total bytes scrubbed: 792.82GB with 0 errors

The first glances of BtrFS in the real world are a lot better with kernel 3.1 then somewhere with kernel 2.6.30 and I’m slowly starting to say it becomes ready to be included in RHEL 7 of Debian 8 for example as default storage solution. The same as ZFS became in Solaris 11. But it is not all glory as still a lot of work needs to be done.

The first is encryption as the LUKS era ends with BtrFS as it is not smart to put it between your disks and BtrFS. You lose the advantage of balancing data between disks when you do mirroring for example. But then again LVM has the same issue where you then also first need to setup software raid with MD with LUKS on top of it and LVM on top of that. For home directories EncFS maybe an option, but it still leaves a lot of area’s uncovered that would be covered by LUKS out of the box.

The second issue is the integration of BtrFS in distributions and the handling of snapshots. As for now you first need to mount the volume before you can make a snapshot of a subvolume. The same for access a snapshot and for that I think ZFS still has an advantage with the .zfs directory accessible for everyone who has access to the filesystem. But time will tell and for now the first tests look great.

ZFS vs Btrfs

Solaris 10 ZFS EssentialsIn 2006 integreerde Sun Microsystems een nieuw filesystem onder de naam ZFS in Solaris 10 en vele waren skeptisch. Ook was er voldoende commentaar vanwege de gekozen licentie en het commentaar is er nog steeds nu ZFS ook ontbreekt in Snow Leopard van Apple. Met de overname door Oracle wordt het er mogelijk niet beter op, maar toch staat er een boek voor ZFS op de roadmap.

In dezelfde periode is Oracle begonnen aan Btrfs voor Linux welke in de kernel is op genomen sinds Linux 2.6.29. Nu begint ZFS redelijk volwassen te worden en het heeft wat jaren geduurt, maar hoe staat Btrfs ervoor. Een virtuele testmachine met een paar lege disken en de laatste versie van Debian Testing zou dus voldoende moeten zijn. Helaas is dit nog geen succesvolle combinatie en geeft aan dat Btrfs misschien nog niet helemaal volwassen is.

Als we naar de techniek en implementatie kijken dan zijn er nog meer verschillen tussen ZFS en Btrfs. Waar ZFS beschikt over RAID 0, 1, 10 maar ook over verschillende RAIDZ generaties en combinaties hiervan waarmee RAID 5 en 6 worden geimplementeerd. Dit terwijl Btrfs voorlopig alleen beschikt over RAID0, RAID1 en RAID10 zonder enige uitbreidingen hierop. Ook blijft Btrfs vertrouwen op bv andere RAID- en LVM-oplossingen, maar ook op de hardware die altijd correct is. Een mooi punt van oa ZFS is dat het de hardware meeneemt in de beslissingen waar het snelst te lezen is bv.

Op bijna alle punten lijkt ZFS volwassen te zijn en behoeft Btrfs nog vele jaren ontwikkeling terwijl ZFS al in 2006 productierijp was. En hoewel ZFS strict in Solaris verweven is en een niet GPL-licentie heeft is het zeer zeker de moeite waard om te gebruiken. Het is misschien ook de vraag of met de komt van Solaris 11 en de uitkomst van de Sun-Oracle merger het nog wel interessant is naar Btrfs te kijken. Zeker als je kijkt naar hoe volwassen OpenSolaris nu is tov wat Linux niet biedt. Alleen een licentie is vaak niet voldoende of het moet de moeite waard zijn om te wachten, maar live upgrade met ZFS is tegenwoordig echt de moeite waard en een verademing.

Inodes, inodes, inodes en toen waren ze op

Soms heb je van die dagen dat er ineens een bericht krijgt dat een dienst geen data meer verwerkt. Bij controle lijkt er weinig te vinden te zijn, want er is oa nog voldoende vrije ruimte op het filesysteem is. De applicatie klaagt toch dat hij niet kan schrijven.

Bij verdere controle komt er toch een oude instinker naar boven. Er blijkt een filesysteem geen vrije inodes meer te hebben en er kan dus ook geen expire van data meer worden gedaan door de applicatie. De enige twee opties zijn dus het vrij maken van enkele inodes of extra inodes beschikbaar stellen. Helaas is de eerste optie niet echt een optie als de data belangrijk is of index afhankelijk. De tweede optie van het logische volume en daarna het filesystem vergroten.

Bij het lezen van de manpage voor ext2/3/4 blijft dat je niet kan opgeven of je inodes kan reserveren voor de root-gebruiker zoals bij het aantal beschikbaar vrije blokken. Dit doet me ook verwonderen hoe dit in ZFS is opgelost, maar het wordt wel duidelijk dat filesystemen gebaseerd op inodes hun beste tijd hebben gehad.