One of the important tunables in ZFS is the
recordsize (for normal
volblocksize (for zvols). These default to 128KB and 8KB
As I understand it, this is the unit of work in ZFS. If you modify one byte in a large file with the default 128KB record size, it causes the whole 128KB to be read in, one byte to be changed, and a new 128KB block to be written out.
As a result, the official recommendation is to use a block size which aligns with the underlying workload: so for example if you are using a database which reads and writes 16KB chunks then you should use a 16KB block size, and if you are running VMs containing an ext4 filesystem, which uses a 4KB block size, you should set a 4KB block size 1
But there are some side effects to this that are not immediately obvious, as I discovered when I tried converting a VM image from a file to a zvol.
Copying a sparse file to a zvol
This is the VM image file I wanted to migrate to a zvol:
root@nuc2:~# cd /zfs/vm/docker0-old root@nuc2:/zfs/vm/docker0-old# ls -lhs total 8.5G 8.5G -rwx------ 1 root root 16G Apr 30 15:09 d624ef03-df8a-40f1-a55e-e13a7356df7d.file.disk0 root@nuc2:/zfs/vm/docker0-old# du -sch 8.5G . 8.5G total
You can see it has a 16GB total file size, of which 8.5G has been touched and consumes space - that is, it’s a “sparse” file. The used space is also visible by looking at the zfs filesystem which this file resides in:
root@nuc2:~# zfs list -r zfs/vm NAME USED AVAIL REFER MOUNTPOINT zfs/vm 8.42G 43.2G 104K /zfs/vm zfs/vm/docker0-old 8.42G 43.2G 8.42G /zfs/vm/docker0-old
All good so far. So I created a zvol to hold it, using the recommended 4KB block size:
# zfs create -b 4k -o compression=lz4 -V 16384M zfs/vm/71d6d53b-4240-457b-b4c7-f8316222c701.ext.disk0
(ignore the UUIDs; these come from ganeti, the VM manager I’m using)
Then I tried to copy the image file whilst maintaining its “sparseness”,
that is, only touching the blocks of the zvol which needed to be touched.
qemu-img is my normal tool of choice:
# qemu-img convert -S 4096 -f raw -O raw \ /zfs/vm/docker0-old/d624ef03-df8a-40f1-a55e-e13a7356df7d.file.disk0 \ /dev/zvol/zfs/vm/71d6d53b-4240-457b-b4c7-f8316222c701.ext.disk0
So how much disk space does the copy use?
root@nuc2:~# zfs list -r zfs/vm NAME USED AVAIL REFER MOUNTPOINT zfs/vm 25.4G 26.2G 104K /zfs/vm zfs/vm/71d6d53b-4240-457b-b4c7-f8316222c701.ext.disk0 17.0G 28.7G 14.6G - zfs/vm/docker0-old 8.42G 26.2G 8.42G /zfs/vm/docker0-old
That’s bad! The original used only 8.42G, but the copy uses 14.6GB - almost the entire 16GB has been touched! What’s gone wrong?
qemu-img isn’t working how I expect. What if I destroy and recreate
the zvol, and then use a different command to copy it?
# dd conv=sparse bs=4096 \ if=/zfs/vm/docker0-old/d624ef03-df8a-40f1-a55e-e13a7356df7d.file.disk0 \ of=/dev/zvol/zfs/vm/71d6d53b-4240-457b-b4c7-f8316222c701.ext.disk0
I got exactly the same poor result. Clearly it’s not
qemu-img at fault here.
Matching block size
I finally realised that the difference between the zfs filesystem and the zvol is the block size. I recreated the zvol with a 128K block size:
# zfs destroy zfs/vm/71d6d53b-4240-457b-b4c7-f8316222c701.ext.disk0 # zfs create -b 128k -o compression=lz4 -V 16384M zfs/vm/71d6d53b-4240-457b-b4c7-f8316222c701.ext.disk0
Repeated the qemu-img command, and then looked at the sizes:
root@nuc2:~# zfs list -r zfs/vm NAME USED AVAIL REFER MOUNTPOINT zfs/vm 24.4G 27.2G 104K /zfs/vm zfs/vm/71d6d53b-4240-457b-b4c7-f8316222c701.ext.disk0 16.0G 34.8G 8.42G - zfs/vm/docker0-old 8.42G 27.2G 8.42G /zfs/vm/docker0-old
That’s better. The disk usage of the zvol is now exactly the same as for the sparse file in the filesystem dataset.
Does it affect speed? Here’s the timing with 4K blocks:
root@nuc2:~# zfs create -b 4k -o compression=lz4 -V 16384M zfs/vm/71d6d53b-4240-457b-b4c7-f8316222c701.ext.disk0 root@nuc2:~# time qemu-img convert -S 4096 -f raw -O raw /zfs/vm/docker0-old/d624ef03-df8a-40f1-a55e-e13a7356df7d.file.disk0 /dev/zvol/zfs/vm/71d6d53b-4240-457b-b4c7-f8316222c701.ext.disk0 real 5m52.217s user 0m4.920s sys 0m40.892s
And with 128K blocks:
root@nuc2:~# zfs create -b 128k -o compression=lz4 -V 16384M zfs/vm/71d6d53b-4240-457b-b4c7-f8316222c701.ext.disk0 root@nuc2:~# time qemu-img convert -S 4096 -f raw -O raw /zfs/vm/docker0-old/d624ef03-df8a-40f1-a55e-e13a7356df7d.file.disk0 /dev/zvol/zfs/vm/71d6d53b-4240-457b-b4c7-f8316222c701.ext.disk0 real 3m20.018s user 0m5.124s sys 0m40.748s
It seems that large block sizes are much faster when doing large sequential writes like this. The amount of CPU time is more or less the same, but the wall-clock time is faster. This may be because the underlying device is SSD which naturally uses 128KB flash erase pages.
(Aside: the hardware is an Intel NUC DN2820 with 8GB DDR3L RAM and a Crucial M550 120GB SSD)
I think there are at least two things at play here.
Firstly, the smaller block size means proportionately higher ZFS overhead, such as checksum blocks and indirect block pointers. Notice how the “USED” (i.e. allocated+reserved) value was 17.0G on the zvol with 4K blocks, but 16.0G with 128K blocks.
Secondly, I think that compression works more efficiently when given larger blocks to compress.
I don’t know for sure that these are the actual mechanisms at play - they are just my best guesses.
But empirically, space efficiency is strongly influenced by the block size. Choosing a small block size can result in significantly larger disk space usage.
What should you choose?
Firstly, if you are benchmarking ZFS datasets versus zvols, you need to use identical block sizes. Under those conditions you might find that they both perform more or less the same, or you might not.
If you are used to using VM images on ZFS datasets with a 128K recordsize, and the performance is sufficient for your needs, you may as well use zvols with a 128K volblocksize. This will maximise your disk compression.
The extra I/O overhead of a large block size might not matter too much if you have enough RAM in your system, so that the “hot” areas remain in your ARC (cache).
Otherwise, you may want to tune the block size for your workload, and this is true whether you are using a zfs filesystem or a zvol. Note that:
- The volblocksize for a zvol can only be set at volume creation time, so you’ll have to copy it to a new zvol to change the blocksize.
- The recordsize for a filesystem dataset can be changed, but doesn’t affect existing files - so you either need to rewrite those files, or copy to a dataset, to get the full effect.
Finally, there are some relevant performance issues here and here.