POSTS
The Importance of ZFS Block Size
One of the important tunables in ZFS is the recordsize (for normal
datasets) and volblocksize (for zvols). These default to 128KB and 8KB
respectively.
As I understand it, this is the unit of work in ZFS. If you modify one byte in a large file with the default 128KB record size, it causes the whole 128KB to be read in, one byte to be changed, and a new 128KB block to be written out.
As a result, the official recommendation is to use a block size which aligns with the underlying workload: so for example if you are using a database which reads and writes 16KB chunks then you should use a 16KB block size, and if you are running VMs containing an ext4 filesystem, which uses a 4KB block size, you should set a 4KB block size 1
But there are some side effects to this that are not immediately obvious, as I discovered when I tried converting a VM image from a file to a zvol.
Copying a sparse file to a zvol
This is the VM image file I wanted to migrate to a zvol:
root@nuc2:~# cd /zfs/vm/docker0-old
root@nuc2:/zfs/vm/docker0-old# ls -lhs
total 8.5G
8.5G -rwx------ 1 root root 16G Apr 30 15:09 d624ef03-df8a-40f1-a55e-e13a7356df7d.file.disk0
root@nuc2:/zfs/vm/docker0-old# du -sch
8.5G .
8.5G total
You can see it has a 16GB total file size, of which 8.5G has been touched and consumes space - that is, it’s a “sparse” file. The used space is also visible by looking at the zfs filesystem which this file resides in:
root@nuc2:~# zfs list -r zfs/vm
NAME USED AVAIL REFER MOUNTPOINT
zfs/vm 8.42G 43.2G 104K /zfs/vm
zfs/vm/docker0-old 8.42G 43.2G 8.42G /zfs/vm/docker0-old
All good so far. So I created a zvol to hold it, using the recommended 4KB block size:
# zfs create -b 4k -o compression=lz4 -V 16384M zfs/vm/71d6d53b-4240-457b-b4c7-f8316222c701.ext.disk0
(ignore the UUIDs; these come from ganeti, the VM manager I’m using)
Then I tried to copy the image file whilst maintaining its “sparseness”,
that is, only touching the blocks of the zvol which needed to be touched.
qemu-img is my normal tool of choice:
# qemu-img convert -S 4096 -f raw -O raw \
/zfs/vm/docker0-old/d624ef03-df8a-40f1-a55e-e13a7356df7d.file.disk0 \
/dev/zvol/zfs/vm/71d6d53b-4240-457b-b4c7-f8316222c701.ext.disk0
So how much disk space does the copy use?
root@nuc2:~# zfs list -r zfs/vm
NAME USED AVAIL REFER MOUNTPOINT
zfs/vm 25.4G 26.2G 104K /zfs/vm
zfs/vm/71d6d53b-4240-457b-b4c7-f8316222c701.ext.disk0 17.0G 28.7G 14.6G -
zfs/vm/docker0-old 8.42G 26.2G 8.42G /zfs/vm/docker0-old
That’s bad! The original used only 8.42G, but the copy uses 14.6GB - almost the entire 16GB has been touched! What’s gone wrong?
Maybe qemu-img isn’t working how I expect. What if I destroy and recreate
the zvol, and then use a different command to copy it?
# dd conv=sparse bs=4096 \
if=/zfs/vm/docker0-old/d624ef03-df8a-40f1-a55e-e13a7356df7d.file.disk0 \
of=/dev/zvol/zfs/vm/71d6d53b-4240-457b-b4c7-f8316222c701.ext.disk0
I got exactly the same poor result. Clearly it’s not qemu-img at fault here.
Matching block size
I finally realised that the difference between the zfs filesystem and the zvol is the block size. I recreated the zvol with a 128K block size:
# zfs destroy zfs/vm/71d6d53b-4240-457b-b4c7-f8316222c701.ext.disk0
# zfs create -b 128k -o compression=lz4 -V 16384M zfs/vm/71d6d53b-4240-457b-b4c7-f8316222c701.ext.disk0
Repeated the qemu-img command, and then looked at the sizes:
root@nuc2:~# zfs list -r zfs/vm
NAME USED AVAIL REFER MOUNTPOINT
zfs/vm 24.4G 27.2G 104K /zfs/vm
zfs/vm/71d6d53b-4240-457b-b4c7-f8316222c701.ext.disk0 16.0G 34.8G 8.42G -
zfs/vm/docker0-old 8.42G 27.2G 8.42G /zfs/vm/docker0-old
That’s better. The disk usage of the zvol is now exactly the same as for the sparse file in the filesystem dataset.
Does it affect speed? Here’s the timing with 4K blocks:
root@nuc2:~# zfs create -b 4k -o compression=lz4 -V 16384M zfs/vm/71d6d53b-4240-457b-b4c7-f8316222c701.ext.disk0
root@nuc2:~# time qemu-img convert -S 4096 -f raw -O raw /zfs/vm/docker0-old/d624ef03-df8a-40f1-a55e-e13a7356df7d.file.disk0 /dev/zvol/zfs/vm/71d6d53b-4240-457b-b4c7-f8316222c701.ext.disk0
real 5m52.217s
user 0m4.920s
sys 0m40.892s
And with 128K blocks:
root@nuc2:~# zfs create -b 128k -o compression=lz4 -V 16384M zfs/vm/71d6d53b-4240-457b-b4c7-f8316222c701.ext.disk0
root@nuc2:~# time qemu-img convert -S 4096 -f raw -O raw /zfs/vm/docker0-old/d624ef03-df8a-40f1-a55e-e13a7356df7d.file.disk0 /dev/zvol/zfs/vm/71d6d53b-4240-457b-b4c7-f8316222c701.ext.disk0
real 3m20.018s
user 0m5.124s
sys 0m40.748s
It seems that large block sizes are much faster when doing large sequential writes like this. The amount of CPU time is more or less the same, but the wall-clock time is faster. This may be because the underlying device is SSD which naturally uses 128KB flash erase pages.
(Aside: the hardware is an Intel NUC DN2820 with 8GB DDR3L RAM and a Crucial M550 120GB SSD)
What’s happening?
I think there are at least two things at play here.
Firstly, the smaller block size means proportionately higher ZFS overhead, such as checksum blocks and indirect block pointers. Notice how the “USED” (i.e. allocated+reserved) value was 17.0G on the zvol with 4K blocks, but 16.0G with 128K blocks.
Secondly, I think that compression works more efficiently when given larger blocks to compress.
I don’t know for sure that these are the actual mechanisms at play - they are just my best guesses.
But empirically, space efficiency is strongly influenced by the block size. Choosing a small block size can result in significantly larger disk space usage.
What should you choose?
Firstly, if you are benchmarking ZFS datasets versus zvols, you need to use identical block sizes. Under those conditions you might find that they both perform more or less the same, or you might not.
If you are used to using VM images on ZFS datasets with a 128K recordsize, and the performance is sufficient for your needs, you may as well use zvols with a 128K volblocksize. This will maximise your disk compression.
The extra I/O overhead of a large block size might not matter too much if you have enough RAM in your system, so that the “hot” areas remain in your ARC (cache).
Otherwise, you may want to tune the block size for your workload, and this is true whether you are using a zfs filesystem or a zvol. Note that:
- The volblocksize for a zvol can only be set at volume creation time, so you’ll have to copy it to a new zvol to change the blocksize.
- The recordsize for a filesystem dataset can be changed, but doesn’t affect existing files - so you either need to rewrite those files, or copy to a dataset, to get the full effect.
Finally, there are some relevant performance issues here and here.
Other advice says not to use anything smaller than 16KB
[return]