The Importance of ZFS Block Size

One of the important tunables in ZFS is the recordsize (for normal datasets) and volblocksize (for zvols). These default to 128KB and 8KB respectively.

As I understand it, this is the unit of work in ZFS. If you modify one byte in a large file with the default 128KB record size, it causes the whole 128KB to be read in, one byte to be changed, and a new 128KB block to be written out.

As a result, the official recommendation is to use a block size which aligns with the underlying workload: so for example if you are using a database which reads and writes 16KB chunks then you should use a 16KB block size, and if you are running VMs containing an ext4 filesystem, which uses a 4KB block size, you should set a 4KB block size 1

But there are some side effects to this that are not immediately obvious, as I discovered when I tried converting a VM image from a file to a zvol.

Copying a sparse file to a zvol

This is the VM image file I wanted to migrate to a zvol:

root@nuc2:~# cd /zfs/vm/docker0-old

root@nuc2:/zfs/vm/docker0-old# ls -lhs
total 8.5G
8.5G -rwx------ 1 root root 16G Apr 30 15:09 d624ef03-df8a-40f1-a55e-e13a7356df7d.file.disk0

root@nuc2:/zfs/vm/docker0-old# du -sch
8.5G	.
8.5G	total

You can see it has a 16GB total file size, of which 8.5G has been touched and consumes space - that is, it’s a “sparse” file. The used space is also visible by looking at the zfs filesystem which this file resides in:

root@nuc2:~# zfs list -r zfs/vm
zfs/vm              8.42G  43.2G   104K  /zfs/vm
zfs/vm/docker0-old  8.42G  43.2G  8.42G  /zfs/vm/docker0-old

All good so far. So I created a zvol to hold it, using the recommended 4KB block size:

# zfs create -b 4k -o compression=lz4 -V 16384M zfs/vm/71d6d53b-4240-457b-b4c7-f8316222c701.ext.disk0

(ignore the UUIDs; these come from ganeti, the VM manager I’m using)

Then I tried to copy the image file whilst maintaining its “sparseness”, that is, only touching the blocks of the zvol which needed to be touched. qemu-img is my normal tool of choice:

# qemu-img convert -S 4096 -f raw -O raw \
  /zfs/vm/docker0-old/d624ef03-df8a-40f1-a55e-e13a7356df7d.file.disk0 \

So how much disk space does the copy use?

root@nuc2:~# zfs list -r zfs/vm
NAME                                                    USED  AVAIL  REFER  MOUNTPOINT
zfs/vm                                                 25.4G  26.2G   104K  /zfs/vm
zfs/vm/71d6d53b-4240-457b-b4c7-f8316222c701.ext.disk0  17.0G  28.7G  14.6G  -
zfs/vm/docker0-old                                     8.42G  26.2G  8.42G  /zfs/vm/docker0-old

That’s bad! The original used only 8.42G, but the copy uses 14.6GB - almost the entire 16GB has been touched! What’s gone wrong?

Maybe qemu-img isn’t working how I expect. What if I destroy and recreate the zvol, and then use a different command to copy it?

# dd conv=sparse bs=4096 \
  if=/zfs/vm/docker0-old/d624ef03-df8a-40f1-a55e-e13a7356df7d.file.disk0 \

I got exactly the same poor result. Clearly it’s not qemu-img at fault here.

Matching block size

I finally realised that the difference between the zfs filesystem and the zvol is the block size. I recreated the zvol with a 128K block size:

# zfs destroy zfs/vm/71d6d53b-4240-457b-b4c7-f8316222c701.ext.disk0
# zfs create -b 128k -o compression=lz4 -V 16384M zfs/vm/71d6d53b-4240-457b-b4c7-f8316222c701.ext.disk0

Repeated the qemu-img command, and then looked at the sizes:

root@nuc2:~# zfs list -r zfs/vm
NAME                                                    USED  AVAIL  REFER  MOUNTPOINT
zfs/vm                                                 24.4G  27.2G   104K  /zfs/vm
zfs/vm/71d6d53b-4240-457b-b4c7-f8316222c701.ext.disk0  16.0G  34.8G  8.42G  -
zfs/vm/docker0-old                                     8.42G  27.2G  8.42G  /zfs/vm/docker0-old

That’s better. The disk usage of the zvol is now exactly the same as for the sparse file in the filesystem dataset.

Does it affect speed? Here’s the timing with 4K blocks:

root@nuc2:~# zfs create -b 4k -o compression=lz4 -V 16384M zfs/vm/71d6d53b-4240-457b-b4c7-f8316222c701.ext.disk0
root@nuc2:~# time qemu-img convert -S 4096 -f raw -O raw /zfs/vm/docker0-old/d624ef03-df8a-40f1-a55e-e13a7356df7d.file.disk0 /dev/zvol/zfs/vm/71d6d53b-4240-457b-b4c7-f8316222c701.ext.disk0

real	5m52.217s
user	0m4.920s
sys	0m40.892s

And with 128K blocks:

root@nuc2:~# zfs create -b 128k -o compression=lz4 -V 16384M zfs/vm/71d6d53b-4240-457b-b4c7-f8316222c701.ext.disk0
root@nuc2:~# time qemu-img convert -S 4096 -f raw -O raw /zfs/vm/docker0-old/d624ef03-df8a-40f1-a55e-e13a7356df7d.file.disk0 /dev/zvol/zfs/vm/71d6d53b-4240-457b-b4c7-f8316222c701.ext.disk0

real	3m20.018s
user	0m5.124s
sys	0m40.748s

It seems that large block sizes are much faster when doing large sequential writes like this. The amount of CPU time is more or less the same, but the wall-clock time is faster. This may be because the underlying device is SSD which naturally uses 128KB flash erase pages.

(Aside: the hardware is an Intel NUC DN2820 with 8GB DDR3L RAM and a Crucial M550 120GB SSD)

What’s happening?

I think there are at least two things at play here.

Firstly, the smaller block size means proportionately higher ZFS overhead, such as checksum blocks and indirect block pointers. Notice how the “USED” (i.e. allocated+reserved) value was 17.0G on the zvol with 4K blocks, but 16.0G with 128K blocks.

Secondly, I think that compression works more efficiently when given larger blocks to compress.

I don’t know for sure that these are the actual mechanisms at play - they are just my best guesses.

But empirically, space efficiency is strongly influenced by the block size. Choosing a small block size can result in significantly larger disk space usage.

What should you choose?

Firstly, if you are benchmarking ZFS datasets versus zvols, you need to use identical block sizes. Under those conditions you might find that they both perform more or less the same, or you might not.

If you are used to using VM images on ZFS datasets with a 128K recordsize, and the performance is sufficient for your needs, you may as well use zvols with a 128K volblocksize. This will maximise your disk compression.

The extra I/O overhead of a large block size might not matter too much if you have enough RAM in your system, so that the “hot” areas remain in your ARC (cache).

Otherwise, you may want to tune the block size for your workload, and this is true whether you are using a zfs filesystem or a zvol. Note that:

  • The volblocksize for a zvol can only be set at volume creation time, so you’ll have to copy it to a new zvol to change the blocksize.
  • The recordsize for a filesystem dataset can be changed, but doesn’t affect existing files - so you either need to rewrite those files, or copy to a dataset, to get the full effect.

Finally, there are some relevant performance issues here and here.

  1. Other advice says not to use anything smaller than 16KB

comments powered by Disqus