In Jim Salter’s excellent series of articles on ZFS, he points out a potential pitfall of ZVOLs: snapshots fail unless there is enough free space in your zpool equal to the size of the ZVOL.
“If you have a dataset that’s occupying 85% of your pool, you can snapshot that dataset any time you like. If you have a ZVOL that’s occupying 85% of your pool, you cannot snapshot it, period.”
And ends by saying:
“Think long and hard before you implement ZVOLs. Then, you know… don’t.”
Whilst it sounds like ZVOLs have some inherent design flaw, it turns out that this behaviour is intentional, documented, and designed for safety. But if you want the unsafe behaviour, you can easily get it too.
ZVOLs and reservations
The reason for the observed behaviour is described in the
zfs(8) manpage, in the section
For volumes, specifies the logical size of the volume. By default, creating a volume establishes a reservation of equal size. For storage pools with a version number of 9 or higher, a refreserva‐ tion is set instead. Any changes to volsize are reflected in an equivalent change to the reservation (or refreservation). ... The reservation is kept equal to the volume's logical size to pre‐ vent unexpected behavior for consumers. Without the reservation, the volume could run out of space, resulting in undefined behavior or data corruption, depending on how the volume is used.
What does this mean?
Well, if you create a ZVOL of size 15GiB, what you are actually creating is a virtual block device of size 15GiB. The application which uses this block device (e.g. a virtual machine) is free to write to any of those blocks, whenever it feel like it.
If you were using a disk partition or an LVM logical volume, the actual blocks would be allocated up-front. In the case of LVM it would allocate “extents”, normally 4MB each, to make up the requested disk space1.
ZFS does the allocation “lazily”: that is, it doesn’t actually allocate space for a block until that block is written to. So to ensure that it never runs out of space, it creates a reservation of equal size to the ZVOL to guarantee that writes to the ZVOL never fail due to no space being available in the zpool.
ZVOLs and snapshots
Let’s go back to our example of a 15GiB ZVOL. As the user writes data to disk, blocks are allocated. In the end, once all 15GiB has been written to, there is 15GiB allocated on disk, equal to the reservation. ZFS never overwrites data in place; but if the ZVOL user overwrites an existing block then the old version of the block can be freed up (garbage collected), so the total usage remains at 15GiB.
Now let’s suppose you take a snapshot of that ZVOL. The current 15GiB of data blocks are now cast in stone, immutable. But the user of the ZVOL could continue to write data, until in the worst case they write all 15GiB with new data, which will require allocating 15GiB of new space.
To allow this to happen, and guarantee that the ZVOL cannot run out of space, suddenly a new 15GiB of space must be reserved. And if there isn’t 15GiB of free space available, ZFS refuses to make the snapshot.
This is conservative, safe behaviour. If ZFS didn’t do this, and allowed a snapshot to be taken with less than 15GiB of free space in the zpool, there would be the real risk that a virtual machine writing to the ZVOL could end up getting a write error due to lack of space in the zpool.
Overcommitting storage with ZVOLs
Another way of thinking about this issue is: suppose you have created a 20GiB zpool (as in JRS’s article). What would you expect to happen if:
- You tried to create three separate 10GiB ZVOLs?
- You tried to create a single 30GiB ZVOL?
If you think those examples should fail, then you understand why ZFS is making reservations.
However it is possible to make these allocations succeed. Also from
Though not recommended, a "sparse volume" (also known as "thin pro‐ visioning") can be created by specifying the -s option to the zfs create -V command, or by changing the reservation after the volume has been created. A "sparse volume" is a volume where the reserva‐ tion is less then the volume size. Consequently, writes to a sparse volume can fail with ENOSPC when the pool is low on space. For a sparse volume, changes to volsize are not reflected in the reserva‐ tion.
So, given a 20GiB zpool, you can create a 30GiB zvol. You just have to accept that once the user has written to 20GiB of the blocks in the zvol, subsequent writes will fail.
It would be just like those fake, knock-off USB sticks you might have been unfortunate enough to come across in some backwater markets: they may tell your OS that they have a a capacity of (say) 64GB, but once you written more than (say) 1GB they strangely fail. It’s because they only had 1G of blocks in them in the first place.
Snapshots and overcommitting
So now, let’s go back to JRS’ exact example. First, create a 20G zpool:
# truncate -s20G /tmp/blk.img # zpool create -oashift=12 target /tmp/blk.img
Create a 15G zvol, fill all the blocks with data, and try to snapshot it:
# zfs create target/zvol -V 15G -o compress=off # dd if=/dev/zero of=/dev/zvol/target/zvol bs=4096k dd: error writing '/dev/zvol/target/zvol': No space left on device 3841+0 records in 3840+0 records out 16106127360 bytes (16 GB, 15 GiB) copied, 44.5358 s, 362 MB/s # zfs snapshot target/zvol@1 cannot create snapshot 'target/zvol@1': out of space
Darn, it fails. But now we know it’s because ZFS wants to reserve 15G of space for writes to the ZVOL after the snapshot has been taken, but only 4.13G is available. Let’s look at the reservation:
# zfs list -r target NAME USED AVAIL REFER MOUNTPOINT target 15.5G 3.78G 96K /target target/zvol 15.5G 4.13G 15.1G - # zfs list -o name,reservation,refreservation,used,usedbydataset,usedbyrefreservation -r target NAME RESERV REFRESERV USED USEDDS USEDREFRESERV target none none 15.5G 96K 0 target/zvol none 15.5G 15.5G 15.1G 362M
Since this is a modern pool, you can see that ZFS sets the
refreservation2 property and not the older
reservation property (in
agreement with the manpage)
Now, if the volume had been created with the
-s flag then it would have
been given a
refreservation of zero and the snapshot would have succeeded.
No matter: we can change it whenever we like. In fact, rather than put it
all the way down to zero, let’s put it to 1G: this guarantees that we’ll be
able to do 1G of new block writes/overwrites to this zvol - after which, we
are at the mercy of free space in the zpool.
# zfs set refreservation=1G target/zvol # zfs snapshot target/zvol@1 # zfs list target/zvol -r -t snap NAME USED AVAIL REFER MOUNTPOINT target/zvol@1 0 - 15.1G -
Yay, our snapshot worked! If this snapshot is only temporary - say because
we wanted to make a consistent point-in-time backup of the zvol - we can
delete it once we’ve finished, and put the
refreservation back up.
What’s the danger? There is only 4.13G of disk space available to accept writes of new data - of which 1G is reserved for sole use of our zvol, and 3.13G is shared with other users of the zpool.
# zfs list -o name,avail,reservation,refreservation,used,usedbydataset,usedbyrefreservation -r target NAME AVAIL RESERV REFRESERV USED USEDDS USEDREFRESERV target 3.13G none none 16.1G 96K 0 target/zvol 4.13G none 1G 16.1G 15.1G 1G
If we are confident that the VM being backed up and other pool users together won’t write to more than 4.13G of blocks before we delete the snapshot, then we won’t run out of space, and everything will be fine. But of course, this is only a call that we can make. ZFS itself doesn’t know anything about how busy our VM is, so by default it takes the safe option, and reserves the full 15G for post-snapshot writes.
Finally, don’t forget to clean up3:
# zpool destroy target # rm /tmp/blk.img
Overcommitting disk space is fairly common practice when you have lots of VMs. It can happen when using non-preallocated qcow2/vmdk/vdi files, or sparse raw image files. You may not even realise that you are overcommitting; you would have to add up the logical size of each of those files, and compare it to the total amount of disk space you have.
ZVOLs do this work for you, because you can easily see the size of each ZVOL. Furthermore, ZFS does the safe thing by default, which is to reserve space for each ZVOL to guarantee no overcommitment. But if you want to overcommit with ZVOLs - either temporarily while you run a snapshot backup, or long-term because you have lots of VMs - that’s possible too.
I think it would be helpful to write a simple tool which periodically scans your ZVOLs and sets
refreservation = min(usedbydataset + 10% of volsize, volsize)
That is: every VM is guaranteed that it write to another 10% of its disk space. Even if some user suddenly allocates all remaining space in your zpool (perhaps a runaway VM?), all the other VMs will each still have at least 10% headroom to grow into while you sort out the problem.
Finally, always remember to turn on compression on all your ZFS datasets and zvols. In practice, the disk space used will almost always be lower than the disk space reserved.
Unless you use the LVM thin provisioning feature.[return]
refreservation15.5G for a 15G ZVOL? I think it’s to allow extra space for the metadata (inodes / checksums / indirect blocks)
If you’re worried that
zpool destroydoesn’t ask for confirmation, it’s because it can be undone:
# zpool import -d /tmp -D target