POSTS

Why ZVOLs Are Good After All

In Jim Salter’s excellent series of articles on ZFS, he points out a potential pitfall of ZVOLs: snapshots fail unless there is enough free space in your zpool equal to the size of the ZVOL.

“If you have a dataset that’s occupying 85% of your pool, you can snapshot that dataset any time you like. If you have a ZVOL that’s occupying 85% of your pool, you cannot snapshot it, period.”

And ends by saying:

“Think long and hard before you implement ZVOLs. Then, you know… don’t.”

Whilst it sounds like ZVOLs have some inherent design flaw, it turns out that this behaviour is intentional, documented, and designed for safety. But if you want the unsafe behaviour, you can easily get it too.

ZVOLs and reservations

The reason for the observed behaviour is described in the zfs(8) manpage, in the section about the volsize property:

For  volumes, specifies the logical size of the volume. By default,
creating a volume establishes a  reservation  of  equal  size.  For
storage  pools  with a version number of 9 or higher, a refreserva‐
tion is set instead. Any changes to volsize  are  reflected  in  an
equivalent  change to the reservation (or refreservation).
...
The  reservation is kept equal to the volume's logical size to pre‐
vent unexpected behavior for consumers.  Without  the  reservation,
the  volume could run out of space, resulting in undefined behavior
or data corruption, depending on how  the  volume  is  used.

What does this mean?

Well, if you create a ZVOL of size 15GiB, what you are actually creating is a virtual block device of size 15GiB. The application which uses this block device (e.g. a virtual machine) is free to write to any of those blocks, whenever it feel like it.

If you were using a disk partition or an LVM logical volume, the actual blocks would be allocated up-front. In the case of LVM it would allocate “extents”, normally 4MB each, to make up the requested disk space1.

ZFS does the allocation “lazily”: that is, it doesn’t actually allocate space for a block until that block is written to. So to ensure that it never runs out of space, it creates a reservation of equal size to the ZVOL to guarantee that writes to the ZVOL never fail due to no space being available in the zpool.

ZVOLs and snapshots

Let’s go back to our example of a 15GiB ZVOL. As the user writes data to disk, blocks are allocated. In the end, once all 15GiB has been written to, there is 15GiB allocated on disk, equal to the reservation. ZFS never overwrites data in place; but if the ZVOL user overwrites an existing block then the old version of the block can be freed up (garbage collected), so the total usage remains at 15GiB.

Now let’s suppose you take a snapshot of that ZVOL. The current 15GiB of data blocks are now cast in stone, immutable. But the user of the ZVOL could continue to write data, until in the worst case they write all 15GiB with new data, which will require allocating 15GiB of new space.

To allow this to happen, and guarantee that the ZVOL cannot run out of space, suddenly a new 15GiB of space must be reserved. And if there isn’t 15GiB of free space available, ZFS refuses to make the snapshot.

This is conservative, safe behaviour. If ZFS didn’t do this, and allowed a snapshot to be taken with less than 15GiB of free space in the zpool, there would be the real risk that a virtual machine writing to the ZVOL could end up getting a write error due to lack of space in the zpool.

Overcommitting storage with ZVOLs

Another way of thinking about this issue is: suppose you have created a 20GiB zpool (as in JRS’s article). What would you expect to happen if:

  1. You tried to create three separate 10GiB ZVOLs?
  2. You tried to create a single 30GiB ZVOL?

If you think those examples should fail, then you understand why ZFS is making reservations.

However it is possible to make these allocations succeed. Also from the zfs(8) manpage:

Though not recommended, a "sparse volume" (also known as "thin pro‐
visioning") can be created by specifying the -s option to  the  zfs
create  -V command, or by changing the reservation after the volume
has been created. A "sparse volume" is a volume where the  reserva‐
tion is less then the volume size. Consequently, writes to a sparse
volume can fail with ENOSPC when the pool is low on  space.  For  a
sparse volume, changes to volsize are not reflected in the reserva‐
tion.

So, given a 20GiB zpool, you can create a 30GiB zvol. You just have to accept that once the user has written to 20GiB of the blocks in the zvol, subsequent writes will fail.

It would be just like those fake, knock-off USB sticks you might have been unfortunate enough to come across in some backwater markets: they may tell your OS that they have a a capacity of (say) 64GB, but once you written more than (say) 1GB they strangely fail. It’s because they only had 1G of blocks in them in the first place.

Snapshots and overcommitting

So now, let’s go back to JRS’ exact example. First, create a 20G zpool:

# truncate -s20G /tmp/blk.img
# zpool create -oashift=12 target /tmp/blk.img

Create a 15G zvol, fill all the blocks with data, and try to snapshot it:

# zfs create target/zvol -V 15G -o compress=off
# dd if=/dev/zero of=/dev/zvol/target/zvol bs=4096k
dd: error writing '/dev/zvol/target/zvol': No space left on device
3841+0 records in
3840+0 records out
16106127360 bytes (16 GB, 15 GiB) copied, 44.5358 s, 362 MB/s
# zfs snapshot target/zvol@1
cannot create snapshot 'target/zvol@1': out of space

Darn, it fails. But now we know it’s because ZFS wants to reserve 15G of space for writes to the ZVOL after the snapshot has been taken, but only 4.13G is available. Let’s look at the reservation:

# zfs list -r target
NAME          USED  AVAIL  REFER  MOUNTPOINT
target       15.5G  3.78G    96K  /target
target/zvol  15.5G  4.13G  15.1G  -
# zfs list -o name,reservation,refreservation,used,usedbydataset,usedbyrefreservation -r target
NAME         RESERV  REFRESERV   USED  USEDDS  USEDREFRESERV
target         none       none  15.5G     96K              0
target/zvol    none      15.5G  15.5G   15.1G           362M

Since this is a modern pool, you can see that ZFS sets the refreservation2 property and not the older reservation property (in agreement with the manpage)

Now, if the volume had been created with the -s flag then it would have been given a refreservation of zero and the snapshot would have succeeded. No matter: we can change it whenever we like. In fact, rather than put it all the way down to zero, let’s put it to 1G: this guarantees that we’ll be able to do 1G of new block writes/overwrites to this zvol - after which, we are at the mercy of free space in the zpool.

# zfs set refreservation=1G target/zvol
# zfs snapshot target/zvol@1
# zfs list target/zvol -r -t snap
NAME            USED  AVAIL  REFER  MOUNTPOINT
target/zvol@1      0      -  15.1G  -

Yay, our snapshot worked! If this snapshot is only temporary - say because we wanted to make a consistent point-in-time backup of the zvol - we can delete it once we’ve finished, and put the refreservation back up.

What’s the danger? There is only 4.13G of disk space available to accept writes of new data - of which 1G is reserved for sole use of our zvol, and 3.13G is shared with other users of the zpool.

# zfs list -o name,avail,reservation,refreservation,used,usedbydataset,usedbyrefreservation -r target
NAME         AVAIL  RESERV  REFRESERV   USED  USEDDS  USEDREFRESERV
target       3.13G    none       none  16.1G     96K              0
target/zvol  4.13G    none         1G  16.1G   15.1G             1G

If we are confident that the VM being backed up and other pool users together won’t write to more than 4.13G of blocks before we delete the snapshot, then we won’t run out of space, and everything will be fine. But of course, this is only a call that we can make. ZFS itself doesn’t know anything about how busy our VM is, so by default it takes the safe option, and reserves the full 15G for post-snapshot writes.

Finally, don’t forget to clean up3:

# zpool destroy target
# rm /tmp/blk.img

Bootnotes

Overcommitting disk space is fairly common practice when you have lots of VMs. It can happen when using non-preallocated qcow2/vmdk/vdi files, or sparse raw image files. You may not even realise that you are overcommitting; you would have to add up the logical size of each of those files, and compare it to the total amount of disk space you have.

ZVOLs do this work for you, because you can easily see the size of each ZVOL. Furthermore, ZFS does the safe thing by default, which is to reserve space for each ZVOL to guarantee no overcommitment. But if you want to overcommit with ZVOLs - either temporarily while you run a snapshot backup, or long-term because you have lots of VMs - that’s possible too.

I think it would be helpful to write a simple tool which periodically scans your ZVOLs and sets

refreservation = min(usedbydataset + 10% of volsize, volsize)

That is: every VM is guaranteed that it write to another 10% of its disk space. Even if some user suddenly allocates all remaining space in your zpool (perhaps a runaway VM?), all the other VMs will each still have at least 10% headroom to grow into while you sort out the problem.

Finally, always remember to turn on compression on all your ZFS datasets and zvols. In practice, the disk space used will almost always be lower than the disk space reserved.


  1. Unless you use the LVM thin provisioning feature.

    [return]
  2. Why is refreservation 15.5G for a 15G ZVOL? I think it’s to allow extra space for the metadata (inodes / checksums / indirect blocks)

    [return]
  3. If you’re worried that zpool destroy doesn’t ask for confirmation, it’s because it can be undone:

    # zpool import -d /tmp -D target
    
    [return]
comments powered by Disqus