Comment 36 for bug 1906476

Revision history for this message
Trent Lloyd (lathiat) wrote :

Have created a 100% reliable reproducer test case and also determined the Ubuntu-specific patch 4701-enable-ARC-FILL-LOCKED-flag.patch to fix Bug #1900889 is likely the cause.

[Test Case]

The important parts are:
- Use encryption
- rsync the zfs git tree
- Use parallel I/O from silversearcher-ag to access it after a reboot. A simple "find ." or "find . -exec cat {} > /dev/null \;" does not reproduce the issue.

Reproduction done using a libvirt VM installed from the Ubuntu Impish daily livecd using a normal ext4 root but with a second 4GB /dev/vdb disk for zfs later

= Preparation
apt install silversearcher-ag git zfs-dkms zfsutils-linux
echo -n testkey2 > /root/testkey
git clone https://github.com/openzfs/zfs /root/zfs

= Test Execution
zpool create test /dev/vdb
zfs create test/test -o encryption=on -o keyformat=passphrase -o keylocation=file:///root/testkey
rsync -va --progress -HAX /root/zfs/ /test/test/zfs/

# If you access the data now it works fine.
reboot

zfs load-key test/test
zfs mount -a
cd /test/test/zfs/
ag DISKS=

= Test Result
ag hangs, "sudo dmesg" shows an exception

[Analysis]
I rebuilt the zfs-linux 2.0.6-1ubuntu1 package from ppa:colin-king/zfs-impish without the Ubuntu-specific patch ubuntu/4701-enable-ARC-FILL-LOCKED-flag.patch which fixed Bug #1900889. With this patch disabled the issue does not reproduce. Re-enabling the patch it reproduces reliably every time again.

Seems this bug was never sent upstream. No code changes upstream setting the flag ARC_FILL_IN_PLACE appear to have been added since that I can see however interestingly the code for this ARC_FILL_IN_PLACE handling was added to fix a similar sounding issue "Raw receive fix and encrypted objset security fix"
 in https://github.com/openzfs/zfs/commit/69830602de2d836013a91bd42cc8d36bbebb3aae . This first shipped in zfs 0.8.0 and the original bug was filed against 0.8.3.

I also have found the same issue as the original Launchpad bug reported upstream without any fixes and a lot of discussion (and quite a few duplicates linking back to 11679):
https://github.com/openzfs/zfs/issues/11679
https://github.com/openzfs/zfs/issues/12014

Without fully understanding the ZFS code in relation to this flag, the code at https://github.com/openzfs/zfs/blob/ce2bdcedf549b2d83ae9df23a3fa0188b33327b7/module/zfs/arc.c#L2026 talks about how this flag is to do with decrypting blocks in the ARC and doing so 'inplace'. It makes some sense thus that I need encryption to reproduce it and it works best after a reboot (thus flushing the ARC) and why I can still read the data in the test case before doing a reboot when it then fails.

This patch was added in 0.8.4-1ubuntu15 and I first experienced the issue somewhere between 0.8.4-1ubuntu11 and 0.8.4-1ubuntu16.

So it all adds up and I suggest that this patch should be reverted.