====== PANIC: rpool: blkptr at 00000000a44c5bb3 DVA 0 has invalid OFFSET 18388167655883276288 ======

3 similar errors catched:
  * 2021-04-18: ''PANIC: rpool: blkptr at 000000003a3d1018 DVA 0 has invalid OFFSET 18388167655883276288''
  * 2021-04-19: ''PANIC: rpool: blkptr at 000000009897c6f0 DVA 0 has invalid OFFSET 18388167655883276288''
  * 2021-05-09" ''PANIC: rpool: blkptr at 00000000a44c5bb3 DVA 0 has invalid OFFSET 18388167655883276288''


DVA (DataVirtualAddress) is made up of:
  * a 32bit integer representing the VDEV 
  * followed by a 63bit integer representing the offset.
  * 
===== Important steps =====

**Stop scrub to prevent error loop** (if scrub reads corrupted data, kernel will panic again):
<code bash>
zfs scrub -s rpool
</code>

**(Option) Turn ZFS PANIC into WARNING **
<code bash>
echo 1 > /sys/module/zfs/parameters/zfs_recover
</code>


Dump whole history of pool:
<code bash>
zdb -hh rpool
</code>

  * ''-bb'': Display statistics regarding the number, size (logical, physical and allocated) and deduplication of blocks. (Verbosity 2)
  * ''-c'': Verify the checksum of all metadata blocks while printing block statistics (see: ''-b''). If specified multiple times, verify the checksums of all blocks. 
  * ''-s'': Report statistics on zdb I/O. Display operation counts, bandwidth, and error counts of I/O
  * ''-L'': Disable leak detection and the loading of space maps. By default, zdb verifies that all non-free blocks are referenced, which can be very expensive.
  * ''-AAA'': Do not abort if asserts fail and also enable panic recovery.


<code bash>
zdb -AAA -bbbcsvL <zfs_pool>
</code>

And sometimes it shows some errors during reading:

<code>
zdb_blkptr_cb: Got error 52 reading <259, 75932, 0, 17> DVA[0]=<0:158039e9000:6000> [L0 ZFS plain file] fletcher4 lz4 unencrypted LE contiguous unique single size=20000L/6000P birth=62707L/62707P fill=1 cksum=516dd1ace1c:414cbfc202333b:af36411a2766c4f:7bc4d6777673687b -- skipping
</code>


====== Find problematic file / volume ======

Try to read all ZFS volumes
<code bash>
zfs list
</code>

And for each volume, try to read it:
<code bash>
zfs send rpool/data/vm-703-disk-1 | pv > /dev/null
</code>
Catch PANIC and reboot system by sysrq to prevent IO lock.
This problmatic volume is a replicated (received) volume from another ZFS node.
<code>
echo s > /proc/sysrq-trigger 
echo b > /proc/sysrq-trigger 
</code>

After power up delete problematic zfs volume. During deletion PANIC happens again.
Deletion is stored ZFS journal, so during mounting ZFS tries to replay pending deletion which cause PANIC again.
System stuck in bootloop.

  * Boot from Live USB with ZFS Support (Ubuntu has 0.8.3 ZOL).
  * stop zfs-zed service
  * ''rmmod zfs''
  * ''modprobe zfs zfs_recover=1''

It doesn't help. After hit ZFS warning insread of panic, ZFS informs about unrecoverable error and pool is suspended.

Last possibility is to boot from Live system from USB and  copy all data to other zpool and recreate ''rpool'':
<code bash>
mkdir /rpool
zpool import -f -R /rpool rpool -o readonly=on

</code>

  * ''-e'' operate on exported pool
  * ''-L'' Disable leak detection and the loading of space maps.  By default, zdb verifies that all non-free blocks are referenced, which can be very expensive.
<code bash>
zdb -e -bcsvL rpool
</code>


====== Resoruces ======
  * https://github.com/openzfs/zfs/issues/3990
  * https://github.com/google/rowhammer-test
  * [[https://github.com/openzfs/zfs/issues/5520|ztest zdb hits zdb_blkptr_cb error leading to leak and zdb error exit with no core file]]
  * [[https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=235683|[zfs] Panic during data access or scrub on 12.0-STABLE r343904 (blkptr at <addr> DVA 0 has invalid OFFSET)]]
  * [[https://forums.freebsd.org/threads/zfs-kernel-panic-solaris-panic-blkptr-at-0xfffffe006bf0d4a8-dva-0-has-invalid-offset-70368744177664.76129/|ZFS kernel panic: Solaris(panic): blkptr at 0xfffffe006bf0d4a8 DVA 0 has invalid OFFSET 70368744177664]]

<code>
ZFS scrub repairs only blocks with bad checksums or otherwise unreadable blocks.
ZFS scrub does not analyze and thus does not fix any logical inconsistencies.
If bad data has a correct checksum, then at present ZFS cannot fix it. Sometimes it can recognize that the data is bad and report an error, sometimes it has no option but to panic, but sometimes it cannot even tell if it's bad data.
</code>