[lustre-discuss] File size anomaly on Lustre Filesystem

Nick Skingle Nick.Skingle at pgs.com
Tue Jan 14 02:56:08 PST 2020


Hi All,

We are seeing an anomaly across all of our RaidInc Lustre filesystems

Problem description:
File Size < on disk size - currently unexplained, size on disk is 2-3 x file size.

Observations:

  1.  A potential ZFS filesystem corruption across RaidInc Storage in London?
  2.  zdb check for leaks, it walks the entire block tree constructing the space maps in memory and then compares them to the ones stored on disk. If they differ it reports the leak.
     *   Presuming from the below investigation that the "space leaks" mean the pool is corrupted somehow. zdb (ZFS debug) has detected tons of corruptions.
  3.  zdb did not report space leaks on ZFS Houston SI's.
  4.  Does zdb leaked space means trouble with the pool and could explain the file size < disk size discrepancy?
  5.  Is it possible that errors got injected due to failover or hardware errors?
  6.  It seems to be at least inconsistent which is supposed to never happen with ZFS. Is this indicative of a larger problem? Numerous lockups, etc.?

Investigation:
For the troubleshooting, the following file located in WEY, was selected. There are no snapshots/reservations/quotas involved here.

lconnect03]</users/jerome.cousin>$ du -h --apparent-size /lus/lsi022/4388cog/p005j02_2010_SRME_1238A018_2copy/*
33K    /lus/lsi022/4388cog/p005j02_2010_SRME_1238A018_2copy/aux_data
19K    /lus/lsi022/4388cog/p005j02_2010_SRME_1238A018_2copy/descriptor.yaml
104G   /lus/lsi022/4388cog/p005j02_2010_SRME_1238A018_2copy/trace_data.bin
14G    /lus/lsi022/4388cog/p005j02_2010_SRME_1238A018_2copy/trace_header.bin

[lconnect03]</users/jerome.cousin>$ du -h /lus/lsi022/4388cog/p005j02_2010_SRME_1238A018_2copy/*
33K    /lus/lsi022/4388cog/p005j02_2010_SRME_1238A018_2copy/aux_data
56K    /lus/lsi022/4388cog/p005j02_2010_SRME_1238A018_2copy/descriptor.yaml
237G   /lus/lsi022/4388cog/p005j02_2010_SRME_1238A018_2copy/trace_data.bin
31G    /lus/lsi022/4388cog/p005j02_2010_SRME_1238A018_2copy/trace_header.bin


  1.  Copy of the dataset onto the same storage.
     *   Disk size is different.
     *   Checksum matches.

[lconnect03]</users/jerome.cousin>$ cp -rp /lus/lsi022/4388cog/p005j02_2010_SRME_1238A018_2copy /lus/lsi022/4388cog/p005j02_2010_SRME_1238A018_JC

[lconnect03]</users/jerome.cousin>$ md5sum  /lus/lsi022/4388cog/p005j02_2010_SRME_1238A018_2copy/*
md5sum: /lus/lsi022/4388cog/p005j02_2010_SRME_1238A018_2copy/aux_data: Is a directory
f861b60d2b1b844e5ae252345aa20497  /lus/lsi022/4388cog/p005j02_2010_SRME_1238A018_2copy/descriptor.yaml
e8ac57c241e52b38b60907e4e767b451  /lus/lsi022/4388cog/p005j02_2010_SRME_1238A018_2copy/trace_data.bin
0826bc74e525697d769248aabcb195cd  /lus/lsi022/4388cog/p005j02_2010_SRME_1238A018_2copy/trace_header.bin

[lconnect03]</users/jerome.cousin>$  md5sum  /lus/lsi022/4388cog/p005j02_2010_SRME_1238A018_JC/*
md5sum: /lus/lsi022/4388cog/p005j02_2010_SRME_1238A018_JC/aux_data: Is a directory
f861b60d2b1b844e5ae252345aa20497  /lus/lsi022/4388cog/p005j02_2010_SRME_1238A018_JC/descriptor.yaml
e8ac57c241e52b38b60907e4e767b451  /lus/lsi022/4388cog/p005j02_2010_SRME_1238A018_JC/trace_data.bin
0826bc74e525697d769248aabcb195cd  /lus/lsi022/4388cog/p005j02_2010_SRME_1238A018_JC/trace_header.bin

[lconnect03]</users/jerome.cousin>$ du -h  /lus/lsi022/4388cog/p005j02_2010_SRME_1238A018_JC/*
33K    /lus/lsi022/4388cog/p005j02_2010_SRME_1238A018_JC/aux_data
56K    /lus/lsi022/4388cog/p005j02_2010_SRME_1238A018_JC/descriptor.yaml
99G    /lus/lsi022/4388cog/p005j02_2010_SRME_1238A018_JC/trace_data.bin
13G    /lus/lsi022/4388cog/p005j02_2010_SRME_1238A018_JC/trace_header.bin

[lconnect03]</users/jerome.cousin>$ du -h --apparent-size /lus/lsi022/4388cog/p005j02_2010_SRME_1238A018_JC/*
33K    /lus/lsi022/4388cog/p005j02_2010_SRME_1238A018_JC/aux_data
19K    /lus/lsi022/4388cog/p005j02_2010_SRME_1238A018_JC/descriptor.yaml
104G   /lus/lsi022/4388cog/p005j02_2010_SRME_1238A018_JC/trace_data.bin
14G    /lus/lsi022/4388cog/p005j02_2010_SRME_1238A018_JC/trace_header.bin



  1.  Printing the OST name hosting the given file.
[lconnect01]</users/jerome.cousin>$ ./lustre-find-ost-for-file /lus/lsi022/4388cog/p005j02_2010_SRME_1238A018_2copy/trace_data.bin
15
/lus/lsi022/4388cog/p005j02_2010_SRME_1238A018_2copy/trace_data.bin: ['lsi022-OST000f'] (lsi022-oss6.lon.compute.pgs.com)


  1.  Run zdb to check for leaks
[root at lsi022-oss6 ~]# zfs list
NAME                          USED  AVAIL  REFER  MOUNTPOINT
lsi022-OST17                 48.3T  18.5T   219K  none
lsi022-OST17/lsi022-OST0005  48.3T  18.5T  48.3T  none
lsi022-OST19                 49.5T  17.3T   219K  none
lsi022-OST19/lsi022-OST0009  49.5T  17.3T  49.5T  none
lsi022-OST21                 47.3T  19.5T   219K  none
lsi022-OST21/lsi022-OST000f  47.3T  19.5T  47.3T  none
lsi022-OST23                 51.1T  15.7T   219K  none
lsi022-OST23/lsi022-OST0013  51.1T  15.7T  51.1T  none

[root at lsi022-oss6 ~]# zdb -b lsi022-OST21
Traversing all blocks to verify nothing leaked ...

loading space map for vdev 0 of 1, metaslab 180 of 181 ...
62.0T completed (12801MB/s) estimated time remaining: 0hr 00min 07sec
leaked space: vdev 0, offset 0x1d80003de000, size 1081344
[...]
See attachment.

Please would someone be able to advise.

Thanks
Nick
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20200114/91f93bf0/attachment-0001.html>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: zdb -b lsi022-OST21.txt
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20200114/91f93bf0/attachment-0001.txt>


More information about the lustre-discuss mailing list