[lustre-discuss] Unexpected used guard number

Andreas Dilger adilger at whamcloud.com
Tue Jun 4 10:35:35 PDT 2024


The difference between your Intel and AMD nodes may be the RPC checksum type that is used by default (the clients and servers negotiate the fastest algorithm).

I suspect the checksum error is itself fixed already, but in the meantime you could try setting a different checksum than t10ip4k (or whatever it is you are using, compare "lctl get_param osc.*.checksum_type" on your Intel vs. AMD clients).

Cheers, Andreas

On Jun 3, 2024, at 08:21, Fokke Dijkstra via lustre-discuss <lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>> wrote:

Dear all,

We are frequently (about daily) seeing the following type of error in our logfile on some specific client nodes:

Jun  1 11:03:17 a100gpu1 kernel: LustreError: 3834:0:(integrity.c:66:obd_page_dif_generate_buffer()) scratch-OST0042-osc-ff35febc655a9000: unexpected used guard number of DIF 5/5, data length 4096, sector s
ize 512: rc = -7
Jun  1 11:03:17 a100gpu1 kernel: LustreError: 3834:0:(osc_request.c:2750:osc_build_rpc()) prep_req failed: -7
Jun  1 11:03:17 a100gpu1 kernel: LustreError: 3834:0:(osc_cache.c:2186:osc_check_rpcs()) Write request failed with -7

We are running Lustre 2.15.4 over Ethernet on Rocky 8 servers and clients.
The error only appears on the client, nothing is found on the servers around that time period.

The errors mostly appear on our Intel ice lake based GPU nodes and less frequently on Intel ice lake based CPU nodes. We do not see the errors on our AMD Zen 3 nodes (the latter form the majority of our cluster).

The problem was brought to our attention by a few users that were running Pytorch code on the GPU nodes, who complained about Pytorch giving an error about writing a file and then failing.
When checking the log files the error appears to occur more often and I can't find a clear correlation with specific job types and neither with job failures (some jobs seem to continue to run after the error appears in the system log file).

Has anyone seen this error before? Does somebody know how to fix this?

Kind regards,

Fokke Dijkstra

--
Fokke Dijkstra <f.dijkstra at rug.nl><mailto:f.dijkstra at rug.nl>
Team High Performance Computing
Center for Information Technology, University of Groningen
Postbus 11044, 9700 CA  Groningen, The Netherlands

_______________________________________________
lustre-discuss mailing list
lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20240604/0f675123/attachment.htm>


More information about the lustre-discuss mailing list