[lustre-discuss] Unexpected used guard number
Fokke Dijkstra
f.dijkstra at rug.nl
Mon Jun 3 07:21:02 PDT 2024
Dear all,
We are frequently (about daily) seeing the following type of error in our
logfile on some specific client nodes:
Jun 1 11:03:17 a100gpu1 kernel: LustreError:
3834:0:(integrity.c:66:obd_page_dif_generate_buffer())
scratch-OST0042-osc-ff35febc655a9000: unexpected used guard number of DIF
5/5, data length 4096, sector s
ize 512: rc = -7
Jun 1 11:03:17 a100gpu1 kernel: LustreError:
3834:0:(osc_request.c:2750:osc_build_rpc()) prep_req failed: -7
Jun 1 11:03:17 a100gpu1 kernel: LustreError:
3834:0:(osc_cache.c:2186:osc_check_rpcs()) Write request failed with -7
We are running Lustre 2.15.4 over Ethernet on Rocky 8 servers and clients.
The error only appears on the client, nothing is found on the servers
around that time period.
The errors mostly appear on our Intel ice lake based GPU nodes and less
frequently on Intel ice lake based CPU nodes. We do not see the errors on
our AMD Zen 3 nodes (the latter form the majority of our cluster).
The problem was brought to our attention by a few users that were running
Pytorch code on the GPU nodes, who complained about Pytorch giving an error
about writing a file and then failing.
When checking the log files the error appears to occur more often and I
can't find a clear correlation with specific job types and neither with job
failures (some jobs seem to continue to run after the error appears in the
system log file).
Has anyone seen this error before? Does somebody know how to fix this?
Kind regards,
Fokke Dijkstra
--
Fokke Dijkstra <f.dijkstra at rug.nl> <f.dijkstra at rug.nl>
Team High Performance Computing
Center for Information Technology, University of Groningen
Postbus 11044, 9700 CA Groningen, The Netherlands
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20240603/2eaf5cb9/attachment.htm>
More information about the lustre-discuss
mailing list