[lustre-discuss] lustre and pytorch
Oleg Drokin
green at whamcloud.com
Thu Jul 11 14:24:40 PDT 2024
On Thu, 2024-07-11 at 12:23 -0400, Michael DiDomenico via lustre-
discuss wrote:
> i have a strange problem, but honestly i'm not sure its a lustre
> issue. but i figure i'll try here. we have users running LLM models
> through pytorch. part of the process saves off checkpoints at
> periodic intervals. when the checkpoint files are being written we
> can see in the logs the pytorch writing out the save files from each
> of the processes.
>
> it chugs along for a little bit, but then comes to a grinding halt.
> no error from pytorch is logged and no errors can be found on the
> lustre clients or servers. the problem is also no transient, it
> happens every time the process runs
does it ever resume or does it stop-stop? If you have a hard stop after
which the thing is killed - how long is it?
Are the writes synchronous? an you collect lustre debug logs from one
of the clients with +vfstrace+cache+rpctrace+inode debug mask may be
when the hang happens?
How many files are there? I assume there's only a limited number of
processes per node?
Were obvious things like "a bunch of nodes writing into the same file
in O_APPEND mode" already eliminated? (or not in O_APPEND, but
doing truncates in between)
Also what version are you running?
More information about the lustre-discuss
mailing list