[lustre-discuss] lustre and pytorch

Thu Jul 11 14:24:40 PDT 2024

On Thu, 2024-07-11 at 12:23 -0400, Michael DiDomenico via lustre-
discuss wrote:
> i have a strange problem, but honestly i'm not sure its a lustre
> issue.  but i figure i'll try here.  we have users running LLM models
> through pytorch.  part of the process saves off checkpoints at
> periodic intervals.  when the checkpoint files are being written we
> can see in the logs the pytorch writing out the save files from each
> of the processes.
> 
> it chugs along for a little bit, but then comes to a grinding halt.
> no error from pytorch is logged and no errors can be found on the
> lustre clients or servers.  the problem is also no transient, it
> happens every time the process runs

does it ever resume or does it stop-stop? If you have a hard stop after
which the thing is killed - how long is it?
Are the writes synchronous? an you collect lustre debug logs from one
of the clients with +vfstrace+cache+rpctrace+inode debug mask may be
when the hang happens?

How many files are there? I assume there's only a limited number of
processes per node?

Were obvious things like "a bunch of nodes writing into the same file
in O_APPEND mode" already eliminated? (or not in O_APPEND, but
doing truncates in between)

Also what version are you running?