[lustre-discuss] lustre and pytorch
Michael DiDomenico
mdidomenico4 at gmail.com
Thu Jul 11 09:23:57 PDT 2024
i have a strange problem, but honestly i'm not sure its a lustre
issue. but i figure i'll try here. we have users running LLM models
through pytorch. part of the process saves off checkpoints at
periodic intervals. when the checkpoint files are being written we
can see in the logs the pytorch writing out the save files from each
of the processes.
it chugs along for a little bit, but then comes to a grinding halt.
no error from pytorch is logged and no errors can be found on the
lustre clients or servers. the problem is also no transient, it
happens every time the process runs
the weird part is, if we switch the output directory from lustre to
nfs (netapp backed), the pytorch run works perfectly fine
has anyone seen anything like this? any suggestions on trouble
shooting the issue?
given that we have a 10x performance difference between netapp and
lustre, i'm pretty keen on getting this fixed
More information about the lustre-discuss
mailing list