[lustre-discuss] lustre and pytorch

Michael DiDomenico mdidomenico4 at gmail.com
Thu Jul 18 06:55:58 PDT 2024


On Wed, Jul 17, 2024 at 10:01 PM Oleg Drokin <green at whamcloud.com> wrote:
> Are the nodes synchronizing the job? Aka when one is stuck that impacts
> the other from progressing further?

yes, i believe the way pytorch works is if one of the process fails to
write out a checkpoint they all wait.  but i'm not a pytorch expert,
so...

> In general debug logs are battle tested enough they should be robust in
> face of anything and not get stuck even if other parts of the system
> are unhappy, but if there's say a memory corruption that affects one of
> its structures, that might make it get stuck.

turns out when i came in this morning, the stuck node has written out
200mb of data.  unfortunately i'm not entirely sure what i'm looking
for and i can't export the data even if you wanted to see it :(


More information about the lustre-discuss mailing list