[lustre-discuss] lustre and pytorch
Michael DiDomenico
mdidomenico4 at gmail.com
Thu Jul 18 06:55:58 PDT 2024
On Wed, Jul 17, 2024 at 10:01 PM Oleg Drokin <green at whamcloud.com> wrote:
> Are the nodes synchronizing the job? Aka when one is stuck that impacts
> the other from progressing further?
yes, i believe the way pytorch works is if one of the process fails to
write out a checkpoint they all wait. but i'm not a pytorch expert,
so...
> In general debug logs are battle tested enough they should be robust in
> face of anything and not get stuck even if other parts of the system
> are unhappy, but if there's say a memory corruption that affects one of
> its structures, that might make it get stuck.
turns out when i came in this morning, the stuck node has written out
200mb of data. unfortunately i'm not entirely sure what i'm looking
for and i can't export the data even if you wanted to see it :(
More information about the lustre-discuss
mailing list