[lustre-discuss] lustre and pytorch

Wed Jul 17 19:01:19 PDT 2024

On Wed, 2024-07-17 at 20:30 -0400, Michael DiDomenico via lustre-
discuss wrote:
> On Thu, Jul 11, 2024 at 5:24 PM Oleg Drokin <green at whamcloud.com>
> wrote:
> > does it ever resume or does it stop-stop? If you have a hard stop
> > after
> > which the thing is killed - how long is it?
> > Are the writes synchronous? an you collect lustre debug logs from
> > one
> > of the clients with +vfstrace+cache+rpctrace+inode debug mask may
> > be
> > when the hang happens?
> 
> small update on this.  we attempted to take a trace today.  we
> managed
> to whittle the process down to two nodes, here's the steps we took
> 
> launch job (2 nodes allocated through slurm)
> tail the job log, seems to be starting up
> pdsh -w node[1-2] -l root 'lctl set_param debug_mb=512'
> pdsh -w node[1-2] -l root 'lctl set_param debug
> +vfstrace+cache+rpctrace+inode'
> pdsh -w node[1-2] -l root 'lctl debug clear'
> pdsh -w node[1-2] -l root 'lctl debug mark'
> job runs along for a few minutes, but eventually the log stops while
> the wall clock moves along
> 
> at this point we pull the debug
> pdsh -w node[1-2] -l root 'lctl debug_kernel
> /lustre/temp/lustre_debug.${HOSTNAME}.`date +%s`'
> 
> this is where things get a little weird
> node2 seemed to dump out 2.5mil lines of logfile and return (~400mb)
> node1 does not, it dumps out 28k worth of the log and then just hangs
> 
> node1 is still up and responding normally as far as i can tell.  no
> errors in dmesg and the filesystem still responds to normal commands.
> even though the node seems okay, the job is definitely stalled
> 
> at this point we cancelled the job.  i had to leave for the day, but
> i
> left the node in the broken state.  i'll see if maybe something gets
> put in the logs or the kernel debug completes overnight, but seems
> unlikely.  i know this is pretty far into left field and hard to
> debug
> at this point, but any suggestions?

If you are experiencing a stuck node, crashdumping it might help to
better see what it was doing.

Are the nodes synchronizing the job? Aka when one is stuck that impacts
the other from progressing further?

In general debug logs are battle tested enough they should be robust in
face of anything and not get stuck even if other parts of the system
are unhappy, but if there's say a memory corruption that affects one of
its structures, that might make it get stuck.

So if you find it still stick in the morning - crashdumping it is your
next best bet I suspect.