[lustre-discuss] lustre and pytorch

Oleg Drokin green at whamcloud.com
Thu Jul 18 12:23:57 PDT 2024


On Thu, 2024-07-18 at 13:26 -0400, Michael DiDomenico wrote:
> there's nothing 'grep -i evict' in the lustre_debug logs or from the
> storage console logs

hm, ok.

> > perhaps see continuity of the timestamps and what happened right
> > before
> > and right after the gap if there is one in the times?
> > 
> 
> i pulled a counter from the logs of the functions calls, maybe one of
> these looks off (this is just the ones over 100k),  please excuse
> typos
> 
> $ grep -vh "^$" lustre_debug*.log | cut -f10 -d: | cut -f1 -d\) |
> sort
> > uniq -c | sort -n
> 105034 lov_io_init
> 105034 vvp_io_init
> 105035 lov_io_iter_init
> 105035 lob_strip_intersects
> 105035 osc_cache_writeback_range
> 105043 vvp_io_fini
> 105050 lov_conf_freeze
> 105050 lov_conf_thaw
> 294806 osc_attr-update
> 294806 osc_page_touch_at
> 294814 osc_consume_write_grant
> 294815 lov_attr_get_composite
> 294816 osc_enter_cache_try
> 351044 ll_write_end
> 589549 osc_queueu_async_io
> 589617 lov_merge_lvm_kms

that's not really going to do anything useful, there's a timestamp in
unix time as the fourth field (separated with colons), see if there are
gaps there.
I imagine there's going to be real dense (time-wise) activity) then an
RPC is prepared and send (Sending RPC ....) and then a lot sparser
activity perhaps with multi-second pauses) and then eventually it'll
pick up after gettign a server response for example?
Though none of that explains why lctl would hang I guess, but still



More information about the lustre-discuss mailing list