[lustre-discuss] lustre and pytorch

Oleg Drokin green at whamcloud.com
Wed Jul 24 15:16:42 PDT 2024


So this is all happening in the same second, I take it there was no gap
?

I wonder if something like LU-17510 is in play here, but also given you
are on 2.15, can you update a couple clients to latest master and see
if it helps anything?

On Wed, 2024-07-24 at 14:39 -0400, Michael DiDomenico via lustre-
discuss wrote:
> what it's worth here's the chunk of debug that seemed suspect.  not
> sure if helps anything
> 
> 00000008:00000020:30.0:1721247906.483639:0:285997:0:(osc_cache.c:1374
> :osc_consume_write_grant())
> using 4096 grant credits for brw 000000005265749c page
> 00000000a0744e3b
> 00000008:00000020:30.0:1721247906.483640:0:285997:0:(osc_cache.c:2388
> :osc_queue_async_io())
> extent 0000000064254af2@{[1466368 -> 1470384/1470463],
> [2|1|-|active|wi|00000000005ec049],
> [16478208|4016|+|-|000000000ffe1304|4096|0000000000000000]} expanded
> for 1470384.
> 00000008:00000002:30.0:1721247906.483641:0:285997:0:(osc_io.c:278:osc
> _page_touch_at())
> stripe KMS increasing 6022692864->6022696960 6022692864
> 00000008:00000020:30.0:1721247906.483642:0:285997:0:(osc_object.c:182
> :osc_attr_update())
> set kms from 6022692864to 6022696960
> 00020000:00000002:30.0:1721247906.483643:0:285997:0:(lov_merge.c:60:l
> ov_merge_lvb_kms())
> MDT ID 0x21:1073784952 initial value: s=0 m=9223372036854775808
> a=9223372036854775808 c=9223372036854775808 b=0
> 00020000:00000002:30.0:1721247906.483643:0:285997:0:(lov_merge.c:94:l
> ov_merge_lvb_kms())
> MDT ID 0x21:1073784952 on OST[15]: s=6022696960 m=1721247906 a=0
> c=1721247906 b=11730952
> 00020000:00000002:30.0:1721247906.483644:0:285997:0:(lov_object.c:105
> 0:lov_attr_get_composite())
> COMP ID #-1: s=6022696960 m=1721247906 a=0 c=1721247906 b=11730952
> 00000008:00000020:30.0:1721247906.483645:0:285997:0:(osc_cache.c:2355
> :osc_queue_async_io())
> obj 00000000005ec049 ready 0|-|- wr 0|-|- rd 0|- oap 000000001625d339
> page 000000002b0d2b57 added for cmd 2
> 00000008:00000020:30.0:1721247906.483646:0:285997:0:(osc_cache.c:1510
> :osc_enter_cache_try())
> snx11255-OST000f-osc-ffff891b019c8000: grant { dirty: 4017/512000
> dirty_pages: 12055/66011400 dropped: 0 avail: 1863200767,
> dirty_grant:
> 16478208, reserved: 0, flight: 0 } lru {in
> list: 1679370, left: 4019, waiters: 0 }need:28672
> 00000008:00000020:30.0:1721247906.483647:0:285997:0:(osc_cache.c:1374
> :osc_consume_write_grant())
> using 4096 grant credits for brw 000000004670e8ec page
> 000000002b0d2b57
> 00000008:00000020:30.0:1721247906.483648:0:285997:0:(osc_cache.c:2388
> :osc_queue_async_io())
> extent 0000000064254af2@{[1466368 -> 1470385/1470463],
> [2|1|-|active|wi|00000000005ec049],
> [16482304|4017|+|-|000000000ffe1304|4096|0000000000000000]} expanded
> for 1470385.
> 00000008:00000002:30.0:1721247906.483650:0:285997:0:(osc_io.c:278:osc
> _page_touch_at())
> stripe KMS increasing 6022696960->6022701056 6022696960
> 00000008:00000020:30.0:1721247906.483650:0:285997:0:(osc_object.c:182
> :osc_attr_update())
> set kms from 6022696960to 6022701056
> 00020000:00000002:30.0:1721247906.483651:0:285997:0:(lov_merge.c:60:l
> ov_merge_lvb_kms())
> MDT ID 0x21:1073784952 initial value: s=0 m=9223372036854775808
> a=9223372036854775808 c=9223372036854775808 b=0
> 00020000:00000002:30.0:1721247906.483652:0:285997:0:(lov_merge.c:94:l
> ov_merge_lvb_kms())
> MDT ID 0x21:1073784952 on OST[15]: s=6022701056 m=1721247906 a=0
> c=1721247906 b=11730952
> 00020000:00000002:30.0:1721247906.483653:0:285997:0:(lov_object.c:105
> 0:lov_attr_get_composite())
> COMP ID #-1: s=6022701056 m=1721247906 a=0 c=1721247906 b=11730952
> 00000008:00000020:30.0:1721247906.483654:0:285997:0:(osc_cache.c:2355
> :osc_queue_async_io())
> obj 00000000005ec049 ready 0|-|- wr 0|-|- rd 0|- oap 0000000044245bf2
> page 00000000ff11bc07 added for cmd 2
> 00000008:00000020:30.0:1721247906.483655:0:285997:0:(osc_cache.c:1510
> :osc_enter_cache_try())
> snx11255-OST000f-osc-ffff891b019c8000: grant { dirty: 4018/512000
> dirty_pages: 12056/66011400 dropped: 0 avail: 1863196671,
> dirty_grant:
> 16482304, reserved: 0, flight: 0 } lru {in
> list: 1679370, left: 4019, waiters: 0 }need:28672
> 
> --- this part above repeats for awhile
> 
> 00000008:00000020:30.0:1721247906.483656:0:285997:0:(osc_cache.c:1374
> :osc_consume_write_grant())
> using 4096 grant credits for brw 000000006d9a7b11 page
> 00000000ff11bc07
> 00000008:00000020:30.0:1721247906.483657:0:285997:0:(osc_cache.c:2388
> :osc_queue_async_io())
> extent 0000000064254af2@{[1466368 -> 1470386/1470463],
> [2|1|-|active|wi|00000000005ec049],
> [16486400|4018|+|-|000000000ffe1304|4096|0000000000000000]} expanded
> for 1470386.
> 00000008:00000002:30.0:1721247906.483658:0:285997:0:(osc_io.c:278:osc
> _page_touch_at())
> stripe KMS increasing 6022701056->6022701376 6022701056
> 00000008:00000020:30.0:1721247906.483659:0:285997:0:(osc_object.c:182
> :osc_attr_update())
> set kms from 6022701056to 6022701376
> 00000008:00000020:30.0:1721247906.483660:0:285997:0:(osc_request.c:76
> 3:osc_update_next_shrink())
> next time 608550 to shrink grant
> 00000080:00200000:30.0:1721247906.483664:0:285997:0:(vvp_io.c:1200:vv
> p_io_write_commit())
> Committed 14993 pages 61407552 bytes, tot: 396953728
> 00020000:00000002:30.0:1721247906.483665:0:285997:0:(lov_merge.c:60:l
> ov_merge_lvb_kms())
> MDT ID 0x21:1073784952 initial value: s=0 m=9223372036854775808
> a=9223372036854775808 c=9223372036854775808 b=0
> 00020000:00000002:30.0:1721247906.483666:0:285997:0:(lov_merge.c:94:l
> ov_merge_lvb_kms())
> MDT ID 0x21:1073784952 on OST[15]: s=6022701376 m=1721247906 a=0
> c=1721247906 b=11730952
> 00020000:00000002:30.0:1721247906.483667:0:285997:0:(lov_object.c:105
> 0:lov_attr_get_composite())
> COMP ID #-1: s=6022701376 m=1721247906 a=0 c=1721247906 b=11730952
> 00000080:00200000:30.0:1721247906.483668:0:285997:0:(file.c:1490:ll_m
> erge_attr())
> [0x24000a878:0x21:0x0] updating i_size 6022701376
> 00000080:00200000:30.0:1721247906.483671:0:285997:0:(vvp_io.c:1364:vv
> p_io_write_start())
> bf16_zero_pp_rank_7_mp_rank_01_optim_states.pt: write nob 0, result:
> 396953728
> 00000008:00000020:30.0:1721247906.483674:0:285997:0:(osc_cache.c:1709
> :osc_update_pending())
> obj 00000000005ec049 ready 0|-|- wr 4019|-|- rd 0|- update pending
> cmd
> 2 delta 4019.
> 00000080:00200000:30.0:1721247906.483687:0:285997:0:(vvp_io.c:537:vvp
> _io_advance())
> advancing 396953728 bytes
> 00000080:00200000:30.0:1721247906.483690:0:285997:0:(file.c:1768:ll_f
> ile_io_generic())
> Range unlock [1373473, 1470386]
> 00000008:00000020:30.0:1721247906.483656:0:285997:0:(osc_cache.c:1374
> :osc_consume_write_grant())
> using 4096 grant credits for brw 000000006d9a7b11 page
> 00000000ff11bc07
> 00000008:00000020:30.0:1721247906.483657:0:285997:0:(osc_cache.c:2388
> :osc_queue_async_io())
> extent 0000000064254af2@{[1466368 -> 1470386/1470463],
> [2|1|-|active|wi|00000000005ec049],
> [16486400|4018|+|-|000000000ffe1304|4096|0000000000000000]} expanded
> for 1470386.
> 00000008:00000002:30.0:1721247906.483658:0:285997:0:(osc_io.c:278:osc
> _page_touch_at())
> stripe KMS increasing 6022701056->6022701376 6022701056
> 00000008:00000020:30.0:1721247906.483659:0:285997:0:(osc_object.c:182
> :osc_attr_update())
> set kms from 6022701056to 6022701376
> 00000008:00000020:30.0:1721247906.483660:0:285997:0:(osc_request.c:76
> 3:osc_update_next_shrink())
> next time 608550 to shrink grant
> 00000080:00200000:30.0:1721247906.483664:0:285997:0:(vvp_io.c:1200:vv
> p_io_write_commit())
> Committed 14993 pages 61407552 bytes, tot: 396953728
> 00020000:00000002:30.0:1721247906.483665:0:285997:0:(lov_merge.c:60:l
> ov_merge_lvb_kms())
> MDT ID 0x21:1073784952 initial value: s=0 m=9223372036854775808
> a=9223372036854775808 c=9223372036854775808 b=0
> 00020000:00000002:30.0:1721247906.483666:0:285997:0:(lov_merge.c:94:l
> ov_merge_lvb_kms())
> MDT ID 0x21:1073784952 on OST[15]: s=6022701376 m=1721247906 a=0
> c=1721247906 b=11730952
> 00020000:00000002:30.0:1721247906.483667:0:285997:0:(lov_object.c:105
> 0:lov_attr_get_composite())
> COMP ID #-1: s=6022701376 m=1721247906 a=0 c=1721247906 b=11730952
> 00000080:00200000:30.0:1721247906.483668:0:285997:0:(file.c:1490:ll_m
> erge_attr())
> [0x24000a878:0x21:0x0] updating i_size 6022701376
> 00000080:00200000:30.0:1721247906.483671:0:285997:0:(vvp_io.c:1364:vv
> p_io_write_start())
> bf16_zero_pp_rank_7_mp_rank_01_optim_states.pt: write nob 0, result:
> 396953728
> 00000008:00000020:30.0:1721247906.483674:0:285997:0:(osc_cache.c:1709
> :osc_update_pending())
> obj 00000000005ec049 ready 0|-|- wr 4019|-|- rd 0|- update pending
> cmd
> 2 delta 4019.
> 00000080:00200000:30.0:1721247906.483687:0:285997:0:(vvp_io.c:537:vvp
> _io_advance())
> advancing 396953728 bytes
> 00000080:00200000:30.0:1721247906.483690:0:285997:0:(file.c:1768:ll_f
> ile_io_generic())
> Range unlock [1373473, 1470386]
> 00000080:00200000:30.0:1721247906.483693:0:285997:0:(vvp_io.c:307:vvp
> _io_fini())
> [0x24000a878:0x21:0x0] ignore/verify layout 0/0, layout version 0
> need
> write layout 0, restore needed 0
> 00000080:00200000:30.0:1721247906.483694:0:285997:0:(file.c:1809:ll_f
> ile_io_generic())
> bf16_zero_pp_rank_7_mp_rank_01_optim_states.pt: 2 io complete with
> rc:
> 0, result: 2007566464, restart: 0
> 00000080:00200000:30.0:1721247906.483697:0:285997:0:(file.c:1874:ll_f
> ile_io_generic())
> iot: 2, result: 2007566464
> 00000080:00200008:30.0:1721247906.487347:0:285997:0:(file.c:2167:ll_f
> ile_write_iter())
> file
> bf16_zero_pp_rank_7_mp_rank_01_optim_states.pt:[0x24000a878:0x21:0x0]
> ,
> ppos: 6022701376, count: 1200
> 00000080:00200000:30.0:1721247906.487354:0:285997:0:(rw26.c:731:ll_wr
> ite_begin())
> Writing 1470386 of 320 to 1200 bytes
> 00000080:00200000:30.0:1721247906.487358:0:285997:0:(rw26.c:917:ll_wr
> ite_end())
> pos 6022701376, len 1200, copied 1200
> 00000008:00000002:30.0:1721247906.487364:0:285997:0:(osc_io.c:278:osc
> _page_touch_at())
> stripe KMS increasing 6022701376->6022702576 6022701376
> 00000008:00000020:30.0:1721247906.487366:0:285997:0:(osc_object.c:182
> :osc_attr_update())
> set kms from 6022701376to 6022702576
> 00000080:00200000:30.0:1721247906.487370:0:285997:0:(file.c:2146:ll_d
> o_tiny_write())
> result: 1200, original count 1200
> 00020000:00000002:30.0:1721247906.487378:0:285997:0:(lov_object.c:118
> 2:lov_conf_freeze())
> To take share lov(00000000e2618e16) owner
> 0000000000000000/0000000065ef6d4c
> 00020000:00000002:30.0:1721247906.487380:0:285997:0:(lov_object.c:119
> 0:lov_conf_thaw())
> To release share lov(00000000e2618e16) owner
> 0000000000000000/0000000065ef6d4c
> 00000080:00200000:30.0:1721247906.487384:0:285997:0:(file.c:405:ll_fi
> le_release())
> VFS Op:inode=[0x24000a878:0x21:0x0](000000005f8eb1b5)
> 00020000:00000002:30.0:1721247906.487385:0:285997:0:(lov_object.c:118
> 2:lov_conf_freeze())
> To take share lov(00000000e2618e16) owner
> 0000000000000000/0000000065ef6d4c
> 00020000:00000002:30.0:1721247906.487386:0:285997:0:(lov_object.c:119
> 0:lov_conf_thaw())
> To release share lov(00000000e2618e16) owner
> 0000000000000000/0000000065ef6d4c
> 00800000:00000002:30.0:1721247906.487389:0:285997:0:(lmv_obd.c:3572:l
> mv_lock_match())
> Lock match for [0x24000a878:0x21:0x0]
> 00800000:00000002:30.0:1721247906.487393:0:285997:0:(lmv_fld.c:77:lmv
> _fld_lookup())
> FLD lookup got mds #1 for fid=[0x24000a878:0x21:0x0]
> 00800000:00000002:30.0:1721247906.487408:0:285997:0:(lmv_fld.c:77:lmv
> _fld_lookup())
> FLD lookup got mds #1 for fid=[0x24000a878:0x21:0x0]
> 00800000:00000002:30.0:1721247906.487409:0:285997:0:(lmv_obd.c:1457:l
> mv_close())
> CLOSE [0x24000a878:0x21:0x0]
> 00000002:00000002:30.0:1721247906.487411:0:285997:0:(mdc_request.c:90
> 2:mdc_close())
> snx11255-MDT0001-mdc-ffff891b019c8000: [0x24000a878:0x21:0x0] file
> closed with intent: 200
> 00000002:00100000:30.0:1721247906.487418:0:285997:0:(mdc_request.c:94
> 4:mdc_close())
> @@@ matched open  req at 00000000da251270
> x1804223383477312/t99088005029(99088005029)
> o101-
> >snx11255-MDT0001-mdc-ffff891b019c8000 at 172.18.20.21@o2ib100:12/10
> lens 616/656 e 0 to 0 dl 1721247862 ref 1 fl Complete:RPQU/4/ffffffff
> rc 0/-1 job:''
> 00000020:00100000:30.0:1721247906.487431:0:285997:0:(genops.c:2299:ob
> d_get_mod_rpc_slot())
> snx11255-MDT0001-mdc-ffff891b019c8000: modify RPC slot 1 is allocated
> opc 35, max 7
> 00000100:00100000:30.0:1721247906.487434:0:285997:0:(client.c:742:ptl
> rpc_reassign_next_xid())
> @@@ reassign xid  req at 000000007e3dc9a5 x1804223383853824/t0(0)
> o35->snx11255-MDT0001-mdc-ffff891b019c8000 at 172.18.20.21@o2ib100:23/10
> lens 392/1104 e 0 to 0 dl 0 ref 1 fl New:QU/0/ffffffff rc 0/-1 job:''
> 00000100:00100000:30.0:1721247906.487443:0:285997:0:(client.c:1727:pt
> lrpc_send_new_req())
> Sending RPC req at 000000007e3dc9a5 pname:cluuid:pid:xid:nid:opc:job
> pt_main_thread:409f80a0-cdfe-466d-9a49-
> 014ed62a4391:285997:1804223383853824:172.18.20.21 at o2ib100:35:
> 00000100:00100000:30.0:1721247906.487486:0:285997:0:(client.c:2491:pt
> lrpc_set_wait())
> set 000000000180e080 going to sleep for 11 seconds
> 00000100:00100000:30.0:1721247906.487761:0:285997:0:(client.c:2207:pt
> lrpc_check_set())
> Completed RPC req at 000000007e3dc9a5 pname:cluuid:pid:xid:nid:opc:job
> pt_main_thread:409f80a0-cdfe-466d-9a49-
> 014ed62a4391:285997:1804223383853824:172.18.20.21 at o2ib100:35:
> 00800000:00000002:30.0:1721247906.487766:0:285997:0:(lmv_fld.c:77:lmv
> _fld_lookup())
> FLD lookup got mds #1 for fid=[0x24000a878:0x21:0x0]
> 00000002:00100000:30.0:1721247906.487768:0:285997:0:(mdc_request.c:85
> 2:mdc_free_open())
> @@@ free open request, rq_replay=0  req at 00000000da251270
> x1804223383477312/t99088005029(99088005029)
> o101-
> >snx11255-MDT0001-mdc-ffff891b019c8000 at 172.18.20.21@o2ib100:12/10
> lens 616/656 e 0 to 0 dl 1721247862 ref 1 fl Complete:RQU/4
> /ffffffff rc 0/-1 job:''
> 
> --- this part repeats with no other output
> 
> 00000080:00200000:8.0:1721247906.528444:0:72:0:(vvp_io.c:1788:vvp_io_
> init())
> [0x24000a876:0x26:0x0] ignore/verify layout 1/0, layout version 0
> restore needed 0
> 00020000:00000002:8.0:1721247906.528446:0:72:0:(lov_object.c:1486:lov
> _io_init())
> [0x24000a876:0x26:0x0]io 000000001dcefeeb type 6 ignore/verify layout
> 1/0
> 00020000:00000002:8.0:1721247906.528447:0:72:0:(lov_object.c:1182:lov
> _conf_freeze())
> To take share lov(0000000018a6920a) owner
> 0000000000000000/0000000096d0da59
> 00020000:00000002:8.0:1721247906.528448:0:72:0:(lov_object.c:1190:lov
> _conf_thaw())
> To release share lov(0000000018a6920a) owner
> 0000000000000000/0000000096d0da59
> 00020000:00200000:8.0:1721247906.528449:0:72:0:(lov_io.c:824:lov_io_i
> ter_init())
> component[0] flags 0x10
> 00020000:00000002:8.0:1721247906.528450:0:72:0:(lov_offset.c:266:lov_
> stripe_intersects())
> [0->-1] -> [(0) 0->-1 (0)]
> 00000008:00000020:8.0:1721247906.528452:0:72:0:(osc_cache.c:3049:osc_
> cache_writeback_range())
> obj 00000000c9987be6 ready 0|-|- wr 0|-|- rd 0|- pageout [0,
> 18446744073709551615], 0.
> 00000080:00200000:8.0:1721247906.528454:0:72:0:(vvp_io.c:307:vvp_io_f
> ini())
> [0x24000a876:0x26:0x0] ignore/verify layout 1/0, layout version 0
> need
> write layout 0, restore needed 0
> 00000080:00200000:8.0:1721247906.528455:0:72:0:(vvp_io.c:1788:vvp_io_
> init())
> [0x24000a876:0x2b:0x0] ignore/verify layout 1/0, layout version 0
> restore needed 0
> 00020000:00000002:8.0:1721247906.528456:0:72:0:(lov_object.c:1486:lov
> _io_init())
> [0x24000a876:0x2b:0x0]io 000000001dcefeeb type 6 ignore/verify layout
> 1/0
> 00020000:00000002:8.0:1721247906.528457:0:72:0:(lov_object.c:1182:lov
> _conf_freeze())
> To take share lov(0000000020610153) owner
> 0000000000000000/0000000096d0da59
> 00020000:00000002:8.0:1721247906.528458:0:72:0:(lov_object.c:1190:lov
> _conf_thaw())
> To release share lov(0000000020610153) owner
> 0000000000000000/0000000096d0da59
> 00020000:00200000:8.0:1721247906.528458:0:72:0:(lov_io.c:824:lov_io_i
> ter_init())
> component[0] flags 0x10
> 00020000:00000002:8.0:1721247906.528459:0:72:0:(lov_offset.c:266:lov_
> stripe_intersects())
> [0->-1] -> [(0) 0->-1 (0)]
> 00000008:00000020:8.0:1721247906.528460:0:72:0:(osc_cache.c:3049:osc_
> cache_writeback_range())
> obj 000000005da60c8b ready 0|-|- wr 0|-|- rd 0|- pageout [0,
> 18446744073709551615], 0.
> 00000080:00200000:8.0:1721247906.528462:0:72:0:(vvp_io.c:307:vvp_io_f
> ini())
> [0x24000a876:0x2b:0x0] ignore/verify layout 1/0, layout version 0
> need
> write layout 0, restore needed 0
> 00000080:00200000:8.0:1721247906.528463:0:72:0:(vvp_io.c:1788:vvp_io_
> init())
> [0x24000a876:0x29:0x0] ignore/verify layout 1/0, layout version 0
> restore needed 0
> 00020000:00000002:8.0:1721247906.528464:0:72:0:(lov_object.c:1486:lov
> _io_init())
> [0x24000a876:0x29:0x0]io 000000001dcefeeb type 6 ignore/verify layout
> 1/0
> 00020000:00000002:8.0:1721247906.528465:0:72:0:(lov_object.c:1182:lov
> _conf_freeze())
> To take share lov(00000000c3ae5165) owner
> 0000000000000000/0000000096d0da59
> 00020000:00000002:8.0:1721247906.528465:0:72:0:(lov_object.c:1190:lov
> _conf_thaw())
> To release share lov(00000000c3ae5165) owner
> 0000000000000000/0000000096d0da59
> 
> On Fri, Jul 19, 2024 at 8:58 AM Michael DiDomenico
> <mdidomenico4 at gmail.com> wrote:
> > 
> > interestingly enough i ran
> > 
> > cut -f4 -d: lustre_debug* | cut -f1 -d. | sort | uniq -c
> > 
> > and while the timeline doesn't show a big gap, the number of
> > records
> > per time interval does.  i can clearly see the slope of the job and
> > exactly where it seems to have fallen off
> > 
> > i might be able to get the small chunk of the log exported, but
> > i'll
> > paraphrase the log a little below
> > 
> > starting with many entries repeat that repeat
> > 
> > using 4096 grant cretis for brw
> > osc grant
> > stripe kms increasing
> > set kms from
> > mdt id initial value
> > mdt id on ost
> > comp id #-1
> > obj ready
> > osc grant
> > 
> > then it stops and this comes out
> > 
> > ll_mert_attr update i_sze
> > bf16_zero_pp_rank_7_mp_rank_01_optim_Stats.pt
> > obj ready update pending cmd
> > advancing bytes
> > range unlock
> > vvp_io_fini()
> > bf16_zero_pp_rank_7_mp_rank_01_optim_states.pt
> > writing x of x to x bytes
> > pos len copied
> > stripe kms increasing
> > set kms from
> > ll_do_tiny_write()
> > to take share lv owner
> > to release share lov owner
> > vfs op:onoe
> > to take share_lov
> > to_release share lov
> > lock match for
> > FLD lookup got mds 1
> > fld lookup got mds
> > CLOSE
> > mdc file closed with intent 200
> > @@@ matched open mdc
> > mdc modify rpc slot 1
> > @@@ reassign xid
> > sending RPC
> > set going to sleep for 11 seconds
> > completed rpc
> > fld lookup got mds 1
> > @@@ free open request
> > 
> > after the last command, it then spirals out repeating until the end
> > of the log
> > 
> > vvp_io_init
> > lov_io_init
> > lov_conf_freeze to take share
> > lov_conf_thaw to release share
> > 
> > On Thu, Jul 18, 2024 at 3:24 PM Oleg Drokin <green at whamcloud.com>
> > wrote:
> > > 
> > > On Thu, 2024-07-18 at 13:26 -0400, Michael DiDomenico wrote:
> > > > there's nothing 'grep -i evict' in the lustre_debug logs or
> > > > from the
> > > > storage console logs
> > > 
> > > hm, ok.
> > > 
> > > > > perhaps see continuity of the timestamps and what happened
> > > > > right
> > > > > before
> > > > > and right after the gap if there is one in the times?
> > > > > 
> > > > 
> > > > i pulled a counter from the logs of the functions calls, maybe
> > > > one of
> > > > these looks off (this is just the ones over 100k),  please
> > > > excuse
> > > > typos
> > > > 
> > > > $ grep -vh "^$" lustre_debug*.log | cut -f10 -d: | cut -f1 -d\)
> > > > |
> > > > sort
> > > > > uniq -c | sort -n
> > > > 105034 lov_io_init
> > > > 105034 vvp_io_init
> > > > 105035 lov_io_iter_init
> > > > 105035 lob_strip_intersects
> > > > 105035 osc_cache_writeback_range
> > > > 105043 vvp_io_fini
> > > > 105050 lov_conf_freeze
> > > > 105050 lov_conf_thaw
> > > > 294806 osc_attr-update
> > > > 294806 osc_page_touch_at
> > > > 294814 osc_consume_write_grant
> > > > 294815 lov_attr_get_composite
> > > > 294816 osc_enter_cache_try
> > > > 351044 ll_write_end
> > > > 589549 osc_queueu_async_io
> > > > 589617 lov_merge_lvm_kms
> > > 
> > > that's not really going to do anything useful, there's a
> > > timestamp in
> > > unix time as the fourth field (separated with colons), see if
> > > there are
> > > gaps there.
> > > I imagine there's going to be real dense (time-wise) activity)
> > > then an
> > > RPC is prepared and send (Sending RPC ....) and then a lot
> > > sparser
> > > activity perhaps with multi-second pauses) and then eventually
> > > it'll
> > > pick up after gettign a server response for example?
> > > Though none of that explains why lctl would hang I guess, but
> > > still
> > > 
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org



More information about the lustre-discuss mailing list