[lustre-discuss] import_set_state_nolock() with binary args in lctl debug_file output?

Sternberg, Michael G. sternberg at anl.gov
Fri Aug 14 14:44:18 PDT 2020



In lctl debug_file output, for import_set_state_nolock(), I sometimes see binary arguments (sample snippet at end of post), and figure that's not a good sign. How can I get to the bottom of this?

The only direct reference I could find is LU-7339 from 2015, which got no replies. I also see call traces on my system that are similar to those in LU-12037 "Possible DNE issue leading to hung filesystem". This led me to implement  lru_size  and lru_max_age  workarounds. I suspect hardware issues within my IB network, but "ibqueryerrors -c -s PortXmitWait" shows nothing for Lustre servers.

The MDSs have on many occasions been getting overloaded by apps that do loads of operations on short (kB-size) files and directories. An MDS (usually but not always the one holding DNE remote directories!?) eventually reports health_check=NOT HEALTHY . As a workaround, I used the following incantation post-mount on MDSs:

    ldlm.namespaces.*.lru_size=clear
    ldlm.namespaces.*.lru_size=1000
    ldlm.namespaces.*.lru_max_age=2s

I'd be grateful for pointers on how to go about diagnosing and resolving this issue.


With best regards,
Michael Sternberg

-----------
System context:

2 Lustre fs ("carbofs2", "sandbox2") on 2 MDSs (1 HA pair) and 6 OSS (3 HA pairs), on Lustre-2.12.4 (now .5), with DoM and DNE active. MDTs are ldiskfs on RAID1 SSD LUNs connected over FC.

# lsb_release -sir
CentOS 7.8.2003

# rpm -q lustre
lustre-2.12.5-1.el7.x86_64

# dsh -e -w mds11,mds12 lctl list_nids
mds11: 172.17.101.5 at o2ib
mds11: 172.16.101.5 at tcp
mds12: 172.17.101.6 at o2ib
mds12: 172.16.101.6 at tcp

# dsh -e -w mds11,mds12 lctl dl
mds11:   0 UP osd-ldiskfs MGS-osd MGS-osd_UUID 4
mds11:   1 UP mgs MGS MGS 442
mds11:   2 UP mgc MGC172.17.101.5 at o2ib 3adfc64b-3179-d8ff-ad27-42788cabd023 4
mds11:   3 UP osd-ldiskfs carbofs2-MDT0000-osd carbofs2-MDT0000-osd_UUID 17
mds11:   4 UP osd-ldiskfs sandbox2-MDT0000-osd sandbox2-MDT0000-osd_UUID 9
mds11:   5 UP mds MDS MDS_uuid 2
mds11:   6 UP lod carbofs2-MDT0000-mdtlov carbofs2-MDT0000-mdtlov_UUID 3
mds11:   7 UP mdt carbofs2-MDT0000 carbofs2-MDT0000_UUID 452
mds11:   8 UP mdd carbofs2-MDD0000 carbofs2-MDD0000_UUID 3
mds11:   9 UP qmt carbofs2-QMT0000 carbofs2-QMT0000_UUID 3
mds11:  10 UP osp carbofs2-MDT0001-osp-MDT0000 carbofs2-MDT0000-mdtlov_UUID 4
mds11:  11 UP osp carbofs2-OST0000-osc-MDT0000 carbofs2-MDT0000-mdtlov_UUID 4
mds11:  12 UP osp carbofs2-OST0001-osc-MDT0000 carbofs2-MDT0000-mdtlov_UUID 4
mds11:  13 UP osp carbofs2-OST0002-osc-MDT0000 carbofs2-MDT0000-mdtlov_UUID 4
mds11:  14 UP osp carbofs2-OST0003-osc-MDT0000 carbofs2-MDT0000-mdtlov_UUID 4
mds11:  15 UP osp carbofs2-OST0004-osc-MDT0000 carbofs2-MDT0000-mdtlov_UUID 4
mds11:  16 UP osp carbofs2-OST0005-osc-MDT0000 carbofs2-MDT0000-mdtlov_UUID 4
mds11:  17 UP osp carbofs2-OST0006-osc-MDT0000 carbofs2-MDT0000-mdtlov_UUID 4
mds11:  18 UP osp carbofs2-OST0007-osc-MDT0000 carbofs2-MDT0000-mdtlov_UUID 4
mds11:  19 UP osp carbofs2-OST0008-osc-MDT0000 carbofs2-MDT0000-mdtlov_UUID 4
mds11:  20 UP osp carbofs2-OST0009-osc-MDT0000 carbofs2-MDT0000-mdtlov_UUID 4
mds11:  21 UP lwp carbofs2-MDT0000-lwp-MDT0000 carbofs2-MDT0000-lwp-MDT0000_UUID 4
mds11:  22 UP lod sandbox2-MDT0000-mdtlov sandbox2-MDT0000-mdtlov_UUID 3
mds11:  23 UP mdt sandbox2-MDT0000 sandbox2-MDT0000_UUID 436
mds11:  24 UP mdd sandbox2-MDD0000 sandbox2-MDD0000_UUID 3
mds11:  25 UP qmt sandbox2-QMT0000 sandbox2-QMT0000_UUID 3
mds11:  26 UP osp sandbox2-MDT0001-osp-MDT0000 sandbox2-MDT0000-mdtlov_UUID 4
mds11:  27 UP osp sandbox2-OST0000-osc-MDT0000 sandbox2-MDT0000-mdtlov_UUID 4
mds11:  28 UP osp sandbox2-OST0001-osc-MDT0000 sandbox2-MDT0000-mdtlov_UUID 4
mds11:  29 UP lwp sandbox2-MDT0000-lwp-MDT0000 sandbox2-MDT0000-lwp-MDT0000_UUID 4
mds12:   0 UP osd-ldiskfs carbofs2-MDT0001-osd carbofs2-MDT0001-osd_UUID 16
mds12:   1 UP osd-ldiskfs sandbox2-MDT0001-osd sandbox2-MDT0001-osd_UUID 8
mds12:   2 UP mgc MGC172.17.101.5 at o2ib 8a65014c-0bf7-e566-56ec-1e13f7ad52af 4
mds12:   3 UP mds MDS MDS_uuid 2
mds12:   4 UP lod carbofs2-MDT0001-mdtlov carbofs2-MDT0001-mdtlov_UUID 3
mds12:   5 UP mdt carbofs2-MDT0001 carbofs2-MDT0001_UUID 448
mds12:   6 UP mdd carbofs2-MDD0001 carbofs2-MDD0001_UUID 3
mds12:   7 UP osp carbofs2-MDT0000-osp-MDT0001 carbofs2-MDT0001-mdtlov_UUID 4
mds12:   8 UP osp carbofs2-OST0000-osc-MDT0001 carbofs2-MDT0001-mdtlov_UUID 4
mds12:   9 UP osp carbofs2-OST0001-osc-MDT0001 carbofs2-MDT0001-mdtlov_UUID 4
mds12:  10 UP osp carbofs2-OST0002-osc-MDT0001 carbofs2-MDT0001-mdtlov_UUID 4
mds12:  11 UP osp carbofs2-OST0003-osc-MDT0001 carbofs2-MDT0001-mdtlov_UUID 4
mds12:  12 UP osp carbofs2-OST0004-osc-MDT0001 carbofs2-MDT0001-mdtlov_UUID 4
mds12:  13 UP osp carbofs2-OST0005-osc-MDT0001 carbofs2-MDT0001-mdtlov_UUID 4
mds12:  14 UP osp carbofs2-OST0006-osc-MDT0001 carbofs2-MDT0001-mdtlov_UUID 4
mds12:  15 UP osp carbofs2-OST0007-osc-MDT0001 carbofs2-MDT0001-mdtlov_UUID 4
mds12:  16 UP osp carbofs2-OST0008-osc-MDT0001 carbofs2-MDT0001-mdtlov_UUID 4
mds12:  17 UP osp carbofs2-OST0009-osc-MDT0001 carbofs2-MDT0001-mdtlov_UUID 4
mds12:  18 UP lwp carbofs2-MDT0000-lwp-MDT0001 carbofs2-MDT0000-lwp-MDT0001_UUID 4
mds12:  19 UP lod sandbox2-MDT0001-mdtlov sandbox2-MDT0001-mdtlov_UUID 3
mds12:  20 UP mdt sandbox2-MDT0001 sandbox2-MDT0001_UUID 432
mds12:  21 UP mdd sandbox2-MDD0001 sandbox2-MDD0001_UUID 3
mds12:  22 UP osp sandbox2-MDT0000-osp-MDT0001 sandbox2-MDT0001-mdtlov_UUID 4
mds12:  23 UP osp sandbox2-OST0000-osc-MDT0001 sandbox2-MDT0001-mdtlov_UUID 4
mds12:  24 UP osp sandbox2-OST0001-osc-MDT0001 sandbox2-MDT0001-mdtlov_UUID 4
mds12:  25 UP lwp sandbox2-MDT0000-lwp-MDT0001 sandbox2-MDT0000-lwp-MDT0001_UUID 4

-----------

# less lustre-log.1595648134.10751.dump
## NB: binary bytes rendered in ASCII as "<XX>"
…
00000004:00080000:12.0:1595612538.574964:0:15300:0:(osp_object.c:1592:osp_create()) carbofs2-OST0008-osc-MDT0001: Wrote last used FID: [0x480000400:0x14189e:0x0], index 8: 0
00000004:00080000:12.0:1595612538.645720:0:15300:0:(osp_object.c:1592:osp_create()) carbofs2-OST0002-osc-MDT0001: Wrote last used FID: [0x340000400:0x14161e:0x0], index 2: 0
00000004:00080000:12.0:1595612538.716636:0:15300:0:(osp_object.c:1592:osp_create()) carbofs2-OST0006-osc-MDT0001: Wrote last used FID: [0x440000400:0x1415fe:0x0], index 6: 0
00000004:00080000:12.0:1595612538.796067:0:15300:0:(osp_object.c:1592:osp_create()) carbofs2-OST0003-osc-MDT0001: Wrote last used FID: [0x2c0000400:0x1415fe:0x0], index 3: 0
00000004:00080000:12.0:1595612538.911165:0:15300:0:(osp_object.c:1592:osp_create()) carbofs2-OST0007-osc-MDT0001: Wrote last used FID: [0x3c0000400:0x14165e:0x0], index 7: 0
00000004:00080000:14.0:1595612538.992960:0:15972:0:(osp_object.c:1592:osp_create()) carbofs2-OST0009-osc-MDT0001: Wrote last used FID: [0x4c0000400:0x1416fe:0x0], index 9: 0
00000001:00080000:13.0:1595612539.108491:0:15302:0:(tgt_lastrcvd.c:1873:tgt_txn_stop_cb()) More than one transaction 339365002795
00000004:00080000:14.0:1595612539.109366:0:15972:0:(osp_object.c:1592:osp_create()) carbofs2-OST0000-osc-MDT0001: Wrote last used FID: [0x300000400:0x14159e:0x0], index 0: 0
00000004:00080000:14.0:1595612539.231687:0:15972:0:(osp_object.c:1592:osp_create()) carbofs2-OST0004-osc-MDT0001: Wrote last used FID: [0x400000400:0x1415be:0x0], index 4: 0
00010000:02000400:5.0F:1595612539.313917:0:10917:0:(ldlm_lib.c:790:target_handle_reconnect()) carbofs2-MDT0001: Client 487a7b66-9b5b-9003-26d7-7762d33cd348 (at 172.17.1.196 at o2ib) reconnecting
00010000:00080000:5.0:1595612539.313926:0:10917:0:(ldlm_lib.c:1227:target_handle_connect()) carbofs2-MDT0001: connection from 487a7b66-9b5b-9003-26d7-7762d33cd348 at 172.17.1.196@o2ib t339362463492 exp ffff8afccb79f400 cur 329484 last 1595612539
00000100:00080000:5.0:1595612539.313957:0:10917:0:(import.c:86:import_set_state_nolock()) ffff8ae6833d6000 0<BD><CB><DA><FB><8A><FF><FF>0<ED>0o<FB><8A><FF><FF>: changing import state from FULL to RECOVER
00000100:00080000:5.0:1595612539.313965:0:10917:0:(import.c:86:import_set_state_nolock()) ffff8ae6833d6000 0<BD><CB><DA><FB><8A><FF><FF>0<ED>0o<FB><8A><FF><FF>: changing import state from RECOVER to FULL
00000100:02000000:5.0:1595612539.313970:0:10917:0:(import.c:1597:ptlrpc_import_recovery_state_machine()) carbofs2-MDT0001: Connection restored to  (at 172.17.1.196 at o2ib)
00010000:02000400:5.0:1595612539.314189:0:10917:0:(ldlm_lib.c:790:target_handle_reconnect()) carbofs2-MDT0001: Client ad50161e-8d5b-d3ea-964d-8d7d7079b8b1 (at 172.17.1.191 at o2ib) reconnecting
00010000:00080000:5.0:1595612539.314194:0:10917:0:(ldlm_lib.c:1227:target_handle_connect()) carbofs2-MDT0001: connection from ad50161e-8d5b-d3ea-964d-8d7d7079b8b1 at 172.17.1.191@o2ib t339362463561 exp ffff8afccb798800 cur 329484 last 1595612539
00000100:00080000:5.0:1595612539.314206:0:10917:0:(import.c:86:import_set_state_nolock()) ffff8ae6833d4800 0<BD><CB><DA><FB><8A><FF><FF>0<ED>0o<FB><8A><FF><FF>: changing import state from FULL to RECOVER
00000100:00080000:5.0:1595612539.314210:0:10917:0:(import.c:86:import_set_state_nolock()) ffff8ae6833d4800 0<BD><CB><DA><FB><8A><FF><FF>0<ED>0o<FB><8A><FF><FF>: changing import state from RECOVER to FULL
00000100:02000000:5.0:1595612539.314214:0:10917:0:(import.c:1597:ptlrpc_import_recovery_state_machine()) carbofs2-MDT0001: Connection restored to  (at 172.17.1.191 at o2ib)
00010000:02000400:1.0:1595612539.315452:0:11315:0:(ldlm_lib.c:790:target_handle_reconnect()) carbofs2-MDT0001: Client b9600848-65b0-d97a-342f-cbe061feefb7 (at 172.17.1.194 at o2ib) reconnecting
00010000:00080000:1.0:1595612539.315458:0:11315:0:(ldlm_lib.c:1227:target_handle_connect()) carbofs2-MDT0001: connection from b9600848-65b0-d97a-342f-cbe061feefb7 at 172.17.1.194@o2ib t339362463551 exp ffff8afccb79f800 cur 329484 last 1595612539
00000100:00080000:1.0:1595612539.315478:0:11315:0:(import.c:86:import_set_state_nolock()) ffff8ae6833d5800 0<BD><CB><DA><FB><8A><FF><FF>0<ED>0o<FB><8A><FF><FF>: changing import state from FULL to RECOVER
00000100:00080000:1.0:1595612539.315483:0:11315:0:(import.c:86:import_set_state_nolock()) ffff8ae6833d5800 0<BD><CB><DA><FB><8A><FF><FF>0<ED>0o<FB><8A><FF><FF>: changing import state from RECOVER to FULL
00000100:02000000:1.0:1595612539.315488:0:11315:0:(import.c:1597:ptlrpc_import_recovery_state_machine()) carbofs2-MDT0001: Connection restored to  (at 172.17.1.194 at o2ib)
00000004:00080000:15.0:1595612539.352869:0:15972:0:(osp_object.c:1592:osp_create()) carbofs2-OST0001-osc-MDT0001: Wrote last used FID: [0x280000400:0x1416be:0x0], index 1: 0
00000004:00080000:15.0:1595612539.475291:0:15972:0:(osp_object.c:1592:osp_create()) carbofs2-OST0005-osc-MDT0001: Wrote last used FID: [0x380000400:0x14163f:0x0], index 5: 0
00000004:00080000:12.0:1595612539.592795:0:15300:0:(osp_object.c:1592:osp_create()) carbofs2-OST0008-osc-MDT0001: Wrote last used FID: [0x480000400:0x14189f:0x0], index 8: 0
00000004:00080000:12.0:1595612539.714588:0:15300:0:(osp_object.c:1592:osp_create()) carbofs2-OST0002-osc-MDT0001: Wrote last used FID: [0x340000400:0x14161f:0x0], index 2: 0
00000004:00080000:12.0:1595612539.827626:0:15300:0:(osp_object.c:1592:osp_create()) carbofs2-OST0006-osc-MDT0001: Wrote last used FID: [0x440000400:0x1415ff:0x0], index 6: 0
00000004:00080000:12.0:1595612539.940175:0:15300:0:(osp_object.c:1592:osp_create()) carbofs2-OST0003-osc-MDT0001: Wrote last used FID: [0x2c0000400:0x1415ff:0x0], index 3: 0
00000001:00080000:13.0:1595612540.050522:0:15302:0:(tgt_lastrcvd.c:1873:tgt_txn_stop_cb()) More than one transaction 339365002827
00000004:00080000:15.0:1595612540.051369:0:15972:0:(osp_object.c:1592:osp_create()) carbofs2-OST0007-osc-MDT0001: Wrote last used FID: [0x3c0000400:0x14165f:0x0], index 7: 0
00000004:00080000:15.0:1595612540.126322:0:15972:0:(osp_object.c:1592:osp_create()) carbofs2-OST0009-osc-MDT0001: Wrote last used FID: [0x4c0000400:0x1416ff:0x0], index 9: 0
00000004:00080000:15.0:1595612540.196376:0:15972:0:(osp_object.c:1592:osp_create()) carbofs2-OST0000-osc-MDT0001: Wrote last used FID: [0x300000400:0x14159f:0x0], index 0: 0
00000001:00080000:13.0:1595612540.232407:0:15302:0:(tgt_lastrcvd.c:1873:tgt_txn_stop_cb()) More than one transaction 339365002839
00000004:00080000:15.0:1595612540.267594:0:15972:0:(osp_object.c:1592:osp_create()) carbofs2-OST0004-osc-MDT0001: Wrote last used FID: [0x400000400:0x1415bf:0x0], index 4: 0
00000004:00080000:15.0:1595612540.338998:0:15972:0:(osp_object.c:1592:osp_create()) carbofs2-OST0001-osc-MDT0001: Wrote last used FID: [0x280000400:0x1416bf:0x0], index 1: 0
00000001:00080000:13.0:1595612540.356074:0:15302:0:(tgt_lastrcvd.c:1873:tgt_txn_stop_cb()) More than one transaction 339365002848
00000004:00080000:15.0:1595612540.456252:0:15972:0:(osp_object.c:1592:osp_create()) carbofs2-OST0005-osc-MDT0001: Wrote last used FID: [0x380000400:0x141640:0x0], index 5: 0
00000004:00080000:15.0:1595612540.533314:0:15972:0:(osp_object.c:1592:osp_create()) carbofs2-OST0008-osc-MDT0001: Wrote last used FID: [0x480000400:0x1418a0:0x0], index 8: 0
00000004:00080000:15.0:1595612540.603625:0:15972:0:(osp_object.c:1592:osp_create()) carbofs2-OST0002-osc-MDT0001: Wrote last used FID: [0x340000400:0x141620:0x0], index 2: 0
00000004:00080000:10.0:1595612540.673850:0:15972:0:(osp_object.c:1592:osp_create()) carbofs2-OST0006-osc-MDT0001: Wrote last used FID: [0x440000400:0x141600:0x0], index 6: 0
00000004:00080000:11.0F:1595612540.797357:0:15300:0:(osp_object.c:1592:osp_create()) carbofs2-OST0003-osc-MDT0001: Wrote last used FID: [0x2c0000400:0x141600:0x0], index 3: 0
00000004:00080000:13.0:1595612540.876671:0:15300:0:(osp_object.c:1592:osp_create()) carbofs2-OST0007-osc-MDT0001: Wrote last used FID: [0x3c0000400:0x141660:0x0], index 7: 0
00000004:00080000:13.0:1595612540.991352:0:15300:0:(osp_object.c:1592:osp_create()) carbofs2-OST0009-osc-MDT0001: Wrote last used FID: [0x4c0000400:0x141700:0x0], index 9: 0
00000001:00080000:13.0:1595612541.102669:0:15302:0:(tgt_lastrcvd.c:1873:tgt_txn_stop_cb()) More than one transaction 339365002879
…


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20200814/715ccfe8/attachment-0001.html>


More information about the lustre-discuss mailing list