[lustre-discuss] import_set_state_nolock() with binary args in lctl debug_file output?
Sternberg, Michael G.
sternberg at anl.gov
Fri Aug 14 14:44:18 PDT 2020
In lctl debug_file output, for import_set_state_nolock(), I sometimes see binary arguments (sample snippet at end of post), and figure that's not a good sign. How can I get to the bottom of this?
The only direct reference I could find is LU-7339 from 2015, which got no replies. I also see call traces on my system that are similar to those in LU-12037 "Possible DNE issue leading to hung filesystem". This led me to implement lru_size and lru_max_age workarounds. I suspect hardware issues within my IB network, but "ibqueryerrors -c -s PortXmitWait" shows nothing for Lustre servers.
The MDSs have on many occasions been getting overloaded by apps that do loads of operations on short (kB-size) files and directories. An MDS (usually but not always the one holding DNE remote directories!?) eventually reports health_check=NOT HEALTHY . As a workaround, I used the following incantation post-mount on MDSs:
ldlm.namespaces.*.lru_size=clear
ldlm.namespaces.*.lru_size=1000
ldlm.namespaces.*.lru_max_age=2s
I'd be grateful for pointers on how to go about diagnosing and resolving this issue.
With best regards,
Michael Sternberg
-----------
System context:
2 Lustre fs ("carbofs2", "sandbox2") on 2 MDSs (1 HA pair) and 6 OSS (3 HA pairs), on Lustre-2.12.4 (now .5), with DoM and DNE active. MDTs are ldiskfs on RAID1 SSD LUNs connected over FC.
# lsb_release -sir
CentOS 7.8.2003
# rpm -q lustre
lustre-2.12.5-1.el7.x86_64
# dsh -e -w mds11,mds12 lctl list_nids
mds11: 172.17.101.5 at o2ib
mds11: 172.16.101.5 at tcp
mds12: 172.17.101.6 at o2ib
mds12: 172.16.101.6 at tcp
# dsh -e -w mds11,mds12 lctl dl
mds11: 0 UP osd-ldiskfs MGS-osd MGS-osd_UUID 4
mds11: 1 UP mgs MGS MGS 442
mds11: 2 UP mgc MGC172.17.101.5 at o2ib 3adfc64b-3179-d8ff-ad27-42788cabd023 4
mds11: 3 UP osd-ldiskfs carbofs2-MDT0000-osd carbofs2-MDT0000-osd_UUID 17
mds11: 4 UP osd-ldiskfs sandbox2-MDT0000-osd sandbox2-MDT0000-osd_UUID 9
mds11: 5 UP mds MDS MDS_uuid 2
mds11: 6 UP lod carbofs2-MDT0000-mdtlov carbofs2-MDT0000-mdtlov_UUID 3
mds11: 7 UP mdt carbofs2-MDT0000 carbofs2-MDT0000_UUID 452
mds11: 8 UP mdd carbofs2-MDD0000 carbofs2-MDD0000_UUID 3
mds11: 9 UP qmt carbofs2-QMT0000 carbofs2-QMT0000_UUID 3
mds11: 10 UP osp carbofs2-MDT0001-osp-MDT0000 carbofs2-MDT0000-mdtlov_UUID 4
mds11: 11 UP osp carbofs2-OST0000-osc-MDT0000 carbofs2-MDT0000-mdtlov_UUID 4
mds11: 12 UP osp carbofs2-OST0001-osc-MDT0000 carbofs2-MDT0000-mdtlov_UUID 4
mds11: 13 UP osp carbofs2-OST0002-osc-MDT0000 carbofs2-MDT0000-mdtlov_UUID 4
mds11: 14 UP osp carbofs2-OST0003-osc-MDT0000 carbofs2-MDT0000-mdtlov_UUID 4
mds11: 15 UP osp carbofs2-OST0004-osc-MDT0000 carbofs2-MDT0000-mdtlov_UUID 4
mds11: 16 UP osp carbofs2-OST0005-osc-MDT0000 carbofs2-MDT0000-mdtlov_UUID 4
mds11: 17 UP osp carbofs2-OST0006-osc-MDT0000 carbofs2-MDT0000-mdtlov_UUID 4
mds11: 18 UP osp carbofs2-OST0007-osc-MDT0000 carbofs2-MDT0000-mdtlov_UUID 4
mds11: 19 UP osp carbofs2-OST0008-osc-MDT0000 carbofs2-MDT0000-mdtlov_UUID 4
mds11: 20 UP osp carbofs2-OST0009-osc-MDT0000 carbofs2-MDT0000-mdtlov_UUID 4
mds11: 21 UP lwp carbofs2-MDT0000-lwp-MDT0000 carbofs2-MDT0000-lwp-MDT0000_UUID 4
mds11: 22 UP lod sandbox2-MDT0000-mdtlov sandbox2-MDT0000-mdtlov_UUID 3
mds11: 23 UP mdt sandbox2-MDT0000 sandbox2-MDT0000_UUID 436
mds11: 24 UP mdd sandbox2-MDD0000 sandbox2-MDD0000_UUID 3
mds11: 25 UP qmt sandbox2-QMT0000 sandbox2-QMT0000_UUID 3
mds11: 26 UP osp sandbox2-MDT0001-osp-MDT0000 sandbox2-MDT0000-mdtlov_UUID 4
mds11: 27 UP osp sandbox2-OST0000-osc-MDT0000 sandbox2-MDT0000-mdtlov_UUID 4
mds11: 28 UP osp sandbox2-OST0001-osc-MDT0000 sandbox2-MDT0000-mdtlov_UUID 4
mds11: 29 UP lwp sandbox2-MDT0000-lwp-MDT0000 sandbox2-MDT0000-lwp-MDT0000_UUID 4
mds12: 0 UP osd-ldiskfs carbofs2-MDT0001-osd carbofs2-MDT0001-osd_UUID 16
mds12: 1 UP osd-ldiskfs sandbox2-MDT0001-osd sandbox2-MDT0001-osd_UUID 8
mds12: 2 UP mgc MGC172.17.101.5 at o2ib 8a65014c-0bf7-e566-56ec-1e13f7ad52af 4
mds12: 3 UP mds MDS MDS_uuid 2
mds12: 4 UP lod carbofs2-MDT0001-mdtlov carbofs2-MDT0001-mdtlov_UUID 3
mds12: 5 UP mdt carbofs2-MDT0001 carbofs2-MDT0001_UUID 448
mds12: 6 UP mdd carbofs2-MDD0001 carbofs2-MDD0001_UUID 3
mds12: 7 UP osp carbofs2-MDT0000-osp-MDT0001 carbofs2-MDT0001-mdtlov_UUID 4
mds12: 8 UP osp carbofs2-OST0000-osc-MDT0001 carbofs2-MDT0001-mdtlov_UUID 4
mds12: 9 UP osp carbofs2-OST0001-osc-MDT0001 carbofs2-MDT0001-mdtlov_UUID 4
mds12: 10 UP osp carbofs2-OST0002-osc-MDT0001 carbofs2-MDT0001-mdtlov_UUID 4
mds12: 11 UP osp carbofs2-OST0003-osc-MDT0001 carbofs2-MDT0001-mdtlov_UUID 4
mds12: 12 UP osp carbofs2-OST0004-osc-MDT0001 carbofs2-MDT0001-mdtlov_UUID 4
mds12: 13 UP osp carbofs2-OST0005-osc-MDT0001 carbofs2-MDT0001-mdtlov_UUID 4
mds12: 14 UP osp carbofs2-OST0006-osc-MDT0001 carbofs2-MDT0001-mdtlov_UUID 4
mds12: 15 UP osp carbofs2-OST0007-osc-MDT0001 carbofs2-MDT0001-mdtlov_UUID 4
mds12: 16 UP osp carbofs2-OST0008-osc-MDT0001 carbofs2-MDT0001-mdtlov_UUID 4
mds12: 17 UP osp carbofs2-OST0009-osc-MDT0001 carbofs2-MDT0001-mdtlov_UUID 4
mds12: 18 UP lwp carbofs2-MDT0000-lwp-MDT0001 carbofs2-MDT0000-lwp-MDT0001_UUID 4
mds12: 19 UP lod sandbox2-MDT0001-mdtlov sandbox2-MDT0001-mdtlov_UUID 3
mds12: 20 UP mdt sandbox2-MDT0001 sandbox2-MDT0001_UUID 432
mds12: 21 UP mdd sandbox2-MDD0001 sandbox2-MDD0001_UUID 3
mds12: 22 UP osp sandbox2-MDT0000-osp-MDT0001 sandbox2-MDT0001-mdtlov_UUID 4
mds12: 23 UP osp sandbox2-OST0000-osc-MDT0001 sandbox2-MDT0001-mdtlov_UUID 4
mds12: 24 UP osp sandbox2-OST0001-osc-MDT0001 sandbox2-MDT0001-mdtlov_UUID 4
mds12: 25 UP lwp sandbox2-MDT0000-lwp-MDT0001 sandbox2-MDT0000-lwp-MDT0001_UUID 4
-----------
# less lustre-log.1595648134.10751.dump
## NB: binary bytes rendered in ASCII as "<XX>"
…
00000004:00080000:12.0:1595612538.574964:0:15300:0:(osp_object.c:1592:osp_create()) carbofs2-OST0008-osc-MDT0001: Wrote last used FID: [0x480000400:0x14189e:0x0], index 8: 0
00000004:00080000:12.0:1595612538.645720:0:15300:0:(osp_object.c:1592:osp_create()) carbofs2-OST0002-osc-MDT0001: Wrote last used FID: [0x340000400:0x14161e:0x0], index 2: 0
00000004:00080000:12.0:1595612538.716636:0:15300:0:(osp_object.c:1592:osp_create()) carbofs2-OST0006-osc-MDT0001: Wrote last used FID: [0x440000400:0x1415fe:0x0], index 6: 0
00000004:00080000:12.0:1595612538.796067:0:15300:0:(osp_object.c:1592:osp_create()) carbofs2-OST0003-osc-MDT0001: Wrote last used FID: [0x2c0000400:0x1415fe:0x0], index 3: 0
00000004:00080000:12.0:1595612538.911165:0:15300:0:(osp_object.c:1592:osp_create()) carbofs2-OST0007-osc-MDT0001: Wrote last used FID: [0x3c0000400:0x14165e:0x0], index 7: 0
00000004:00080000:14.0:1595612538.992960:0:15972:0:(osp_object.c:1592:osp_create()) carbofs2-OST0009-osc-MDT0001: Wrote last used FID: [0x4c0000400:0x1416fe:0x0], index 9: 0
00000001:00080000:13.0:1595612539.108491:0:15302:0:(tgt_lastrcvd.c:1873:tgt_txn_stop_cb()) More than one transaction 339365002795
00000004:00080000:14.0:1595612539.109366:0:15972:0:(osp_object.c:1592:osp_create()) carbofs2-OST0000-osc-MDT0001: Wrote last used FID: [0x300000400:0x14159e:0x0], index 0: 0
00000004:00080000:14.0:1595612539.231687:0:15972:0:(osp_object.c:1592:osp_create()) carbofs2-OST0004-osc-MDT0001: Wrote last used FID: [0x400000400:0x1415be:0x0], index 4: 0
00010000:02000400:5.0F:1595612539.313917:0:10917:0:(ldlm_lib.c:790:target_handle_reconnect()) carbofs2-MDT0001: Client 487a7b66-9b5b-9003-26d7-7762d33cd348 (at 172.17.1.196 at o2ib) reconnecting
00010000:00080000:5.0:1595612539.313926:0:10917:0:(ldlm_lib.c:1227:target_handle_connect()) carbofs2-MDT0001: connection from 487a7b66-9b5b-9003-26d7-7762d33cd348 at 172.17.1.196@o2ib t339362463492 exp ffff8afccb79f400 cur 329484 last 1595612539
00000100:00080000:5.0:1595612539.313957:0:10917:0:(import.c:86:import_set_state_nolock()) ffff8ae6833d6000 0<BD><CB><DA><FB><8A><FF><FF>0<ED>0o<FB><8A><FF><FF>: changing import state from FULL to RECOVER
00000100:00080000:5.0:1595612539.313965:0:10917:0:(import.c:86:import_set_state_nolock()) ffff8ae6833d6000 0<BD><CB><DA><FB><8A><FF><FF>0<ED>0o<FB><8A><FF><FF>: changing import state from RECOVER to FULL
00000100:02000000:5.0:1595612539.313970:0:10917:0:(import.c:1597:ptlrpc_import_recovery_state_machine()) carbofs2-MDT0001: Connection restored to (at 172.17.1.196 at o2ib)
00010000:02000400:5.0:1595612539.314189:0:10917:0:(ldlm_lib.c:790:target_handle_reconnect()) carbofs2-MDT0001: Client ad50161e-8d5b-d3ea-964d-8d7d7079b8b1 (at 172.17.1.191 at o2ib) reconnecting
00010000:00080000:5.0:1595612539.314194:0:10917:0:(ldlm_lib.c:1227:target_handle_connect()) carbofs2-MDT0001: connection from ad50161e-8d5b-d3ea-964d-8d7d7079b8b1 at 172.17.1.191@o2ib t339362463561 exp ffff8afccb798800 cur 329484 last 1595612539
00000100:00080000:5.0:1595612539.314206:0:10917:0:(import.c:86:import_set_state_nolock()) ffff8ae6833d4800 0<BD><CB><DA><FB><8A><FF><FF>0<ED>0o<FB><8A><FF><FF>: changing import state from FULL to RECOVER
00000100:00080000:5.0:1595612539.314210:0:10917:0:(import.c:86:import_set_state_nolock()) ffff8ae6833d4800 0<BD><CB><DA><FB><8A><FF><FF>0<ED>0o<FB><8A><FF><FF>: changing import state from RECOVER to FULL
00000100:02000000:5.0:1595612539.314214:0:10917:0:(import.c:1597:ptlrpc_import_recovery_state_machine()) carbofs2-MDT0001: Connection restored to (at 172.17.1.191 at o2ib)
00010000:02000400:1.0:1595612539.315452:0:11315:0:(ldlm_lib.c:790:target_handle_reconnect()) carbofs2-MDT0001: Client b9600848-65b0-d97a-342f-cbe061feefb7 (at 172.17.1.194 at o2ib) reconnecting
00010000:00080000:1.0:1595612539.315458:0:11315:0:(ldlm_lib.c:1227:target_handle_connect()) carbofs2-MDT0001: connection from b9600848-65b0-d97a-342f-cbe061feefb7 at 172.17.1.194@o2ib t339362463551 exp ffff8afccb79f800 cur 329484 last 1595612539
00000100:00080000:1.0:1595612539.315478:0:11315:0:(import.c:86:import_set_state_nolock()) ffff8ae6833d5800 0<BD><CB><DA><FB><8A><FF><FF>0<ED>0o<FB><8A><FF><FF>: changing import state from FULL to RECOVER
00000100:00080000:1.0:1595612539.315483:0:11315:0:(import.c:86:import_set_state_nolock()) ffff8ae6833d5800 0<BD><CB><DA><FB><8A><FF><FF>0<ED>0o<FB><8A><FF><FF>: changing import state from RECOVER to FULL
00000100:02000000:1.0:1595612539.315488:0:11315:0:(import.c:1597:ptlrpc_import_recovery_state_machine()) carbofs2-MDT0001: Connection restored to (at 172.17.1.194 at o2ib)
00000004:00080000:15.0:1595612539.352869:0:15972:0:(osp_object.c:1592:osp_create()) carbofs2-OST0001-osc-MDT0001: Wrote last used FID: [0x280000400:0x1416be:0x0], index 1: 0
00000004:00080000:15.0:1595612539.475291:0:15972:0:(osp_object.c:1592:osp_create()) carbofs2-OST0005-osc-MDT0001: Wrote last used FID: [0x380000400:0x14163f:0x0], index 5: 0
00000004:00080000:12.0:1595612539.592795:0:15300:0:(osp_object.c:1592:osp_create()) carbofs2-OST0008-osc-MDT0001: Wrote last used FID: [0x480000400:0x14189f:0x0], index 8: 0
00000004:00080000:12.0:1595612539.714588:0:15300:0:(osp_object.c:1592:osp_create()) carbofs2-OST0002-osc-MDT0001: Wrote last used FID: [0x340000400:0x14161f:0x0], index 2: 0
00000004:00080000:12.0:1595612539.827626:0:15300:0:(osp_object.c:1592:osp_create()) carbofs2-OST0006-osc-MDT0001: Wrote last used FID: [0x440000400:0x1415ff:0x0], index 6: 0
00000004:00080000:12.0:1595612539.940175:0:15300:0:(osp_object.c:1592:osp_create()) carbofs2-OST0003-osc-MDT0001: Wrote last used FID: [0x2c0000400:0x1415ff:0x0], index 3: 0
00000001:00080000:13.0:1595612540.050522:0:15302:0:(tgt_lastrcvd.c:1873:tgt_txn_stop_cb()) More than one transaction 339365002827
00000004:00080000:15.0:1595612540.051369:0:15972:0:(osp_object.c:1592:osp_create()) carbofs2-OST0007-osc-MDT0001: Wrote last used FID: [0x3c0000400:0x14165f:0x0], index 7: 0
00000004:00080000:15.0:1595612540.126322:0:15972:0:(osp_object.c:1592:osp_create()) carbofs2-OST0009-osc-MDT0001: Wrote last used FID: [0x4c0000400:0x1416ff:0x0], index 9: 0
00000004:00080000:15.0:1595612540.196376:0:15972:0:(osp_object.c:1592:osp_create()) carbofs2-OST0000-osc-MDT0001: Wrote last used FID: [0x300000400:0x14159f:0x0], index 0: 0
00000001:00080000:13.0:1595612540.232407:0:15302:0:(tgt_lastrcvd.c:1873:tgt_txn_stop_cb()) More than one transaction 339365002839
00000004:00080000:15.0:1595612540.267594:0:15972:0:(osp_object.c:1592:osp_create()) carbofs2-OST0004-osc-MDT0001: Wrote last used FID: [0x400000400:0x1415bf:0x0], index 4: 0
00000004:00080000:15.0:1595612540.338998:0:15972:0:(osp_object.c:1592:osp_create()) carbofs2-OST0001-osc-MDT0001: Wrote last used FID: [0x280000400:0x1416bf:0x0], index 1: 0
00000001:00080000:13.0:1595612540.356074:0:15302:0:(tgt_lastrcvd.c:1873:tgt_txn_stop_cb()) More than one transaction 339365002848
00000004:00080000:15.0:1595612540.456252:0:15972:0:(osp_object.c:1592:osp_create()) carbofs2-OST0005-osc-MDT0001: Wrote last used FID: [0x380000400:0x141640:0x0], index 5: 0
00000004:00080000:15.0:1595612540.533314:0:15972:0:(osp_object.c:1592:osp_create()) carbofs2-OST0008-osc-MDT0001: Wrote last used FID: [0x480000400:0x1418a0:0x0], index 8: 0
00000004:00080000:15.0:1595612540.603625:0:15972:0:(osp_object.c:1592:osp_create()) carbofs2-OST0002-osc-MDT0001: Wrote last used FID: [0x340000400:0x141620:0x0], index 2: 0
00000004:00080000:10.0:1595612540.673850:0:15972:0:(osp_object.c:1592:osp_create()) carbofs2-OST0006-osc-MDT0001: Wrote last used FID: [0x440000400:0x141600:0x0], index 6: 0
00000004:00080000:11.0F:1595612540.797357:0:15300:0:(osp_object.c:1592:osp_create()) carbofs2-OST0003-osc-MDT0001: Wrote last used FID: [0x2c0000400:0x141600:0x0], index 3: 0
00000004:00080000:13.0:1595612540.876671:0:15300:0:(osp_object.c:1592:osp_create()) carbofs2-OST0007-osc-MDT0001: Wrote last used FID: [0x3c0000400:0x141660:0x0], index 7: 0
00000004:00080000:13.0:1595612540.991352:0:15300:0:(osp_object.c:1592:osp_create()) carbofs2-OST0009-osc-MDT0001: Wrote last used FID: [0x4c0000400:0x141700:0x0], index 9: 0
00000001:00080000:13.0:1595612541.102669:0:15302:0:(tgt_lastrcvd.c:1873:tgt_txn_stop_cb()) More than one transaction 339365002879
…
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20200814/715ccfe8/attachment-0001.html>
More information about the lustre-discuss
mailing list