[lustre-discuss] Random drop off OST from clients

Lixin Liu liu at sfu.ca
Thu Oct 5 22:09:15 PDT 2023


 Hi,

Recently, we frequently see OSTs are randomly dropped by some client nodes.

We have 4 Lustre filesystems, total 126 OSTs. All clients are running 2.15.3 client on CentOS 7.
Servers are CentOS 7 with Lustre 2.12.8 (3 FS') and 2.15.3 on Alma 8.8. Failures can happen
from both versions of servers. LNET is using OPA interface.

One example of the failure is like

# lctl dl | grep ' IN '
126 IN osc cedar_sc-OST000a-osc-ffff980c76944800 52e66575-6443-4be9-a7ce-348b526a0836 4

In syslog, we see

Oct  4 23:24:30 cedar5 kernel: LustreError: 11-0: cedar_sc-OST000a-osc-ffff980c76944800: operation ldlm_enqueue to node 172.19.128.33 at o2ib failed: rc = -107
Oct  4 23:24:30 cedar5 kernel: Lustre: cedar_sc-OST000a-osc-ffff980c76944800: Connection to cedar_sc-OST000a (at 172.19.128.33 at o2ib) was lost; in progress operations using this service will wait for recovery to complete
Oct  4 23:24:30 cedar5 kernel: LustreError: 5195:0:(osc_request.c:1037:osc_init_grant()) cedar_sc-OST000a-osc-ffff980c76944800: granted 3407872 but already consumed 519700480
Oct  4 23:24:30 cedar5 kernel: LustreError: 167-0: cedar_sc-OST000a-osc-ffff980c76944800: This client was evicted by cedar_sc-OST000a; in progress operations using this service will fail.
Oct  4 23:24:31 cedar5 kernel: LustreError: 62880:0:(ldlm_resource.c:1126:ldlm_resource_complain()) cedar_sc-OST000a-osc-ffff980c76944800: namespace resource [0x73fbbe2:0x0:0x0].0x0 (ffff97fe127e3080) refcount nonzero (1) after lock cleanup; forcing cleanup.
Oct  4 23:24:31 cedar5 kernel: LustreError: 5218:0:(osc_request.c:711:osc_announce_cached()) cedar_sc-OST000a-osc-ffff980c76944800: dirty 131074 > system dirty_max 131072
Oct  4 23:24:36 cedar5 kernel: LustreError: 5209:0:(osc_request.c:711:osc_announce_cached()) cedar_sc-OST000a-osc-ffff980c76944800: dirty 131074 > system dirty_max 131072
Oct  4 23:24:47 cedar5 kernel: LustreError: 5220:0:(osc_request.c:711:osc_announce_cached()) cedar_sc-OST000a-osc-ffff980c76944800: dirty 131072 > system dirty_max 131072
Oct  4 23:25:36 cedar5 kernel: LustreError: 5242:0:(osc_request.c:711:osc_announce_cached()) cedar_sc-OST000a-osc-ffff980c76944800: dirty 131074 > system dirty_max 131072
....

This one in particular is 2.15.3 server. Once this happen, it appears the only way is to reboot the
client and then the issue goes away.

Any ideas where we should check?

Thank you very much.

Lixin.





More information about the lustre-discuss mailing list