[Lustre-discuss] ll_ost_creat_* goes bersek (100% cpu used - OST disabled)

Adrian Ulrich adrian at blinkenlights.ch
Fri Aug 13 10:49:01 PDT 2010


Hi,

Since a few hours we have a problem with one of our OSTs:

One (and only one) ll_ost_create_ process on one of the OSTs
seems to go crazy and uses 100% CPU.

Rebooting the OST + MDS didn't help and there isn't much
going on on the filesystem itself:

 - /proc/fs/lustre/ost/OSS/ost_create/stats is almost 'static'
 - iostat shows almost no usage
 - ib traffic is < 100 kb/s


The MDS logs this each ~3 minutes:
 Aug 13 19:11:14 mds1 kernel: LustreError: 11-0: an error occurred while communicating with 10.201.62.23 at o2ib. The ost_connect operation failed with -16
..and later:
 Aug 13 19:17:16 mds1 kernel: LustreError: 10253:0:(osc_create.c:390:osc_create()) lustre1-OST0005-osc: oscc recovery failed: -110
 Aug 13 19:17:16 mds1 kernel: LustreError: 10253:0:(lov_obd.c:1129:lov_clear_orphans()) error in orphan recovery on OST idx 5/32: rc = -110
 Aug 13 19:17:16 mds1 kernel: LustreError: 10253:0:(mds_lov.c:1022:__mds_lov_synchronize()) lustre1-OST0005_UUID failed at mds_lov_clear_orphans: -110
 Aug 13 19:17:16 mds1 kernel: LustreError: 10253:0:(mds_lov.c:1031:__mds_lov_synchronize()) lustre1-OST0005_UUID sync failed -110, deactivating
 Aug 13 19:17:54 mds1 kernel: Lustre: 6544:0:(import.c:508:import_select_connection()) lustre1-OST0005-osc: tried all connections, increasing latency to 51s

oops! (lustre1-OST0005 is hosted on the OSS with the crazy ll_ost_create process)

On the affected OSS we get
 Lustre: 11764:0:(ldlm_lib.c:835:target_handle_connect()) lustre1-OST0005: refuse reconnection from lustre1-mdtlov_UUID at 10.201.62.11@o2ib to 0xffff8102164d0200; still busy with 2 active RPCs


$ llog_reader lustre-log.1281718692.11833 shows:
Bit 0 of 284875 not set
Bit -32510 of 284875 not set
Bit -32510 of 284875 not set
Bit -32511 of 284875 not set
Bit 0 of 284875 not set
Bit -1 of 284875 not set
Bit 0 of 284875 not set
Bit -32510 of 284875 not set
Bit -32510 of 284875 not set
Bit -32510 of 284875 not set
Bit -1 of 284875 not set
Bit 0 of 284875 not set
Segmentation fault <-- *ouch*


And we get tons of soft-cpu lockups :-/

Any ideas?


Regards,
 Adrian





More information about the lustre-discuss mailing list