[Lustre-discuss] ll_ost_creat_* goes bersek (100% cpu used - OST disabled)
Adrian Ulrich
adrian at blinkenlights.ch
Fri Aug 13 10:49:01 PDT 2010
Hi,
Since a few hours we have a problem with one of our OSTs:
One (and only one) ll_ost_create_ process on one of the OSTs
seems to go crazy and uses 100% CPU.
Rebooting the OST + MDS didn't help and there isn't much
going on on the filesystem itself:
- /proc/fs/lustre/ost/OSS/ost_create/stats is almost 'static'
- iostat shows almost no usage
- ib traffic is < 100 kb/s
The MDS logs this each ~3 minutes:
Aug 13 19:11:14 mds1 kernel: LustreError: 11-0: an error occurred while communicating with 10.201.62.23 at o2ib. The ost_connect operation failed with -16
..and later:
Aug 13 19:17:16 mds1 kernel: LustreError: 10253:0:(osc_create.c:390:osc_create()) lustre1-OST0005-osc: oscc recovery failed: -110
Aug 13 19:17:16 mds1 kernel: LustreError: 10253:0:(lov_obd.c:1129:lov_clear_orphans()) error in orphan recovery on OST idx 5/32: rc = -110
Aug 13 19:17:16 mds1 kernel: LustreError: 10253:0:(mds_lov.c:1022:__mds_lov_synchronize()) lustre1-OST0005_UUID failed at mds_lov_clear_orphans: -110
Aug 13 19:17:16 mds1 kernel: LustreError: 10253:0:(mds_lov.c:1031:__mds_lov_synchronize()) lustre1-OST0005_UUID sync failed -110, deactivating
Aug 13 19:17:54 mds1 kernel: Lustre: 6544:0:(import.c:508:import_select_connection()) lustre1-OST0005-osc: tried all connections, increasing latency to 51s
oops! (lustre1-OST0005 is hosted on the OSS with the crazy ll_ost_create process)
On the affected OSS we get
Lustre: 11764:0:(ldlm_lib.c:835:target_handle_connect()) lustre1-OST0005: refuse reconnection from lustre1-mdtlov_UUID at 10.201.62.11@o2ib to 0xffff8102164d0200; still busy with 2 active RPCs
$ llog_reader lustre-log.1281718692.11833 shows:
Bit 0 of 284875 not set
Bit -32510 of 284875 not set
Bit -32510 of 284875 not set
Bit -32511 of 284875 not set
Bit 0 of 284875 not set
Bit -1 of 284875 not set
Bit 0 of 284875 not set
Bit -32510 of 284875 not set
Bit -32510 of 284875 not set
Bit -32510 of 284875 not set
Bit -1 of 284875 not set
Bit 0 of 284875 not set
Segmentation fault <-- *ouch*
And we get tons of soft-cpu lockups :-/
Any ideas?
Regards,
Adrian
More information about the lustre-discuss
mailing list