[Lustre-discuss] Timeouts and Dumps

Denise Hummel denise_hummel at nrel.gov
Mon Dec 22 12:22:31 PST 2008


Hi;

A continuation of the timeouts and dumps.  The login node has hung three
times today - no one can login and anyone logged in cannot work.  The
hangs are lasting for 5 or more minutes.  
The messages on the OSS are:

Dec 22 13:00:44 oss1 kernel: LustreError: 138-a: lustre-OST0000: A
client on nid 172.16.100.1 at tcp was evicted due to a lock blocking
callback to 172.16.100.1 at tcp timed out: rc -107
Dec 22 13:00:44 oss1 kernel: LustreError:
27250:0:(ost_handler.c:1065:ost_brw_write()) @@@ Eviction on bulk GET
req at 00000100bff5c800 x91545/t0
o4->ab915dd6-4f50-ba8a-5cd4-05edb28fcc82 at NET_0x20000ac106401_UUID:-1
lens 384/352 ref 0 fl Interpret:/0/0 rc 0/0
Dec 22 13:00:44 oss1 kernel: LustreError:
27166:0:(ldlm_lockd.c:962:ldlm_handle_enqueue()) ### lock on destroyed
export 0000010087dc1000 ns: filter-lustre-OST0000_UUID lock:
0000010191bd54c0/0x0 lrc: 1/0,0 mode: --/PW res: 14313567/0 rrc: 1 type:
EXT [0->8191] (req 0->8191) flags: 20000080 remote: 0x4c7a2f2234322fa9
expref: 29 pid: 27166
Dec 22 13:00:44 oss1 kernel: LustreError:
27166:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error
(-107)  req at 00000101397be600 x91686/t0
o101->ab915dd6-4f50-ba8a-5cd4-05edb28fcc82 at NET_0x20000ac106401_UUID:-1
lens 232/288 ref 0 fl Interpret:/0/0 rc -107/0
Dec 22 13:00:44 oss1 kernel: Lustre:
27250:0:(ost_handler.c:1205:ost_brw_write()) lustre-OST0000: ignoring
bulk IO comm error with
ab915dd6-4f50-ba8a-5cd4-05edb28fcc82 at NET_0x20000ac106401_UUID id
12345-172.16.100.1 at tcp - client will retry
Dec 22 13:00:44 oss1 kernel: LustreError:
27314:0:(ost_handler.c:1065:ost_brw_write()) @@@ Eviction on bulk GET
req at 0000010065c0ca00 x91546/t0
o4->ab915dd6-4f50-ba8a-5cd4-05edb28fcc82 at NET_0x20000ac106401_UUID:-1
lens 384/352 ref 0 fl Interpret:/0/0 rc 0/0
Dec 22 13:00:45 oss1 kernel: LustreError:
27256:0:(ost_handler.c:1065:ost_brw_write()) @@@ Eviction on bulk GET
req at 0000010008df4a00 x91586/t0
o4->ab915dd6-4f50-ba8a-5cd4-05edb28fcc82 at NET_0x20000ac106401_UUID:-1
lens 384/352 ref 0 fl Interpret:/0/0 rc 0/0
Dec 22 13:00:45 oss1 kernel: LustreError:
27256:0:(ost_handler.c:1065:ost_brw_write()) Skipped 1 previous similar
message
Dec 22 13:00:45 oss1 kernel: Lustre:
27285:0:(ost_handler.c:1205:ost_brw_write()) lustre-OST0000: ignoring
bulk IO comm error with
ab915dd6-4f50-ba8a-5cd4-05edb28fcc82 at NET_0x20000ac106401_UUID id
12345-172.16.100.1 at tcp - client will retry
Dec 22 13:00:45 oss1 kernel: Lustre:
27285:0:(ost_handler.c:1205:ost_brw_write()) Skipped 1 previous similar
message
Dec 22 13:00:48 oss1 kernel: Lustre:
27245:0:(ost_handler.c:1205:ost_brw_write()) lustre-OST0000: ignoring
bulk IO comm error with
ab915dd6-4f50-ba8a-5cd4-05edb28fcc82 at NET_0x20000ac106401_UUID id
12345-172.16.100.1 at tcp - client will retry
Dec 22 13:00:48 oss1 kernel: Lustre:
27245:0:(ost_handler.c:1205:ost_brw_write()) Skipped 3 previous similar
messages
Dec 22 13:00:48 oss1 kernel: LustreError:
29210:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error
(-107)  req at 00000100bff84600 x91689/t0 o13-><?>@<?>:-1 lens 128/0 ref 0
fl Interpret:/0/0 rc -107/0
Dec 22 13:00:48 oss1 kernel: LustreError:
27116:0:(ldlm_lockd.c:1556:ldlm_cancel_handler()) operation 103 from
12345-172.16.100.1 at tcp with bad export cookie 13571729019316464878
Dec 22 13:00:48 oss1 kernel: LustreError:
27114:0:(ldlm_lockd.c:1556:ldlm_cancel_handler()) operation 103 from
12345-172.16.100.1 at tcp with bad export cookie 13571729019316464878

The messages in the syslog on the login node are:

Dec 22 13:00:45 head kernel: LustreError: 11-0: an error occurred while
communicating with 172.16.100.41 at tcp. The ldlm_enqueue operation failed
with -107
Dec 22 13:00:45 head kernel: Lustre:
lustre-OST0000-osc-000001018197f800: Connection to service
lustre-OST0000 via nid 172.16.100.41 at tcp was lost; in progress
operations using this service will wait for recovery to complete.
Dec 22 13:00:58 head kernel: LustreError: 11-0: an error occurred while
communicating with 172.16.100.41 at tcp. The ost_statfs operation failed
with -107
Dec 22 13:00:58 head kernel: LustreError:
3270:0:(ldlm_request.c:909:ldlm_cli_cancel_req()) Got rc -107 from
cancel RPC: canceling anyway
Dec 22 13:00:58 head kernel: LustreError:
3270:0:(file.c:803:ll_extent_lock_callback()) ldlm_cli_cancel failed:
-107
Dec 22 13:00:58 head kernel: LustreError:
3271:0:(ldlm_request.c:909:ldlm_cli_cancel_req()) Got rc -107 from
cancel RPC: canceling anyway
Dec 22 13:00:58 head kernel: LustreError:
3271:0:(file.c:803:ll_extent_lock_callback()) ldlm_cli_cancel failed:
-107
Dec 22 13:00:58 head kernel: LustreError: 167-0: This client was evicted
by lustre-OST0000; in progress operations using this service will fail.
Dec 22 13:00:58 head kernel: LustreError:
3248:0:(osc_request.c:1301:osc_brw_redo_request()) @@@ redo for
recoverable error  req at 00000101819c4a00 x91544/t0
o4->lustre-OST0000_UUID at 172.17.103.18@o2ib:28 lens 384/352 ref 2 fl
Complete:E/0/0 rc -5/-22
Dec 22 13:00:58 head kernel: LustreError:
3248:0:(osc_request.c:1301:osc_brw_redo_request()) Skipped 8 previous
similar messages
Dec 22 13:00:58 head kernel: LustreError:
19930:0:(llite_mmap.c:206:ll_tree_unlock()) couldn't unlock -5
Dec 22 13:00:58 head kernel: LustreError:
19930:0:(file.c:1052:ll_glimpse_size()) obd_enqueue returned rc -5,
returning -EIO
Dec 22 13:00:58 head kernel: LustreError:
19930:0:(file.c:1052:ll_glimpse_size()) Skipped 543 previous similar
messages
Dec 22 13:00:58 head kernel: LustreError:
19998:0:(file.c:1052:ll_glimpse_size()) obd_enqueue returned rc -5,
returning -EIO
Dec 22 13:00:58 head kernel: LustreError:
19998:0:(file.c:1052:ll_glimpse_size()) Skipped 195 previous similar
messages
Dec 22 13:01:00 head kernel: LustreError:
3248:0:(osc_request.c:1301:osc_brw_redo_request()) @@@ redo for
recoverable error  req at 000001017aa67c00 x91546/t0
o4->lustre-OST0000_UUID at 172.17.103.18@o2ib:28 lens 384/352 ref 2 fl
Complete:E/0/0 rc -5/-22
Dec 22 13:01:00 head kernel: LustreError:
3248:0:(osc_request.c:1301:osc_brw_redo_request()) Skipped 1 previous
similar message
Dec 22 13:01:00 head kernel: LustreError:
3248:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID
req at 00000100080d9400 x92325/t0
o4->lustre-OST0000_UUID at 172.17.103.18@o2ib:28 lens 384/352 ref 2 fl
Rpc:/0/0 rc 0/0
Dec 22 13:01:00 head kernel: LustreError:
3248:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID
req at 00000100f9fda200 x93266/t0
o4->lustre-OST0000_UUID at 172.17.103.18@o2ib:28 lens 384/352 ref 2 fl
Rpc:/0/0 rc 0/0
Dec 22 13:01:00 head kernel: LustreError:
3248:0:(client.c:519:ptlrpc_import_delay_req()) Skipped 35 previous
similar messages
Dec 22 13:01:00 head kernel: LustreError: Skipped 5 previous similar
messages
Dec 22 13:01:00 head kernel: LustreError:
19947:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID
req at 00000102faba0200 x91689/t0
o13->lustre-OST0000_UUID at 172.17.103.18@o2ib:28 lens 128/272 ref 1 fl
Rpc:RS/0/0 rc -107/-107
Dec 22 13:01:00 head kernel: LustreError:
19947:0:(client.c:519:ptlrpc_import_delay_req()) Skipped 5 previous
similar messages
Dec 22 13:01:00 head kernel: LustreError:
19947:0:(llite_lib.c:1520:ll_statfs_internal()) obd_statfs fails: rc =
-5
Dec 22 13:01:00 head kernel: LustreError:
20029:0:(llite_lib.c:1520:ll_statfs_internal()) obd_statfs fails: rc =
-5
Dec 22 13:01:00 head kernel: Lustre:
lustre-OST0000-osc-000001018197f800: Connection restored to service
lustre-OST0000 using nid 172.16.100.41 at tcp.
Dec 22 13:09:10 head kernel: LustreError: 11-0: an error occurred while
communicating with 172.16.100.41 at tcp. The ldlm_enqueue operation failed
with -107
Dec 22 13:09:10 head kernel: Lustre:
lustre-OST0000-osc-000001018197f800: Connection to service
lustre-OST0000 via nid 172.16.100.41 at tcp was lost; in progress
operations using this service will wait for recovery to complete.
Dec 22 13:09:24 head kernel: LustreError:
3268:0:(ldlm_request.c:909:ldlm_cli_cancel_req()) Got rc -107 from
cancel RPC: canceling anyway
Dec 22 13:09:24 head kernel: LustreError: 167-0: This client was evicted
by lustre-OST0000; in progress operations using this service will fail.
Dec 22 13:09:24 head kernel: LustreError:
23487:0:(llite_mmap.c:206:ll_tree_unlock()) couldn't unlock -5
Dec 22 13:09:24 head kernel: LustreError:
23487:0:(file.c:1052:ll_glimpse_size()) obd_enqueue returned rc -5,
returning -EIO
Dec 22 13:09:24 head kernel: LustreError:
23487:0:(file.c:1052:ll_glimpse_size()) Skipped 131 previous similar
messages
Dec 22 13:09:24 head kernel: LustreError:
23487:0:(namei.c:1132:ll_objects_destroy()) obd destroy objid 0xa485b6
error -5
Dec 22 13:09:24 head kernel: Lustre:
lustre-OST0000-osc-000001018197f800: Connection restored to service
lustre-OST0000 using nid 172.16.100.41 at tcp.

I also have 56 lustre dump logs in about 2 minutes.  I have attached
lustre.logs.tar.gz with all of the dumps.

If anyone has any suggestions, I would appreciate it.  
Denise

-------------- next part --------------
A non-text attachment was scrubbed...
Name: lustre.logs.tar.gz
Type: application/x-compressed-tar
Size: 93452 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20081222/b34ffa99/attachment.bin>


More information about the lustre-discuss mailing list