[Lustre-discuss] LBUG ASSERTION(lock->l_resource != NULL) failed

Brock Palen brockp at umich.edu
Wed Jan 14 07:14:59 PST 2009


I am having servers LBUG on a regular basis, Clients are running  
1.6.6 patchless on RHEL4,  servers are running RHEL4 with 1.6.5.1  
RPM's from the download page.  All connection is over Ethernet,   
Servers are x4600's.

The OSS that BUG'd has in its log:

Jan 13 16:35:39 oss2 kernel: LustreError: 10243:0:(ldlm_lock.c: 
430:__ldlm_handle2lock()) ASSERTION(lock->l_resource != NULL) failed
Jan 13 16:35:39 oss2 kernel: LustreError: 10243:0:(tracefile.c: 
432:libcfs_assertion_failed()) LBUG
Jan 13 16:35:39 oss2 kernel: Lustre: 10243:0:(linux-debug.c: 
167:libcfs_debug_dumpstack()) showing stack for process 10243
Jan 13 16:35:39 oss2 kernel: ldlm_cn_08    R  running task       0  
10243      1         10244  7776 (L-TLB)
Jan 13 16:35:39 oss2 kernel: 0000000000000000 ffffffffa0414629  
00000103d83c7e00 0000000000000000
Jan 13 16:35:39 oss2 kernel:        00000101f8c88d40 ffffffffa021445e  
00000103e315dd98 0000000000000001
Jan 13 16:35:39 oss2 kernel:        00000101f3993ea0 0000000000000000
Jan 13 16:35:39 oss2 kernel: Call Trace:<ffffffffa0414629> 
{:ptlrpc:ptlrpc_server_handle_request+2457}
Jan 13 16:35:39 oss2 kernel:        <ffffffffa021445e> 
{:libcfs:lcw_update_time+30} <ffffffff80133855>{__wake_up_common+67}
Jan 13 16:35:39 oss2 kernel:        <ffffffffa0416d05> 
{:ptlrpc:ptlrpc_main+3989} <ffffffffa0415270> 
{:ptlrpc:ptlrpc_retry_rqbds+0}
Jan 13 16:35:39 oss2 kernel:        <ffffffffa0415270> 
{:ptlrpc:ptlrpc_retry_rqbds+0} <ffffffffa0415270> 
{:ptlrpc:ptlrpc_retry_rqbds+0}
Jan 13 16:35:39 oss2 kernel:        <ffffffff80110de3>{child_rip+8}  
<ffffffffa0415d70>{:ptlrpc:ptlrpc_main+0}
Jan 13 16:35:39 oss2 kernel:        <ffffffff80110ddb>{child_rip+0}
Jan 13 16:35:40 oss2 kernel: LustreError: dumping log to /tmp/lustre- 
log.1231882539.10243


At the same time a client (nyx346) lost contact with that oss, and is  
never allowed to reconnect.
Client /var/log/message:

Jan 13 16:37:20 nyx346 kernel: Lustre: nobackup-OST000d- 
osc-000001022c2a7800: Connection to service nobackup-OST000d via nid  
10.164.3.245 at tcp was lost; in progress operations using this service  
will wait for recovery to complete.Jan 13 16:37:20 nyx346 kernel:  
Lustre: Skipped 6 previous similar messagesJan 13 16:37:20 nyx346  
kernel: LustreError: 3889:0:(ldlm_request.c:996:ldlm_cli_cancel_req 
()) Got rc -11 from cancel RPC: canceling anywayJan 13 16:37:20  
nyx346 kernel: LustreError: 3889:0:(ldlm_request.c: 
1605:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -11Jan 13 16:37:20  
nyx346 kernel: LustreError: 11-0: an error occurred while  
communicating with 10.164.3.245 at tcp. The ost_connect operation failed  
with -16Jan 13 16:37:20 nyx346 kernel: LustreError: Skipped 10  
previous similar messages
Jan 13 16:37:45 nyx346 kernel: Lustre: 3849:0:(import.c: 
410:import_select_connection()) nobackup-OST000d- 
osc-000001022c2a7800: tried all connections, increasing latency to 7s

Even now the server(OSS) is refusing connection to OST00d,  with the  
message:

Lustre: 9631:0:(ldlm_lib.c:760:target_handle_connect()) nobackup- 
OST000d: refuse reconnection from 145a1ec5-07ef- 
f7eb-0ca9-2a2b6503e0cd at 10.164.1.90@tcp to 0x00000103d5ce7000; still  
busy with 2 active RPCs


If I reboot the OSS, the OST's on it go though recovery like normal,  
and then the client is fine.

Network looks clean, found one machine with lots of dropped packets  
between the servers, but that is not the client in question.

Thank you!  If it happens again, and I find any other data I will let  
you know.


Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
brockp at umich.edu
(734)936-1985






More information about the lustre-discuss mailing list