[Lustre-discuss] 1.8 : recurrent LBUG's on clients

Guillaume Demillecamps guillaume at multipurpose.be
Fri Jul 31 00:15:52 PDT 2009


Hello,


All servers and clients are having Lustre 1.8, on SLES 10 SP2. Clients  
use patchless kernels, using same base revision as the ones for the  
patched kernel servers.
We recurrently encounter this error :

Server log :
------------
Jul 30 06:11:47 BEESPBESXFIL27 kernel: LustreError:  
22061:0:(mds_open.c:1665:mds_close()) @@@ no handle for file close ino  
5606195: cookie 0x5ed7d8c3d1299f40  req at ffff810065a60400  
x1308791892785337/t0  
o35->4f104403-eb03-83be-2910-2fd7cc26087c at NET_0x20000c0a84410_UUID:0/0  
lens 408/864 e 0 to 0 dl 1248927113 ref 1 fl Interpret:/0/0 rc 0/0
Jul 30 06:11:47 BEESPBESXFIL27 kernel: LustreError:  
22061:0:(ldlm_lib.c:1826:target_send_reply_msg()) @@@ processing error  
(-116)  req at ffff810065a60400 x1308791892785337/t0  
o35->4f104403-eb03-83be-2910-2fd7cc26087c at NET_0x20000c0a84410_UUID:0/0  
lens 408/864 e 0 to 0 dl 1248927113 ref 1 fl Interpret:/0/0 rc -116/0
Jul 30 06:11:47 BEESPBESXFIL27 kernel: LustreError:  
22061:0:(mds_open.c:1665:mds_close()) @@@ no handle for file close ino  
5606200: cookie 0x5ed7d8c3d129a361  req at ffff810071b28400  
x1308791892785342/t0  
o35->4f104403-eb03-83be-2910-2fd7cc26087c at NET_0x20000c0a84410_UUID:0/0  
lens 408/864 e 0 to 0 dl 1248927113 ref 1 fl Interpret:/0/0 rc 0/0
Jul 30 06:11:47 BEESPBESXFIL27 kernel: LustreError:  
22061:0:(mds_open.c:1665:mds_close()) Skipped 4 previous similar  
messages
Jul 30 06:11:47 BEESPBESXFIL27 kernel: LustreError:  
22061:0:(ldlm_lib.c:1826:target_send_reply_msg()) @@@ processing error  
(-116)  req at ffff810071b28400 x1308791892785342/t0  
o35->4f104403-eb03-83be-2910-2fd7cc26087c at NET_0x20000c0a84410_UUID:0/0  
lens 408/864 e 0 to 0 dl 1248927113 ref 1 fl Interpret:/0/0 rc -116/0
Jul 30 06:11:47 BEESPBESXFIL27 kernel: LustreError:  
22061:0:(ldlm_lib.c:1826:target_send_reply_msg()) Skipped 4 previous  
similar messages


Client log:
-----------
Jul 30 06:11:47 BEESPDESXAPP06 kernel: LustreError: 11-0: an error  
occurred while communicating with 172.16.0.55 at tcp. The mds_close  
operation failed with -116
Jul 30 06:11:47 BEESPDESXAPP06 kernel: LustreError:  
13298:0:(file.c:114:ll_close_inode_openhandle()) inode 5606195 mdc  
close failed: rc = -116
Jul 30 06:11:47 BEESPDESXAPP06 kernel: LustreError:  
13298:0:(file.c:114:ll_close_inode_openhandle()) Skipped 1 previous  
similar message
Jul 30 06:11:47 BEESPDESXAPP06 kernel: LustreError:  
13298:0:(file.c:114:ll_close_inode_openhandle()) inode 5606155 mdc  
close failed: rc = -116
Jul 30 06:11:47 BEESPDESXAPP06 kernel: LustreError:  
13298:0:(file.c:114:ll_close_inode_openhandle()) Skipped 3 previous  
similar messages
Jul 30 06:11:47 BEESPDESXAPP06 kernel: LustreError: 11-0: an error  
occurred while communicating with 172.16.0.55 at tcp. The mds_close  
operation failed with -116
Jul 30 06:11:47 BEESPDESXAPP06 kernel: LustreError: Skipped 7 previous  
similar messages
Jul 30 06:11:47 BEESPDESXAPP06 kernel: LustreError:  
13298:0:(ldlm_lock.c:602:ldlm_lock_decref_internal_nolock())  
ASSERTION(lock->l_writers > 0) failed
Jul 30 06:11:47 BEESPDESXAPP06 kernel: LustreError:  
13298:0:(ldlm_lock.c:602:ldlm_lock_decref_internal_nolock()) LBUG
Jul 30 06:11:47 BEESPDESXAPP06 kernel:
Jul 30 06:11:47 BEESPDESXAPP06 kernel: Call Trace:  
<ffffffff88257aea>{:libcfs:lbug_with_loc+122}
Jul 30 06:11:47 BEESPDESXAPP06 kernel:         
<ffffffff8825fe00>{:libcfs:tracefile_init+0}  
<ffffffff8835d566>{:ptlrpc:ldlm_lock_decref_internal_nolock+182}
Jul 30 06:11:47 BEESPDESXAPP06 kernel:         
<ffffffff8838533b>{:ptlrpc:ldlm_process_flock_lock+4139}
Jul 30 06:11:47 BEESPDESXAPP06 kernel:         
<ffffffff883864ef>{:ptlrpc:ldlm_flock_completion_ast+2111}
Jul 30 06:11:47 BEESPDESXAPP06 kernel:         
<ffffffff8835f4a9>{:ptlrpc:ldlm_lock_enqueue+2169}  
<ffffffff88377ca0>{:ptlrpc:ldlm_cli_enqueue_fini+2624}
Jul 30 06:11:47 BEESPDESXAPP06 kernel:         
<ffffffff88376fd3>{:ptlrpc:ldlm_prep_elc_req+755}  
<ffffffff8835bc0d>{:ptlrpc:ldlm_lock_create+2541}
Jul 30 06:11:47 BEESPDESXAPP06 kernel:         
<ffffffff8012c668>{default_wake_function+0}  
<ffffffff88379ae2>{:ptlrpc:ldlm_cli_enqueue+1666}
Jul 30 06:11:47 BEESPDESXAPP06 kernel:         
<ffffffff88523fcf>{:lustre:ll_file_flock+1407}  
<ffffffff88385cb0>{:ptlrpc:ldlm_flock_completion_ast+0}
Jul 30 06:11:47 BEESPDESXAPP06 kernel:         
<ffffffff8019ae2e>{locks_remove_posix+132}  
<ffffffff80147fdc>{bit_waitqueue+56}
Jul 30 06:11:47 BEESPDESXAPP06 kernel:         
<ffffffff80190241>{flush_old_exec+2729} <ffffffff80186fc1>{__fput+355}
Jul 30 06:11:47 BEESPDESXAPP06 kernel:         
<ffffffff8018455b>{filp_close+84}  
<ffffffff801360b7>{put_files_struct+107}
Jul 30 06:11:47 BEESPDESXAPP06 kernel:         
<ffffffff8010aecb>{sysret_signal+28} <ffffffff8013725c>{do_exit+684}
Jul 30 06:11:47 BEESPDESXAPP06 kernel:         
<ffffffff80137995>{sys_exit_group+0}  
<ffffffff8014083c>{get_signal_to_deliver+1394}
Jul 30 06:11:47 BEESPDESXAPP06 kernel:         
<ffffffff8010aecb>{sysret_signal+28} <ffffffff8010a19c>{do_signal+118}
Jul 30 06:11:47 BEESPDESXAPP06 kernel:         
<ffffffff8012c668>{default_wake_function+0}  
<ffffffff8014b227>{do_futex+104}
Jul 30 06:11:47 BEESPDESXAPP06 kernel:         
<ffffffff801743b2>{sys_mprotect+1742}  
<ffffffff8010aecb>{sysret_signal+28}
Jul 30 06:11:47 BEESPDESXAPP06 kernel:         
<ffffffff8010b14f>{ptregscall_common+103}
Jul 30 06:11:47 BEESPDESXAPP06 kernel: LustreError: dumping log to  
/tmp/lustre-log.1248927107.13298
Jul 30 06:11:47 BEESPDESXAPP06 kernel: Fixing recursive fault but  
reboot is needed!

Then ineed a reboot of the client is required. What does it mean ?  
Could it be related to sys.timeouts and/or ldlm_timeouts too short ?


Regards,


Guillaume Demillecamps




More information about the lustre-discuss mailing list