[Lustre-discuss] 1.8 : recurrent LBUG's on clients
Guillaume Demillecamps
guillaume at multipurpose.be
Fri Jul 31 00:15:52 PDT 2009
Hello,
All servers and clients are having Lustre 1.8, on SLES 10 SP2. Clients
use patchless kernels, using same base revision as the ones for the
patched kernel servers.
We recurrently encounter this error :
Server log :
------------
Jul 30 06:11:47 BEESPBESXFIL27 kernel: LustreError:
22061:0:(mds_open.c:1665:mds_close()) @@@ no handle for file close ino
5606195: cookie 0x5ed7d8c3d1299f40 req at ffff810065a60400
x1308791892785337/t0
o35->4f104403-eb03-83be-2910-2fd7cc26087c at NET_0x20000c0a84410_UUID:0/0
lens 408/864 e 0 to 0 dl 1248927113 ref 1 fl Interpret:/0/0 rc 0/0
Jul 30 06:11:47 BEESPBESXFIL27 kernel: LustreError:
22061:0:(ldlm_lib.c:1826:target_send_reply_msg()) @@@ processing error
(-116) req at ffff810065a60400 x1308791892785337/t0
o35->4f104403-eb03-83be-2910-2fd7cc26087c at NET_0x20000c0a84410_UUID:0/0
lens 408/864 e 0 to 0 dl 1248927113 ref 1 fl Interpret:/0/0 rc -116/0
Jul 30 06:11:47 BEESPBESXFIL27 kernel: LustreError:
22061:0:(mds_open.c:1665:mds_close()) @@@ no handle for file close ino
5606200: cookie 0x5ed7d8c3d129a361 req at ffff810071b28400
x1308791892785342/t0
o35->4f104403-eb03-83be-2910-2fd7cc26087c at NET_0x20000c0a84410_UUID:0/0
lens 408/864 e 0 to 0 dl 1248927113 ref 1 fl Interpret:/0/0 rc 0/0
Jul 30 06:11:47 BEESPBESXFIL27 kernel: LustreError:
22061:0:(mds_open.c:1665:mds_close()) Skipped 4 previous similar
messages
Jul 30 06:11:47 BEESPBESXFIL27 kernel: LustreError:
22061:0:(ldlm_lib.c:1826:target_send_reply_msg()) @@@ processing error
(-116) req at ffff810071b28400 x1308791892785342/t0
o35->4f104403-eb03-83be-2910-2fd7cc26087c at NET_0x20000c0a84410_UUID:0/0
lens 408/864 e 0 to 0 dl 1248927113 ref 1 fl Interpret:/0/0 rc -116/0
Jul 30 06:11:47 BEESPBESXFIL27 kernel: LustreError:
22061:0:(ldlm_lib.c:1826:target_send_reply_msg()) Skipped 4 previous
similar messages
Client log:
-----------
Jul 30 06:11:47 BEESPDESXAPP06 kernel: LustreError: 11-0: an error
occurred while communicating with 172.16.0.55 at tcp. The mds_close
operation failed with -116
Jul 30 06:11:47 BEESPDESXAPP06 kernel: LustreError:
13298:0:(file.c:114:ll_close_inode_openhandle()) inode 5606195 mdc
close failed: rc = -116
Jul 30 06:11:47 BEESPDESXAPP06 kernel: LustreError:
13298:0:(file.c:114:ll_close_inode_openhandle()) Skipped 1 previous
similar message
Jul 30 06:11:47 BEESPDESXAPP06 kernel: LustreError:
13298:0:(file.c:114:ll_close_inode_openhandle()) inode 5606155 mdc
close failed: rc = -116
Jul 30 06:11:47 BEESPDESXAPP06 kernel: LustreError:
13298:0:(file.c:114:ll_close_inode_openhandle()) Skipped 3 previous
similar messages
Jul 30 06:11:47 BEESPDESXAPP06 kernel: LustreError: 11-0: an error
occurred while communicating with 172.16.0.55 at tcp. The mds_close
operation failed with -116
Jul 30 06:11:47 BEESPDESXAPP06 kernel: LustreError: Skipped 7 previous
similar messages
Jul 30 06:11:47 BEESPDESXAPP06 kernel: LustreError:
13298:0:(ldlm_lock.c:602:ldlm_lock_decref_internal_nolock())
ASSERTION(lock->l_writers > 0) failed
Jul 30 06:11:47 BEESPDESXAPP06 kernel: LustreError:
13298:0:(ldlm_lock.c:602:ldlm_lock_decref_internal_nolock()) LBUG
Jul 30 06:11:47 BEESPDESXAPP06 kernel:
Jul 30 06:11:47 BEESPDESXAPP06 kernel: Call Trace:
<ffffffff88257aea>{:libcfs:lbug_with_loc+122}
Jul 30 06:11:47 BEESPDESXAPP06 kernel:
<ffffffff8825fe00>{:libcfs:tracefile_init+0}
<ffffffff8835d566>{:ptlrpc:ldlm_lock_decref_internal_nolock+182}
Jul 30 06:11:47 BEESPDESXAPP06 kernel:
<ffffffff8838533b>{:ptlrpc:ldlm_process_flock_lock+4139}
Jul 30 06:11:47 BEESPDESXAPP06 kernel:
<ffffffff883864ef>{:ptlrpc:ldlm_flock_completion_ast+2111}
Jul 30 06:11:47 BEESPDESXAPP06 kernel:
<ffffffff8835f4a9>{:ptlrpc:ldlm_lock_enqueue+2169}
<ffffffff88377ca0>{:ptlrpc:ldlm_cli_enqueue_fini+2624}
Jul 30 06:11:47 BEESPDESXAPP06 kernel:
<ffffffff88376fd3>{:ptlrpc:ldlm_prep_elc_req+755}
<ffffffff8835bc0d>{:ptlrpc:ldlm_lock_create+2541}
Jul 30 06:11:47 BEESPDESXAPP06 kernel:
<ffffffff8012c668>{default_wake_function+0}
<ffffffff88379ae2>{:ptlrpc:ldlm_cli_enqueue+1666}
Jul 30 06:11:47 BEESPDESXAPP06 kernel:
<ffffffff88523fcf>{:lustre:ll_file_flock+1407}
<ffffffff88385cb0>{:ptlrpc:ldlm_flock_completion_ast+0}
Jul 30 06:11:47 BEESPDESXAPP06 kernel:
<ffffffff8019ae2e>{locks_remove_posix+132}
<ffffffff80147fdc>{bit_waitqueue+56}
Jul 30 06:11:47 BEESPDESXAPP06 kernel:
<ffffffff80190241>{flush_old_exec+2729} <ffffffff80186fc1>{__fput+355}
Jul 30 06:11:47 BEESPDESXAPP06 kernel:
<ffffffff8018455b>{filp_close+84}
<ffffffff801360b7>{put_files_struct+107}
Jul 30 06:11:47 BEESPDESXAPP06 kernel:
<ffffffff8010aecb>{sysret_signal+28} <ffffffff8013725c>{do_exit+684}
Jul 30 06:11:47 BEESPDESXAPP06 kernel:
<ffffffff80137995>{sys_exit_group+0}
<ffffffff8014083c>{get_signal_to_deliver+1394}
Jul 30 06:11:47 BEESPDESXAPP06 kernel:
<ffffffff8010aecb>{sysret_signal+28} <ffffffff8010a19c>{do_signal+118}
Jul 30 06:11:47 BEESPDESXAPP06 kernel:
<ffffffff8012c668>{default_wake_function+0}
<ffffffff8014b227>{do_futex+104}
Jul 30 06:11:47 BEESPDESXAPP06 kernel:
<ffffffff801743b2>{sys_mprotect+1742}
<ffffffff8010aecb>{sysret_signal+28}
Jul 30 06:11:47 BEESPDESXAPP06 kernel:
<ffffffff8010b14f>{ptregscall_common+103}
Jul 30 06:11:47 BEESPDESXAPP06 kernel: LustreError: dumping log to
/tmp/lustre-log.1248927107.13298
Jul 30 06:11:47 BEESPDESXAPP06 kernel: Fixing recursive fault but
reboot is needed!
Then ineed a reboot of the client is required. What does it mean ?
Could it be related to sys.timeouts and/or ldlm_timeouts too short ?
Regards,
Guillaume Demillecamps
More information about the lustre-discuss
mailing list