[Lustre-discuss] Client Kernel panic - not syncing. Lustre 1.8.5

Aaron Everett aeverett at forteds.com
Tue May 17 17:13:42 PDT 2011


Hi all,

We've been running Lustre 1.6.6 for several years and are deploying 1.8.5 on
some new hardware. When under load we've been seeing random kernel panics on
many of the clients. We are running 2.6.18-194.17.1.el5_lustre.1.8.5 on the
servers (shared MDT/MGS, and 4 OST's. We have patchless clients
running 2.6.18-238.9.1.el5 (all CentOS).

On the MDT, the following is logged in /var/log/messages:

May 17 16:46:44 lustre-mdt-00 kernel: Lustre:
5878:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
x1368993021040034 sent from fdfs-MDT0000 to NID 172.16.14.219 at tcp 7s ago has
timed out (7s prior to deadline).
May 17 16:46:44 lustre-mdt-00 kernel:
req at ffff8105f140b800x1368993021040034/t0
o104->@NET_0x20000ac100edb_UUID:15/16 lens 296/384 e 0
to 1 dl 1305665204 ref 1 fl Rpc:N/0/0 rc 0/0
May 17 16:46:44 lustre-mdt-00 kernel: Lustre:
5878:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 39 previous
similar messages
May 17 16:46:44 lustre-mdt-00 kernel: LustreError: 138-a: fdfs-MDT0000: A
client on nid 172.16.14.219 at tcp was evicted due to a lock blocking callback
to 172.16.14.219 at tcp timed out: rc -107
May 17 16:46:52 lustre-mdt-00 kernel: LustreError: 138-a: fdfs-MDT0000: A
client on nid 172.16.14.225 at tcp was evicted due to a lock blocking callback
to 172.16.14.225 at tcp timed out: rc -107
May 17 16:46:52 lustre-mdt-00 kernel: LustreError:
6227:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED
req at ffff81181ccb6800 x1368993021041016/t0
o104->@NET_0x20000ac100ee1_UUID:15/16 lens 296/384 e 0 to 1 dl 0 ref 1 fl
Rpc:N/0/0 rc 0/0
May 17 16:46:52 lustre-mdt-00 kernel: LustreError:
6227:0:(ldlm_lockd.c:607:ldlm_handle_ast_error()) ### client (nid
172.16.14.225 at tcp) returned 0 from blocking AST ns: mds-fdfs-MDT0000_UUID
lock: ffff81169f590a00/0x767f56e4ad136f72 lrc: 4/0,0 mode: CR/CR res:
35202584/110090815 bits 0x3 rrc: 25 type: IBT flags: 0x4000020 remote:
0x364122c82e3aca01 expref: 229900 pid: 6310 timeout: 4386580591
May 17 16:46:59 lustre-mdt-00 kernel: LustreError: 138-a: fdfs-MDT0000: A
client on nid 172.16.14.230 at tcp was evicted due to a lock blocking callback
to 172.16.14.230 at tcp timed out: rc -107
May 17 16:46:59 lustre-mdt-00 kernel: LustreError: Skipped 6 previous
similar messages
May 17 16:47:07 lustre-mdt-00 kernel: Lustre:
6688:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
x1368993021041492 sent from fdfs-MDT0000 to NID 172.16.14.229 at tcp 7s ago has
timed out (7s prior to deadline).
May 17 16:47:07 lustre-mdt-00 kernel:
req at ffff81093052b000x1368993021041492/t0
o104->@NET_0x20000ac100ee5_UUID:15/16 lens 296/384 e 0
to 1 dl 1305665227 ref 1 fl Rpc:N/0/0 rc 0/0
May 17 16:47:07 lustre-mdt-00 kernel: Lustre:
6688:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 8 previous
similar messages
May 17 16:47:07 lustre-mdt-00 kernel: LustreError: 138-a: fdfs-MDT0000: A
client on nid 172.16.14.229 at tcp was evicted due to a lock blocking callback
to 172.16.14.229 at tcp timed out: rc -107
May 17 16:47:07 lustre-mdt-00 kernel: LustreError: Skipped 8 previous
similar messages
May 17 16:50:16 lustre-mdt-00 kernel: Lustre: MGS: haven't heard from client
c8e311a5-f1d6-7197-1021-c5a02c1c5b14 (at 172.16.14.230 at tcp) in 228 seconds.
I think it's dead, and I am evicting it.

On the clients, there is a kernel panic, with the following message on the
screen:

Code: 48 89 08 31 c9 48 89 12 48 89 52 08 ba 01 00 00 00 83 83 10
RIP   [<ffffffff8891ddcd>]  :mdc:mdc_exit_request+0x6d/0xb0
 RSP  <ffff81028c137858>
CR2:  0000000000003877
 <0>Kernel panic - not syncing: Fatal exception

We're running the same set of jobs on both the 1.6.6 lustre filesystem and
the 1.8.5 lustre filesystem. Only the 1.8.5 clients crash, the 1.6.6 clients
that are also using the new servers never exhibit this issue. I'm assuming
there is a setting on the 1.8.5 clients that needs to be adjusted, but I'm
searching for help.

Best regards,
Aaron
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20110517/f19a0146/attachment.htm>


More information about the lustre-discuss mailing list