[Lustre-discuss] Client Kernel panic - not syncing. Lustre 1.8.5

Wed May 18 11:34:36 PDT 2011

More information:

The frequency of these errors was dramatically reduced by
changing /proc/fs/lustre/osc/fdfs-OST000[0-3]-osc/max_rpcs_in_flight from 8
to 32.

Processor, memory, and disk I/O on the servers is not high, is there a
reason for not increasing max_rpcs_in_flight from 32 to 48 or 64? Is there a
limit on how high I can set this value?

Best regards,
Aaron

On Tue, May 17, 2011 at 8:13 PM, Aaron Everett <aeverett at forteds.com> wrote:

> Hi all,
>
> We've been running Lustre 1.6.6 for several years and are deploying 1.8.5
> on some new hardware. When under load we've been seeing random kernel panics
> on many of the clients. We are running 2.6.18-194.17.1.el5_lustre.1.8.5 on
> the servers (shared MDT/MGS, and 4 OST's. We have patchless clients
> running 2.6.18-238.9.1.el5 (all CentOS).
>
> On the MDT, the following is logged in /var/log/messages:
>
> May 17 16:46:44 lustre-mdt-00 kernel: Lustre:
> 5878:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
> x1368993021040034 sent from fdfs-MDT0000 to NID 172.16.14.219 at tcp 7s ago
> has timed out (7s prior to deadline).
> May 17 16:46:44 lustre-mdt-00 kernel:   req at ffff8105f140b800x1368993021040034/t0 o104->@NET_0x20000ac100edb_UUID:15/16 lens 296/384 e 0
> to 1 dl 1305665204 ref 1 fl Rpc:N/0/0 rc 0/0
> May 17 16:46:44 lustre-mdt-00 kernel: Lustre:
> 5878:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 39 previous
> similar messages
> May 17 16:46:44 lustre-mdt-00 kernel: LustreError: 138-a: fdfs-MDT0000: A
> client on nid 172.16.14.219 at tcp was evicted due to a lock blocking
> callback to 172.16.14.219 at tcp timed out: rc -107
> May 17 16:46:52 lustre-mdt-00 kernel: LustreError: 138-a: fdfs-MDT0000: A
> client on nid 172.16.14.225 at tcp was evicted due to a lock blocking
> callback to 172.16.14.225 at tcp timed out: rc -107
> May 17 16:46:52 lustre-mdt-00 kernel: LustreError:
> 6227:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED
> req at ffff81181ccb6800 x1368993021041016/t0
> o104->@NET_0x20000ac100ee1_UUID:15/16 lens 296/384 e 0 to 1 dl 0 ref 1 fl
> Rpc:N/0/0 rc 0/0
> May 17 16:46:52 lustre-mdt-00 kernel: LustreError:
> 6227:0:(ldlm_lockd.c:607:ldlm_handle_ast_error()) ### client (nid
> 172.16.14.225 at tcp) returned 0 from blocking AST ns: mds-fdfs-MDT0000_UUID
> lock: ffff81169f590a00/0x767f56e4ad136f72 lrc: 4/0,0 mode: CR/CR res:
> 35202584/110090815 bits 0x3 rrc: 25 type: IBT flags: 0x4000020 remote:
> 0x364122c82e3aca01 expref: 229900 pid: 6310 timeout: 4386580591
> May 17 16:46:59 lustre-mdt-00 kernel: LustreError: 138-a: fdfs-MDT0000: A
> client on nid 172.16.14.230 at tcp was evicted due to a lock blocking
> callback to 172.16.14.230 at tcp timed out: rc -107
> May 17 16:46:59 lustre-mdt-00 kernel: LustreError: Skipped 6 previous
> similar messages
> May 17 16:47:07 lustre-mdt-00 kernel: Lustre:
> 6688:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
> x1368993021041492 sent from fdfs-MDT0000 to NID 172.16.14.229 at tcp 7s ago
> has timed out (7s prior to deadline).
> May 17 16:47:07 lustre-mdt-00 kernel:   req at ffff81093052b000x1368993021041492/t0 o104->@NET_0x20000ac100ee5_UUID:15/16 lens 296/384 e 0
> to 1 dl 1305665227 ref 1 fl Rpc:N/0/0 rc 0/0
> May 17 16:47:07 lustre-mdt-00 kernel: Lustre:
> 6688:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 8 previous
> similar messages
> May 17 16:47:07 lustre-mdt-00 kernel: LustreError: 138-a: fdfs-MDT0000: A
> client on nid 172.16.14.229 at tcp was evicted due to a lock blocking
> callback to 172.16.14.229 at tcp timed out: rc -107
> May 17 16:47:07 lustre-mdt-00 kernel: LustreError: Skipped 8 previous
> similar messages
> May 17 16:50:16 lustre-mdt-00 kernel: Lustre: MGS: haven't heard from
> client c8e311a5-f1d6-7197-1021-c5a02c1c5b14 (at 172.16.14.230 at tcp) in 228
> seconds. I think it's dead, and I am evicting it.
>
> On the clients, there is a kernel panic, with the following message on the
> screen:
>
> Code: 48 89 08 31 c9 48 89 12 48 89 52 08 ba 01 00 00 00 83 83 10
> RIP   [<ffffffff8891ddcd>]  :mdc:mdc_exit_request+0x6d/0xb0
>  RSP  <ffff81028c137858>
> CR2:  0000000000003877
>  <0>Kernel panic - not syncing: Fatal exception
>
> We're running the same set of jobs on both the 1.6.6 lustre filesystem and
> the 1.8.5 lustre filesystem. Only the 1.8.5 clients crash, the 1.6.6 clients
> that are also using the new servers never exhibit this issue. I'm assuming
> there is a setting on the 1.8.5 clients that needs to be adjusted, but I'm
> searching for help.
>
> Best regards,
> Aaron
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20110518/857a1f02/attachment.htm>