[Lustre-discuss] Client Kernel panic - not syncing. Lustre 1.8.5

Wed May 18 11:47:05 PDT 2011

max_rpcs_in_flight has a max of 256 in 1.8.5 IIRC.  The down side is a
single client can consume more resources from the OSS.  Looks like your
using ksocklnd for your LND below so you may also have to increase the
ksocklnd peer_credits module parameters which defaults to 8 as well.

Jeremy

On Wed, May 18, 2011 at 2:34 PM, Aaron Everett <aeverett at forteds.com> wrote:

> More information:
>
> The frequency of these errors was dramatically reduced by
> changing /proc/fs/lustre/osc/fdfs-OST000[0-3]-osc/max_rpcs_in_flight from 8
> to 32.
>
> Processor, memory, and disk I/O on the servers is not high, is there a
> reason for not increasing max_rpcs_in_flight from 32 to 48 or 64? Is there a
> limit on how high I can set this value?
>
> Best regards,
> Aaron
>
>
> On Tue, May 17, 2011 at 8:13 PM, Aaron Everett <aeverett at forteds.com>wrote:
>
>> Hi all,
>>
>> We've been running Lustre 1.6.6 for several years and are deploying 1.8.5
>> on some new hardware. When under load we've been seeing random kernel panics
>> on many of the clients. We are running 2.6.18-194.17.1.el5_lustre.1.8.5 on
>> the servers (shared MDT/MGS, and 4 OST's. We have patchless clients
>> running 2.6.18-238.9.1.el5 (all CentOS).
>>
>> On the MDT, the following is logged in /var/log/messages:
>>
>> May 17 16:46:44 lustre-mdt-00 kernel: Lustre:
>> 5878:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
>> x1368993021040034 sent from fdfs-MDT0000 to NID 172.16.14.219 at tcp 7s ago
>> has timed out (7s prior to deadline).
>> May 17 16:46:44 lustre-mdt-00 kernel:   req at ffff8105f140b800x1368993021040034/t0 o104->@NET_0x20000ac100edb_UUID:15/16 lens 296/384 e 0
>> to 1 dl 1305665204 ref 1 fl Rpc:N/0/0 rc 0/0
>> May 17 16:46:44 lustre-mdt-00 kernel: Lustre:
>> 5878:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 39 previous
>> similar messages
>> May 17 16:46:44 lustre-mdt-00 kernel: LustreError: 138-a: fdfs-MDT0000: A
>> client on nid 172.16.14.219 at tcp was evicted due to a lock blocking
>> callback to 172.16.14.219 at tcp timed out: rc -107
>> May 17 16:46:52 lustre-mdt-00 kernel: LustreError: 138-a: fdfs-MDT0000: A
>> client on nid 172.16.14.225 at tcp was evicted due to a lock blocking
>> callback to 172.16.14.225 at tcp timed out: rc -107
>> May 17 16:46:52 lustre-mdt-00 kernel: LustreError:
>> 6227:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED
>> req at ffff81181ccb6800 x1368993021041016/t0
>> o104->@NET_0x20000ac100ee1_UUID:15/16 lens 296/384 e 0 to 1 dl 0 ref 1 fl
>> Rpc:N/0/0 rc 0/0
>> May 17 16:46:52 lustre-mdt-00 kernel: LustreError:
>> 6227:0:(ldlm_lockd.c:607:ldlm_handle_ast_error()) ### client (nid
>> 172.16.14.225 at tcp) returned 0 from blocking AST ns: mds-fdfs-MDT0000_UUID
>> lock: ffff81169f590a00/0x767f56e4ad136f72 lrc: 4/0,0 mode: CR/CR res:
>> 35202584/110090815 bits 0x3 rrc: 25 type: IBT flags: 0x4000020 remote:
>> 0x364122c82e3aca01 expref: 229900 pid: 6310 timeout: 4386580591
>> May 17 16:46:59 lustre-mdt-00 kernel: LustreError: 138-a: fdfs-MDT0000: A
>> client on nid 172.16.14.230 at tcp was evicted due to a lock blocking
>> callback to 172.16.14.230 at tcp timed out: rc -107
>> May 17 16:46:59 lustre-mdt-00 kernel: LustreError: Skipped 6 previous
>> similar messages
>> May 17 16:47:07 lustre-mdt-00 kernel: Lustre:
>> 6688:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
>> x1368993021041492 sent from fdfs-MDT0000 to NID 172.16.14.229 at tcp 7s ago
>> has timed out (7s prior to deadline).
>> May 17 16:47:07 lustre-mdt-00 kernel:   req at ffff81093052b000x1368993021041492/t0 o104->@NET_0x20000ac100ee5_UUID:15/16 lens 296/384 e 0
>> to 1 dl 1305665227 ref 1 fl Rpc:N/0/0 rc 0/0
>> May 17 16:47:07 lustre-mdt-00 kernel: Lustre:
>> 6688:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 8 previous
>> similar messages
>> May 17 16:47:07 lustre-mdt-00 kernel: LustreError: 138-a: fdfs-MDT0000: A
>> client on nid 172.16.14.229 at tcp was evicted due to a lock blocking
>> callback to 172.16.14.229 at tcp timed out: rc -107
>> May 17 16:47:07 lustre-mdt-00 kernel: LustreError: Skipped 8 previous
>> similar messages
>> May 17 16:50:16 lustre-mdt-00 kernel: Lustre: MGS: haven't heard from
>> client c8e311a5-f1d6-7197-1021-c5a02c1c5b14 (at 172.16.14.230 at tcp) in 228
>> seconds. I think it's dead, and I am evicting it.
>>
>> On the clients, there is a kernel panic, with the following message on the
>> screen:
>>
>> Code: 48 89 08 31 c9 48 89 12 48 89 52 08 ba 01 00 00 00 83 83 10
>> RIP   [<ffffffff8891ddcd>]  :mdc:mdc_exit_request+0x6d/0xb0
>>  RSP  <ffff81028c137858>
>> CR2:  0000000000003877
>>  <0>Kernel panic - not syncing: Fatal exception
>>
>> We're running the same set of jobs on both the 1.6.6 lustre filesystem and
>> the 1.8.5 lustre filesystem. Only the 1.8.5 clients crash, the 1.6.6 clients
>> that are also using the new servers never exhibit this issue. I'm assuming
>> there is a setting on the 1.8.5 clients that needs to be adjusted, but I'm
>> searching for help.
>>
>> Best regards,
>> Aaron
>>
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20110518/51e47d72/attachment.htm>