[Lustre-discuss] Luster clients getting evicted

Aaron Knister aaron at iges.org
Mon Feb 11 11:16:20 PST 2008


I'm having a similar issue with lustre 1.6.4.2 and infiniband. Under  
load, the clients hand about every 10 minutes which is really bad for  
a production machine. The only way to fix the hang is to reboot the  
server. My users are getting extremely impatient :-/

I see this on the clients-

LustreError: 2814:0:(client.c:975:ptlrpc_expire_one_request()) @@@  
timeout (sent at 1202756629, 301s ago)  req at ffff8100af233600 x1796079/ 
t0 o6->data-OST0000_UUID at 192.168.64.71@o2ib:28 lens 336/336 ref 1 fl  
Rpc:/0/0 rc 0/-22
Lustre: data-OST0000-osc-ffff810139ce4800: Connection to service data- 
OST0000 via nid 192.168.64.71 at o2ib was lost; in progress operations  
using this service will wait for recovery to complete.
LustreError: 11-0: an error occurred while communicating with  
192.168.64.71 at o2ib. The ost_connect operation failed with -16
LustreError: 11-0: an error occurred while communicating with  
192.168.64.71 at o2ib. The ost_connect operation failed with -16

I've increased the timeout to 300seconds and it has helped marginally.

-Aaron

On Feb 9, 2008, at 12:06 AM, Tom.Wang wrote:

> Hi,
> Aha, this is bug has been fixed in 14360.
>
> https://bugzilla.lustre.org/show_bug.cgi?id=14360
>
> The patch there should fix your problem, which should be released in  
> 1.6.5
>
> Thanks
>
> Brock Palen wrote:
>> Sure, Attached,  note though, we rebuilt our lustre source for  
>> another
>> box that uses the largesmp kernel. but it used the same options and
>> compiler.
>>
>>
>> Brock Palen
>> Center for Advanced Computing
>> brockp at umich.edu
>> (734)936-1985
>>
>>
>> On Feb 8, 2008, at 2:47 PM, Tom.Wang wrote:
>>
>>> Hello,
>>>
>>> m45_amp214_om D 0000000000000000     0  2587      1         31389
>>> 2586 (NOTLB)
>>> 00000101f6b435f8 0000000000000006 000001022c7fc030 0000000000000001
>>>      00000100080f1a40 0000000000000246 00000101f6b435a8
>>> 0000000380136025
>>>      00000102270a1030 00000000000000d0
>>> Call Trace:<ffffffffa0216e79>{:lnet:LNetPut+1689}
>>> <ffffffff8030e45f>{__down+147}
>>>      <ffffffff80134659>{default_wake_function+0}
>>> <ffffffff8030ff7d>{__down_failed+53}
>>>      <ffffffffa04292e1>{:lustre:.text.lock.file+5}
>>> <ffffffffa044b12e>{:lustre:ll_mdc_blocking_ast+798}
>>>      <ffffffffa02c8eb8>{:ptlrpc:ldlm_resource_get+456}
>>> <ffffffffa02c3bbb>{:ptlrpc:ldlm_cancel_callback+107}
>>>      <ffffffffa02da615>{:ptlrpc:ldlm_cli_cancel_local+213}
>>>      <ffffffffa02c3c48>{:ptlrpc:ldlm_lock_addref_internal_nolock+56}
>>>      <ffffffffa02c3dbc>{:ptlrpc:search_queue+284}
>>> <ffffffffa02dbc03>{:ptlrpc:ldlm_cancel_list+99}
>>>      <ffffffffa02dc113>{:ptlrpc:ldlm_cancel_lru_local+915}
>>>      <ffffffffa02ca293>{:ptlrpc:ldlm_resource_putref+435}
>>>      <ffffffffa02dc2c9>{:ptlrpc:ldlm_prep_enqueue_req+313}
>>>      <ffffffffa0394e6f>{:mdc:mdc_enqueue+1023}
>>> <ffffffffa02c1035>{:ptlrpc:lock_res_and_lock+53}
>>>      <ffffffffa0268730>{:obdclass:class_handle2object+224}
>>>      <ffffffffa02c5fea>{:ptlrpc:__ldlm_handle2lock+794}
>>>      <ffffffffa02c106f>{:ptlrpc:unlock_res_and_lock+31}
>>>      <ffffffffa02c5c03>{:ptlrpc:ldlm_lock_decref_internal+595}
>>>      <ffffffffa02c156c>{:ptlrpc:ldlm_lock_add_to_lru+140}
>>>      <ffffffffa02c1035>{:ptlrpc:lock_res_and_lock+53}
>>> <ffffffffa02c6f0a>{:ptlrpc:ldlm_lock_decref+154}
>>>      <ffffffffa039617d>{:mdc:mdc_intent_lock+685}
>>> <ffffffffa044ae10>{:lustre:ll_mdc_blocking_ast+0}
>>>      <ffffffffa02d85f0>{:ptlrpc:ldlm_completion_ast+0}
>>> <ffffffffa044ae10>{:lustre:ll_mdc_blocking_ast+0}
>>>      <ffffffffa02d85f0>{:ptlrpc:ldlm_completion_ast+0}
>>> <ffffffffa044b64b>{:lustre:ll_prepare_mdc_op_data+139}
>>>      <ffffffffa0418a32>{:lustre:ll_intent_file_open+450}
>>>      <ffffffffa044ae10>{:lustre:ll_mdc_blocking_ast+0}
>>> <ffffffff80192006>{__d_lookup+287}
>>>      <ffffffffa0419724>{:lustre:ll_file_open+2100}
>>> <ffffffffa0428a18>{:lustre:ll_inode_permission+184}
>>>      <ffffffff80179bdb>{sys_access+349}
>>> <ffffffff8017a1ee>{__dentry_open+201}
>>>      <ffffffff8017a3a9>{filp_open+95}
>>> <ffffffff80179bdb>{sys_access+349}
>>>      <ffffffff801f00b5>{strncpy_from_user+74}
>>> <ffffffff8017a598>{sys_open+57}
>>>      <ffffffff8011026a>{system_call+126}
>>>
>>> It seems blocking_ast process was blocked here. Could you dump the
>>> lustre/llite/namei.o by  objdump -S lustre/llite/namei.o and send  
>>> to me?
>>>
>>> Thanks
>>> WangDi
>>>
>>> Brock Palen wrote:
>>>>>> On Feb 7, 2008, at 11:09 PM, Tom.Wang wrote:
>>>>>>>> MDT dmesg:
>>>>>>>>
>>>>>>>> LustreError: 9042:0:(ldlm_lib.c:1442:target_send_reply_msg())
>>>>>>>> @@@  processing error (-107)  req at 000001002b
>>>>>>>> 52b000 x445020/t0 o400-><?>@<?>:-1 lens 128/0 ref 0 fl
>>>>>>>> Interpret:/0/0  rc -107/0
>>>>>>>> LustreError: 0:0:(ldlm_lockd.c:210:waiting_locks_callback())  
>>>>>>>> ###
>>>>>>>> lock  callback timer expired: evicting cl
>>>>>>>> ient
>>>>>>>> 2faf3c9e-26fb-64b7-ca6c-7c5b09374e67 at NET_0x200000aa4008d_UUID
>>>>>>>> nid 10.164.0.141 at tcp  ns: mds-nobackup
>>>>>>>> -MDT0000_UUID lock: 00000100476df240/0xbc269e05c512de3a lrc:
>>>>>>>> 1/0,0  mode: CR/CR res: 11240142/324715850 bi
>>>>>>>> ts 0x5 rrc: 2 type: IBT flags: 20 remote: 0x4e54bc800174cd08
>>>>>>>> expref:  372 pid 26925
>>>>>>>>
>>>>>>> The client was evicted because of this lock can not be released
>>>>>>> on client
>>>>>>> on time. Could you provide the stack strace of client at that  
>>>>>>> time?
>>>>>>>
>>>>>>> I assume increase obd_timeout could fix your problem. Then maybe
>>>>>>> you should wait 1.6.5 released, including a new feature
>>>>>>> adaptive_timeout,
>>>>>>> which will adjust the timeout value according to the network
>>>>>>> congestion
>>>>>>> and server load. And it should help your problem.
>>>>>>
>>>>>> Waiting for the next version of lustre might be the best  
>>>>>> thing.  I
>>>>>> had upped the timeout a few days back but the next day i had
>>>>>> errors on the MDS box.  I have switched it back:
>>>>>>
>>>>>> lctl conf_param nobackup-MDT0000.sys.timeout=300
>>>>>>
>>>>>> I would love to give you that trace but I don't know how to get
>>>>>> it.  Is there a debug option to turn on in the clients?
>>>>> You can get that by echo t > /proc/sysrq-trigger on client.
>>>>>
>>>> Cool command,  output of the client is attached.  The four  
>>>> processes
>>>> m45_amp214_om,  is the application that hung when working off of
>>>> luster.  you can see its stuck in IO state.
>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>> ------------------------------------------------------------------------
>>>>
>>>>
>>>> _______________________________________________
>>>> Lustre-discuss mailing list
>>>> Lustre-discuss at lists.lustre.org
>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>
>>>
>>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Aaron Knister
Associate Systems Analyst
Center for Ocean-Land-Atmosphere Studies

(301) 595-7000
aaron at iges.org







More information about the lustre-discuss mailing list