[Lustre-discuss] Luster clients getting evicted

Mon Feb 18 13:11:43 PST 2008

I found that something is getting overloaded some place.  If i just  
go start and stop a job over and over quickly the client will lose  
contact with one of the servers, ether OST or MDT.

Would more ram in the servers help? I dont see a high load or IO  
wait, but both servers are older (dual 1.4Ghz amd) with only 2 gb of  
memory.

Brock Palen
Center for Advanced Computing
brockp at umich.edu
(734)936-1985

On Feb 8, 2008, at 2:47 PM, Tom.Wang wrote:

> Hello,
>
> m45_amp214_om D 0000000000000000     0  2587      1         31389   
> 2586 (NOTLB)
> 00000101f6b435f8 0000000000000006 000001022c7fc030 0000000000000001
>       00000100080f1a40 0000000000000246 00000101f6b435a8  
> 0000000380136025
>       00000102270a1030 00000000000000d0
> Call Trace:<ffffffffa0216e79>{:lnet:LNetPut+1689} <ffffffff8030e45f> 
> {__down+147}
>       <ffffffff80134659>{default_wake_function+0} <ffffffff8030ff7d> 
> {__down_failed+53}
>       <ffffffffa04292e1>{:lustre:.text.lock.file+5}  
> <ffffffffa044b12e>{:lustre:ll_mdc_blocking_ast+798}
>       <ffffffffa02c8eb8>{:ptlrpc:ldlm_resource_get+456}  
> <ffffffffa02c3bbb>{:ptlrpc:ldlm_cancel_callback+107}
>       <ffffffffa02da615>{:ptlrpc:ldlm_cli_cancel_local+213}
>       <ffffffffa02c3c48>{:ptlrpc:ldlm_lock_addref_internal_nolock+56}
>       <ffffffffa02c3dbc>{:ptlrpc:search_queue+284}  
> <ffffffffa02dbc03>{:ptlrpc:ldlm_cancel_list+99}
>       <ffffffffa02dc113>{:ptlrpc:ldlm_cancel_lru_local+915}
>       <ffffffffa02ca293>{:ptlrpc:ldlm_resource_putref+435}
>       <ffffffffa02dc2c9>{:ptlrpc:ldlm_prep_enqueue_req+313}
>       <ffffffffa0394e6f>{:mdc:mdc_enqueue+1023} <ffffffffa02c1035> 
> {:ptlrpc:lock_res_and_lock+53}
>       <ffffffffa0268730>{:obdclass:class_handle2object+224}
>       <ffffffffa02c5fea>{:ptlrpc:__ldlm_handle2lock+794}
>       <ffffffffa02c106f>{:ptlrpc:unlock_res_and_lock+31}
>       <ffffffffa02c5c03>{:ptlrpc:ldlm_lock_decref_internal+595}
>       <ffffffffa02c156c>{:ptlrpc:ldlm_lock_add_to_lru+140}
>       <ffffffffa02c1035>{:ptlrpc:lock_res_and_lock+53}  
> <ffffffffa02c6f0a>{:ptlrpc:ldlm_lock_decref+154}
>       <ffffffffa039617d>{:mdc:mdc_intent_lock+685}  
> <ffffffffa044ae10>{:lustre:ll_mdc_blocking_ast+0}
>       <ffffffffa02d85f0>{:ptlrpc:ldlm_completion_ast+0}  
> <ffffffffa044ae10>{:lustre:ll_mdc_blocking_ast+0}
>       <ffffffffa02d85f0>{:ptlrpc:ldlm_completion_ast+0}  
> <ffffffffa044b64b>{:lustre:ll_prepare_mdc_op_data+139}
>       <ffffffffa0418a32>{:lustre:ll_intent_file_open+450}
>       <ffffffffa044ae10>{:lustre:ll_mdc_blocking_ast+0}  
> <ffffffff80192006>{__d_lookup+287}
>       <ffffffffa0419724>{:lustre:ll_file_open+2100}  
> <ffffffffa0428a18>{:lustre:ll_inode_permission+184}
>       <ffffffff80179bdb>{sys_access+349} <ffffffff8017a1ee> 
> {__dentry_open+201}
>       <ffffffff8017a3a9>{filp_open+95} <ffffffff80179bdb>{sys_access 
> +349}
>       <ffffffff801f00b5>{strncpy_from_user+74} <ffffffff8017a598> 
> {sys_open+57}
>       <ffffffff8011026a>{system_call+126}
>
> It seems blocking_ast process was blocked here. Could you dump the  
> lustre/llite/namei.o by  objdump -S lustre/llite/namei.o and send  
> to me?
>
> Thanks
> WangDi
>
> Brock Palen wrote:
>>>> On Feb 7, 2008, at 11:09 PM, Tom.Wang wrote:
>>>>>> MDT dmesg:
>>>>>>
>>>>>> LustreError: 9042:0:(ldlm_lib.c:1442:target_send_reply_msg())  
>>>>>> @@@  processing error (-107)  req at 000001002b
>>>>>> 52b000 x445020/t0 o400-><?>@<?>:-1 lens 128/0 ref 0 fl  
>>>>>> Interpret:/0/0  rc -107/0
>>>>>> LustreError: 0:0:(ldlm_lockd.c:210:waiting_locks_callback())  
>>>>>> ### lock  callback timer expired: evicting cl
>>>>>> ient 2faf3c9e-26fb-64b7- 
>>>>>> ca6c-7c5b09374e67 at NET_0x200000aa4008d_UUID  nid  
>>>>>> 10.164.0.141 at tcp  ns: mds-nobackup
>>>>>> -MDT0000_UUID lock: 00000100476df240/0xbc269e05c512de3a lrc:  
>>>>>> 1/0,0  mode: CR/CR res: 11240142/324715850 bi
>>>>>> ts 0x5 rrc: 2 type: IBT flags: 20 remote: 0x4e54bc800174cd08  
>>>>>> expref:  372 pid 26925
>>>>>>
>>>>> The client was evicted because of this lock can not be released  
>>>>> on client
>>>>> on time. Could you provide the stack strace of client at that  
>>>>> time?
>>>>>
>>>>> I assume increase obd_timeout could fix your problem. Then maybe
>>>>> you should wait 1.6.5 released, including a new feature  
>>>>> adaptive_timeout,
>>>>> which will adjust the timeout value according to the network  
>>>>> congestion
>>>>> and server load. And it should help your problem.
>>>>
>>>> Waiting for the next version of lustre might be the best thing.   
>>>> I had upped the timeout a few days back but the next day i had  
>>>> errors on the MDS box.  I have switched it back:
>>>>
>>>> lctl conf_param nobackup-MDT0000.sys.timeout=300
>>>>
>>>> I would love to give you that trace but I don't know how to get  
>>>> it.  Is there a debug option to turn on in the clients?
>>> You can get that by echo t > /proc/sysrq-trigger on client.
>>>
>> Cool command,  output of the client is attached.  The four  
>> processes m45_amp214_om,  is the application that hung when  
>> working off of luster.  you can see its stuck in IO state.
>>
>>>
>>>
>>>
>>>
>>>
>> --------------------------------------------------------------------- 
>> ---
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>
>