[Lustre-discuss] Luster clients getting evicted

Tom.Wang Tom.Wang at Sun.COM
Mon Feb 18 13:43:35 PST 2008


Brock Palen wrote:
> I found that something is getting overloaded some place.  If i just  
> go start and stop a job over and over quickly the client will lose  
> contact with one of the servers, ether OST or MDT.
>
>   
Server might be stuck somewhere.  It should depend on what does
the job do if you start and stop it over and over?

Will it create, then unlink a lot of file if you start and stop the job?

Whether the memory will help your problem depend on what triggers
this server stuck. Could you find some console error msg when they
are stuck?

Usually memory is more helpful on MDS, if you have a large number of
clients and big directory in your system.  Memory on OST is also helpful,
but not directly for read and write.  I think you can find the reason of 
this
easily on the list, and there are many discussion about the hardware
requirement for lustre before.

Thanks
WangDi
> Would more ram in the servers help? I dont see a high load or IO  
> wait, but both servers are older (dual 1.4Ghz amd) with only 2 gb of  
> memory.
>
> Brock Palen
> Center for Advanced Computing
> brockp at umich.edu
> (734)936-1985
>
>
> On Feb 8, 2008, at 2:47 PM, Tom.Wang wrote:
>
>   
>> Hello,
>>
>> m45_amp214_om D 0000000000000000     0  2587      1         31389   
>> 2586 (NOTLB)
>> 00000101f6b435f8 0000000000000006 000001022c7fc030 0000000000000001
>>       00000100080f1a40 0000000000000246 00000101f6b435a8  
>> 0000000380136025
>>       00000102270a1030 00000000000000d0
>> Call Trace:<ffffffffa0216e79>{:lnet:LNetPut+1689} <ffffffff8030e45f> 
>> {__down+147}
>>       <ffffffff80134659>{default_wake_function+0} <ffffffff8030ff7d> 
>> {__down_failed+53}
>>       <ffffffffa04292e1>{:lustre:.text.lock.file+5}  
>> <ffffffffa044b12e>{:lustre:ll_mdc_blocking_ast+798}
>>       <ffffffffa02c8eb8>{:ptlrpc:ldlm_resource_get+456}  
>> <ffffffffa02c3bbb>{:ptlrpc:ldlm_cancel_callback+107}
>>       <ffffffffa02da615>{:ptlrpc:ldlm_cli_cancel_local+213}
>>       <ffffffffa02c3c48>{:ptlrpc:ldlm_lock_addref_internal_nolock+56}
>>       <ffffffffa02c3dbc>{:ptlrpc:search_queue+284}  
>> <ffffffffa02dbc03>{:ptlrpc:ldlm_cancel_list+99}
>>       <ffffffffa02dc113>{:ptlrpc:ldlm_cancel_lru_local+915}
>>       <ffffffffa02ca293>{:ptlrpc:ldlm_resource_putref+435}
>>       <ffffffffa02dc2c9>{:ptlrpc:ldlm_prep_enqueue_req+313}
>>       <ffffffffa0394e6f>{:mdc:mdc_enqueue+1023} <ffffffffa02c1035> 
>> {:ptlrpc:lock_res_and_lock+53}
>>       <ffffffffa0268730>{:obdclass:class_handle2object+224}
>>       <ffffffffa02c5fea>{:ptlrpc:__ldlm_handle2lock+794}
>>       <ffffffffa02c106f>{:ptlrpc:unlock_res_and_lock+31}
>>       <ffffffffa02c5c03>{:ptlrpc:ldlm_lock_decref_internal+595}
>>       <ffffffffa02c156c>{:ptlrpc:ldlm_lock_add_to_lru+140}
>>       <ffffffffa02c1035>{:ptlrpc:lock_res_and_lock+53}  
>> <ffffffffa02c6f0a>{:ptlrpc:ldlm_lock_decref+154}
>>       <ffffffffa039617d>{:mdc:mdc_intent_lock+685}  
>> <ffffffffa044ae10>{:lustre:ll_mdc_blocking_ast+0}
>>       <ffffffffa02d85f0>{:ptlrpc:ldlm_completion_ast+0}  
>> <ffffffffa044ae10>{:lustre:ll_mdc_blocking_ast+0}
>>       <ffffffffa02d85f0>{:ptlrpc:ldlm_completion_ast+0}  
>> <ffffffffa044b64b>{:lustre:ll_prepare_mdc_op_data+139}
>>       <ffffffffa0418a32>{:lustre:ll_intent_file_open+450}
>>       <ffffffffa044ae10>{:lustre:ll_mdc_blocking_ast+0}  
>> <ffffffff80192006>{__d_lookup+287}
>>       <ffffffffa0419724>{:lustre:ll_file_open+2100}  
>> <ffffffffa0428a18>{:lustre:ll_inode_permission+184}
>>       <ffffffff80179bdb>{sys_access+349} <ffffffff8017a1ee> 
>> {__dentry_open+201}
>>       <ffffffff8017a3a9>{filp_open+95} <ffffffff80179bdb>{sys_access 
>> +349}
>>       <ffffffff801f00b5>{strncpy_from_user+74} <ffffffff8017a598> 
>> {sys_open+57}
>>       <ffffffff8011026a>{system_call+126}
>>
>> It seems blocking_ast process was blocked here. Could you dump the  
>> lustre/llite/namei.o by  objdump -S lustre/llite/namei.o and send  
>> to me?
>>
>> Thanks
>> WangDi
>>
>> Brock Palen wrote:
>>     
>>>>> On Feb 7, 2008, at 11:09 PM, Tom.Wang wrote:
>>>>>           
>>>>>>> MDT dmesg:
>>>>>>>
>>>>>>> LustreError: 9042:0:(ldlm_lib.c:1442:target_send_reply_msg())  
>>>>>>> @@@  processing error (-107)  req at 000001002b
>>>>>>> 52b000 x445020/t0 o400-><?>@<?>:-1 lens 128/0 ref 0 fl  
>>>>>>> Interpret:/0/0  rc -107/0
>>>>>>> LustreError: 0:0:(ldlm_lockd.c:210:waiting_locks_callback())  
>>>>>>> ### lock  callback timer expired: evicting cl
>>>>>>> ient 2faf3c9e-26fb-64b7- 
>>>>>>> ca6c-7c5b09374e67 at NET_0x200000aa4008d_UUID  nid  
>>>>>>> 10.164.0.141 at tcp  ns: mds-nobackup
>>>>>>> -MDT0000_UUID lock: 00000100476df240/0xbc269e05c512de3a lrc:  
>>>>>>> 1/0,0  mode: CR/CR res: 11240142/324715850 bi
>>>>>>> ts 0x5 rrc: 2 type: IBT flags: 20 remote: 0x4e54bc800174cd08  
>>>>>>> expref:  372 pid 26925
>>>>>>>
>>>>>>>               
>>>>>> The client was evicted because of this lock can not be released  
>>>>>> on client
>>>>>> on time. Could you provide the stack strace of client at that  
>>>>>> time?
>>>>>>
>>>>>> I assume increase obd_timeout could fix your problem. Then maybe
>>>>>> you should wait 1.6.5 released, including a new feature  
>>>>>> adaptive_timeout,
>>>>>> which will adjust the timeout value according to the network  
>>>>>> congestion
>>>>>> and server load. And it should help your problem.
>>>>>>             
>>>>> Waiting for the next version of lustre might be the best thing.   
>>>>> I had upped the timeout a few days back but the next day i had  
>>>>> errors on the MDS box.  I have switched it back:
>>>>>
>>>>> lctl conf_param nobackup-MDT0000.sys.timeout=300
>>>>>
>>>>> I would love to give you that trace but I don't know how to get  
>>>>> it.  Is there a debug option to turn on in the clients?
>>>>>           
>>>> You can get that by echo t > /proc/sysrq-trigger on client.
>>>>
>>>>         
>>> Cool command,  output of the client is attached.  The four  
>>> processes m45_amp214_om,  is the application that hung when  
>>> working off of luster.  you can see its stuck in IO state.
>>>
>>>       
>>>>
>>>>
>>>>
>>>>         
>>> --------------------------------------------------------------------- 
>>> ---
>>>
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>       
>>
>>     
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>   




More information about the lustre-discuss mailing list