[Lustre-discuss] Luster clients getting evicted

Brock Palen brockp at umich.edu
Wed Feb 6 09:10:32 PST 2008


I was able to catch a client and server in the act:

client dmesg:

eth0: no IPv6 routers present
Lustre: nobackup-MDT0000-mdc-000001012bd39800: Connection to service  
nobackup-MDT0000 via nid 141.212.30.184 at tcp was lost; in progress  
operations using this service will wait for recovery to complete.
LustreError: 167-0: This client was evicted by nobackup-MDT0000; in  
progress operations using this service will fail.
LustreError: 2757:0:(client.c:519:ptlrpc_import_delay_req()) @@@  
IMP_INVALID  req at 00000100cfce6800 x3216741/t0 o101->nobackup- 
MDT0000_UUID at 141.212.30.184@tcp:12 lens 448/768 ref 1 fl Rpc:/0/0 rc 0/0
LustreError: 2757:0:(mdc_locks.c:423:mdc_finish_enqueue())  
ldlm_cli_enqueue: -108
LustreError: 2822:0:(file.c:97:ll_close_inode_openhandle()) inode  
11237379 mdc close failed: rc = -108
LustreError: 2822:0:(client.c:519:ptlrpc_import_delay_req()) @@@  
IMP_INVALID  req at 000001002966d000 x3216837/t0 o35->nobackup- 
MDT0000_UUID at 141.212.30.184@tcp:12 lens 296/448 ref 1 fl Rpc:/0/0 rc 0/0
LustreError: 2822:0:(client.c:519:ptlrpc_import_delay_req()) Skipped  
95 previous similar messages
LustreError: 2822:0:(file.c:97:ll_close_inode_openhandle()) inode  
11270746 mdc close failed: rc = -108
LustreError: 2757:0:(mdc_locks.c:423:mdc_finish_enqueue())  
ldlm_cli_enqueue: -108
LustreError: 2757:0:(mdc_locks.c:423:mdc_finish_enqueue()) Skipped 30  
previous similar messages
LustreError: 2822:0:(file.c:97:ll_close_inode_openhandle()) Skipped  
62 previous similar messages
LustreError: 2757:0:(dir.c:258:ll_get_dir_page()) lock enqueue: rc: -108
LustreError: 2757:0:(dir.c:412:ll_readdir()) error reading dir  
11239903/324715747 page 2: rc -108
Lustre: nobackup-MDT0000-mdc-000001012bd39800: Connection restored to  
service nobackup-MDT0000 using nid 141.212.30.184 at tcp.
LustreError: 11-0: an error occurred while communicating with  
141.212.30.184 at tcp. The mds_close operation failed with -116
LustreError: 11-0: an error occurred while communicating with  
141.212.30.184 at tcp. The mds_close operation failed with -116
LustreError: 11-0: an error occurred while communicating with  
141.212.30.184 at tcp. The mds_close operation failed with -116
LustreError: 2834:0:(file.c:97:ll_close_inode_openhandle()) inode  
11270686 mdc close failed: rc = -116
LustreError: 2834:0:(file.c:97:ll_close_inode_openhandle()) Skipped  
40 previous similar messages
LustreError: 11-0: an error occurred while communicating with  
141.212.30.184 at tcp. The mds_close operation failed with -116
LustreError: 2728:0:(file.c:97:ll_close_inode_openhandle()) inode  
11240591 mdc close failed: rc = -116

MDT dmesg:

LustreError: 9042:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@  
processing error (-107)  req at 000001002b
52b000 x445020/t0 o400-><?>@<?>:-1 lens 128/0 ref 0 fl Interpret:/0/0  
rc -107/0
LustreError: 0:0:(ldlm_lockd.c:210:waiting_locks_callback()) ### lock  
callback timer expired: evicting cl
ient 2faf3c9e-26fb-64b7-ca6c-7c5b09374e67 at NET_0x200000aa4008d_UUID  
nid 10.164.0.141 at tcp  ns: mds-nobackup
-MDT0000_UUID lock: 00000100476df240/0xbc269e05c512de3a lrc: 1/0,0  
mode: CR/CR res: 11240142/324715850 bi
ts 0x5 rrc: 2 type: IBT flags: 20 remote: 0x4e54bc800174cd08 expref:  
372 pid 26925
Lustre: 3170:0:(mds_reint.c:127:mds_finish_transno()) commit  
transaction for disconnected client 2faf3c9e
-26fb-64b7-ca6c-7c5b09374e67: rc 0
LustreError: 27505:0:(mds_open.c:1474:mds_close()) @@@ no handle for  
file close ino 11239903: cookie 0xbc
269e05c51912d8  req at 000001001e69c600 x3216892/t0 o35- 
 >2faf3c9e-26fb-64b7-ca6c-7c5b09374e67 at NET_0x200000aa
4008d_UUID:-1 lens 296/448 ref 0 fl Interpret:/0/0 rc 0/0
LustreError: 27505:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@  
processing error (-116)  req at 000001001
e69c600 x3216892/t0 o35->2faf3c9e-26fb-64b7- 
ca6c-7c5b09374e67 at NET_0x200000aa4008d_UUID:-1 lens 296/448 re
f 0 fl Interpret:/0/0 rc -116/0

I hope this helps, but I see it happen more often with the OSS  
evicting client.

On Feb 6, 2008, at 10:59 AM, Brock Palen wrote:

>>> If client get eviction from the server, it might be triggered by
>>
>> 1) server did not get client pinger msg in a long time.
>> 2) client is too busy to handle the server lock cancel req.
>
> Clients show a load of 4.2  (4 cores total, 1 process per core).
>
>> 3) client cancel the lock, but the network just dropped the cancel
>> reply to server.
> I see a very small amount (6339) of dropped packets on the interfaces
> of the OSS.  Links between the switches show no errors.
>
>
>> 4) server is too busy to handle the lock cancel reply from the
>> client or be blocked somewhere.
>
> I started paying attention to the OSS more once you said this, some
> times i see the cpu use of socknal_sd00 get to 100%.  Now is this
> process used to keep all the odb_ping's going?
>
> Both the OSS and the MDS/MGS are SMP systems and run single
> interfaces.  If I dual homed the servers would that create another
> socknal process for lnet?
>
>
>>
>> It seems there are a lot of metadata operations in your job. I
>> guess your eviction
>> might be caused by the latter 2 reasons. If you could provide the
>> process stack trace on MDS
>> when the job died, it might help us to figure out what is going on
>> there?
>>
>> WangDi
>>> Brock Palen
>>> Center for Advanced Computing
>>> brockp at umich.edu
>>> (734)936-1985
>>>
>>>
>>> On Feb 4, 2008, at 2:47 PM, Brock Palen wrote:
>>>
>>>
>>>>> Which version of lustre do you use?
>>>>> Server and clients same version and same os? which one?
>>>>>
>>>> lustre-1.6.4.1
>>>>
>>>> The servers (oss and mds/mgs) use the RHEL4 rpm from lustre.org:
>>>> 2.6.9-55.0.9.EL_lustre.1.6.4.1smp
>>>>
>>>> The clients run patchless RHEL4
>>>> 2.6.9-67.0.1.ELsmp
>>>>
>>>> One set of clients are on a 10.x network while the servers and  
>>>> other
>>>> half of clients are on a 141.  network, because we are using the  
>>>> tcp
>>>> network type, we have not setup any lnet routes.  I don't think
>>>> should cause a problem but I include the information for
>>>> clarity.  We
>>>> do route 10.x on campus.
>>>>
>>>>
>>>>> Harald
>>>>>
>>>>> On Monday 04 February 2008 04:11 pm, Brock Palen wrote:
>>>>>
>>>>>> on our cluster that has been running lustre for about 1 month.
>>>>>> I  have
>>>>>> 1 MDT/MGS and 1 OSS with 2 OST's.
>>>>>>
>>>>>> Our cluster uses all Gige and has about 608 nodes 1854 cores.
>>>>>>
>>>>>> We have allot of jobs that die, and/or go into high IO wait,
>>>>>> strace
>>>>>> shows processes stuck in fstat().
>>>>>>
>>>>>> The big problem is (i think) I would like some feedback on it
>>>>>> that of
>>>>>> these 608 nodes 209 of them have in dmesg the string
>>>>>>
>>>>>> "This client was evicted by"
>>>>>>
>>>>>> Is this normal for clients to be dropped like this?  Is there  
>>>>>> some
>>>>>> tuning that needs to be done to the server to carry this many
>>>>>> nodes
>>>>>> out of the box?  We are using default lustre install with Gige.
>>>>>>
>>>>>>
>>>>>> Brock Palen
>>>>>> Center for Advanced Computing
>>>>>> brockp at umich.edu
>>>>>> (734)936-1985
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Lustre-discuss mailing list
>>>>>> Lustre-discuss at lists.lustre.org
>>>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>>>>
>>>>> -- 
>>>>> Harald van Pee
>>>>>
>>>>> Helmholtz-Institut fuer Strahlen- und Kernphysik der
>>>>> Universitaet  Bonn
>>>>> _______________________________________________
>>>>> Lustre-discuss mailing list
>>>>> Lustre-discuss at lists.lustre.org
>>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>>>
>>>>>
>>>>>
>>>> _______________________________________________
>>>> Lustre-discuss mailing list
>>>> Lustre-discuss at lists.lustre.org
>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>>
>>>>
>>>>
>>>
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>
>>
>>
>>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>




More information about the lustre-discuss mailing list