[Lustre-discuss] Clients fail every now and again,

Tue Nov 18 15:27:18 PST 2008

Thanks,

Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
brockp at umich.edu
(734)936-1985

On Nov 18, 2008, at 4:47 PM, Andreas Dilger wrote:

> On Nov 18, 2008  12:14 -0500, Brock Palen wrote:
>> if that is the bug causing this, is the fix till we upgrade to the
>> newer lustre, to set statahead_max=0 again?
>
> Yes, this is another statahead bug.
>
>> I see this same behavior this morning on a compute node.
>>
>> Brock Palen
>> www.umich.edu/~brockp
>> Center for Advanced Computing
>> brockp at umich.edu
>> (734)936-1985
>>
>>
>>
>> On Nov 16, 2008, at 10:49 PM, Yong Fan wrote:
>>
>>> Brock Palen 写道:
>>>> We consistantly see random ocurances of a client being kicked
>>>> out,  and while lustre says it tries to reconnect, it almost never
>>>> can  without a reboot:
>>>>
>>>>
>>> Maybe you can check:
>>> https://bugzilla.lustre.org/show_bug.cgi?id=15927
>>>
>>> Regards!
>>> --
>>> Fan Yong
>>>> Nov 14 18:28:18 nyx-login1 kernel: LustreError: 14130:0:(import.c:
>>>> 226:ptlrpc_invalidate_import()) nobackup-MDT0000_UUID: rc = -110
>>>> waiting for callback (3 != 0)
>>>> Nov 14 18:28:18 nyx-login1 kernel: LustreError: 14130:0:(import.c:
>>>> 230:ptlrpc_invalidate_import()) @@@ still on sending list
>>>> req at 000001015dd9ec00 x979024/t0 o101->nobackup-
>>>> MDT0000_UUID at 10.164.3.246@tcp:12/10 lens 448/1184 e 0 to 100 dl
>>>> 1226700928 ref 1 fl Rpc:RES/0/0 rc -4/0
>>>> Nov 14 18:28:18 nyx-login1 kernel: LustreError: 14130:0:(import.c:
>>>> 230:ptlrpc_invalidate_import()) Skipped 1 previous similar
>>>> messageNov  14 18:28:18 nyx-login1 kernel: Lustre: nobackup-
>>>> MDT0000- mdc-00000100f7ef0400: Connection restored to service
>>>> nobackup-MDT0000  using nid 10.164.3.246 at tcp.
>>>> Nov 14 18:30:32 nyx-login1 kernel: LustreError: 11-0: an error
>>>> occurred while communicating with 10.164.3.246 at tcp. The
>>>> mds_statfs  operation failed with -107
>>>> Nov 14 18:30:32 nyx-login1 kernel: Lustre: nobackup-MDT0000-
>>>> mdc-00000100f7ef0400: Connection to service nobackup-MDT0000 via
>>>> nid  10.164.3.246 at tcp was lost; in progress operations using this
>>>> service  will wait for recovery to complete.
>>>> Nov 14 18:30:32 nyx-login1 kernel: LustreError: 167-0: This
>>>> client  was evicted by nobackup-MDT0000; in progress operations
>>>> using this  service will fail.
>>>> Nov 14 18:30:32 nyx-login1 kernel: LustreError: 16523:0:
>>>> (llite_lib.c: 1549:ll_statfs_internal()) mdc_statfs fails: rc = -5
>>>> Nov 14 18:30:35 nyx-login1 kernel: LustreError: 16525:0:(client.c:
>>>> 716:ptlrpc_import_delay_req()) @@@ IMP_INVALID
>>>> req at 000001000990fe00  x983192/t0 o41->nobackup-
>>>> MDT0000_UUID at 10.164.3.246@tcp:12/10 lens  128/400 e 0 to 100 dl 0
>>>> ref 1 fl Rpc:/0/0 rc 0/0
>>>> Nov 14 18:30:35 nyx-login1 kernel: LustreError: 16525:0:
>>>> (llite_lib.c: 1549:ll_statfs_internal()) mdc_statfs fails: rc =  
>>>> -108
>>>>
>>>> Is there any way to make lustre more robust against these types
>>>> of  failures?  According to the manual (and many times in
>>>> practice, like  rebooting a MDS)  the filesystem will just block
>>>> and comeback.  This  almost never comes back, after a while it
>>>> will say reconnected, but  will fail again right away.
>>>>
>>>> On the MDS I see:
>>>>
>>>> Nov 14 18:30:20 mds1 kernel: Lustre: nobackup-MDT0000: haven't
>>>> heard  from client 1284bfca-91bd-03f6-649c-f591e5d807d5 (at
>>>> 141.212.31.43 at tcp) in 227 seconds. I think it's dead, and I am
>>>> evicting it.
>>>> Nov 14 18:30:28 mds1 kernel: LustreError: 11463:0:(handler.c:
>>>> 1515:mds_handle()) operation 41 on unconnected MDS from
>>>> 12345-141.212.31.43 at tcp
>>>> Nov 14 18:30:28 mds1 kernel: LustreError: 11463:0:(ldlm_lib.c:
>>>> 1536:target_send_reply_msg()) @@@ processing error (-107)
>>>> req at 00000103f84eae00 x983190/t0 o41-><?>@<?>:0/0 lens 128/0 e 0 to
>>>> 0  dl 1226705528 ref 1 fl Interpret:/0/0 rc -107/0
>>>> Nov 14 18:34:15 mds1 kernel: Lustre: nobackup-MDT0000: haven't
>>>> heard  from client 1284bfca-91bd-03f6-649c-f591e5d807d5 (at
>>>> 141.212.31.43 at tcp) in 227 seconds. I think it's dead, and I am
>>>> evicting it.
>>>>
>>>> Just keeps kicking it out,  /proc/fs/lustre/health_check on
>>>> client,  and servers are healthy.
>>>>
>>>> Brock Palen
>>>> www.umich.edu/~brockp
>>>> Center for Advanced Computing
>>>> brockp at umich.edu
>>>> (734)936-1985
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Lustre-discuss mailing list
>>>> Lustre-discuss at lists.lustre.org
>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>>
>>>
>>>
>>>
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
>
>