[Lustre-discuss] Clients fail every now and again,
Andreas Dilger
adilger at sun.com
Tue Nov 18 13:47:17 PST 2008
On Nov 18, 2008 12:14 -0500, Brock Palen wrote:
> if that is the bug causing this, is the fix till we upgrade to the
> newer lustre, to set statahead_max=0 again?
Yes, this is another statahead bug.
> I see this same behavior this morning on a compute node.
>
> Brock Palen
> www.umich.edu/~brockp
> Center for Advanced Computing
> brockp at umich.edu
> (734)936-1985
>
>
>
> On Nov 16, 2008, at 10:49 PM, Yong Fan wrote:
>
> > Brock Palen 写道:
> >> We consistantly see random ocurances of a client being kicked
> >> out, and while lustre says it tries to reconnect, it almost never
> >> can without a reboot:
> >>
> >>
> > Maybe you can check:
> > https://bugzilla.lustre.org/show_bug.cgi?id=15927
> >
> > Regards!
> > --
> > Fan Yong
> >> Nov 14 18:28:18 nyx-login1 kernel: LustreError: 14130:0:(import.c:
> >> 226:ptlrpc_invalidate_import()) nobackup-MDT0000_UUID: rc = -110
> >> waiting for callback (3 != 0)
> >> Nov 14 18:28:18 nyx-login1 kernel: LustreError: 14130:0:(import.c:
> >> 230:ptlrpc_invalidate_import()) @@@ still on sending list
> >> req at 000001015dd9ec00 x979024/t0 o101->nobackup-
> >> MDT0000_UUID at 10.164.3.246@tcp:12/10 lens 448/1184 e 0 to 100 dl
> >> 1226700928 ref 1 fl Rpc:RES/0/0 rc -4/0
> >> Nov 14 18:28:18 nyx-login1 kernel: LustreError: 14130:0:(import.c:
> >> 230:ptlrpc_invalidate_import()) Skipped 1 previous similar
> >> messageNov 14 18:28:18 nyx-login1 kernel: Lustre: nobackup-
> >> MDT0000- mdc-00000100f7ef0400: Connection restored to service
> >> nobackup-MDT0000 using nid 10.164.3.246 at tcp.
> >> Nov 14 18:30:32 nyx-login1 kernel: LustreError: 11-0: an error
> >> occurred while communicating with 10.164.3.246 at tcp. The
> >> mds_statfs operation failed with -107
> >> Nov 14 18:30:32 nyx-login1 kernel: Lustre: nobackup-MDT0000-
> >> mdc-00000100f7ef0400: Connection to service nobackup-MDT0000 via
> >> nid 10.164.3.246 at tcp was lost; in progress operations using this
> >> service will wait for recovery to complete.
> >> Nov 14 18:30:32 nyx-login1 kernel: LustreError: 167-0: This
> >> client was evicted by nobackup-MDT0000; in progress operations
> >> using this service will fail.
> >> Nov 14 18:30:32 nyx-login1 kernel: LustreError: 16523:0:
> >> (llite_lib.c: 1549:ll_statfs_internal()) mdc_statfs fails: rc = -5
> >> Nov 14 18:30:35 nyx-login1 kernel: LustreError: 16525:0:(client.c:
> >> 716:ptlrpc_import_delay_req()) @@@ IMP_INVALID
> >> req at 000001000990fe00 x983192/t0 o41->nobackup-
> >> MDT0000_UUID at 10.164.3.246@tcp:12/10 lens 128/400 e 0 to 100 dl 0
> >> ref 1 fl Rpc:/0/0 rc 0/0
> >> Nov 14 18:30:35 nyx-login1 kernel: LustreError: 16525:0:
> >> (llite_lib.c: 1549:ll_statfs_internal()) mdc_statfs fails: rc = -108
> >>
> >> Is there any way to make lustre more robust against these types
> >> of failures? According to the manual (and many times in
> >> practice, like rebooting a MDS) the filesystem will just block
> >> and comeback. This almost never comes back, after a while it
> >> will say reconnected, but will fail again right away.
> >>
> >> On the MDS I see:
> >>
> >> Nov 14 18:30:20 mds1 kernel: Lustre: nobackup-MDT0000: haven't
> >> heard from client 1284bfca-91bd-03f6-649c-f591e5d807d5 (at
> >> 141.212.31.43 at tcp) in 227 seconds. I think it's dead, and I am
> >> evicting it.
> >> Nov 14 18:30:28 mds1 kernel: LustreError: 11463:0:(handler.c:
> >> 1515:mds_handle()) operation 41 on unconnected MDS from
> >> 12345-141.212.31.43 at tcp
> >> Nov 14 18:30:28 mds1 kernel: LustreError: 11463:0:(ldlm_lib.c:
> >> 1536:target_send_reply_msg()) @@@ processing error (-107)
> >> req at 00000103f84eae00 x983190/t0 o41-><?>@<?>:0/0 lens 128/0 e 0 to
> >> 0 dl 1226705528 ref 1 fl Interpret:/0/0 rc -107/0
> >> Nov 14 18:34:15 mds1 kernel: Lustre: nobackup-MDT0000: haven't
> >> heard from client 1284bfca-91bd-03f6-649c-f591e5d807d5 (at
> >> 141.212.31.43 at tcp) in 227 seconds. I think it's dead, and I am
> >> evicting it.
> >>
> >> Just keeps kicking it out, /proc/fs/lustre/health_check on
> >> client, and servers are healthy.
> >>
> >> Brock Palen
> >> www.umich.edu/~brockp
> >> Center for Advanced Computing
> >> brockp at umich.edu
> >> (734)936-1985
> >>
> >>
> >>
> >> _______________________________________________
> >> Lustre-discuss mailing list
> >> Lustre-discuss at lists.lustre.org
> >> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> >>
> >
> >
> >
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
More information about the lustre-discuss
mailing list