[Lustre-discuss] Client errors then reboot

Oleg Drokin oleg.drokin at oracle.com
Mon Nov 22 18:00:07 PST 2010


Hello!

On Nov 22, 2010, at 2:04 PM, David Noriega wrote:

> We've got the latest lustre running(1.8.4) and kernel
> 2.6.18-194.3.1.el5. I call it our primary client as it is what exposes
> the file system for others to use via nfs/samba. Today the machine
> seeminly rebooted on its own and checking the logs I see these
> messages

Do you have automatic reboot on panic set? If so, that means you just run into
BUG() or LBUG situation.
If you have some sort of serial console setup, you should see what was it there.
If you do not, then there is now no way to find out, but please consider setting
it up for the future.

> This is what I see on 192.168.5.101
> Nov 22 12:25:22 data2 kernel: LustreError:
> 4726:0:(socklnd_cb.c:1714:ksocknal_recv_hello()) Error -104 reading
> HELLO from 129.115.117.8
> Nov 22 12:25:22 data2 kernel: LustreError:
> 4726:0:(socklnd_cb.c:1714:ksocknal_recv_hello()) Skipped 8 previous
> similar messages
> Nov 22 12:36:02 data2 kernel: LustreError:
> 4725:0:(socklnd_cb.c:1714:ksocknal_recv_hello()) Error -104 reading
> HELLO from 129.115.117.8
> Nov 22 12:36:02 data2 kernel: LustreError:
> 4725:0:(socklnd_cb.c:1714:ksocknal_recv_hello()) Skipped 9 previous
> similar messages

So you are getting connection resets from this node for some reason
was it the one that rebooted?

> Nov 22 12:43:39 data2 kernel: Lustre:
> 23762:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
> x1351344462337868 sent from lustre-OST0002 to NID 129.115.117.8 at tcp 7s
> ago has timed out (7s prior to deadline).
> Nov 22 12:43:39 data2 kernel:   req at ffff81004c42d800
> x1351344462337868/t0 o104->@NET_0x2000081737508_UUID:15/16 lens
> 296/384 e 0 to 1 dl 1290451419 ref 1 fl Rpc:N/0/0 rc 0/0

Attempt to send blocking callback to 129.115.117.8 failed

> Nov 22 12:43:39 data2 kernel: LustreError: 138-a: lustre-OST0002: A
> client on nid 129.115.117.8 at tcp was evicted due to a lock blocking
> callback to 129.115.117.8 at tcp timed out: rc -107
> Nov 22 12:44:38 data2 kernel: Lustre:
> 23569:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
> x1351344462337882 sent from lustre-OST0003 to NID 129.115.117.8 at tcp 0s
> ago has failed due to network error (7s prior to deadline).
> Nov 22 12:44:38 data2 kernel:   req at ffff810111b38400
> x1351344462337882/t0 o104->@NET_0x2000081737508_UUID:15/16 lens
> 296/384 e 0 to 1 dl 1290451485 ref 1 fl Rpc:N/0/0 rc 0/0
> Nov 22 12:44:38 data2 kernel: LustreError: 138-a: lustre-OST0003: A
> client on nid 129.115.117.8 at tcp was evicted due to a lock blocking
> callback to 129.115.117.8 at tcp timed out: rc -107

We are evicting this client (129.115.117.8) because we cannot deliver ldlm ASTs to it
and assume it is dead or is in some wedged state.

Bye,
    Oleg


More information about the lustre-discuss mailing list