[Lustre-discuss] Network problems

Arne Brutschy arne.brutschy at ulb.ac.be
Thu Oct 28 05:19:35 PDT 2010


Hi all,

we're using Lustre 1.8.3 on a gigabit network. We have 4 osts on 2 oss
and a single mgs. We are serving the users' homedirs (mostly small
files) for 64 clients on this network. It now happened for the third
time that the cluster went down: either the oss or the mgs block, and
nobody can access the lustre share anymore.

Looking at the logs, I see lots of connectivity errors:
        
        LustreError: 17792:0:(mgs_handler.c:641:mgs_handle()) MGS handle cmd=250 rc=-16
        LustreError: 17792:0:(mgs_handler.c:641:mgs_handle()) Skipped 3 previous similar messages
        LustreError: 17792:0:(ldlm_lib.c:1892:target_send_reply_msg()) @@@ processing error (-16)  req at f5ae642c x1344336331331741/t0 o250->6e1c6cb5-564f-49b0-a01e-e7e4605423a8 at NET_0x200000affffc1_UUID:0/0 lens 368/264 e 0 to 0 dl 1288258643 ref 1 fl Interpret:/0/0 rc -16/0
        LustreError: 17792:0:(ldlm_lib.c:1892:target_send_reply_msg()) Skipped 3 previous similar messages
        Lustre: 2895:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request x1348047450260821 sent from lustre-MDT0000-mdc-f3aa2e00 to NID 0 at lo 30s ago has timed out (30s prior to deadline).
          req at e66d1e00 x1348047450260821/t0 o38->lustre-MDT0000_UUID at 10.1.1.1@tcp:12/10 lens 368/584 e 0 to 1 dl 1288258632 ref 1 fl Rpc:N/0/0 rc 0/0
        Lustre: 2895:0:(client.c:1463:ptlrpc_expire_one_request()) Skipped 19 previous similar messages
        Lustre: MGS: haven't heard from client 03b9cdae-66f1-552b-8c7c-94a9499c8dcf (at 10.255.255.220 at tcp) in 228 seconds. I think it's dead, and I am evicting it.
        LustreError: 2893:0:(acceptor.c:455:lnet_acceptor()) Error -11 reading connection request from 10.255.255.199
        Lustre: 2936:0:(ldlm_lib.c:575:target_handle_reconnect()) MGS: fa602b20-b24c-bbcd-7003-b3b9bf702db4 reconnecting
        Lustre: 2936:0:(ldlm_lib.c:575:target_handle_reconnect()) Skipped 49 previous similar messages
        LustreError: 2888:0:(socklnd_cb.c:1707:ksocknal_recv_hello()) Error -104 reading HELLO from 10.255.255.199
        LustreError: 2888:0:(socklnd_cb.c:1707:ksocknal_recv_hello()) Skipped 2 previous similar messages
        LustreError: 11b-b: Connection to 10.255.255.199 at tcp at host 10.255.255.199 on port 988 was reset: is it running a compatible version of Lustre and is 10.255.255.199 at tcp one of its NIDs?
        Lustre: 20268:0:(ldlm_lib.c:804:target_handle_connect()) MGS: exp eb3ff200 already connecting
        Lustre: 17792:0:(ldlm_lib.c:875:target_handle_connect()) MGS: refuse reconnection from fa602b20-b24c-bbcd-7003-b3b9bf702db4 at 10.255.255.199@tcp to 0xeb3ff200; still busy with 2 active RPCs
        Lustre: 17792:0:(ldlm_lib.c:875:target_handle_connect()) Skipped 1 previous similar message
        Lustre: 2895:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request x1348047450261066 sent from lustre-MDT0000-mdc-f3aa2e00 to NID 0 at lo 30s ago has timed out (30s prior to deadline).
          req at ce186400 x1348047450261066/t0 o38->lustre-MDT0000_UUID at 10.1.1.1@tcp:12/10 lens 368/584 e 0 to 1 dl 1288259252 ref 1 fl Rpc:N/0/0 rc 0/0
        Lustre: 2895:0:(client.c:1463:ptlrpc_expire_one_request()) Skipped 19 previous similar messages
        Lustre: There was an unexpected network error while writing to 10.255.255.221: -110.
        Lustre: 20268:0:(ldlm_lib.c:804:target_handle_connect()) MGS: exp d8637600 already connecting
        Lustre: 19656:0:(ldlm_lib.c:875:target_handle_connect()) MGS: refuse reconnection from 7ee2fe58-3fab-c39a-8adb-c356d1bdc434 at 10.255.255.209@tcp to 0xd8637600; still busy with 1 active RPCs
        LustreError: 19656:0:(mgs_handler.c:641:mgs_handle()) MGS handle cmd=250 rc=-16
        LustreError: 19656:0:(mgs_handler.c:641:mgs_handle()) Skipped 3 previous similar messages
        LustreError: 19656:0:(ldlm_lib.c:1892:target_send_reply_msg()) @@@ processing error (-16)  req at f5ae6c2c x1344329600678410/t0 o250->7ee2fe58-3fab-c39a-8adb-c356d1bdc434 at NET_0x200000affffd1_UUID:0/0 lens 368/264 e 0 to 0 dl 1288259427 ref 1 fl Interpret:/0/0 rc -16/0
        LustreError: 19656:0:(ldlm_lib.c:1892:target_send_reply_msg()) Skipped 3 previous similar messages
        Lustre: There was an unexpected network error while writing to 10.255.255.209: -110.
        Lustre: 2935:0:(ldlm_lib.c:575:target_handle_reconnect()) MGS: fa602b20-b24c-bbcd-7003-b3b9bf702db4 reconnecting
        Lustre: 2935:0:(ldlm_lib.c:575:target_handle_reconnect()) Skipped 55 previous similar messages
        LustreError: 2890:0:(socklnd_cb.c:1707:ksocknal_recv_hello()) Error -11 reading HELLO from 10.255.255.191
        LustreError: 2890:0:(socklnd_cb.c:1707:ksocknal_recv_hello()) Skipped 1 previous similar message
        LustreError: 2893:0:(acceptor.c:455:lnet_acceptor()) Error -11 reading connection request from 10.255.255.220
        eth0: too many iterations (16) in nv_nic_irq.
        Lustre: MGS: haven't heard from client ed155959-9a71-653f-7d3b-0f4982ed020c (at 10.255.255.212 at tcp) in 222 seconds. I think it's dead, and I am evicting it.
        Lustre: Skipped 3 previous similar messages
        Lustre: There was an unexpected network error while writing to 10.255.255.215: -110.
        Lustre: 2935:0:(ldlm_lib.c:875:target_handle_connect()) MGS: refuse reconnection from 4e3abc28-f42b-ad98-3021-5557328f80a7 at 10.255.255.213@tcp to 0xee70bc00; still busy with 1 active RPCs
        Lustre: 2895:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request x1348047450261319 sent from lustre-MDT0000-mdc-f3aa2e00 to NID 0 at lo 30s ago has timed out (30s prior to deadline).
          req at ec7ffc00 x1348047450261319/t0 o38->lustre-MDT0000_UUID at 10.1.1.1@tcp:12/10 lens 368/584 e 0 to 1 dl 1288259872 ref 1 fl Rpc:N/0/0 rc 0/0
        Lustre: 2895:0:(client.c:1463:ptlrpc_expire_one_request()) Skipped 19 previous similar messages
        Lustre: 20268:0:(ldlm_lib.c:575:target_handle_reconnect()) MGS: ed155959-9a71-653f-7d3b-0f4982ed020c reconnecting
        Lustre: 20268:0:(ldlm_lib.c:575:target_handle_reconnect()) Skipped 53 previous similar messages
        LustreError: 2889:0:(socklnd_cb.c:1707:ksocknal_recv_hello()) Error -11 reading HELLO from 10.255.255.217
        LustreError: 2889:0:(socklnd_cb.c:1707:ksocknal_recv_hello()) Skipped 1 previous similar message
        Lustre: MGS: haven't heard from client 5ccc9289-412d-1e96-a777-f5c3dbcc3899 (at 10.255.255.218 at tcp) in 227 seconds. I think it's dead, and I am evicting it.
        Lustre: Skipped 3 previous similar messages
        Lustre: 2895:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request x1348047450261564 sent from lustre-MDT0000-mdc-f3aa2e00 to NID 0 at lo 30s ago has timed out (30s prior to deadline).
          req at eb72ba00 x1348047450261564/t0 o38->lustre-MDT0000_UUID at 10.1.1.1@tcp:12/10 lens 368/584 e 0 to 1 dl 1288260492 ref 1 fl Rpc:N/0/0 rc 0/0
        Lustre: 2895:0:(client.c:1463:ptlrpc_expire_one_request()) Skipped 19 previous similar messages
        LustreError: 14286:0:(mdc_locks.c:625:mdc_enqueue()) ldlm_cli_enqueue: -4
        LustreError: 14286:0:(file.c:3256:ll_inode_revalidate_fini()) failure -4 inode 38076417
        LustreError: 14286:0:(mdc_locks.c:625:mdc_enqueue()) ldlm_cli_enqueue: -4
        LustreError: 14286:0:(file.c:3256:ll_inode_revalidate_fini()) failure -4 inode 38076417
        LustreError: 21507:0:(mdc_locks.c:625:mdc_enqueue()) ldlm_cli_enqueue: -4
        LustreError: 21507:0:(file.c:3256:ll_inode_revalidate_fini()) failure -4 inode 38076417
        Lustre: 2935:0:(ldlm_lib.c:875:target_handle_connect()) MGS: refuse reconnection from ed155959-9a71-653f-7d3b-0f4982ed020c at 10.255.255.212@tcp to 0xd1681e00; still busy with 1 active RPCs
        Lustre: 2935:0:(ldlm_lib.c:875:target_handle_connect()) Skipped 1 previous similar message
        LustreError: 2935:0:(mgs_handler.c:641:mgs_handle()) MGS handle cmd=250 rc=-16
        LustreError: 2935:0:(mgs_handler.c:641:mgs_handle()) Skipped 2 previous similar messages
        LustreError: 2935:0:(ldlm_lib.c:1892:target_send_reply_msg()) @@@ processing error (-16)  req at f55c402c x1344329607901556/t0 o250->ed155959-9a71-653f-7d3b-0f4982ed020c at NET_0x200000affffd4_UUID:0/0 lens 368/264 e 0 to 0 dl 1288260660 ref 1 fl Interpret:/0/0 rc -16/0
        LustreError: 2935:0:(ldlm_lib.c:1892:target_send_reply_msg()) Skipped 2 previous similar messages
        LustreError: 21584:0:(mdc_locks.c:625:mdc_enqueue()) ldlm_cli_enqueue: -4
        LustreError: 21584:0:(file.c:3256:ll_inode_revalidate_fini()) failure -4 inode 38076417

So (I guess) this seems to be an issue with our network, which appears
under certain conditions (eg. high load). Another hint pointing in this
direction is the high number of RX errors I can see on the ports of the
connecting switch, and that some of the node are resetting their link.
In normal operation, I don't get any of these behaviors.

I suspect that something goes wrong while communicating over the
network, the mgs starts to evict clients and afterwards the whole system
spirals down in a storm of request. At least it didn't get better when I
resetted the switch, so it's not the switch alone. I only managed to fix
it by switching off half of the nodes...

Any thoughts on this? How can I know if your switch (a Netgear GS748T)
is up to the task? What type of backend do you recommend for a cluster
of this size?

regards,
Arne

-- 
Arne Brutschy
Ph.D. Student                    Email    arne.brutschy(AT)ulb.ac.be
IRIDIA CP 194/6                  Web      iridia.ulb.ac.be/~abrutschy
Universite' Libre de Bruxelles   Tel      +32 2 650 2273
Avenue Franklin Roosevelt 50     Fax      +32 2 650 2715
1050 Bruxelles, Belgium          (Fax at IRIDIA secretary)




More information about the lustre-discuss mailing list