[lustre-discuss] MGS failover problem

Vicker, Darby (JSC-EG311) darby.vicker-1 at nasa.gov
Fri Jan 13 22:33:34 PST 2017


Progress.  I did another round of "tunefs.lustre –writeconf" to take out the IB so we are on Ethernet only.  I think the MDS/MGS failover worked properly – note the "Connection restored to MGC192.52.98.30 at tcp_1 (at 192.52.98.31 at tcp)" message in the oss logs below – and that the ptlrpc_expire_one_request messages stop once this happens.  The info in /proc/fs/lustre/mgc is still pointed to the original MGS IP though.  Ahh, but I see that while the path of /proc/fs/lustre/mgc/MGC192.52.98.30 at tcp/import indicates this is still pointed to the primary, the *contents* indicates that it is really pointed to the secondary (see the very bottom).  I probably need to put the IB back in the mix and test this again...

Its still not clear to me if this is a setup error on my part or a lustre bug.  I'm kind of thinking its a bug since the setup is the same and all I've really done is just removed the IB network.  But maybe doing something wrong with multiple fabric setup.  Thoughts appreciated.  

Details – all after the failover. 

OSS00 logs:

Jan 13 23:54:27 hpfs-fsl-oss00 kernel: Lustre: 27683:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1484373260/real 1484373260]  req at ffff881e9e8a3c00 x1556477804282864/t0(0) o400->hpfs-fsl-MDT0000-lwp-OST0000 at 192.52.98.30@tcp:12/10 lens 224/224 e 0 to 1 dl 1484373267 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jan 13 23:54:27 hpfs-fsl-oss00 kernel: Lustre: hpfs-fsl-MDT0000-lwp-OST0000: Connection to hpfs-fsl-MDT0000 (at 192.52.98.30 at tcp) was lost; in progress operations using this service will wait for recovery to complete
Jan 13 23:54:33 hpfs-fsl-oss00 kernel: Lustre: 27671:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1484373267/real 1484373267]  req at ffff881e9e8a3f00 x1556477804282880/t0(0) o38->hpfs-fsl-MDT0000-lwp-OST0000 at 192.52.98.30@tcp:12/10 lens 520/544 e 0 to 1 dl 1484373273 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jan 13 23:54:58 hpfs-fsl-oss00 kernel: Lustre: 27671:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1484373292/real 1484373292]  req at ffff881e9e8a4500 x1556477804282912/t0(0) o38->hpfs-fsl-MDT0000-lwp-OST0000 at 192.52.98.31@tcp:12/10 lens 520/544 e 0 to 1 dl 1484373298 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jan 13 23:55:04 hpfs-fsl-oss00 kernel: Lustre: 27682:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1484373260/real 1484373260]  req at ffff881e9e8a3900 x1556477804282848/t0(0) o400->MGC192.52.98.30 at tcp@192.52.98.30 at tcp:26/25 lens 224/224 e 0 to 1 dl 1484373304 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jan 13 23:55:04 hpfs-fsl-oss00 kernel: LustreError: 166-1: MGC192.52.98.30 at tcp: Connection to MGS (at 192.52.98.30 at tcp) was lost; in progress operations using this service will fail
Jan 13 23:55:07 hpfs-fsl-oss00 kernel: Lustre: hpfs-fsl-OST0000: Connection restored to MGC192.52.98.30 at tcp_1 (at 192.52.98.31 at tcp)
Jan 13 23:55:10 hpfs-fsl-oss00 kernel: Lustre: 27671:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1484373304/real 1484373304]  req at ffff881e9e8a4800 x1556477804282928/t0(0) o250->MGC192.52.98.30 at tcp@192.52.98.30 at tcp:26/25 lens 520/544 e 0 to 1 dl 1484373310 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jan 13 23:55:29 hpfs-fsl-oss00 kernel: Lustre: Evicted from MGS (at MGC192.52.98.30 at tcp_1) after server handle changed from 0x9027cb7bbd974ef1 to 0x1bd9753462f57d48
Jan 13 23:55:29 hpfs-fsl-oss00 kernel: Lustre: MGC192.52.98.30 at tcp: Connection restored to MGC192.52.98.30 at tcp_1 (at 192.52.98.31 at tcp)
Jan 13 23:55:40 hpfs-fsl-oss00 kernel: Lustre: 27671:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1484373329/real 1484373329]  req at ffff881e9e8a4e00 x1556477804282960/t0(0) o38->hpfs-fsl-MDT0000-lwp-OST0000 at 192.52.98.30@tcp:12/10 lens 520/544 e 0 to 1 dl 1484373340 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jan 13 23:55:46 hpfs-fsl-oss00 kernel: Lustre: hpfs-fsl-OST0000: deleting orphan objects from 0x0:16904081 to 0x0:16904417
Jan 13 23:55:54 hpfs-fsl-oss00 kernel: LustreError: 167-0: hpfs-fsl-MDT0000-lwp-OST0000: This client was evicted by hpfs-fsl-MDT0000; in progress operations using this service will fail.
Jan 13 23:55:54 hpfs-fsl-oss00 kernel: Lustre: hpfs-fsl-MDT0000-lwp-OST0000: Connection restored to 192.52.98.31 at tcp (at 192.52.98.31 at tcp)
Jan 14 00:30:01 hpfs-fsl-oss00 systemd: Starting Session 644 of user root.




[root at hpfs-fsl-oss00 ~]# cat /proc/fs/lustre/mgc/MGC192.52.98.30 at tcp/import 
import:
    name: MGC192.52.98.30 at tcp
    target: MGS
    state: FULL
    connect_flags: [ version, adaptive_timeouts, mds_mds_connection, full20, imp_recov, bulk_mbits ]
    connect_data:
       flags: 0x2000011005000020
       instance: 0
       target_version: 2.9.51.0
    import_flags: [ pingable, connect_tried ]
    connection:
       failover_nids: [ 192.52.98.31 at tcp, 192.52.98.30 at tcp ]
       current_connection: 192.52.98.31 at tcp
       connection_attempts: 3
       generation: 2
       in-progress_invalidations: 0
[root at hpfs-fsl-oss00 ~]#





More information about the lustre-discuss mailing list