[lustre-discuss] MGS failover problem

Ben Evans bevans at cray.com
Wed Jan 11 08:58:26 PST 2017


The question I have in this is how long are you waiting, and how are you
determining that lnet has hung?

How are you specifying --failnode for your configuration?  If you could
rune tunefs.lustre on the MDT/MGS and an OST, that would be very helpful.

Finally, how are you specifying the mount string on your various clients?

Getting failover right over multiple separate networks can be a real
hair-pulling experience.

-Ben Evans

On 1/11/17, 11:29 AM, "lustre-discuss on behalf of Vicker, Darby
(JSC-EG311)" <lustre-discuss-bounces at lists.lustre.org on behalf of
darby.vicker-1 at nasa.gov> wrote:

>I tried a failover making sure lustre, including lnet, was completely
>shutdown on the primary MDS.  This didn't work either.  Lnet hung like I
>remembered.  So I powered down the primary MDS to force it offline and
>then mounted lustre on the secondary MDS.  The services and a client
>recovers but the OST's still appear to be pointing to the primary MGS
>(same lctl output and /proc/fs/lustre/mgc) and the
>ptlrpc_expire_one_request messages start up on the OSS's.  I then tried
>to remount a OST, thinking that it might contact the secondary MGS
>properly when mounting.  That also did not work.
>
>Any ideas why lnet is hanging when I try to stop it on the MDS?  This
>works properly on the OSS.
>
>It sure seems like we either don't have something configured properly or
>we aren't doing the failover properly (or there is a bug in lustre).
>
>
>
> 
>
>The details of what was described above follow.  On the primary MDS:
>
>mds0# cd /etc/init.d ; ./lustre stop
>
>This returns quickly:
>
>
>Jan 11 09:15:53 hpfs-fsl-mds0 kernel: Lustre: Failing over
>hpfs-fsl-MDT0000
>Jan 11 09:15:54 hpfs-fsl-mds0 kernel: LustreError: 137-5:
>hpfs-fsl-MDT0000_UUID: not available for connect from 192.52.98.32
>@tcp (no target). If you are running an HA pair check that the target is
>mounted on the other server.
>Jan 11 09:15:54 hpfs-fsl-mds0 kernel: LustreError: Skipped 1 previous
>similar message
>Jan 11 09:15:54 hpfs-fsl-mds0 kernel: LustreError: 137-5:
>hpfs-fsl-MDT0000_UUID: not available for connect from 192.52.98.35
>@tcp (no target). If you are running an HA pair check that the target is
>mounted on the other server.
>Jan 11 09:15:54 hpfs-fsl-mds0 kernel: LustreError: Skipped 1 previous
>similar message
>Jan 11 09:15:56 hpfs-fsl-mds0 kernel: LustreError: 137-5:
>hpfs-fsl-MDT0000_UUID: not available for connect from 192.52.98.40
>@tcp (no target). If you are running an HA pair check that the target is
>mounted on the other server.
>Jan 11 09:15:59 hpfs-fsl-mds0 kernel: Lustre:
>21424:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has
>timed
> out for slow reply: [sent 1484147753/real 1484147753]
>req at ffff881eccfb6900 x1556149769946448/t0(0) o251->MGC192.52.98.30 at t
>cp at 0@lo:26/25 lens 224/224 e 0 to 1 dl 1484147759 ref 2 fl
>Rpc:XN/0/ffffffff rc 0/-1
>Jan 11 09:15:59 hpfs-fsl-mds0 kernel: Lustre: server umount
>hpfs-fsl-MDT0000 complete
>
>Then stop lnet:
>
>mds0# ./lnet stop
>
>This hangs:
>
>
>Jan 11 09:16:35 hpfs-fsl-mds0 kernel: LNetError:
>7065:0:(lib-move.c:1990:lnet_parse()) 192.52.98.39 at tcp, src
>192.52.98.39 at tc
>p: Dropping PUT (error -108 looking up sender)
>Jan 11 09:16:36 hpfs-fsl-mds0 kernel: LNet: Removed LNI 10.148.0.30 at o2ib
>Jan 11 09:16:37 hpfs-fsl-mds0 kernel: LNet:
>21555:0:(socklnd.c:2577:ksocknal_shutdown()) waiting for 1 peers to
>disconnect
>Jan 11 09:16:41 hpfs-fsl-mds0 kernel: LNet:
>21555:0:(socklnd.c:2577:ksocknal_shutdown()) waiting for 1 peers to
>disconnect
>Jan 11 09:16:49 hpfs-fsl-mds0 kernel: LNet:
>21555:0:(socklnd.c:2577:ksocknal_shutdown()) waiting for 1 peers to
>disconnect
>Jan 11 09:17:05 hpfs-fsl-mds0 kernel: LNet:
>21555:0:(socklnd.c:2577:ksocknal_shutdown()) waiting for 1 peers to
>disconnect
>Jan 11 09:17:37 hpfs-fsl-mds0 kernel: LNet:
>21555:0:(socklnd.c:2577:ksocknal_shutdown()) waiting for 1 peers to
>disconnect
>Jan 11 09:18:41 hpfs-fsl-mds0 kernel: LNet:
>21555:0:(socklnd.c:2577:ksocknal_shutdown()) waiting for 1 peers to
>disconnect
>Jan 11 09:20:49 hpfs-fsl-mds0 kernel: LNet:
>21555:0:(socklnd.c:2577:ksocknal_shutdown()) waiting for 1 peers to
>disconnect
>Jan 11 09:25:05 hpfs-fsl-mds0 kernel: LNet:
>21555:0:(socklnd.c:2577:ksocknal_shutdown()) waiting for 1 peers to
>disconnect
>
>
>
>Mds0 was powered down at this point. I looked back through the logs and
>found the last time I tried this and eventually lnet dumps a stack trace.
> Here's that info from the previous attempt:
>
>
>
>Jan  9 16:26:13 hpfs-fsl-mds0 kernel: Lustre: Failing over
>hpfs-fsl-MDT0000
>Jan  9 16:26:19 hpfs-fsl-mds0 kernel: Lustre:
>25690:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has
>timed out for slow reply: [sent 1484000773/real 1484000773]
>req at ffff88069d615400 x1556086544936704/t0(0)
>o251->MGC192.52.98.30 at tcp@0 at lo:26/25 lens 224/224 e 0 to 1 dl 1484000779
>ref 2 fl Rpc:XN/0/ff
>ffffff rc 0/-1
>Jan  9 16:26:19 hpfs-fsl-mds0 kernel: Lustre:
>25690:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 11 previous
>similar messages
>Jan  9 16:26:20 hpfs-fsl-mds0 kernel: Lustre: server umount
>hpfs-fsl-MDT0000 complete
>Jan  9 16:26:39 hpfs-fsl-mds0 kernel: LNetError:
>25392:0:(lib-move.c:1990:lnet_parse()) 192.52.98.40 at tcp, src
>192.52.98.40 at tcp: Dropping PUT (error -108 looking up sender)
>Jan  9 16:26:40 hpfs-fsl-mds0 kernel: LNet: Removed LNI 10.148.0.30 at o2ib
>Jan  9 16:26:41 hpfs-fsl-mds0 kernel: LNet:
>25820:0:(socklnd.c:2577:ksocknal_shutdown()) waiting for 1 peers to
>disconnect
>Jan  9 16:26:45 hpfs-fsl-mds0 kernel: LNet:
>25820:0:(socklnd.c:2577:ksocknal_shutdown()) waiting for 1 peers to
>disconnect
>Jan  9 16:26:53 hpfs-fsl-mds0 kernel: LNet:
>25820:0:(socklnd.c:2577:ksocknal_shutdown()) waiting for 1 peers to
>disconnect
>Jan  9 16:27:09 hpfs-fsl-mds0 kernel: LNet:
>25820:0:(socklnd.c:2577:ksocknal_shutdown()) waiting for 1 peers to
>disconnect
>Jan  9 16:27:41 hpfs-fsl-mds0 kernel: LNet:
>25820:0:(socklnd.c:2577:ksocknal_shutdown()) waiting for 1 peers to
>disconnect
>Jan  9 16:28:45 hpfs-fsl-mds0 kernel: LNet:
>25820:0:(socklnd.c:2577:ksocknal_shutdown()) waiting for 1 peers to
>disconnect
>Jan  9 16:30:53 hpfs-fsl-mds0 kernel: LNet:
>25820:0:(socklnd.c:2577:ksocknal_shutdown()) waiting for 1 peers to
>disconnect
>Jan  9 16:35:09 hpfs-fsl-mds0 kernel: LNet:
>25820:0:(socklnd.c:2577:ksocknal_shutdown()) waiting for 1 peers to
>disconnect
>Jan  9 16:42:54 hpfs-fsl-mds0 kernel: INFO: task lctl:25908 blocked for
>more than 120 seconds.
>Jan  9 16:42:54 hpfs-fsl-mds0 kernel: "echo 0 >
>/proc/sys/kernel/hung_task_timeout_secs" disables this message.
>Jan  9 16:42:54 hpfs-fsl-mds0 kernel: lctl            D ffffffffa0d0b560
>   0 25908  25900 0x00000084
>Jan  9 16:42:54 hpfs-fsl-mds0 kernel: ffff881e9ffc7d20 0000000000000082
>ffff880f77a7bec0 ffff881e9ffc7fd8
>Jan  9 16:42:54 hpfs-fsl-mds0 kernel: ffff881e9ffc7fd8 ffff881e9ffc7fd8
>ffff880f77a7bec0 ffffffffa0d0b558
>Jan  9 16:42:54 hpfs-fsl-mds0 kernel: ffffffffa0d0b55c ffff880f77a7bec0
>00000000ffffffff ffffffffa0d0b560
>Jan  9 16:42:54 hpfs-fsl-mds0 kernel: Call Trace:
>Jan  9 16:42:54 hpfs-fsl-mds0 kernel: [<ffffffff8168c989>]
>schedule_preempt_disabled+0x29/0x70
>Jan  9 16:42:54 hpfs-fsl-mds0 kernel: [<ffffffff8168a5e5>]
>__mutex_lock_slowpath+0xc5/0x1c0
>Jan  9 16:42:54 hpfs-fsl-mds0 kernel: [<ffffffff81689a4f>]
>mutex_lock+0x1f/0x2f
>Jan  9 16:42:54 hpfs-fsl-mds0 kernel: [<ffffffffa0cccf45>]
>LNetNIInit+0x45/0xa10 [lnet]
>Jan  9 16:42:54 hpfs-fsl-mds0 kernel: [<ffffffff811806bb>] ?
>unlock_page+0x2b/0x30
>Jan  9 16:42:54 hpfs-fsl-mds0 kernel: [<ffffffffa0ce6372>]
>lnet_configure+0x52/0x80 [lnet]
>Jan  9 16:42:54 hpfs-fsl-mds0 kernel: [<ffffffffa0ce64eb>]
>lnet_ioctl+0x14b/0x180 [lnet]
>Jan  9 16:42:54 hpfs-fsl-mds0 kernel: [<ffffffffa0bf2e5c>]
>libcfs_ioctl+0x2ac/0x4c0 [libcfs]
>Jan  9 16:42:54 hpfs-fsl-mds0 kernel: [<ffffffffa0bef427>]
>libcfs_psdev_ioctl+0x67/0xf0 [libcfs]
>Jan  9 16:42:54 hpfs-fsl-mds0 kernel: [<ffffffff81212035>]
>do_vfs_ioctl+0x2d5/0x4b0
>Jan  9 16:42:54 hpfs-fsl-mds0 kernel: [<ffffffff8121ccd7>] ?
>__fd_install+0x47/0x60
>Jan  9 16:42:54 hpfs-fsl-mds0 kernel: [<ffffffff812122b1>]
>SyS_ioctl+0xa1/0xc0
>Jan  9 16:42:54 hpfs-fsl-mds0 kernel: [<ffffffff816967c9>]
>system_call_fastpath+0x16/0x1b
>Jan  9 16:43:41 hpfs-fsl-mds0 kernel: LNet:
>25820:0:(socklnd.c:2577:ksocknal_shutdown()) waiting for 1 peers to
>disconnect
>
>
>
>
>
>
>So with the primary MDS shut down I mounted on the secondary MDS:
>
>mds1# cd /etc/init.d/ ; ./lustre start
>
>
>
>
>Jan 11 09:29:48 hpfs-fsl-mds1 kernel: LNet: HW nodes: 2, HW CPU cores:
>16, npartitions: 2
>Jan 11 09:29:48 hpfs-fsl-mds1 kernel: alg: No test for adler32
>(adler32-zlib)
>Jan 11 09:29:48 hpfs-fsl-mds1 kernel: alg: No test for crc32 (crc32-table)
>Jan 11 09:29:56 hpfs-fsl-mds1 kernel: LNet: Added LNI 192.52.98.31 at tcp
>[8/256/0/180]
>Jan 11 09:29:56 hpfs-fsl-mds1 kernel: LNet: Using FMR for registration
>Jan 11 09:29:57 hpfs-fsl-mds1 kernel: LNet: Added LNI 10.148.0.31 at o2ib
>[8/256/0/180]
>Jan 11 09:29:57 hpfs-fsl-mds1 kernel: LNet: Accept secure, port 988
>Jan 11 09:30:22 hpfs-fsl-mds1 kernel: Lustre: Lustre: Build Version:
>2.9.51
>Jan 11 09:30:22 hpfs-fsl-mds1 kernel: Lustre: MGS: Connection restored to
>d08a6361-1b98-2c42-a6c4-ec1317aa9351 (at 0 at lo)
>Jan 11 09:30:23 hpfs-fsl-mds1 kernel: Lustre: hpfs-fsl-MDT0000:
>Imperative Recovery not enabled, recovery window 300-900
>Jan 11 09:30:28 hpfs-fsl-mds1 kernel: Lustre:
>10312:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has
>timed out for slow reply: [sent 1484148623/real 1484148626]
>req at ffff881010219e00 x1556242625462976/t0(0)
>o38->hpfs-fsl-MDT0000-lwp-MDT0000 at 192.52.98.30@tcp:12/10 lens 520/544 e 0
>to 1 dl 1484148628 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
>Jan 11 09:30:48 hpfs-fsl-mds1 kernel: Lustre: hpfs-fsl-MDT0000:
>Connection restored to d08a6361-1b98-2c42-a6c4-ec1317aa9351 (at 0 at lo)
>Jan 11 09:31:01 hpfs-fsl-mds1 kernel: Lustre: hpfs-fsl-MDT0000:
>Connection restored to 192.52.98.32 at tcp (at 192.52.98.32 at tcp)
>Jan 11 09:31:03 hpfs-fsl-mds1 kernel: Lustre: hpfs-fsl-MDT0000:
>Connection restored to 192.52.98.40 at tcp (at 192.52.98.40 at tcp)
>Jan 11 09:31:03 hpfs-fsl-mds1 kernel: Lustre: Skipped 1 previous similar
>message
>Jan 11 09:31:08 hpfs-fsl-mds1 kernel: Lustre: hpfs-fsl-MDT0000:
>Connection restored to 192.52.98.43 at tcp (at 192.52.98.43 at tcp)
>Jan 11 09:31:08 hpfs-fsl-mds1 kernel: Lustre: Skipped 1 previous similar
>message
>Jan 11 09:31:26 hpfs-fsl-mds1 kernel: Lustre: hpfs-fsl-MDT0000: Will be
>in recovery for at least 5:00, or until 1 client reconnects
>Jan 11 09:31:26 hpfs-fsl-mds1 kernel: Lustre: MGS: Connection restored to
>47b7f6ce-5d63-8eb1-59b6-4d26560019e9 (at 192.52.98.55 at tcp)
>Jan 11 09:31:26 hpfs-fsl-mds1 kernel: Lustre: Skipped 1 previous similar
>message
>Jan 11 09:31:26 hpfs-fsl-mds1 kernel: Lustre: hpfs-fsl-MDT0000: Recovery
>over after 0:01, of 1 clients 1 recovered and 0 were evicted.
>Jan 11 09:31:50 hpfs-fsl-mds1 kernel: Lustre: hpfs-fsl-MDT0000:
>Connection restored to 192.52.98.42 at tcp (at 192.52.98.42 at tcp)
>Jan 11 09:31:50 hpfs-fsl-mds1 kernel: Lustre: Skipped 6 previous similar
>messages
>
>
>
>And the OSS log while all this is happening.  As mentioned above, note
>that the ptlrpc_expire_one_request messages to the primary MGS persist
>beyond when the MDT/MGC is mounted on the secondary MDS.
>
>
>
>
>Jan 11 09:15:54 hpfs-fsl-oss00 kernel: LustreError: 11-0:
>hpfs-fsl-MDT0000-lwp-OST0000: operation obd_ping to node 192.52.98.30 at tcp
>failed: rc = -107
>Jan 11 09:15:54 hpfs-fsl-oss00 kernel: Lustre:
>hpfs-fsl-MDT0000-lwp-OST0000: Connection to hpfs-fsl-MDT0000 (at
>192.52.98.30 at tcp) was lost; in progress operations using this service
>will wait for recovery to complete
>Jan 11 09:16:01 hpfs-fsl-oss00 kernel: Lustre:
>17097:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has
>timed out for slow reply: [sent 1484147754/real 1484147754]
>req at ffff88101f2fbc00 x1556149818209744/t0(0)
>o400->MGC192.52.98.30 at tcp@192.52.98.30 at tcp:26/25 lens 224/224 e 0 to 1 dl
>1484147761 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
>Jan 11 09:16:01 hpfs-fsl-oss00 kernel: Lustre:
>17097:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 5 previous
>similar messages
>Jan 11 09:16:01 hpfs-fsl-oss00 kernel: LustreError: 166-1:
>MGC192.52.98.30 at tcp: Connection to MGS (at 192.52.98.30 at tcp) was lost; in
>progress operations using this service will fail
>Jan 11 09:17:27 hpfs-fsl-oss00 kernel: Lustre:
>17090:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has
>timed out for slow reply: [sent 1484147836/real 1484147836]
>req at ffff88101f256000 x1556149818209888/t0(0)
>o38->hpfs-fsl-MDT0000-lwp-OST0000 at 192.52.98.31@tcp:12/10 lens 520/544 e 0
>to 1 dl 1484147847 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
>Jan 11 09:17:27 hpfs-fsl-oss00 kernel: Lustre:
>17090:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 5 previous
>similar messages
>Jan 11 09:20:37 hpfs-fsl-oss00 kernel: Lustre:
>17090:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has
>timed out for slow reply: [sent 1484148011/real 1484148011]
>req at ffff88101f486f00 x1556149818210048/t0(0)
>o38->hpfs-fsl-MDT0000-lwp-OST0000 at 192.52.98.31@tcp:12/10 lens 520/544 e 0
>to 1 dl 1484148037 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
>Jan 11 09:20:37 hpfs-fsl-oss00 kernel: Lustre:
>17090:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 9 previous
>similar messages
>Jan 11 09:25:41 hpfs-fsl-oss00 kernel: Lustre:
>17090:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has
>timed out for slow reply: [sent 1484148286/real 1484148286]
>req at ffff88101f487b00 x1556149818210224/t0(0)
>o250->MGC192.52.98.30 at tcp@192.52.98.30 at tcp:26/25 lens 520/544 e 0 to 1 dl
>1484148341 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
>Jan 11 09:25:41 hpfs-fsl-oss00 kernel: Lustre:
>17090:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 12 previous
>similar messages
>Jan 11 09:30:23 hpfs-fsl-oss00 kernel: Lustre: hpfs-fsl-OST0000:
>Connection restored to hpfs-fsl-MDT0000-mdtlov_UUID (at 192.52.98.31 at tcp)
>Jan 11 09:30:23 hpfs-fsl-oss00 kernel: Lustre: Skipped 1 previous similar
>message
>Jan 11 09:31:01 hpfs-fsl-oss00 kernel: LustreError: 167-0:
>hpfs-fsl-MDT0000-lwp-OST0000: This client was evicted by
>hpfs-fsl-MDT0000; in progress operations using this service will fail.
>Jan 11 09:31:01 hpfs-fsl-oss00 kernel: Lustre:
>hpfs-fsl-MDT0000-lwp-OST0000: Connection restored to 192.52.98.31 at tcp (at
>192.52.98.31 at tcp)
>Jan 11 09:31:26 hpfs-fsl-oss00 kernel: Lustre: hpfs-fsl-OST0000: deleting
>orphan objects from 0x0:16904081 to 0x0:16904321
>Jan 11 09:36:06 hpfs-fsl-oss00 kernel: Lustre:
>17090:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has
>timed out for slow reply: [sent 1484148911/real 1484148914]
>req at ffff88101f35ad00 x1556149818210640/t0(0)
>o250->MGC192.52.98.30 at tcp@192.52.98.30 at tcp:26/25 lens 520/544 e 0 to 1 dl
>1484148966 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
>Jan 11 09:36:06 hpfs-fsl-oss00 kernel: Lustre:
>17090:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 14 previous
>similar messages
>Jan 11 09:47:21 hpfs-fsl-oss00 kernel: Lustre:
>17090:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has
>timed out for slow reply: [sent 1484149586/real 1484149586]
>req at ffff88101f35e600 x1556149818211216/t0(0)
>o250->MGC192.52.98.30 at tcp@192.52.98.30 at tcp:26/25 lens 520/544 e 0 to 1 dl
>1484149641 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
>Jan 11 09:47:21 hpfs-fsl-oss00 kernel: Lustre:
>17090:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 8 previous
>similar messages
>Jan 11 09:58:36 hpfs-fsl-oss00 kernel: Lustre:
>17090:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has
>timed out for slow reply: [sent 1484150261/real 1484150261]
>req at ffff88101f2f9b00 x1556149818211792/t0(0)
>o250->MGC192.52.98.30 at tcp@192.52.98.30 at tcp:26/25 lens 520/544 e 0 to 1 dl
>1484150316 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
>Jan 11 09:58:36 hpfs-fsl-oss00 kernel: Lustre:
>17090:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 8 previous
>similar messages
>Jan 11 10:09:51 hpfs-fsl-oss00 kernel: Lustre:
>17090:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has
>timed out for slow reply: [sent 1484150936/real 1484150936]
>req at ffff88101f483f00 x1556149818212368/t0(0)
>o250->MGC192.52.98.30 at tcp@192.52.98.30 at tcp:26/25 lens 520/544 e 0 to 1 dl
>1484150991 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
>Jan 11 10:09:51 hpfs-fsl-oss00 kernel: Lustre:
>17090:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 8 previous
>similar messages
>
>
>And this:
>
>[root at hpfs-fsl-oss00 ~]# date
>Wed Jan 11 10:12:43 CST 2017
>[root at hpfs-fsl-oss00 ~]# lctl dl
>  0 UP osd-zfs hpfs-fsl-OST0000-osd hpfs-fsl-OST0000-osd_UUID 5
>  1 UP mgc MGC192.52.98.30 at tcp 75fa2ba9-749d-e00f-84d3-e4e9b8753be3 5
>  2 UP ost OSS OSS_uuid 3
>  3 UP obdfilter hpfs-fsl-OST0000 hpfs-fsl-OST0000_UUID 7
>  4 UP lwp hpfs-fsl-MDT0000-lwp-OST0000 hpfs-fsl-MDT0000-lwp-OST0000_UUID
>5
>[root at hpfs-fsl-oss00 ~]# ls /proc/fs/lustre/mgc/
>MGC192.52.98.30 at tcp
>[root at hpfs-fsl-oss00 ~]#
>
>
>
>I was wondering if it might work better to remount a OST with the LFS
>failed over to the secondary MDS.  I tried that ­ all the below is while
>the MDT is still mounted on mds1:
>
>
>
>[root at hpfs-fsl-oss00 ~]# date
>Wed Jan 11 10:14:23 CST 2017
>[root at hpfs-fsl-oss00 ~]# cd /etc/init.d/
>[root at hpfs-fsl-oss00 init.d]# ./lustre stop local
>Unmounting /mnt/lustre/local/hpfs-fsl-OST0000
>[root at hpfs-fsl-oss00 init.d]# ./lnet stop
>[root at hpfs-fsl-oss00 init.d]# ./lnet start
>LNET configured
>[root at hpfs-fsl-oss00 init.d]# ./lustre start local
>Mounting oss00-0/ost-fsl on /mnt/lustre/local/hpfs-fsl-OST0000
>[root at hpfs-fsl-oss00 init.d]# lctl dl
>  0 UP osd-zfs hpfs-fsl-OST0000-osd hpfs-fsl-OST0000-osd_UUID 5
>  1 UP mgc MGC192.52.98.30 at tcp 17af2f5d-ebd3-b57d-0c3d-9c7bc7654172 5
>  2 UP ost OSS OSS_uuid 3
>  3 UP obdfilter hpfs-fsl-OST0000 hpfs-fsl-OST0000_UUID 7
>  4 UP lwp hpfs-fsl-MDT0000-lwp-OST0000 hpfs-fsl-MDT0000-lwp-OST0000_UUID
>5
>[root at hpfs-fsl-oss00 init.d]# ls /proc/fs/lustre/mgc/
>MGC192.52.98.30 at tcp
>[root at hpfs-fsl-oss00 init.d]#
>
>
>
>Same result.  MDS1 and OSS00 logs are below.
>
>
>
>Jan 11 10:14:42 hpfs-fsl-mds1 kernel: Lustre:
>10323:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has
>timed out for slow reply: [sent 1484151275/real 1484151275]
>req at ffff882036ef7800 x1556242625566576/t0(0)
>o13->hpfs-fsl-OST0000-osc-MDT0000 at 192.52.98.32@tcp:7/4 lens 224/368 e 0
>to 1 dl 1484151282 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
>Jan 11 10:14:42 hpfs-fsl-mds1 kernel: Lustre:
>hpfs-fsl-OST0000-osc-MDT0000: Connection to hpfs-fsl-OST0000 (at
>192.52.98.32 at tcp) was lost; in progress operations using this service
>will wait for recovery to complete
>Jan 11 10:14:48 hpfs-fsl-mds1 kernel: Lustre:
>10312:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has
>timed out for slow reply: [sent 1484151282/real 1484151282]
>req at ffff88101a9ff800 x1556242625566928/t0(0)
>o8->hpfs-fsl-OST0000-osc-MDT0000 at 192.52.98.32@tcp:28/4 lens 520/544 e 0
>to 1 dl 1484151288 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
>Jan 11 10:15:43 hpfs-fsl-mds1 kernel: Lustre:
>10312:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has
>timed out for slow reply: [sent 1484151332/real 1484151332]
>req at ffff88101a9fce00 x1556242625568768/t0(0)
>o8->hpfs-fsl-OST0000-osc-MDT0000 at 192.52.98.32@tcp:28/4 lens 520/544 e 0
>to 1 dl 1484151343 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
>Jan 11 10:17:12 hpfs-fsl-mds1 kernel: Lustre: hpfs-fsl-MDT0000:
>Connection restored to 192.52.98.32 at tcp (at 192.52.98.32 at tcp)
>Jan 11 10:17:23 hpfs-fsl-mds1 kernel: Lustre:
>hpfs-fsl-OST0000-osc-MDT0000: Connection restored to 192.52.98.32 at tcp (at
>192.52.98.32 at tcp)
>
>
>
>
>
>Jan 11 10:14:30 hpfs-fsl-oss00 kernel: Lustre: Failing over
>hpfs-fsl-OST0000
>Jan 11 10:14:30 hpfs-fsl-oss00 kernel: Lustre: server umount
>hpfs-fsl-OST0000 complete
>Jan 11 10:15:18 hpfs-fsl-oss00 kernel: LNet: Removed LNI 10.148.0.32 at o2ib
>Jan 11 10:15:20 hpfs-fsl-oss00 kernel: LNet: Removed LNI 192.52.98.32 at tcp
>Jan 11 10:15:26 hpfs-fsl-oss00 kernel: LNet: HW nodes: 2, HW CPU cores:
>16, npartitions: 2
>Jan 11 10:15:26 hpfs-fsl-oss00 kernel: alg: No test for adler32
>(adler32-zlib)
>Jan 11 10:15:26 hpfs-fsl-oss00 kernel: alg: No test for crc32
>(crc32-table)
>Jan 11 10:15:34 hpfs-fsl-oss00 kernel: LNet: Added LNI 192.52.98.32 at tcp
>[8/256/0/180]
>Jan 11 10:15:34 hpfs-fsl-oss00 kernel: LNet: Using FMR for registration
>Jan 11 10:15:34 hpfs-fsl-oss00 kernel: LNet: Added LNI 10.148.0.32 at o2ib
>[8/256/0/180]
>Jan 11 10:15:34 hpfs-fsl-oss00 kernel: LNet: Accept secure, port 988
>Jan 11 10:15:41 hpfs-fsl-oss00 kernel: Lustre: Lustre: Build Version:
>2.9.51
>Jan 11 10:15:43 hpfs-fsl-oss00 kernel: LustreError: 137-5:
>hpfs-fsl-OST0000_UUID: not available for connect from 192.52.98.55 at tcp
>(no target). If you are running an HA pair check that the target is
>mounted on the other server.
>Jan 11 10:15:46 hpfs-fsl-oss00 kernel: Lustre:
>20564:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has
>timed out for slow reply: [sent 1484151341/real 1484151341]
>req at ffff882020280000 x1556245476540432/t0(0)
>o250->MGC192.52.98.30 at tcp@192.52.98.30 at tcp:26/25 lens 520/544 e 0 to 1 dl
>1484151346 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
>Jan 11 10:16:16 hpfs-fsl-oss00 kernel: Lustre:
>20564:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has
>timed out for slow reply: [sent 1484151366/real 1484151366]
>req at ffff880fff120000 x1556245476540496/t0(0)
>o250->MGC192.52.98.30 at tcp@192.52.98.30 at tcp:26/25 lens 520/544 e 0 to 1 dl
>1484151376 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
>Jan 11 10:16:22 hpfs-fsl-oss00 kernel: LustreError: 137-5:
>hpfs-fsl-OST0000_UUID: not available for connect from 192.52.98.31 at tcp
>(no target). If you are running an HA pair check that the target is
>mounted on the other server.
>Jan 11 10:16:29 hpfs-fsl-oss00 kernel: LustreError:
>20547:0:(mgc_request.c:249:do_config_log_add()) MGC192.52.98.30 at tcp:
>failed processing log, type 4: rc = -110
>Jan 11 10:16:33 hpfs-fsl-oss00 kernel: LustreError: 137-5:
>hpfs-fsl-OST0000_UUID: not available for connect from 192.52.98.55 at tcp
>(no target). If you are running an HA pair check that the target is
>mounted on the other server.
>Jan 11 10:16:35 hpfs-fsl-oss00 kernel: LustreError:
>20629:0:(sec_config.c:1107:sptlrpc_target_local_read_conf()) missing llog
>context
>Jan 11 10:16:46 hpfs-fsl-oss00 kernel: Lustre:
>20564:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has
>timed out for slow reply: [sent 1484151391/real 1484151391]
>req at ffff880fff120300 x1556245476540528/t0(0)
>o250->MGC192.52.98.30 at tcp@192.52.98.30 at tcp:26/25 lens 520/544 e 0 to 1 dl
>1484151406 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
>Jan 11 10:16:52 hpfs-fsl-oss00 kernel: Lustre:
>20564:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has
>timed out for slow reply: [sent 1484151407/real 1484151407]
>req at ffff880fff120600 x1556245476540576/t0(0)
>o38->hpfs-fsl-MDT0000-lwp-OST0000 at 192.52.98.30@tcp:12/10 lens 520/544 e 0
>to 1 dl 1484151412 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
>Jan 11 10:17:04 hpfs-fsl-oss00 kernel: Lustre: hpfs-fsl-OST0000:
>Imperative Recovery not enabled, recovery window 300-900
>Jan 11 10:17:12 hpfs-fsl-oss00 kernel: Lustre: hpfs-fsl-OST0000: Will be
>in recovery for at least 5:00, or until 2 clients reconnect
>Jan 11 10:17:12 hpfs-fsl-oss00 kernel: Lustre: hpfs-fsl-OST0000:
>Connection restored to  (at 192.52.98.31 at tcp)
>Jan 11 10:17:23 hpfs-fsl-oss00 kernel: Lustre: hpfs-fsl-OST0000:
>Connection restored to  (at 192.52.98.55 at tcp)
>Jan 11 10:17:23 hpfs-fsl-oss00 kernel: Lustre: hpfs-fsl-OST0000: Recovery
>over after 0:11, of 2 clients 2 recovered and 0 were evicted.
>Jan 11 10:17:32 hpfs-fsl-oss00 kernel: Lustre:
>20564:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has
>timed out for slow reply: [sent 1484151432/real 1484151432]
>req at ffff880fff090000 x1556245476540624/t0(0)
>o250->MGC192.52.98.30 at tcp@192.52.98.30 at tcp:26/25 lens 520/544 e 0 to 1 dl
>1484151452 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
>Jan 11 10:18:02 hpfs-fsl-oss00 kernel: Lustre:
>20564:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has
>timed out for slow reply: [sent 1484151457/real 1484151457]
>req at ffff880fff090600 x1556245476540656/t0(0)
>o250->MGC192.52.98.30 at tcp@192.52.98.30 at tcp:26/25 lens 520/544 e 0 to 1 dl
>1484151482 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
>Jan 11 10:18:57 hpfs-fsl-oss00 kernel: Lustre:
>20564:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has
>timed out for slow reply: [sent 1484151507/real 1484151507]
>req at ffff880fff090f00 x1556245476540704/t0(0)
>o250->MGC192.52.98.30 at tcp@192.52.98.30 at tcp:26/25 lens 520/544 e 0 to 1 dl
>1484151537 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
>
>
>_______________________________________________
>lustre-discuss mailing list
>lustre-discuss at lists.lustre.org
>http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org



More information about the lustre-discuss mailing list