[lustre-discuss] MGS failover problem

Vicker, Darby (JSC-EG311) darby.vicker-1 at nasa.gov
Wed Jan 11 08:29:22 PST 2017


I tried a failover making sure lustre, including lnet, was completely shutdown on the primary MDS.  This didn't work either.  Lnet hung like I remembered.  So I powered down the primary MDS to force it offline and then mounted lustre on the secondary MDS.  The services and a client recovers but the OST's still appear to be pointing to the primary MGS (same lctl output and /proc/fs/lustre/mgc) and the ptlrpc_expire_one_request messages start up on the OSS's.  I then tried to remount a OST, thinking that it might contact the secondary MGS properly when mounting.  That also did not work.  

Any ideas why lnet is hanging when I try to stop it on the MDS?  This works properly on the OSS.  

It sure seems like we either don't have something configured properly or we aren't doing the failover properly (or there is a bug in lustre).  



 

The details of what was described above follow.  On the primary MDS:

mds0# cd /etc/init.d ; ./lustre stop

This returns quickly:


Jan 11 09:15:53 hpfs-fsl-mds0 kernel: Lustre: Failing over hpfs-fsl-MDT0000
Jan 11 09:15:54 hpfs-fsl-mds0 kernel: LustreError: 137-5: hpfs-fsl-MDT0000_UUID: not available for connect from 192.52.98.32
@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
Jan 11 09:15:54 hpfs-fsl-mds0 kernel: LustreError: Skipped 1 previous similar message
Jan 11 09:15:54 hpfs-fsl-mds0 kernel: LustreError: 137-5: hpfs-fsl-MDT0000_UUID: not available for connect from 192.52.98.35
@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
Jan 11 09:15:54 hpfs-fsl-mds0 kernel: LustreError: Skipped 1 previous similar message
Jan 11 09:15:56 hpfs-fsl-mds0 kernel: LustreError: 137-5: hpfs-fsl-MDT0000_UUID: not available for connect from 192.52.98.40
@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
Jan 11 09:15:59 hpfs-fsl-mds0 kernel: Lustre: 21424:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed
 out for slow reply: [sent 1484147753/real 1484147753]  req at ffff881eccfb6900 x1556149769946448/t0(0) o251->MGC192.52.98.30 at t
cp at 0@lo:26/25 lens 224/224 e 0 to 1 dl 1484147759 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
Jan 11 09:15:59 hpfs-fsl-mds0 kernel: Lustre: server umount hpfs-fsl-MDT0000 complete

Then stop lnet:

mds0# ./lnet stop

This hangs:


Jan 11 09:16:35 hpfs-fsl-mds0 kernel: LNetError: 7065:0:(lib-move.c:1990:lnet_parse()) 192.52.98.39 at tcp, src 192.52.98.39 at tc
p: Dropping PUT (error -108 looking up sender)
Jan 11 09:16:36 hpfs-fsl-mds0 kernel: LNet: Removed LNI 10.148.0.30 at o2ib
Jan 11 09:16:37 hpfs-fsl-mds0 kernel: LNet: 21555:0:(socklnd.c:2577:ksocknal_shutdown()) waiting for 1 peers to disconnect
Jan 11 09:16:41 hpfs-fsl-mds0 kernel: LNet: 21555:0:(socklnd.c:2577:ksocknal_shutdown()) waiting for 1 peers to disconnect
Jan 11 09:16:49 hpfs-fsl-mds0 kernel: LNet: 21555:0:(socklnd.c:2577:ksocknal_shutdown()) waiting for 1 peers to disconnect
Jan 11 09:17:05 hpfs-fsl-mds0 kernel: LNet: 21555:0:(socklnd.c:2577:ksocknal_shutdown()) waiting for 1 peers to disconnect
Jan 11 09:17:37 hpfs-fsl-mds0 kernel: LNet: 21555:0:(socklnd.c:2577:ksocknal_shutdown()) waiting for 1 peers to disconnect
Jan 11 09:18:41 hpfs-fsl-mds0 kernel: LNet: 21555:0:(socklnd.c:2577:ksocknal_shutdown()) waiting for 1 peers to disconnect
Jan 11 09:20:49 hpfs-fsl-mds0 kernel: LNet: 21555:0:(socklnd.c:2577:ksocknal_shutdown()) waiting for 1 peers to disconnect
Jan 11 09:25:05 hpfs-fsl-mds0 kernel: LNet: 21555:0:(socklnd.c:2577:ksocknal_shutdown()) waiting for 1 peers to disconnect



Mds0 was powered down at this point. I looked back through the logs and found the last time I tried this and eventually lnet dumps a stack trace.  Here's that info from the previous attempt:



Jan  9 16:26:13 hpfs-fsl-mds0 kernel: Lustre: Failing over hpfs-fsl-MDT0000
Jan  9 16:26:19 hpfs-fsl-mds0 kernel: Lustre: 25690:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1484000773/real 1484000773]  req at ffff88069d615400 x1556086544936704/t0(0) o251->MGC192.52.98.30 at tcp@0 at lo:26/25 lens 224/224 e 0 to 1 dl 1484000779 ref 2 fl Rpc:XN/0/ff
ffffff rc 0/-1
Jan  9 16:26:19 hpfs-fsl-mds0 kernel: Lustre: 25690:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 11 previous similar messages
Jan  9 16:26:20 hpfs-fsl-mds0 kernel: Lustre: server umount hpfs-fsl-MDT0000 complete
Jan  9 16:26:39 hpfs-fsl-mds0 kernel: LNetError: 25392:0:(lib-move.c:1990:lnet_parse()) 192.52.98.40 at tcp, src 192.52.98.40 at tcp: Dropping PUT (error -108 looking up sender)
Jan  9 16:26:40 hpfs-fsl-mds0 kernel: LNet: Removed LNI 10.148.0.30 at o2ib
Jan  9 16:26:41 hpfs-fsl-mds0 kernel: LNet: 25820:0:(socklnd.c:2577:ksocknal_shutdown()) waiting for 1 peers to disconnect
Jan  9 16:26:45 hpfs-fsl-mds0 kernel: LNet: 25820:0:(socklnd.c:2577:ksocknal_shutdown()) waiting for 1 peers to disconnect
Jan  9 16:26:53 hpfs-fsl-mds0 kernel: LNet: 25820:0:(socklnd.c:2577:ksocknal_shutdown()) waiting for 1 peers to disconnect
Jan  9 16:27:09 hpfs-fsl-mds0 kernel: LNet: 25820:0:(socklnd.c:2577:ksocknal_shutdown()) waiting for 1 peers to disconnect
Jan  9 16:27:41 hpfs-fsl-mds0 kernel: LNet: 25820:0:(socklnd.c:2577:ksocknal_shutdown()) waiting for 1 peers to disconnect
Jan  9 16:28:45 hpfs-fsl-mds0 kernel: LNet: 25820:0:(socklnd.c:2577:ksocknal_shutdown()) waiting for 1 peers to disconnect
Jan  9 16:30:53 hpfs-fsl-mds0 kernel: LNet: 25820:0:(socklnd.c:2577:ksocknal_shutdown()) waiting for 1 peers to disconnect
Jan  9 16:35:09 hpfs-fsl-mds0 kernel: LNet: 25820:0:(socklnd.c:2577:ksocknal_shutdown()) waiting for 1 peers to disconnect
Jan  9 16:42:54 hpfs-fsl-mds0 kernel: INFO: task lctl:25908 blocked for more than 120 seconds.
Jan  9 16:42:54 hpfs-fsl-mds0 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan  9 16:42:54 hpfs-fsl-mds0 kernel: lctl            D ffffffffa0d0b560     0 25908  25900 0x00000084
Jan  9 16:42:54 hpfs-fsl-mds0 kernel: ffff881e9ffc7d20 0000000000000082 ffff880f77a7bec0 ffff881e9ffc7fd8
Jan  9 16:42:54 hpfs-fsl-mds0 kernel: ffff881e9ffc7fd8 ffff881e9ffc7fd8 ffff880f77a7bec0 ffffffffa0d0b558
Jan  9 16:42:54 hpfs-fsl-mds0 kernel: ffffffffa0d0b55c ffff880f77a7bec0 00000000ffffffff ffffffffa0d0b560
Jan  9 16:42:54 hpfs-fsl-mds0 kernel: Call Trace:
Jan  9 16:42:54 hpfs-fsl-mds0 kernel: [<ffffffff8168c989>] schedule_preempt_disabled+0x29/0x70
Jan  9 16:42:54 hpfs-fsl-mds0 kernel: [<ffffffff8168a5e5>] __mutex_lock_slowpath+0xc5/0x1c0
Jan  9 16:42:54 hpfs-fsl-mds0 kernel: [<ffffffff81689a4f>] mutex_lock+0x1f/0x2f
Jan  9 16:42:54 hpfs-fsl-mds0 kernel: [<ffffffffa0cccf45>] LNetNIInit+0x45/0xa10 [lnet]
Jan  9 16:42:54 hpfs-fsl-mds0 kernel: [<ffffffff811806bb>] ? unlock_page+0x2b/0x30
Jan  9 16:42:54 hpfs-fsl-mds0 kernel: [<ffffffffa0ce6372>] lnet_configure+0x52/0x80 [lnet]
Jan  9 16:42:54 hpfs-fsl-mds0 kernel: [<ffffffffa0ce64eb>] lnet_ioctl+0x14b/0x180 [lnet]
Jan  9 16:42:54 hpfs-fsl-mds0 kernel: [<ffffffffa0bf2e5c>] libcfs_ioctl+0x2ac/0x4c0 [libcfs]
Jan  9 16:42:54 hpfs-fsl-mds0 kernel: [<ffffffffa0bef427>] libcfs_psdev_ioctl+0x67/0xf0 [libcfs]
Jan  9 16:42:54 hpfs-fsl-mds0 kernel: [<ffffffff81212035>] do_vfs_ioctl+0x2d5/0x4b0
Jan  9 16:42:54 hpfs-fsl-mds0 kernel: [<ffffffff8121ccd7>] ? __fd_install+0x47/0x60
Jan  9 16:42:54 hpfs-fsl-mds0 kernel: [<ffffffff812122b1>] SyS_ioctl+0xa1/0xc0
Jan  9 16:42:54 hpfs-fsl-mds0 kernel: [<ffffffff816967c9>] system_call_fastpath+0x16/0x1b
Jan  9 16:43:41 hpfs-fsl-mds0 kernel: LNet: 25820:0:(socklnd.c:2577:ksocknal_shutdown()) waiting for 1 peers to disconnect






So with the primary MDS shut down I mounted on the secondary MDS:

mds1# cd /etc/init.d/ ; ./lustre start




Jan 11 09:29:48 hpfs-fsl-mds1 kernel: LNet: HW nodes: 2, HW CPU cores: 16, npartitions: 2
Jan 11 09:29:48 hpfs-fsl-mds1 kernel: alg: No test for adler32 (adler32-zlib)
Jan 11 09:29:48 hpfs-fsl-mds1 kernel: alg: No test for crc32 (crc32-table)
Jan 11 09:29:56 hpfs-fsl-mds1 kernel: LNet: Added LNI 192.52.98.31 at tcp [8/256/0/180]
Jan 11 09:29:56 hpfs-fsl-mds1 kernel: LNet: Using FMR for registration
Jan 11 09:29:57 hpfs-fsl-mds1 kernel: LNet: Added LNI 10.148.0.31 at o2ib [8/256/0/180]
Jan 11 09:29:57 hpfs-fsl-mds1 kernel: LNet: Accept secure, port 988
Jan 11 09:30:22 hpfs-fsl-mds1 kernel: Lustre: Lustre: Build Version: 2.9.51
Jan 11 09:30:22 hpfs-fsl-mds1 kernel: Lustre: MGS: Connection restored to d08a6361-1b98-2c42-a6c4-ec1317aa9351 (at 0 at lo)
Jan 11 09:30:23 hpfs-fsl-mds1 kernel: Lustre: hpfs-fsl-MDT0000: Imperative Recovery not enabled, recovery window 300-900
Jan 11 09:30:28 hpfs-fsl-mds1 kernel: Lustre: 10312:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1484148623/real 1484148626]  req at ffff881010219e00 x1556242625462976/t0(0) o38->hpfs-fsl-MDT0000-lwp-MDT0000 at 192.52.98.30@tcp:12/10 lens 520/544 e 0 to 1 dl 1484148628 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jan 11 09:30:48 hpfs-fsl-mds1 kernel: Lustre: hpfs-fsl-MDT0000: Connection restored to d08a6361-1b98-2c42-a6c4-ec1317aa9351 (at 0 at lo)
Jan 11 09:31:01 hpfs-fsl-mds1 kernel: Lustre: hpfs-fsl-MDT0000: Connection restored to 192.52.98.32 at tcp (at 192.52.98.32 at tcp)
Jan 11 09:31:03 hpfs-fsl-mds1 kernel: Lustre: hpfs-fsl-MDT0000: Connection restored to 192.52.98.40 at tcp (at 192.52.98.40 at tcp)
Jan 11 09:31:03 hpfs-fsl-mds1 kernel: Lustre: Skipped 1 previous similar message
Jan 11 09:31:08 hpfs-fsl-mds1 kernel: Lustre: hpfs-fsl-MDT0000: Connection restored to 192.52.98.43 at tcp (at 192.52.98.43 at tcp)
Jan 11 09:31:08 hpfs-fsl-mds1 kernel: Lustre: Skipped 1 previous similar message
Jan 11 09:31:26 hpfs-fsl-mds1 kernel: Lustre: hpfs-fsl-MDT0000: Will be in recovery for at least 5:00, or until 1 client reconnects
Jan 11 09:31:26 hpfs-fsl-mds1 kernel: Lustre: MGS: Connection restored to 47b7f6ce-5d63-8eb1-59b6-4d26560019e9 (at 192.52.98.55 at tcp)
Jan 11 09:31:26 hpfs-fsl-mds1 kernel: Lustre: Skipped 1 previous similar message
Jan 11 09:31:26 hpfs-fsl-mds1 kernel: Lustre: hpfs-fsl-MDT0000: Recovery over after 0:01, of 1 clients 1 recovered and 0 were evicted.
Jan 11 09:31:50 hpfs-fsl-mds1 kernel: Lustre: hpfs-fsl-MDT0000: Connection restored to 192.52.98.42 at tcp (at 192.52.98.42 at tcp)
Jan 11 09:31:50 hpfs-fsl-mds1 kernel: Lustre: Skipped 6 previous similar messages



And the OSS log while all this is happening.  As mentioned above, note that the ptlrpc_expire_one_request messages to the primary MGS persist beyond when the MDT/MGC is mounted on the secondary MDS.  




Jan 11 09:15:54 hpfs-fsl-oss00 kernel: LustreError: 11-0: hpfs-fsl-MDT0000-lwp-OST0000: operation obd_ping to node 192.52.98.30 at tcp failed: rc = -107
Jan 11 09:15:54 hpfs-fsl-oss00 kernel: Lustre: hpfs-fsl-MDT0000-lwp-OST0000: Connection to hpfs-fsl-MDT0000 (at 192.52.98.30 at tcp) was lost; in progress operations using this service will wait for recovery to complete
Jan 11 09:16:01 hpfs-fsl-oss00 kernel: Lustre: 17097:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1484147754/real 1484147754]  req at ffff88101f2fbc00 x1556149818209744/t0(0) o400->MGC192.52.98.30 at tcp@192.52.98.30 at tcp:26/25 lens 224/224 e 0 to 1 dl 1484147761 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jan 11 09:16:01 hpfs-fsl-oss00 kernel: Lustre: 17097:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 5 previous similar messages
Jan 11 09:16:01 hpfs-fsl-oss00 kernel: LustreError: 166-1: MGC192.52.98.30 at tcp: Connection to MGS (at 192.52.98.30 at tcp) was lost; in progress operations using this service will fail
Jan 11 09:17:27 hpfs-fsl-oss00 kernel: Lustre: 17090:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1484147836/real 1484147836]  req at ffff88101f256000 x1556149818209888/t0(0) o38->hpfs-fsl-MDT0000-lwp-OST0000 at 192.52.98.31@tcp:12/10 lens 520/544 e 0 to 1 dl 1484147847 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jan 11 09:17:27 hpfs-fsl-oss00 kernel: Lustre: 17090:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 5 previous similar messages
Jan 11 09:20:37 hpfs-fsl-oss00 kernel: Lustre: 17090:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1484148011/real 1484148011]  req at ffff88101f486f00 x1556149818210048/t0(0) o38->hpfs-fsl-MDT0000-lwp-OST0000 at 192.52.98.31@tcp:12/10 lens 520/544 e 0 to 1 dl 1484148037 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jan 11 09:20:37 hpfs-fsl-oss00 kernel: Lustre: 17090:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 9 previous similar messages
Jan 11 09:25:41 hpfs-fsl-oss00 kernel: Lustre: 17090:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1484148286/real 1484148286]  req at ffff88101f487b00 x1556149818210224/t0(0) o250->MGC192.52.98.30 at tcp@192.52.98.30 at tcp:26/25 lens 520/544 e 0 to 1 dl 1484148341 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jan 11 09:25:41 hpfs-fsl-oss00 kernel: Lustre: 17090:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 12 previous similar messages
Jan 11 09:30:23 hpfs-fsl-oss00 kernel: Lustre: hpfs-fsl-OST0000: Connection restored to hpfs-fsl-MDT0000-mdtlov_UUID (at 192.52.98.31 at tcp)
Jan 11 09:30:23 hpfs-fsl-oss00 kernel: Lustre: Skipped 1 previous similar message
Jan 11 09:31:01 hpfs-fsl-oss00 kernel: LustreError: 167-0: hpfs-fsl-MDT0000-lwp-OST0000: This client was evicted by hpfs-fsl-MDT0000; in progress operations using this service will fail.
Jan 11 09:31:01 hpfs-fsl-oss00 kernel: Lustre: hpfs-fsl-MDT0000-lwp-OST0000: Connection restored to 192.52.98.31 at tcp (at 192.52.98.31 at tcp)
Jan 11 09:31:26 hpfs-fsl-oss00 kernel: Lustre: hpfs-fsl-OST0000: deleting orphan objects from 0x0:16904081 to 0x0:16904321
Jan 11 09:36:06 hpfs-fsl-oss00 kernel: Lustre: 17090:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1484148911/real 1484148914]  req at ffff88101f35ad00 x1556149818210640/t0(0) o250->MGC192.52.98.30 at tcp@192.52.98.30 at tcp:26/25 lens 520/544 e 0 to 1 dl 1484148966 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jan 11 09:36:06 hpfs-fsl-oss00 kernel: Lustre: 17090:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 14 previous similar messages
Jan 11 09:47:21 hpfs-fsl-oss00 kernel: Lustre: 17090:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1484149586/real 1484149586]  req at ffff88101f35e600 x1556149818211216/t0(0) o250->MGC192.52.98.30 at tcp@192.52.98.30 at tcp:26/25 lens 520/544 e 0 to 1 dl 1484149641 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jan 11 09:47:21 hpfs-fsl-oss00 kernel: Lustre: 17090:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 8 previous similar messages
Jan 11 09:58:36 hpfs-fsl-oss00 kernel: Lustre: 17090:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1484150261/real 1484150261]  req at ffff88101f2f9b00 x1556149818211792/t0(0) o250->MGC192.52.98.30 at tcp@192.52.98.30 at tcp:26/25 lens 520/544 e 0 to 1 dl 1484150316 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jan 11 09:58:36 hpfs-fsl-oss00 kernel: Lustre: 17090:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 8 previous similar messages
Jan 11 10:09:51 hpfs-fsl-oss00 kernel: Lustre: 17090:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1484150936/real 1484150936]  req at ffff88101f483f00 x1556149818212368/t0(0) o250->MGC192.52.98.30 at tcp@192.52.98.30 at tcp:26/25 lens 520/544 e 0 to 1 dl 1484150991 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jan 11 10:09:51 hpfs-fsl-oss00 kernel: Lustre: 17090:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 8 previous similar messages


And this:

[root at hpfs-fsl-oss00 ~]# date
Wed Jan 11 10:12:43 CST 2017
[root at hpfs-fsl-oss00 ~]# lctl dl
  0 UP osd-zfs hpfs-fsl-OST0000-osd hpfs-fsl-OST0000-osd_UUID 5
  1 UP mgc MGC192.52.98.30 at tcp 75fa2ba9-749d-e00f-84d3-e4e9b8753be3 5
  2 UP ost OSS OSS_uuid 3
  3 UP obdfilter hpfs-fsl-OST0000 hpfs-fsl-OST0000_UUID 7
  4 UP lwp hpfs-fsl-MDT0000-lwp-OST0000 hpfs-fsl-MDT0000-lwp-OST0000_UUID 5
[root at hpfs-fsl-oss00 ~]# ls /proc/fs/lustre/mgc/
MGC192.52.98.30 at tcp
[root at hpfs-fsl-oss00 ~]#



I was wondering if it might work better to remount a OST with the LFS failed over to the secondary MDS.  I tried that – all the below is while the MDT is still mounted on mds1:



[root at hpfs-fsl-oss00 ~]# date
Wed Jan 11 10:14:23 CST 2017
[root at hpfs-fsl-oss00 ~]# cd /etc/init.d/
[root at hpfs-fsl-oss00 init.d]# ./lustre stop local
Unmounting /mnt/lustre/local/hpfs-fsl-OST0000
[root at hpfs-fsl-oss00 init.d]# ./lnet stop
[root at hpfs-fsl-oss00 init.d]# ./lnet start
LNET configured
[root at hpfs-fsl-oss00 init.d]# ./lustre start local
Mounting oss00-0/ost-fsl on /mnt/lustre/local/hpfs-fsl-OST0000
[root at hpfs-fsl-oss00 init.d]# lctl dl
  0 UP osd-zfs hpfs-fsl-OST0000-osd hpfs-fsl-OST0000-osd_UUID 5
  1 UP mgc MGC192.52.98.30 at tcp 17af2f5d-ebd3-b57d-0c3d-9c7bc7654172 5
  2 UP ost OSS OSS_uuid 3
  3 UP obdfilter hpfs-fsl-OST0000 hpfs-fsl-OST0000_UUID 7
  4 UP lwp hpfs-fsl-MDT0000-lwp-OST0000 hpfs-fsl-MDT0000-lwp-OST0000_UUID 5
[root at hpfs-fsl-oss00 init.d]# ls /proc/fs/lustre/mgc/
MGC192.52.98.30 at tcp
[root at hpfs-fsl-oss00 init.d]# 



Same result.  MDS1 and OSS00 logs are below.  



Jan 11 10:14:42 hpfs-fsl-mds1 kernel: Lustre: 10323:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1484151275/real 1484151275]  req at ffff882036ef7800 x1556242625566576/t0(0) o13->hpfs-fsl-OST0000-osc-MDT0000 at 192.52.98.32@tcp:7/4 lens 224/368 e 0 to 1 dl 1484151282 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Jan 11 10:14:42 hpfs-fsl-mds1 kernel: Lustre: hpfs-fsl-OST0000-osc-MDT0000: Connection to hpfs-fsl-OST0000 (at 192.52.98.32 at tcp) was lost; in progress operations using this service will wait for recovery to complete
Jan 11 10:14:48 hpfs-fsl-mds1 kernel: Lustre: 10312:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1484151282/real 1484151282]  req at ffff88101a9ff800 x1556242625566928/t0(0) o8->hpfs-fsl-OST0000-osc-MDT0000 at 192.52.98.32@tcp:28/4 lens 520/544 e 0 to 1 dl 1484151288 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jan 11 10:15:43 hpfs-fsl-mds1 kernel: Lustre: 10312:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1484151332/real 1484151332]  req at ffff88101a9fce00 x1556242625568768/t0(0) o8->hpfs-fsl-OST0000-osc-MDT0000 at 192.52.98.32@tcp:28/4 lens 520/544 e 0 to 1 dl 1484151343 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jan 11 10:17:12 hpfs-fsl-mds1 kernel: Lustre: hpfs-fsl-MDT0000: Connection restored to 192.52.98.32 at tcp (at 192.52.98.32 at tcp)
Jan 11 10:17:23 hpfs-fsl-mds1 kernel: Lustre: hpfs-fsl-OST0000-osc-MDT0000: Connection restored to 192.52.98.32 at tcp (at 192.52.98.32 at tcp)





Jan 11 10:14:30 hpfs-fsl-oss00 kernel: Lustre: Failing over hpfs-fsl-OST0000
Jan 11 10:14:30 hpfs-fsl-oss00 kernel: Lustre: server umount hpfs-fsl-OST0000 complete
Jan 11 10:15:18 hpfs-fsl-oss00 kernel: LNet: Removed LNI 10.148.0.32 at o2ib
Jan 11 10:15:20 hpfs-fsl-oss00 kernel: LNet: Removed LNI 192.52.98.32 at tcp
Jan 11 10:15:26 hpfs-fsl-oss00 kernel: LNet: HW nodes: 2, HW CPU cores: 16, npartitions: 2
Jan 11 10:15:26 hpfs-fsl-oss00 kernel: alg: No test for adler32 (adler32-zlib)
Jan 11 10:15:26 hpfs-fsl-oss00 kernel: alg: No test for crc32 (crc32-table)
Jan 11 10:15:34 hpfs-fsl-oss00 kernel: LNet: Added LNI 192.52.98.32 at tcp [8/256/0/180]
Jan 11 10:15:34 hpfs-fsl-oss00 kernel: LNet: Using FMR for registration
Jan 11 10:15:34 hpfs-fsl-oss00 kernel: LNet: Added LNI 10.148.0.32 at o2ib [8/256/0/180]
Jan 11 10:15:34 hpfs-fsl-oss00 kernel: LNet: Accept secure, port 988
Jan 11 10:15:41 hpfs-fsl-oss00 kernel: Lustre: Lustre: Build Version: 2.9.51
Jan 11 10:15:43 hpfs-fsl-oss00 kernel: LustreError: 137-5: hpfs-fsl-OST0000_UUID: not available for connect from 192.52.98.55 at tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
Jan 11 10:15:46 hpfs-fsl-oss00 kernel: Lustre: 20564:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1484151341/real 1484151341]  req at ffff882020280000 x1556245476540432/t0(0) o250->MGC192.52.98.30 at tcp@192.52.98.30 at tcp:26/25 lens 520/544 e 0 to 1 dl 1484151346 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jan 11 10:16:16 hpfs-fsl-oss00 kernel: Lustre: 20564:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1484151366/real 1484151366]  req at ffff880fff120000 x1556245476540496/t0(0) o250->MGC192.52.98.30 at tcp@192.52.98.30 at tcp:26/25 lens 520/544 e 0 to 1 dl 1484151376 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jan 11 10:16:22 hpfs-fsl-oss00 kernel: LustreError: 137-5: hpfs-fsl-OST0000_UUID: not available for connect from 192.52.98.31 at tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
Jan 11 10:16:29 hpfs-fsl-oss00 kernel: LustreError: 20547:0:(mgc_request.c:249:do_config_log_add()) MGC192.52.98.30 at tcp: failed processing log, type 4: rc = -110
Jan 11 10:16:33 hpfs-fsl-oss00 kernel: LustreError: 137-5: hpfs-fsl-OST0000_UUID: not available for connect from 192.52.98.55 at tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
Jan 11 10:16:35 hpfs-fsl-oss00 kernel: LustreError: 20629:0:(sec_config.c:1107:sptlrpc_target_local_read_conf()) missing llog context
Jan 11 10:16:46 hpfs-fsl-oss00 kernel: Lustre: 20564:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1484151391/real 1484151391]  req at ffff880fff120300 x1556245476540528/t0(0) o250->MGC192.52.98.30 at tcp@192.52.98.30 at tcp:26/25 lens 520/544 e 0 to 1 dl 1484151406 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jan 11 10:16:52 hpfs-fsl-oss00 kernel: Lustre: 20564:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1484151407/real 1484151407]  req at ffff880fff120600 x1556245476540576/t0(0) o38->hpfs-fsl-MDT0000-lwp-OST0000 at 192.52.98.30@tcp:12/10 lens 520/544 e 0 to 1 dl 1484151412 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jan 11 10:17:04 hpfs-fsl-oss00 kernel: Lustre: hpfs-fsl-OST0000: Imperative Recovery not enabled, recovery window 300-900
Jan 11 10:17:12 hpfs-fsl-oss00 kernel: Lustre: hpfs-fsl-OST0000: Will be in recovery for at least 5:00, or until 2 clients reconnect
Jan 11 10:17:12 hpfs-fsl-oss00 kernel: Lustre: hpfs-fsl-OST0000: Connection restored to  (at 192.52.98.31 at tcp)
Jan 11 10:17:23 hpfs-fsl-oss00 kernel: Lustre: hpfs-fsl-OST0000: Connection restored to  (at 192.52.98.55 at tcp)
Jan 11 10:17:23 hpfs-fsl-oss00 kernel: Lustre: hpfs-fsl-OST0000: Recovery over after 0:11, of 2 clients 2 recovered and 0 were evicted.
Jan 11 10:17:32 hpfs-fsl-oss00 kernel: Lustre: 20564:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1484151432/real 1484151432]  req at ffff880fff090000 x1556245476540624/t0(0) o250->MGC192.52.98.30 at tcp@192.52.98.30 at tcp:26/25 lens 520/544 e 0 to 1 dl 1484151452 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jan 11 10:18:02 hpfs-fsl-oss00 kernel: Lustre: 20564:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1484151457/real 1484151457]  req at ffff880fff090600 x1556245476540656/t0(0) o250->MGC192.52.98.30 at tcp@192.52.98.30 at tcp:26/25 lens 520/544 e 0 to 1 dl 1484151482 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jan 11 10:18:57 hpfs-fsl-oss00 kernel: Lustre: 20564:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1484151507/real 1484151507]  req at ffff880fff090f00 x1556245476540704/t0(0) o250->MGC192.52.98.30 at tcp@192.52.98.30 at tcp:26/25 lens 520/544 e 0 to 1 dl 1484151537 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1




More information about the lustre-discuss mailing list