[Lustre-discuss] OSS in inactive status

Mon Feb 10 09:01:30 PST 2014

Hi Rick,
Thank you for your advice. We did the activation exercise during last weekend. It went without issues:

[root at sklusp01a ~]# lctl dl
  0 UP mgs MGS MGS 145
  1 UP mgc MGC10.214.127.54 at tcp 6cc1bf8e-85f9-3d93-5d32-be3d441076a7 5
  2 UP mdt MDS MDS_uuid 3
  3 UP lov l1-mdtlov l1-mdtlov_UUID 4
  4 UP mds l1-MDT0000 l1-MDT0000_UUID 131
  5 UP osc l1-OST0000-osc l1-mdtlov_UUID 5
  6 UP osc l1-OST0001-osc l1-mdtlov_UUID 5
  7 IN osc l1-OST0002-osc l1-mdtlov_UUID 5
  8 UP osc l1-OST0003-osc l1-mdtlov_UUID 5
  9 UP osc l1-OST0004-osc l1-mdtlov_UUID 5
10 UP osc l1-OST0005-osc l1-mdtlov_UUID 5

[root at sklusp01a ~]# lctl --device 7 activate

[root at sklusp01a ~]# lctl dl
  0 UP mgs MGS MGS 145
  1 UP mgc MGC10.214.127.54 at tcp 6cc1bf8e-85f9-3d93-5d32-be3d441076a7 5
  2 UP mdt MDS MDS_uuid 3
  3 UP lov l1-mdtlov l1-mdtlov_UUID 4
  4 UP mds l1-MDT0000 l1-MDT0000_UUID 131
  5 UP osc l1-OST0000-osc l1-mdtlov_UUID 5
  6 UP osc l1-OST0001-osc l1-mdtlov_UUID 5
  7 UP osc l1-OST0002-osc l1-mdtlov_UUID 5
  8 UP osc l1-OST0003-osc l1-mdtlov_UUID 5
  9 UP osc l1-OST0004-osc l1-mdtlov_UUID 5
10 UP osc l1-OST0005-osc l1-mdtlov_UUID 5

Best regards,
Akos

-----Original Message-----
From: Mohr Jr, Richard Frank (Rick Mohr) [mailto:rmohr at utk.edu] 
Sent: 3. februára 2014 21:35
To: Kalosi, Akos
Cc: lustre-discuss at lists.lustre.org
Subject: Re: [Lustre-discuss] OSS in inactive status

Based on the log messages, it looks like the OSS server deleted the orphan objids, but for some reason the MDS didn't think the OSS server did that.  Maybe there was a networking issue that messed up the communication?  Just from these logs, I don't think you can tell what the cause was.

I have had something similar happen a couple of times to one of my Lustre file systems.  After startup, a few of the OSTs were listed as inactive.  I never could figure out what caused it, and in the end I just reactivated them and they worked fine.  Here are the steps I would take (but see my disclaimer below):

1) Run some "ping" and "lctl ping" checks between the MDS and OSS server to look for possible networking problems.

2) If the network appears to be fine, run "lctl activate" on the OST and check the logs for signs of possible errors.

3) If no errors are found, create a single striped file on the reactivated OST and do some read/write tests to make sure it is working.

If the read/write tests are fine, then I think it is safe to assume that the OST is working.  But you will probably want to keep a close eye on things for a little while just in case some problem arises.

DISCLAIMER: This advice is based purely on my experience.  While these steps worked for me, I don't know the cause of your issue so your circumstances may be different than mine.  YMMV, caveat emptor, etc, etc.

--
Rick Mohr
Senior HPC System Administrator
National Institute for Computational Sciences http://www.nics.tennessee.edu

On Feb 1, 2014, at 3:35 AM, "Kalosi, Akos" <Akos.Kalosi at hp.com>
 wrote:

> Hi,
>  
> After a failover one OSS is got into INACTIVE status:
>  
> [root at sklusp01a ~]# lctl dl
>   0 UP mgs MGS MGS 145
>   1 UP mgc MGC10.214.127.54 at tcp 6cc1bf8e-85f9-3d93-5d32-be3d441076a7 5
>   2 UP mdt MDS MDS_uuid 3
>   3 UP lov l1-mdtlov l1-mdtlov_UUID 4
>   4 UP mds l1-MDT0000 l1-MDT0000_UUID 131
>   5 UP osc l1-OST0000-osc l1-mdtlov_UUID 5
>   6 UP osc l1-OST0001-osc l1-mdtlov_UUID 5
>   7 IN osc l1-OST0002-osc l1-mdtlov_UUID 5
>   8 UP osc l1-OST0003-osc l1-mdtlov_UUID 5
>   9 UP osc l1-OST0004-osc l1-mdtlov_UUID 5
> 10 UP osc l1-OST0005-osc l1-mdtlov_UUID 5
> [root at sklusp01a ~]#
>  
> The interesting part of MDS log:
>  
> Dec 27 22:58:45 sklusp01a kernel: Lustre: 1512:0:(mds_unlink_open.c:287:mds_cleanup_pending()) l1-MDT0000: orphan 44b58f0:ec7d7d52 re-opened during recovery
> Dec 27 22:58:45 sklusp01a kernel: Lustre: 1512:0:(quota_master.c:1722:mds_quota_recovery()) Only 2/6 OSTs are active, abort quota recovery
> Dec 27 22:58:45 sklusp01a kernel: Lustre: l1-MDT0000: Recovery period over after 10:00, of 64 clients 63 recovered and 1 was evicted.
> Dec 27 22:58:45 sklusp01a kernel: Lustre: l1-MDT0000: sending delayed replies to recovered clients
> Dec 27 22:58:45 sklusp01a kernel: LustreError: 1578:0:(mds_open.c:1645:mds_close()) @@@ no handle for file close ino 72927933: cookie 0xf5ee63c8f533fd1c  req at ffff810c584cbc00 x1438686218045461/t0 o35->46a7866a-0713-ce2d-f66d-44fb9b42fef8 at NET_0x200000ad67f58_UUID:0/0 lens 408/752 e 0 to 0 dl 1388181531 ref 1 fl Interpret:/0/0 rc 0/0
> Dec 27 22:58:45 sklusp01a kernel: LustreError: 1578:0:(mds_open.c:1645:mds_close()) Skipped 3 previous similar messages
> Dec 27 22:58:45 sklusp01a kernel: LustreError: 1578:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing error (-116)  req at ffff810c584cbc00 x1438686218045461/t0 o35->46a7866a-0713-ce2d-f66d-44fb9b42fef8 at NET_0x200000ad67f58_UUID:0/0 lens 408/560 e 0 to 0 dl 1388181531 ref 1 fl Interpret:/0/0 rc -116/0
> Dec 27 22:58:45 sklusp01a kernel: Lustre: MDS l1-MDT0000: l1-OST0004_UUID now active, resetting orphans
> Dec 27 22:58:45 sklusp01a kernel: Lustre: Skipped 1 previous similar message
> Dec 27 22:59:22 sklusp01a kernel: Lustre: 25465:0:(quota_master.c:1722:mds_quota_recovery()) Only 2/6 OSTs are active, abort quota recovery
> Dec 27 22:59:22 sklusp01a kernel: Lustre: 25465:0:(quota_master.c:1722:mds_quota_recovery()) Skipped 6 previous similar messages
> Dec 27 22:59:22 sklusp01a kernel: Lustre: l1-OST0002-osc: Connection restored to service l1-OST0002 using nid 10.214.127.56 at tcp.
> Dec 27 22:59:22 sklusp01a kernel: Lustre: MDS l1-MDT0000: l1-OST0002_UUID now active, resetting orphans
> Dec 27 22:59:22 sklusp01a kernel: Lustre: Skipped 3 previous similar messages
> Dec 27 22:59:22 sklusp01a kernel: LustreError: 3138:0:(lov_obd.c:1150:lov_clear_orphans()) error in orphan recovery on OST idx 2/6: rc = -16
> Dec 27 22:59:22 sklusp01a kernel: LustreError: 3138:0:(mds_lov.c:1057:__mds_lov_synchronize()) l1-OST0002_UUID failed at mds_lov_clear_orphans: -16
> Dec 27 22:59:22 sklusp01a kernel: LustreError: 3138:0:(mds_lov.c:1066:__mds_lov_synchronize()) l1-OST0002_UUID sync failed -16, deactivating
> Dec 27 22:59:22 sklusp01a kernel: LustreError: 3038:0:(llog_server.c:466:llog_origin_handle_cancel()) Cancel 61 of 122 llog-records failed: -2
> Dec 27 23:01:39 sklusp01a kernel: LustreError: 3002:0:(handler.c:1513:mds_handle()) operation 101 on unconnected MDS from 12345-10.214.127.216 at tcp
> Dec 27 23:01:39 sklusp01a kernel: LustreError: 3002:0:(handler.c:1513:mds_handle()) Skipped 8 previous similar messages
> Dec 27 23:01:39 sklusp01a kernel: LustreError: 3002:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing error (-107)  req at ffff810d0162e800 x1438688843764464/t0 o101-><?>@<?>:0/0 lens 512/0 e 0 to 0 dl 1388181741 ref 1 fl Interpret:/4/0 rc -107/0
>  
> The corresponding part of OSS log:
>  
> Dec 27 22:59:22 sklusp03a kernel: Lustre: l1-OST0002: Recovery period over after 10:04, of 65 clients 64 recovered and 1 was evicted.
> Dec 27 22:59:22 sklusp03a kernel: Lustre: l1-OST0002: sending delayed replies to recovered clients
> Dec 27 22:59:22 sklusp03a kernel: Lustre: l1-OST0002: received MDS connection from 10.214.127.54 at tcp
> Dec 27 22:59:22 sklusp03a kernel: Lustre: 15495:0:(filter.c:3127:filter_destroy_precreated()) l1-OST0002: deleting orphan objects from 53849605 to 53849665, orphan objids won't be reused any more.
>  
> How to recover from this situation?
> Is it safe to activate OSS with lctl activate?
> Is it possible to tell why the OSS got into DEACTIVATED status?
>  
> Thanks for any hint.
> Best regards,
> Akos
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss