[Lustre-discuss] OSTs inactive on one client (only)

Colin Faber colin_faber at xyratex.com
Mon Apr 29 17:05:09 PDT 2013


Hi Patrick,

Verify interconnect health from those clients to the OSS hosting those
OST's.

-cf



On Mon, Apr 29, 2013 at 5:28 PM, Patrick Shopbell <pls at astro.caltech.edu>wrote:

>
>
> Hi everyone,
> I have seen this question here before, but without a very
> satisfactory answer. One of our half a dozen clients has
> lost access to a set of OSTs:
>
>  > lfs osts
> OBDS::
> 0: lustre-OST0000_UUID ACTIVE
> 1: lustre-OST0001_UUID ACTIVE
> 2: lustre-OST0002_UUID INACTIVE
> 3: lustre-OST0003_UUID INACTIVE
> 4: lustre-OST0004_UUID INACTIVE
> 5: lustre-OST0005_UUID ACTIVE
> 6: lustre-OST0006_UUID ACTIVE
>
> All OSTs show as completely fine on the other clients, and
> the system is working there. In addition, I have run numerous
> checks of the IB network (ibhosts, ibping, etc.), and I do not
> see any networking issues.
>
> Moreover, the OSSs include:
>
>      OSS #1  -->   OST #0, #1, #2
>      OSS #2  -->   OST #3, #4, #5
>      OSS #3  -->   OST #6
>
> So, the machine is seeing two of three OSTs on OSS #1 and one
> of three OSTs on OSS #2. It is showing some OSTs on an OSS as
> active and others as inactive. So this does not seem to be a
> networking
> issue.
>
> I am getting a set of errors on that client periodically:
>
> Apr 29 16:21:18 abacus kernel: LustreError:
> 28707:0:(import.c:324:ptlrpc_invalidate_import()) lustre-OST0003_UUID:
> rc = -110 waiting for callback (3 != 0)
> Apr 29 16:21:18 abacus kernel: LustreError:
> 28707:0:(import.c:324:ptlrpc_invalidate_import()) Skipped 18 previous
> similar messages
> Apr 29 16:21:18 abacus kernel: LustreError:
> 28707:0:(import.c:350:ptlrpc_invalidate_import()) @@@ still on sending
> list  req at ffff8803b45c6c00 x1430098383471272/t0(0)
> o101->lustre-OST0003-osc-ffff880331f33400 at 192.168.100.103@o2ib:28/4 lens
> 328/352 e 0 to 0 dl 1367194410 ref 1 fl Interpret:RE/0/0 rc -5/0
> Apr 29 16:21:18 abacus kernel: LustreError:
> 28707:0:(import.c:350:ptlrpc_invalidate_import()) Skipped 61 previous
> similar messages
> Apr 29 16:21:18 abacus kernel: LustreError:
> 28707:0:(import.c:366:ptlrpc_invalidate_import()) lustre-OST0003_UUID:
> RPCs in "Unregistering" phase found (0). Network is sluggish? Waiting
> them to error out.
> Apr 29 16:21:18 abacus kernel: LustreError:
> 28707:0:(import.c:366:ptlrpc_invalidate_import()) Skipped 18 previous
> similar messages
>
> I seem to recall some talk of what happens when a client or
> two does a lot of I/O and sort of takes over. Indeed, a couple
> of the other clients are very busily using Lustre. But still,
> I would have hoped that this client (abacus) would have regained
> its connections after a few hours.
>
> Any ideas as to what I can do, short of rebooting the client?
> I am nervous about that leaving incomplete I/O.
>
> Thanks,
> Patrick Shopbell
> pls at astro.caltech.edu
>
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20130429/ed1aa5f1/attachment.htm>


More information about the lustre-discuss mailing list