[Lustre-discuss] OSTs inactive on one client (only)

Patrick Shopbell pls at astro.caltech.edu
Mon Apr 29 16:28:23 PDT 2013



Hi everyone,
I have seen this question here before, but without a very
satisfactory answer. One of our half a dozen clients has
lost access to a set of OSTs:

 > lfs osts
OBDS::
0: lustre-OST0000_UUID ACTIVE
1: lustre-OST0001_UUID ACTIVE
2: lustre-OST0002_UUID INACTIVE
3: lustre-OST0003_UUID INACTIVE
4: lustre-OST0004_UUID INACTIVE
5: lustre-OST0005_UUID ACTIVE
6: lustre-OST0006_UUID ACTIVE

All OSTs show as completely fine on the other clients, and
the system is working there. In addition, I have run numerous
checks of the IB network (ibhosts, ibping, etc.), and I do not
see any networking issues.

Moreover, the OSSs include:

     OSS #1  -->   OST #0, #1, #2
     OSS #2  -->   OST #3, #4, #5
     OSS #3  -->   OST #6

So, the machine is seeing two of three OSTs on OSS #1 and one
of three OSTs on OSS #2. It is showing some OSTs on an OSS as
active and others as inactive. So this does not seem to be a
networking
issue.

I am getting a set of errors on that client periodically:

Apr 29 16:21:18 abacus kernel: LustreError: 
28707:0:(import.c:324:ptlrpc_invalidate_import()) lustre-OST0003_UUID: 
rc = -110 waiting for callback (3 != 0)
Apr 29 16:21:18 abacus kernel: LustreError: 
28707:0:(import.c:324:ptlrpc_invalidate_import()) Skipped 18 previous 
similar messages
Apr 29 16:21:18 abacus kernel: LustreError: 
28707:0:(import.c:350:ptlrpc_invalidate_import()) @@@ still on sending 
list  req at ffff8803b45c6c00 x1430098383471272/t0(0) 
o101->lustre-OST0003-osc-ffff880331f33400 at 192.168.100.103@o2ib:28/4 lens 
328/352 e 0 to 0 dl 1367194410 ref 1 fl Interpret:RE/0/0 rc -5/0
Apr 29 16:21:18 abacus kernel: LustreError: 
28707:0:(import.c:350:ptlrpc_invalidate_import()) Skipped 61 previous 
similar messages
Apr 29 16:21:18 abacus kernel: LustreError: 
28707:0:(import.c:366:ptlrpc_invalidate_import()) lustre-OST0003_UUID: 
RPCs in "Unregistering" phase found (0). Network is sluggish? Waiting 
them to error out.
Apr 29 16:21:18 abacus kernel: LustreError: 
28707:0:(import.c:366:ptlrpc_invalidate_import()) Skipped 18 previous 
similar messages

I seem to recall some talk of what happens when a client or
two does a lot of I/O and sort of takes over. Indeed, a couple
of the other clients are very busily using Lustre. But still,
I would have hoped that this client (abacus) would have regained
its connections after a few hours.

Any ideas as to what I can do, short of rebooting the client?
I am nervous about that leaving incomplete I/O.

Thanks,
Patrick Shopbell
pls at astro.caltech.edu






More information about the lustre-discuss mailing list