[Lustre-discuss] o2ib possible network problems

Ms. Megan Larko dobsonunit at gmail.com
Mon Sep 22 08:34:15 PDT 2008


In my continuing quest (and self-education) on lustre networking (lctl
ping, and obd_ping, in particular....):

My MGS/MDS box is losing the connection to one and only one particular
OSS and then restoring in all within the same wall-clock second:
MGDS/MDS  /var/log/messages:
Sep 22 11:04:58 mds1 kernel: LustreError: Skipped 9 previous similar messages
Sep 22 11:04:58 mds1 kernel: Lustre: crew8-OST0003-osc: Connection to
service crew8-OST0003 via nid 172.18.0.15 at o2ib was lost; in progress
operations using this service will wait for recovery to complete.
Sep 22 11:04:58 mds1 kernel: Lustre: Skipped 9 previous similar messages
Sep 22 11:04:58 mds1 kernel: LustreError: 167-0: This client was
evicted by crew8-OST0003; in progress operations using this service
will fail.
Sep 22 11:04:58 mds1 kernel: LustreError: Skipped 9 previous similar messages
Sep 22 11:04:58 mds1 kernel: Lustre:
931:0:(quota_master.c:1100:mds_quota_recovery()) Not all osts are
active, abort quota recovery
Sep 22 11:04:58 mds1 kernel: Lustre: crew8-OST0003-osc: Connection
restored to service crew8-OST0003 using nid 172.18.0.15 at o2ib.
Sep 22 11:04:58 mds1 kernel: Lustre: Skipped 9 previous similar messages
Sep 22 11:04:59 mds1 kernel: Lustre: MDS crew8-MDT0000:
crew8-OST0003_UUID now active, resetting orphans

My corresponding problem OSS has a <i>processing error</i> ??? and
then resets its own connection:
OSS4 /var/log/messages:
Sep 22 11:00:16 oss4 kernel: LustreError:
4261:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error
(-107)  req at ffff81036b10ac00 x1788392/t0 o400-><?>@<?>:-1 lens 128/0
ref 0 fl Interpret:/0/0 rc -107/0
Sep 22 11:00:16 oss4 kernel: LustreError:
4261:0:(ldlm_lib.c:1442:target_send_reply_msg()) Skipped 12 previous
similar messages
Sep 22 11:04:59 oss4 kernel: Lustre: crew8-OST0003: received MDS
connection from 172.18.0.10 at o2ib
Sep 22 11:04:59 oss4 kernel: Lustre: Skipped 9 previous similar messages
Sep 22 11:07:20 oss4 kernel: Lustre: crew8-OST0001: haven't heard from
client crew8-mdtlov_UUID (at 172.18.0.10 at o2ib) in 391 seconds. I think
it's dead, and I am evicting it.

My client box here has the same connection error but minutes
later(!!).   Odd.   The boxes all use ntpd and sync from a common time
server here.  But the notable thing is that the obd_ping, lost
connection, eviction and then restoration all occurr with a wall-clock
minute of one another.
crew01 /var/log/messages:
Sep 22 11:16:56 cn2 kernel: LustreError: 11-0: an error occurred while
communicating with 172.18.0.15 at o2ib. The obd_ping operation failed
with -107
Sep 22 11:16:56 cn2 kernel: LustreError: 11-0: an error occurred while
communicating with 172.18.0.15 at o2ib. The obd_ping operation failed
with -107
Sep 22 11:16:56 cn2 kernel: LustreError: Skipped 4 previous similar messages
Sep 22 11:16:56 cn2 kernel: LustreError: Skipped 4 previous similar messages
Sep 22 11:16:56 cn2 kernel: Lustre:
crew8-OST0000-osc-ffff81083ea5c400: Connection to service
crew8-OST0000 via nid 172.18.0.15 at o2ib was lost; in progress
operations using this service will wait for recovery to complete.
Sep 22 11:16:56 cn2 kernel: Lustre:
crew8-OST0000-osc-ffff81083ea5c400: Connection to service
crew8-OST0000 via nid 172.18.0.15 at o2ib was lost; in progress
operations using this service will wait for recovery to complete.
Sep 22 11:16:56 cn2 kernel: Lustre: Skipped 4 previous similar messages
Sep 22 11:16:56 cn2 kernel: Lustre: Skipped 4 previous similar messages
Sep 22 11:16:56 cn2 kernel: LustreError: 167-0: This client was
evicted by crew8-OST0000; in progress operations using this service
will fail.
Sep 22 11:16:56 cn2 kernel: LustreError: 167-0: This client was
evicted by crew8-OST0000; in progress operations using this service
will fail.
Sep 22 11:16:56 cn2 kernel: LustreError: Skipped 4 previous similar messages
Sep 22 11:16:56 cn2 kernel: LustreError: Skipped 4 previous similar messages
Sep 22 11:16:56 cn2 kernel: Lustre:
crew8-OST0000-osc-ffff81083ea5c400: Connection restored to service
crew8-OST0000 using nid 172.18.0.15 at o2ib.
Sep 22 11:16:56 cn2 kernel: Lustre:
crew8-OST0000-osc-ffff81083ea5c400: Connection restored to service
crew8-OST0000 using nid 172.18.0.15 at o2ib.

I have swapped IB network cables.   The linux (CentOS 5 on all
systems) ping has no dropped packets between any of the systems on the
o2ib network.  All lctl pings return normally.   All systems are
running the same OS code---
[root at oss4 ~]# uname -a
Linux oss4.crew.local 2.6.18-53.1.13.el5_lustre.1.6.4.3smp #1 SMP Sun
Feb 17 08:38:44 EST 2008 x86_64 x86_64 x86_64 GNU/Linux

What is this "LustreError:
4261:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error
(-107)" error  on my OSS?  As the end-users are not noticing anything
and all of the activity on this one OSS is "no
communication--evicted--restored" inside of a minute, should I do
anything other than clean my becoming-voluminous logfiles more
frequently?

megan



More information about the lustre-discuss mailing list