[Lustre-discuss] o2ib possible network problems
Ms. Megan Larko
dobsonunit at gmail.com
Mon Sep 22 08:34:15 PDT 2008
In my continuing quest (and self-education) on lustre networking (lctl
ping, and obd_ping, in particular....):
My MGS/MDS box is losing the connection to one and only one particular
OSS and then restoring in all within the same wall-clock second:
MGDS/MDS /var/log/messages:
Sep 22 11:04:58 mds1 kernel: LustreError: Skipped 9 previous similar messages
Sep 22 11:04:58 mds1 kernel: Lustre: crew8-OST0003-osc: Connection to
service crew8-OST0003 via nid 172.18.0.15 at o2ib was lost; in progress
operations using this service will wait for recovery to complete.
Sep 22 11:04:58 mds1 kernel: Lustre: Skipped 9 previous similar messages
Sep 22 11:04:58 mds1 kernel: LustreError: 167-0: This client was
evicted by crew8-OST0003; in progress operations using this service
will fail.
Sep 22 11:04:58 mds1 kernel: LustreError: Skipped 9 previous similar messages
Sep 22 11:04:58 mds1 kernel: Lustre:
931:0:(quota_master.c:1100:mds_quota_recovery()) Not all osts are
active, abort quota recovery
Sep 22 11:04:58 mds1 kernel: Lustre: crew8-OST0003-osc: Connection
restored to service crew8-OST0003 using nid 172.18.0.15 at o2ib.
Sep 22 11:04:58 mds1 kernel: Lustre: Skipped 9 previous similar messages
Sep 22 11:04:59 mds1 kernel: Lustre: MDS crew8-MDT0000:
crew8-OST0003_UUID now active, resetting orphans
My corresponding problem OSS has a <i>processing error</i> ??? and
then resets its own connection:
OSS4 /var/log/messages:
Sep 22 11:00:16 oss4 kernel: LustreError:
4261:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error
(-107) req at ffff81036b10ac00 x1788392/t0 o400-><?>@<?>:-1 lens 128/0
ref 0 fl Interpret:/0/0 rc -107/0
Sep 22 11:00:16 oss4 kernel: LustreError:
4261:0:(ldlm_lib.c:1442:target_send_reply_msg()) Skipped 12 previous
similar messages
Sep 22 11:04:59 oss4 kernel: Lustre: crew8-OST0003: received MDS
connection from 172.18.0.10 at o2ib
Sep 22 11:04:59 oss4 kernel: Lustre: Skipped 9 previous similar messages
Sep 22 11:07:20 oss4 kernel: Lustre: crew8-OST0001: haven't heard from
client crew8-mdtlov_UUID (at 172.18.0.10 at o2ib) in 391 seconds. I think
it's dead, and I am evicting it.
My client box here has the same connection error but minutes
later(!!). Odd. The boxes all use ntpd and sync from a common time
server here. But the notable thing is that the obd_ping, lost
connection, eviction and then restoration all occurr with a wall-clock
minute of one another.
crew01 /var/log/messages:
Sep 22 11:16:56 cn2 kernel: LustreError: 11-0: an error occurred while
communicating with 172.18.0.15 at o2ib. The obd_ping operation failed
with -107
Sep 22 11:16:56 cn2 kernel: LustreError: 11-0: an error occurred while
communicating with 172.18.0.15 at o2ib. The obd_ping operation failed
with -107
Sep 22 11:16:56 cn2 kernel: LustreError: Skipped 4 previous similar messages
Sep 22 11:16:56 cn2 kernel: LustreError: Skipped 4 previous similar messages
Sep 22 11:16:56 cn2 kernel: Lustre:
crew8-OST0000-osc-ffff81083ea5c400: Connection to service
crew8-OST0000 via nid 172.18.0.15 at o2ib was lost; in progress
operations using this service will wait for recovery to complete.
Sep 22 11:16:56 cn2 kernel: Lustre:
crew8-OST0000-osc-ffff81083ea5c400: Connection to service
crew8-OST0000 via nid 172.18.0.15 at o2ib was lost; in progress
operations using this service will wait for recovery to complete.
Sep 22 11:16:56 cn2 kernel: Lustre: Skipped 4 previous similar messages
Sep 22 11:16:56 cn2 kernel: Lustre: Skipped 4 previous similar messages
Sep 22 11:16:56 cn2 kernel: LustreError: 167-0: This client was
evicted by crew8-OST0000; in progress operations using this service
will fail.
Sep 22 11:16:56 cn2 kernel: LustreError: 167-0: This client was
evicted by crew8-OST0000; in progress operations using this service
will fail.
Sep 22 11:16:56 cn2 kernel: LustreError: Skipped 4 previous similar messages
Sep 22 11:16:56 cn2 kernel: LustreError: Skipped 4 previous similar messages
Sep 22 11:16:56 cn2 kernel: Lustre:
crew8-OST0000-osc-ffff81083ea5c400: Connection restored to service
crew8-OST0000 using nid 172.18.0.15 at o2ib.
Sep 22 11:16:56 cn2 kernel: Lustre:
crew8-OST0000-osc-ffff81083ea5c400: Connection restored to service
crew8-OST0000 using nid 172.18.0.15 at o2ib.
I have swapped IB network cables. The linux (CentOS 5 on all
systems) ping has no dropped packets between any of the systems on the
o2ib network. All lctl pings return normally. All systems are
running the same OS code---
[root at oss4 ~]# uname -a
Linux oss4.crew.local 2.6.18-53.1.13.el5_lustre.1.6.4.3smp #1 SMP Sun
Feb 17 08:38:44 EST 2008 x86_64 x86_64 x86_64 GNU/Linux
What is this "LustreError:
4261:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error
(-107)" error on my OSS? As the end-users are not noticing anything
and all of the activity on this one OSS is "no
communication--evicted--restored" inside of a minute, should I do
anything other than clean my becoming-voluminous logfiles more
frequently?
megan
More information about the lustre-discuss
mailing list