[Lustre-discuss] o2ib possible network problems

Tue Sep 23 01:51:21 PDT 2008

On Sep 22, 2008  11:34 -0400, Ms. Megan Larko wrote:
> My MGS/MDS box is losing the connection to one and only one particular
> OSS and then restoring in all within the same wall-clock second:
> MGDS/MDS  /var/log/messages:
> 
> My corresponding problem OSS has a <i>processing error</i> ??? and
> then resets its own connection:
> Sep 22 11:00:16 oss4 kernel: LustreError:
> 4261:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error
> (-107)  req at ffff81036b10ac00 x1788392/t0 o400-><?>@<?>:-1 lens 128/0
> ref 0 fl Interpret:/0/0 rc -107/0
> Sep 22 11:00:16 oss4 kernel: LustreError:
> 4261:0:(ldlm_lib.c:1442:target_send_reply_msg()) Skipped 12 previous
> similar messages
> Sep 22 11:04:59 oss4 kernel: Lustre: crew8-OST0003: received MDS
> connection from 172.18.0.10 at o2ib
> Sep 22 11:04:59 oss4 kernel: Lustre: Skipped 9 previous similar messages
> Sep 22 11:07:20 oss4 kernel: Lustre: crew8-OST0001: haven't heard from
> client crew8-mdtlov_UUID (at 172.18.0.10 at o2ib) in 391 seconds. I think
> it's dead, and I am evicting it.

This indicates that the MDS is trying to ping the OSS server ("o400" is
OBD_PING) to indicate that it is alive, but the OSS is having trouble to
reply to these requests, apparently because it thinks the MDS/client are
not connected (-107 = -ENOTCONN).

> My client box here has the same connection error but minutes
> later(!!).   Odd.   The boxes all use ntpd and sync from a common time
> server here.  But the notable thing is that the obd_ping, lost
> connection, eviction and then restoration all occurr with a wall-clock
> minute of one another.

There isn't a requirement that the clocks are synchronized for Lustre,
and the difference in the printing of the messages between client and
MDS is likely just due to different timing for skipping of messages.

> I have swapped IB network cables.   The linux (CentOS 5 on all
> systems) ping has no dropped packets between any of the systems on the
> o2ib network.  All lctl pings return normally.   All systems are
> running the same OS code---
> [root at oss4 ~]# uname -a
> Linux oss4.crew.local 2.6.18-53.1.13.el5_lustre.1.6.4.3smp #1 SMP Sun
> Feb 17 08:38:44 EST 2008 x86_64 x86_64 x86_64 GNU/Linux
> 
> What is this "LustreError:
> 4261:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error
> (-107)" error  on my OSS?  As the end-users are not noticing anything
> and all of the activity on this one OSS is "no
> communication--evicted--restored" inside of a minute, should I do
> anything other than clean my becoming-voluminous logfiles more
> frequently?

It definitely looks like there is some sort of strange problem.  You
could enable RPC tracing and dump a debug log (lctl dk /tmp/lustre.log),
checking if the client is sending a ping (o400) and the server is getting
it and what is happening during the processing.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.