[Lustre-discuss] o2ib possible network problems??

Thu Sep 18 11:04:03 PDT 2008

Hello List,

I have finally got my new lustre disk on-line having rescued as much
as I could from the old, hardware-failed volume.

The new disk is mounted on new hardware, OSS4 (for Object Storage
Server4---no I am not imaginative).  The disk OSTs are called "crew8".
They mount happily as shown on OSS4.

/dev/sdb1             6.3T  897G  5.1T  15% /srv/lustre/OST/crew8-OST0000
/dev/sdb2             6.3T  867G  5.1T  15% /srv/lustre/OST/crew8-OST0001
/dev/sdc1             6.3T  892G  5.1T  15% /srv/lustre/OST/crew8-OST0002
/dev/sdc2             6.3T 1003G  5.0T  17% /srv/lustre/OST/crew8-OST0003
/dev/sdd1             6.3T  907G  5.1T  15% /srv/lustre/OST/crew8-OST0004
/dev/sdd2             6.3T  877G  5.1T  15% /srv/lustre/OST/crew8-OST0005
/dev/sdi1             6.3T  916G  5.1T  16% /srv/lustre/OST/crew8-OST0006
/dev/sdi2             6.3T  920G  5.1T  16% /srv/lustre/OST/crew8-OST0007
/dev/sdj1             6.3T  901G  5.1T  15% /srv/lustre/OST/crew8-OST0008
/dev/sdj2             6.3T  895G  5.1T  15% /srv/lustre/OST/crew8-OST0009
/dev/sdk1             6.3T  878G  5.1T  15% /srv/lustre/OST/crew8-OST0010
/dev/sdk2             6.3T  891G  5.1T  15% /srv/lustre/OST/crew8-OST0011

The MGS/MDS (which servers two other lustre volumes for us currently)
shows the following info:
[root at mds1 ~]# lctl dl
  0 UP mgs MGS MGS 13
  1 UP mgc MGC172.18.0.10 at o2ib 81039216-0261-c74d-3f2f-a504788ad8f8 5
  2 UP mdt MDS MDS_uuid 3
  3 UP lov crew2-mdtlov crew2-mdtlov_UUID 4
  4 UP mds crew2-MDT0000 crew2mds_UUID 7
  5 UP osc crew2-OST0000-osc crew2-mdtlov_UUID 5
  6 UP osc crew2-OST0001-osc crew2-mdtlov_UUID 5
  7 UP osc crew2-OST0002-osc crew2-mdtlov_UUID 5
  8 UP lov crew3-mdtlov crew3-mdtlov_UUID 4
  9 UP mds crew3-MDT0000 crew3mds_UUID 7
 10 UP osc crew3-OST0000-osc crew3-mdtlov_UUID 5
 11 UP osc crew3-OST0001-osc crew3-mdtlov_UUID 5
 12 UP osc crew3-OST0002-osc crew3-mdtlov_UUID 5
 13 UP lov crew8-mdtlov crew8-mdtlov_UUID 4
 14 UP mds crew8-MDT0000 crew8-MDT0000_UUID 9
 15 UP osc crew8-OST0000-osc crew8-mdtlov_UUID 5
 16 UP osc crew8-OST0001-osc crew8-mdtlov_UUID 5
 17 UP osc crew8-OST0002-osc crew8-mdtlov_UUID 5
 18 UP osc crew8-OST0003-osc crew8-mdtlov_UUID 5
 19 UP osc crew8-OST0004-osc crew8-mdtlov_UUID 5
 20 UP osc crew8-OST0005-osc crew8-mdtlov_UUID 5
 21 UP osc crew8-OST0006-osc crew8-mdtlov_UUID 5
 22 UP osc crew8-OST0007-osc crew8-mdtlov_UUID 5
 23 UP osc crew8-OST0008-osc crew8-mdtlov_UUID 5
 24 UP osc crew8-OST0009-osc crew8-mdtlov_UUID 5
 25 UP osc crew8-OST000a-osc crew8-mdtlov_UUID 5
 26 UP osc crew8-OST000b-osc crew8-mdtlov_UUID 5

(NOTE: last two disks came in as crew8-OST000a and crew8-OST000b and
not crew8-OST0010 and crew8-OST0011 respectively.  I don't know if
that has anything at all to do with my issue.)

The clients are forever losing this one crew8 volume (mounted  on the
clients as /crewdat).
>From /var/log/messages:
[larkoc at crew01 ~]$ tail /var/log/messages
Sep 18 13:53:10 crew01 kernel: Lustre:
crew8-OST0002-osc-ffff8101edbff400: Connection restored to service
crew8-OST0002 using nid 172.18.0.15 at o2ib.
Sep 18 13:53:10 crew01 kernel: Lustre: Skipped 4 previous similar messages
Sep 18 13:54:05 cn2 kernel: LustreError: 11-0: an error occurred while
communicating with 172.18.0.15 at o2ib. The obd_ping operation failed
with -107
Sep 18 13:54:05 cn2 kernel: LustreError: 11-0: an error occurred while
communicating with 172.18.0.15 at o2ib. The obd_ping operation failed
with -107
Sep 18 13:54:05 cn2 kernel: LustreError: Skipped 9 previous similar messages
Sep 18 13:54:05 cn2 kernel: LustreError: Skipped 9 previous similar messages
Sep 18 13:54:05 cn2 kernel: Lustre:
crew8-OST0003-osc-ffff81083ea5c400: Connection to service
crew8-OST0003 via nid 172.18.0.15 at o2ib was lost; in progress
operations using this service will wait for recovery to complete.
Sep 18 13:54:05 cn2 kernel: Lustre:
crew8-OST0003-osc-ffff81083ea5c400: Connection to service
crew8-OST0003 via nid 172.18.0.15 at o2ib was lost; in progress
operations using this service will wait for recovery to complete.
Sep 18 13:54:05 cn2 kernel: Lustre: Skipped 9 previous similar messages
Sep 18 13:54:05 cn2 kernel: Lustre: Skipped 9 previous similar messages

The MGS/MDS /var/log/messages reads:
root at mds1 ~]# tail /var/log/messages
Sep 18 13:50:58 mds1 kernel: LustreError: Skipped 20 previous similar messages
Sep 18 13:50:58 mds1 kernel: Lustre: crew8-OST0005-osc: Connection to
service crew8-OST0005 via nid 172.18.0.15 at o2ib was lost; in progress
operations using this service will wait for recovery to complete.
Sep 18 13:50:58 mds1 kernel: Lustre: Skipped 20 previous similar messages
Sep 18 13:50:58 mds1 kernel: LustreError: 167-0: This client was
evicted by crew8-OST0005; in progress operations using this service
will fail.
Sep 18 13:50:58 mds1 kernel: LustreError: Skipped 20 previous similar messages
Sep 18 13:50:58 mds1 kernel: Lustre:
568:0:(quota_master.c:1100:mds_quota_recovery()) Not all osts are
active, abort quota recovery
Sep 18 13:50:58 mds1 kernel: Lustre: crew8-OST0005-osc: Connection
restored to service crew8-OST0005 using nid 172.18.0.15 at o2ib.
Sep 18 13:50:58 mds1 kernel: Lustre: Skipped 20 previous similar messages
Sep 18 13:50:58 mds1 kernel: Lustre: MDS crew8-MDT0000:
crew8-OST000b_UUID now active, resetting orphans
Sep 18 13:50:58 mds1 kernel: Lustre: Skipped 20 previous similar messages

The OSS4 box /var/log/messages:
[root at oss4 ~]# tail /var/log/messages
Sep 18 13:40:40 oss4 kernel: Lustre: crew8-OST0000: haven't heard from
client 794ff121-dfec-3934-338e-6b7f861f69b6 (at 172.18.1.2 at o2ib) in
195 seconds. I think it's dead, and I am evicting it.
Sep 18 13:40:40 oss4 kernel: Lustre: Skipped 25 previous similar messages
Sep 18 13:44:50 oss4 kernel: LustreError:
3954:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error
(-107)  req at ffff81042b8c9a00 x8274144/t0 o400-><?>@<?>:-1 lens 128/0
ref 0 fl Interpret:/0/0 rc -107/0
Sep 18 13:44:50 oss4 kernel: LustreError:
3954:0:(ldlm_lib.c:1442:target_send_reply_msg()) Skipped 23 previous
similar messages
Sep 18 13:50:58 oss4 kernel: Lustre: crew8-OST000b: received MDS
connection from 172.18.0.10 at o2ib
Sep 18 13:50:58 oss4 kernel: Lustre: Skipped 20 previous similar messages
Sep 18 13:51:00 oss4 kernel: Lustre: crew8-OST0006: haven't heard from
client crew8-mdtlov_UUID (at 172.18.0.10 at o2ib) in 251 seconds. I think
it's dead, and I am evicting it.
Sep 18 13:51:00 oss4 kernel: Lustre: Skipped 30 previous similar messages
Sep 18 13:55:08 oss4 kernel: LustreError:
3993:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error
(-107)  req at ffff81042d8dba00 x9085095/t0 o400-><?>@<?>:-1 lens 128/0
ref 0 fl Interpret:/0/0 rc -107/0
Sep 18 13:55:08 oss4 kernel: LustreError:
3993:0:(ldlm_lib.c:1442:target_send_reply_msg()) Skipped 35 previous
similar messages

So---I am seeing that OSS4 is repeatedly losing its network contact
with MGS/MDS machine mds1.  Each time this occurs the mds1 box tells
the client the disk mounted as /crewdat that the disk is evicted and
will wait for recovery service to complete.   A check on mds1 of the
crew8-MDT0000 target shows that no recovery is occurring (and has not
AFAICT since I mounted the disk on the clients, post-recovery).
On mds1:
cat /proc/fs/lustre/mds/crew8-MDT0000/recovery_status
status: COMPLETE
recovery_start: 1221745009
recovery_end: 1221745185
recovered_clients: 1
unrecovered_clients: 0
last_transno: 33954534
replayed_requests: 0

I am guessing that I need to increase a lustre client timeout value
for our o2ib connections for the new disk to not generate these
messages (the /crewdat disk itself seems to be fine for user access).
 The other two lustre volumes on the system seem content.   Is my
guess correct?   If yes, what timeout value do I need to increase?

Thank you,
megan