[Lustre-discuss] o2ib possible network problems??
Ms. Megan Larko
dobsonunit at gmail.com
Thu Sep 18 11:04:03 PDT 2008
Hello List,
I have finally got my new lustre disk on-line having rescued as much
as I could from the old, hardware-failed volume.
The new disk is mounted on new hardware, OSS4 (for Object Storage
Server4---no I am not imaginative). The disk OSTs are called "crew8".
They mount happily as shown on OSS4.
/dev/sdb1 6.3T 897G 5.1T 15% /srv/lustre/OST/crew8-OST0000
/dev/sdb2 6.3T 867G 5.1T 15% /srv/lustre/OST/crew8-OST0001
/dev/sdc1 6.3T 892G 5.1T 15% /srv/lustre/OST/crew8-OST0002
/dev/sdc2 6.3T 1003G 5.0T 17% /srv/lustre/OST/crew8-OST0003
/dev/sdd1 6.3T 907G 5.1T 15% /srv/lustre/OST/crew8-OST0004
/dev/sdd2 6.3T 877G 5.1T 15% /srv/lustre/OST/crew8-OST0005
/dev/sdi1 6.3T 916G 5.1T 16% /srv/lustre/OST/crew8-OST0006
/dev/sdi2 6.3T 920G 5.1T 16% /srv/lustre/OST/crew8-OST0007
/dev/sdj1 6.3T 901G 5.1T 15% /srv/lustre/OST/crew8-OST0008
/dev/sdj2 6.3T 895G 5.1T 15% /srv/lustre/OST/crew8-OST0009
/dev/sdk1 6.3T 878G 5.1T 15% /srv/lustre/OST/crew8-OST0010
/dev/sdk2 6.3T 891G 5.1T 15% /srv/lustre/OST/crew8-OST0011
The MGS/MDS (which servers two other lustre volumes for us currently)
shows the following info:
[root at mds1 ~]# lctl dl
0 UP mgs MGS MGS 13
1 UP mgc MGC172.18.0.10 at o2ib 81039216-0261-c74d-3f2f-a504788ad8f8 5
2 UP mdt MDS MDS_uuid 3
3 UP lov crew2-mdtlov crew2-mdtlov_UUID 4
4 UP mds crew2-MDT0000 crew2mds_UUID 7
5 UP osc crew2-OST0000-osc crew2-mdtlov_UUID 5
6 UP osc crew2-OST0001-osc crew2-mdtlov_UUID 5
7 UP osc crew2-OST0002-osc crew2-mdtlov_UUID 5
8 UP lov crew3-mdtlov crew3-mdtlov_UUID 4
9 UP mds crew3-MDT0000 crew3mds_UUID 7
10 UP osc crew3-OST0000-osc crew3-mdtlov_UUID 5
11 UP osc crew3-OST0001-osc crew3-mdtlov_UUID 5
12 UP osc crew3-OST0002-osc crew3-mdtlov_UUID 5
13 UP lov crew8-mdtlov crew8-mdtlov_UUID 4
14 UP mds crew8-MDT0000 crew8-MDT0000_UUID 9
15 UP osc crew8-OST0000-osc crew8-mdtlov_UUID 5
16 UP osc crew8-OST0001-osc crew8-mdtlov_UUID 5
17 UP osc crew8-OST0002-osc crew8-mdtlov_UUID 5
18 UP osc crew8-OST0003-osc crew8-mdtlov_UUID 5
19 UP osc crew8-OST0004-osc crew8-mdtlov_UUID 5
20 UP osc crew8-OST0005-osc crew8-mdtlov_UUID 5
21 UP osc crew8-OST0006-osc crew8-mdtlov_UUID 5
22 UP osc crew8-OST0007-osc crew8-mdtlov_UUID 5
23 UP osc crew8-OST0008-osc crew8-mdtlov_UUID 5
24 UP osc crew8-OST0009-osc crew8-mdtlov_UUID 5
25 UP osc crew8-OST000a-osc crew8-mdtlov_UUID 5
26 UP osc crew8-OST000b-osc crew8-mdtlov_UUID 5
(NOTE: last two disks came in as crew8-OST000a and crew8-OST000b and
not crew8-OST0010 and crew8-OST0011 respectively. I don't know if
that has anything at all to do with my issue.)
The clients are forever losing this one crew8 volume (mounted on the
clients as /crewdat).
>From /var/log/messages:
[larkoc at crew01 ~]$ tail /var/log/messages
Sep 18 13:53:10 crew01 kernel: Lustre:
crew8-OST0002-osc-ffff8101edbff400: Connection restored to service
crew8-OST0002 using nid 172.18.0.15 at o2ib.
Sep 18 13:53:10 crew01 kernel: Lustre: Skipped 4 previous similar messages
Sep 18 13:54:05 cn2 kernel: LustreError: 11-0: an error occurred while
communicating with 172.18.0.15 at o2ib. The obd_ping operation failed
with -107
Sep 18 13:54:05 cn2 kernel: LustreError: 11-0: an error occurred while
communicating with 172.18.0.15 at o2ib. The obd_ping operation failed
with -107
Sep 18 13:54:05 cn2 kernel: LustreError: Skipped 9 previous similar messages
Sep 18 13:54:05 cn2 kernel: LustreError: Skipped 9 previous similar messages
Sep 18 13:54:05 cn2 kernel: Lustre:
crew8-OST0003-osc-ffff81083ea5c400: Connection to service
crew8-OST0003 via nid 172.18.0.15 at o2ib was lost; in progress
operations using this service will wait for recovery to complete.
Sep 18 13:54:05 cn2 kernel: Lustre:
crew8-OST0003-osc-ffff81083ea5c400: Connection to service
crew8-OST0003 via nid 172.18.0.15 at o2ib was lost; in progress
operations using this service will wait for recovery to complete.
Sep 18 13:54:05 cn2 kernel: Lustre: Skipped 9 previous similar messages
Sep 18 13:54:05 cn2 kernel: Lustre: Skipped 9 previous similar messages
The MGS/MDS /var/log/messages reads:
root at mds1 ~]# tail /var/log/messages
Sep 18 13:50:58 mds1 kernel: LustreError: Skipped 20 previous similar messages
Sep 18 13:50:58 mds1 kernel: Lustre: crew8-OST0005-osc: Connection to
service crew8-OST0005 via nid 172.18.0.15 at o2ib was lost; in progress
operations using this service will wait for recovery to complete.
Sep 18 13:50:58 mds1 kernel: Lustre: Skipped 20 previous similar messages
Sep 18 13:50:58 mds1 kernel: LustreError: 167-0: This client was
evicted by crew8-OST0005; in progress operations using this service
will fail.
Sep 18 13:50:58 mds1 kernel: LustreError: Skipped 20 previous similar messages
Sep 18 13:50:58 mds1 kernel: Lustre:
568:0:(quota_master.c:1100:mds_quota_recovery()) Not all osts are
active, abort quota recovery
Sep 18 13:50:58 mds1 kernel: Lustre: crew8-OST0005-osc: Connection
restored to service crew8-OST0005 using nid 172.18.0.15 at o2ib.
Sep 18 13:50:58 mds1 kernel: Lustre: Skipped 20 previous similar messages
Sep 18 13:50:58 mds1 kernel: Lustre: MDS crew8-MDT0000:
crew8-OST000b_UUID now active, resetting orphans
Sep 18 13:50:58 mds1 kernel: Lustre: Skipped 20 previous similar messages
The OSS4 box /var/log/messages:
[root at oss4 ~]# tail /var/log/messages
Sep 18 13:40:40 oss4 kernel: Lustre: crew8-OST0000: haven't heard from
client 794ff121-dfec-3934-338e-6b7f861f69b6 (at 172.18.1.2 at o2ib) in
195 seconds. I think it's dead, and I am evicting it.
Sep 18 13:40:40 oss4 kernel: Lustre: Skipped 25 previous similar messages
Sep 18 13:44:50 oss4 kernel: LustreError:
3954:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error
(-107) req at ffff81042b8c9a00 x8274144/t0 o400-><?>@<?>:-1 lens 128/0
ref 0 fl Interpret:/0/0 rc -107/0
Sep 18 13:44:50 oss4 kernel: LustreError:
3954:0:(ldlm_lib.c:1442:target_send_reply_msg()) Skipped 23 previous
similar messages
Sep 18 13:50:58 oss4 kernel: Lustre: crew8-OST000b: received MDS
connection from 172.18.0.10 at o2ib
Sep 18 13:50:58 oss4 kernel: Lustre: Skipped 20 previous similar messages
Sep 18 13:51:00 oss4 kernel: Lustre: crew8-OST0006: haven't heard from
client crew8-mdtlov_UUID (at 172.18.0.10 at o2ib) in 251 seconds. I think
it's dead, and I am evicting it.
Sep 18 13:51:00 oss4 kernel: Lustre: Skipped 30 previous similar messages
Sep 18 13:55:08 oss4 kernel: LustreError:
3993:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error
(-107) req at ffff81042d8dba00 x9085095/t0 o400-><?>@<?>:-1 lens 128/0
ref 0 fl Interpret:/0/0 rc -107/0
Sep 18 13:55:08 oss4 kernel: LustreError:
3993:0:(ldlm_lib.c:1442:target_send_reply_msg()) Skipped 35 previous
similar messages
So---I am seeing that OSS4 is repeatedly losing its network contact
with MGS/MDS machine mds1. Each time this occurs the mds1 box tells
the client the disk mounted as /crewdat that the disk is evicted and
will wait for recovery service to complete. A check on mds1 of the
crew8-MDT0000 target shows that no recovery is occurring (and has not
AFAICT since I mounted the disk on the clients, post-recovery).
On mds1:
cat /proc/fs/lustre/mds/crew8-MDT0000/recovery_status
status: COMPLETE
recovery_start: 1221745009
recovery_end: 1221745185
recovered_clients: 1
unrecovered_clients: 0
last_transno: 33954534
replayed_requests: 0
I am guessing that I need to increase a lustre client timeout value
for our o2ib connections for the new disk to not generate these
messages (the /crewdat disk itself seems to be fine for user access).
The other two lustre volumes on the system seem content. Is my
guess correct? If yes, what timeout value do I need to increase?
Thank you,
megan
More information about the lustre-discuss
mailing list