[Lustre-discuss] opensm-3.2.2 and kernel-ib-1.3.1-2.6.18_92.1.17.el5_lustre.1.6.7.1smp
Ms. Megan Larko
dobsonunit at gmail.com
Thu Jan 21 09:22:25 PST 2010
Hi,
On my mds (CentOS 5.2 2.6.18-53.1.13.el5_lustre.1.6.4.3smp) I have
recently been seeing the following errors in /var/log/messages:
Jan 21 11:37:45 mds1 kernel: LustreError:
2844:0:(client.c:975:ptlrpc_expire_one_request()) @@@ timeout (sent at
1264091765, 100s ago) req at ffff81006e421200 x1237502/t0
o250->MGS at MGC192.168.64.210@o2ib_0:26 lens 304/328 ref 1 fl Rpc:/0/0
rc 0/-22
Jan 21 11:37:45 mds1 kernel: LustreError:
2844:0:(client.c:975:ptlrpc_expire_one_request()) Skipped 4 previous
similar messages
Jan 21 11:46:27 mds1 kernel: LustreError:
23713:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID
req at ffff810049133600 x1237825/t0 o101->MGS at MGC192.168.64.210@o2ib_0:26
lens 232/240 ref 1 fl Rpc:/0/0 rc 0/0
Jan 21 11:46:27 mds1 kernel: LustreError:
23713:0:(client.c:519:ptlrpc_import_delay_req()) Skipped 19 previous
similar messages
Jan 21 11:48:10 mds1 kernel: LustreError:
2844:0:(client.c:975:ptlrpc_expire_one_request()) @@@ timeout (sent at
1264092390, 100s ago) req at ffff810049133000 x1237827/t0
o250->MGS at MGC192.168.64.210@o2ib_0:26 lens 304/328 ref 1 fl Rpc:/0/0
rc 0/-22
Jan 21 11:48:10 mds1 kernel: LustreError:
2844:0:(client.c:975:ptlrpc_expire_one_request()) Skipped 4 previous
similar messages
Jan 21 11:57:11 mds1 kernel: LustreError:
23713:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID
req at ffff81003de5be00 x1238163/t0 o101->MGS at MGC192.168.64.210@o2ib_0:26
lens 232/240 ref 1 fl Rpc:/0/0 rc 0/0
Jan 21 11:57:11 mds1 kernel: LustreError:
23713:0:(client.c:519:ptlrpc_import_delay_req()) Skipped 19 previous
similar messages
Jan 21 11:58:35 mds1 kernel: LustreError:
2844:0:(client.c:975:ptlrpc_expire_one_request()) @@@ timeout (sent at
1264093015, 100s ago) req at ffff81007482ae00 x1238150/t0
o250->MGS at MGC192.168.64.210@o2ib_0:26 lens 304/328 ref 1 fl Rpc:/0/0
rc 0/-22
Jan 21 11:58:35 mds1 kernel: LustreError:
2844:0:(client.c:975:ptlrpc_expire_one_request()) Skipped 4 previous
similar messages
I have had error messages with "IMP_INVALID" before solved by
replacing the IB cable. That doesn't seem to do it this time.
There are zero errors on the OSS (CentOS 5.2
2.6.18-53.1.13.el5_lustre.1.6.4.3smp).
There are errors on the client (CentOS 5.3
2.6.18-92.1.17.el5_lustre.1.6.7.1smp). The /var/log/messages states:
Jan 21 11:54:21 crew kernel: Lustre: Request x280981444 sent from
MGC192.168.64.210 at o2ib to NID 192.168.64.210 at o2ib 5s ago has timed out
(limit 5s).
Jan 21 11:54:21 crew kernel: Lustre: Skipped 24 previous similar messages
Jan 21 11:56:32 crew last message repeated 2 times
Jan 21 12:04:46 crew kernel: Lustre: Request x280981807 sent from
MGC192.168.64.210 at o2ib to NID 192.168.64.210 at o2ib 5s ago has timed out
(limit 5s).
Jan 21 12:04:46 crew kernel: Lustre: Skipped 24 previous similar messages
The lustre disk is behaving just fine on the client although perhaps a
little slow, but imperceptibly so. So this is not critical at this
time. I would like to understand the issue.
I did change hardware recently so that my InfiniBand subnet manager
(opensm v. 3.2.2) is running on a new IB card. My understanding is
that ordinarily when opensm starts it looks for the appropriate GUID.
One is able to look at this via the "osmtest -v" command. An
inventory file (osmtest.dat) is supposed to be generated at start-up
or by the command "osmtest -fc". My osmtest commands fail with:
[root at crew ~]# osmtest -fc
Command Line Arguments
Done with args
Flow = Create Inventory
Jan 21 12:11:54 009612 [B78A6AE0] 0x7f -> Setting log level to: 0x03
Jan 21 12:11:54 009741 [B78A6AE0] 0x02 -> osm_vendor_init: 1000
pending umads specified
using default guid 0x2c9020028d1b1
Jan 21 12:11:54 016407 [B78A6AE0] 0x02 -> osm_vendor_bind: Binding to
port 0x2c9020028d1b1
Jan 21 12:11:54 026556 [B78A6AE0] 0x02 -> osmtest_validate_sa_class_port_info:
-----------------------------
SA Class Port Info:
base_ver:1
class_ver:2
cap_mask:0x600
cap_mask2:0x0
resp_time_val:0x12
-----------------------------
Jan 21 12:11:54 031248 [41FE8940] 0x01 -> __osmv_sa_mad_rcv_cb: ERR
5501: Remote error:0x0006
Jan 21 12:11:54 031263 [41FE8940] 0x01 -> osmtest_query_res_cb: ERR
0003: Error on query (IB_REMOTE_ERROR)
Jan 21 12:11:54 031287 [B78A6AE0] 0x01 -> osmtest_get_all_recs: ERR
0064: ib_query failed (IB_REMOTE_ERROR)
Jan 21 12:11:54 031298 [B78A6AE0] 0x01 -> osmtest_get_all_recs: Remote
error = IB_SA_MAD_STATUS_INSUF_COMPS
Jan 21 12:11:54 031308 [B78A6AE0] 0x01 -> osmtest_write_all_path_recs:
ERR 0025: osmtest_get_all_recs failed (IB_REMOTE_ERROR)
Jan 21 12:11:54 031318 [B78A6AE0] 0x01 -> osmtest_run: ERR 0139:
Inventory file create failed (IB_REMOTE_ERROR)
OSMTEST: TEST "Create Inventory" FAIL
I thing (not certain) that my lustre 5s timeouts are because my opensm
is not correct. I don't know if the GUID port is accurate. I think
the GUID port should incorporate my MAC Addr of my current IB hw
in-use. That would be 80:00:04:04:FE:80 on my new hw. I notice that
the old GUID is indicated as still in use on my Silver Storm
InfiniBand switch into which all of the machines are connected.
I tried to update my client running opensm to a newer version
(2.3.6-2) but the deps fail because openib-1.4.1-3.el5 is required for
the new version. Two files from my opensm server/lustre client
conflict with the openib package. The /etc/rc.d/init.d/openibd and
/etc/udev/rules.d/90-ib.rules files from Lustre
kernel-ib1.3.1-2.6.18_92.1.17.el5_lustre.1.6.7.1smp have the issue.
My questions are:
1) Am I correct in thinking that my Lustre 5s timeout messages are
from a badly configured opensm?
2) As the machine serving opensm is a Lustre client, should I boot
into a non-Lustre kernel and apply the opensm updates to see if that
would solve the Lustre timeout message problem?
3) Are there better solutions of which I have not yet thought up?
TIA,
megan
More information about the lustre-discuss
mailing list