[Lustre-discuss] opensm-3.2.2 and kernel-ib-1.3.1-2.6.18_92.1.17.el5_lustre.1.6.7.1smp

Ms. Megan Larko dobsonunit at gmail.com
Thu Jan 21 09:22:25 PST 2010


Hi,

On my mds (CentOS 5.2 2.6.18-53.1.13.el5_lustre.1.6.4.3smp) I have
recently been seeing the following errors in /var/log/messages:

Jan 21 11:37:45 mds1 kernel: LustreError:
2844:0:(client.c:975:ptlrpc_expire_one_request()) @@@ timeout (sent at
1264091765, 100s ago)  req at ffff81006e421200 x1237502/t0
o250->MGS at MGC192.168.64.210@o2ib_0:26 lens 304/328 ref 1 fl Rpc:/0/0
rc 0/-22
Jan 21 11:37:45 mds1 kernel: LustreError:
2844:0:(client.c:975:ptlrpc_expire_one_request()) Skipped 4 previous
similar messages
Jan 21 11:46:27 mds1 kernel: LustreError:
23713:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID
req at ffff810049133600 x1237825/t0 o101->MGS at MGC192.168.64.210@o2ib_0:26
lens 232/240 ref 1 fl Rpc:/0/0 rc 0/0
Jan 21 11:46:27 mds1 kernel: LustreError:
23713:0:(client.c:519:ptlrpc_import_delay_req()) Skipped 19 previous
similar messages
Jan 21 11:48:10 mds1 kernel: LustreError:
2844:0:(client.c:975:ptlrpc_expire_one_request()) @@@ timeout (sent at
1264092390, 100s ago)  req at ffff810049133000 x1237827/t0
o250->MGS at MGC192.168.64.210@o2ib_0:26 lens 304/328 ref 1 fl Rpc:/0/0
rc 0/-22
Jan 21 11:48:10 mds1 kernel: LustreError:
2844:0:(client.c:975:ptlrpc_expire_one_request()) Skipped 4 previous
similar messages
Jan 21 11:57:11 mds1 kernel: LustreError:
23713:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID
req at ffff81003de5be00 x1238163/t0 o101->MGS at MGC192.168.64.210@o2ib_0:26
lens 232/240 ref 1 fl Rpc:/0/0 rc 0/0
Jan 21 11:57:11 mds1 kernel: LustreError:
23713:0:(client.c:519:ptlrpc_import_delay_req()) Skipped 19 previous
similar messages
Jan 21 11:58:35 mds1 kernel: LustreError:
2844:0:(client.c:975:ptlrpc_expire_one_request()) @@@ timeout (sent at
1264093015, 100s ago)  req at ffff81007482ae00 x1238150/t0
o250->MGS at MGC192.168.64.210@o2ib_0:26 lens 304/328 ref 1 fl Rpc:/0/0
rc 0/-22
Jan 21 11:58:35 mds1 kernel: LustreError:
2844:0:(client.c:975:ptlrpc_expire_one_request()) Skipped 4 previous
similar messages

I have had error messages with "IMP_INVALID"  before solved by
replacing the IB cable.   That doesn't seem to do it this time.

There are zero errors on the OSS (CentOS 5.2
2.6.18-53.1.13.el5_lustre.1.6.4.3smp).

There are errors on the client (CentOS 5.3
2.6.18-92.1.17.el5_lustre.1.6.7.1smp).  The /var/log/messages states:
Jan 21 11:54:21 crew kernel: Lustre: Request x280981444 sent from
MGC192.168.64.210 at o2ib to NID 192.168.64.210 at o2ib 5s ago has timed out
(limit 5s).
Jan 21 11:54:21 crew kernel: Lustre: Skipped 24 previous similar messages
Jan 21 11:56:32 crew last message repeated 2 times
Jan 21 12:04:46 crew kernel: Lustre: Request x280981807 sent from
MGC192.168.64.210 at o2ib to NID 192.168.64.210 at o2ib 5s ago has timed out
(limit 5s).
Jan 21 12:04:46 crew kernel: Lustre: Skipped 24 previous similar messages

The lustre disk is behaving just fine on the client although perhaps a
little slow, but imperceptibly so.    So this is not critical at this
time.   I would like to understand the issue.

I did change hardware recently so that my InfiniBand subnet manager
(opensm v. 3.2.2) is running on a new IB card.   My understanding is
that ordinarily when opensm starts it looks for the appropriate GUID.
One is able to look at this via the "osmtest -v" command.  An
inventory file (osmtest.dat) is supposed to be generated at start-up
or by the command "osmtest -fc".   My osmtest commands fail with:

[root at crew ~]# osmtest -fc

Command Line Arguments
Done with args
        Flow = Create Inventory
Jan 21 12:11:54 009612 [B78A6AE0] 0x7f -> Setting log level to: 0x03
Jan 21 12:11:54 009741 [B78A6AE0] 0x02 -> osm_vendor_init: 1000
pending umads specified
using default guid 0x2c9020028d1b1
Jan 21 12:11:54 016407 [B78A6AE0] 0x02 -> osm_vendor_bind: Binding to
port 0x2c9020028d1b1
Jan 21 12:11:54 026556 [B78A6AE0] 0x02 -> osmtest_validate_sa_class_port_info:
-----------------------------
SA Class Port Info:
 base_ver:1
 class_ver:2
 cap_mask:0x600
 cap_mask2:0x0
 resp_time_val:0x12
-----------------------------
Jan 21 12:11:54 031248 [41FE8940] 0x01 -> __osmv_sa_mad_rcv_cb: ERR
5501: Remote error:0x0006
Jan 21 12:11:54 031263 [41FE8940] 0x01 -> osmtest_query_res_cb: ERR
0003: Error on query (IB_REMOTE_ERROR)
Jan 21 12:11:54 031287 [B78A6AE0] 0x01 -> osmtest_get_all_recs: ERR
0064: ib_query failed (IB_REMOTE_ERROR)
Jan 21 12:11:54 031298 [B78A6AE0] 0x01 -> osmtest_get_all_recs: Remote
error = IB_SA_MAD_STATUS_INSUF_COMPS
Jan 21 12:11:54 031308 [B78A6AE0] 0x01 -> osmtest_write_all_path_recs:
ERR 0025: osmtest_get_all_recs failed (IB_REMOTE_ERROR)
Jan 21 12:11:54 031318 [B78A6AE0] 0x01 -> osmtest_run: ERR 0139:
Inventory file create failed (IB_REMOTE_ERROR)
OSMTEST: TEST "Create Inventory" FAIL

I thing (not certain) that my lustre 5s timeouts are because my opensm
is not correct.  I don't know if the GUID port is accurate.  I think
the GUID port should incorporate my MAC Addr of my current IB hw
in-use.   That would be 80:00:04:04:FE:80 on my new hw.  I notice that
the old GUID is indicated as still in use on my Silver Storm
InfiniBand switch into which all of the machines are connected.

I tried to update my client running opensm to a newer version
(2.3.6-2) but the deps fail because openib-1.4.1-3.el5 is required for
the new version.  Two files from my opensm server/lustre client
conflict with the openib package.   The /etc/rc.d/init.d/openibd and
/etc/udev/rules.d/90-ib.rules files from Lustre
kernel-ib1.3.1-2.6.18_92.1.17.el5_lustre.1.6.7.1smp have the issue.

My questions are:
1)  Am I correct in thinking that my Lustre 5s timeout messages are
from a badly configured opensm?
2)  As the machine serving opensm is a Lustre client, should I boot
into a non-Lustre kernel and apply the opensm updates to see if that
would solve the Lustre timeout message problem?
3)   Are there better solutions of which I have not yet thought up?

TIA,
megan



More information about the lustre-discuss mailing list