[Lustre-discuss] o2ib possible network problems??

Sat Sep 20 19:29:11 PDT 2008

Hello!

I don't think it is a timeout issue any longer.    The timeout value
is the same for all of the Lustre systems mounted via our MGS/MDS
system.   The value is rather high.   It is currently 1000.  The value
I got from "cat /proc/sys/lustre/timeout on the MGS/MDS box.

I changed the IB cable on the problem box using the same IB card, PCI
slot and slot on the IB SilverStorm switch.
The errors I now see on the clients are the same but the server OSS
for crew8-OST0000 thru crew8-OST0011 are:
ib0: multicast join failed for
ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -22
LustreError: 4346:0:(filter.c:2674:filter_destroy_precreated())
LustreError:4486:0:(ldmlm_lib.c:1442:target_send_reply_msg(()@@processing
error -107

Perhaps it could be the IB card?   It is a Mellanox Technologies
MT25204 [InfiniHost III Lx HCA].  This is the same card in many, but
not all, of our other systems.  I can try a new IB card on Monday.

On the OSS, the following lines repeat every two minutes (from
/var/log/messages):
ep 20 22:20:32 oss4 kernel: LustreError:
3775:0:(client.c:975:ptlrpc_expire_one_request()) @@@ timeout (sent at
1221963532, 100s ago)  req at ffff810383d5de00 x46975/t0
o250->MGS at MGC172.18.0.10@o2ib_0:26 lens 304/328 ref 1 fl Rpc:/0/0 rc
0/-22
Sep 20 22:20:32 oss4 kernel: LustreError:
3775:0:(client.c:975:ptlrpc_expire_one_request()) Skipped 5 previous
similar messages

Thank you,
megan

On Sat, Sep 20, 2008 at 6:23 PM, Andreas Dilger <adilger at sun.com> wrote:
> On Sep 18, 2008  14:04 -0400, Ms. Megan Larko wrote:
>> /dev/sdk1             6.3T  878G  5.1T  15% /srv/lustre/OST/crew8-OST0010
>> /dev/sdk2             6.3T  891G  5.1T  15% /srv/lustre/OST/crew8-OST0011
>>
>>  25 UP osc crew8-OST000a-osc crew8-mdtlov_UUID 5
>>  26 UP osc crew8-OST000b-osc crew8-mdtlov_UUID 5
>>
>> (NOTE: last two disks came in as crew8-OST000a and crew8-OST000b and
>> not crew8-OST0010 and crew8-OST0011 respectively.  I don't know if
>> that has anything at all to do with my issue.)
>
> Hmm, that is a bit strange, I don't know that I've seen this before.
>
>> crew8-OST0003-osc-ffff81083ea5c400: Connection to service
>> crew8-OST0003 via nid 172.18.0.15 at o2ib was lost; in progress
>> crew8-OST0003-osc-ffff81083ea5c400: Connection to service
>> crew8-OST0003 via nid 172.18.0.15 at o2ib was lost; in progress
>>
>> The MGS/MDS /var/log/messages reads:
>> root at mds1 ~]# tail /var/log/messages
>> Sep 18 13:50:58 mds1 kernel: Lustre: crew8-OST0005-osc: Connection to
>> service crew8-OST0005 via nid 172.18.0.15 at o2ib was lost; in progress
>>
>> So---I am seeing that OSS4 is repeatedly losing its network contact
>> with MGS/MDS machine mds1.
>
> It is also losing connection to the crew01 client, I'd suspect some
> kind of network problem (e.g. cable).
>>
>> I am guessing that I need to increase a lustre client timeout value
>> for our o2ib connections for the new disk to not generate these
>> messages (the /crewdat disk itself seems to be fine for user access).
>
> This seems unlikely, unless you have a large cluster (e.g. 500+ clients).
>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
>