[Lustre-discuss] o2ib possible network problems??

Sat Sep 20 15:23:36 PDT 2008

On Sep 18, 2008  14:04 -0400, Ms. Megan Larko wrote:
> /dev/sdk1             6.3T  878G  5.1T  15% /srv/lustre/OST/crew8-OST0010
> /dev/sdk2             6.3T  891G  5.1T  15% /srv/lustre/OST/crew8-OST0011
> 
>  25 UP osc crew8-OST000a-osc crew8-mdtlov_UUID 5
>  26 UP osc crew8-OST000b-osc crew8-mdtlov_UUID 5
> 
> (NOTE: last two disks came in as crew8-OST000a and crew8-OST000b and
> not crew8-OST0010 and crew8-OST0011 respectively.  I don't know if
> that has anything at all to do with my issue.)

Hmm, that is a bit strange, I don't know that I've seen this before.

> crew8-OST0003-osc-ffff81083ea5c400: Connection to service
> crew8-OST0003 via nid 172.18.0.15 at o2ib was lost; in progress
> crew8-OST0003-osc-ffff81083ea5c400: Connection to service
> crew8-OST0003 via nid 172.18.0.15 at o2ib was lost; in progress
> 
> The MGS/MDS /var/log/messages reads:
> root at mds1 ~]# tail /var/log/messages
> Sep 18 13:50:58 mds1 kernel: Lustre: crew8-OST0005-osc: Connection to
> service crew8-OST0005 via nid 172.18.0.15 at o2ib was lost; in progress
> 
> So---I am seeing that OSS4 is repeatedly losing its network contact
> with MGS/MDS machine mds1.

It is also losing connection to the crew01 client, I'd suspect some
kind of network problem (e.g. cable).
> 
> I am guessing that I need to increase a lustre client timeout value
> for our o2ib connections for the new disk to not generate these
> messages (the /crewdat disk itself seems to be fine for user access).

This seems unlikely, unless you have a large cluster (e.g. 500+ clients).

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.