[Lustre-discuss] o2ib possible network problems??
Andreas Dilger
adilger at sun.com
Sat Sep 20 15:23:36 PDT 2008
On Sep 18, 2008 14:04 -0400, Ms. Megan Larko wrote:
> /dev/sdk1 6.3T 878G 5.1T 15% /srv/lustre/OST/crew8-OST0010
> /dev/sdk2 6.3T 891G 5.1T 15% /srv/lustre/OST/crew8-OST0011
>
> 25 UP osc crew8-OST000a-osc crew8-mdtlov_UUID 5
> 26 UP osc crew8-OST000b-osc crew8-mdtlov_UUID 5
>
> (NOTE: last two disks came in as crew8-OST000a and crew8-OST000b and
> not crew8-OST0010 and crew8-OST0011 respectively. I don't know if
> that has anything at all to do with my issue.)
Hmm, that is a bit strange, I don't know that I've seen this before.
> crew8-OST0003-osc-ffff81083ea5c400: Connection to service
> crew8-OST0003 via nid 172.18.0.15 at o2ib was lost; in progress
> crew8-OST0003-osc-ffff81083ea5c400: Connection to service
> crew8-OST0003 via nid 172.18.0.15 at o2ib was lost; in progress
>
> The MGS/MDS /var/log/messages reads:
> root at mds1 ~]# tail /var/log/messages
> Sep 18 13:50:58 mds1 kernel: Lustre: crew8-OST0005-osc: Connection to
> service crew8-OST0005 via nid 172.18.0.15 at o2ib was lost; in progress
>
> So---I am seeing that OSS4 is repeatedly losing its network contact
> with MGS/MDS machine mds1.
It is also losing connection to the crew01 client, I'd suspect some
kind of network problem (e.g. cable).
>
> I am guessing that I need to increase a lustre client timeout value
> for our o2ib connections for the new disk to not generate these
> messages (the /crewdat disk itself seems to be fine for user access).
This seems unlikely, unless you have a large cluster (e.g. 500+ clients).
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
More information about the lustre-discuss
mailing list