[Lustre-discuss] LustreError: 11-0: an error occurred while communicating with 192.168.16.24 at o2ib. The ost_connect operation failed with -19

Wed Mar 25 10:03:59 PDT 2009

On 3/25/09 11:12 AM, "Kevin Van Maren" <Kevin.Vanmaren at Sun.COM> wrote:

> Dennis,
> 
> You haven't provided enough context for people to help.
> 
> What have you done to determine if the IB fabric is working properly?

Basic functionality appears to be there.  I can lctl ping between all
servers.  I have run ibdiagnet and it appears to be clean.  I have run
several instances of ib_rdma_bw between various lustre servers and it
completes with good performance.
> 
> What are hostnames and NIDs for the 10 servers (lctl list_nids)?
Executing on mds2
192.168.17.11 at o2ib
Executing on mds1
192.168.16.11 at o2ib
Executing on oss1
192.168.16.21 at o2ib
Executing on oss2
192.168.16.22 at o2ib
Executing on oss3
192.168.16.23 at o2ib
Executing on oss4
192.168.16.24 at o2ib
Executing on oss5
192.168.17.21 at o2ib
Executing on oss6
192.168.17.22 at o2ib
Executing on oss7
192.168.17.23 at o2ib
Executing on oss8
192.168.17.24 at o2ib

> Which OSTs are on which servers?
Lustre filesystems on mds2
Lustre filesystems on mds1
/dev/mapper/mdt      2009362216    485528 2008876688   1% /mnt/mdt
Lustre filesystems on oss1
/dev/mapper/ost0000 1130279280   715816 1129563464   1% /mnt/ost0000
/dev/mapper/ost0001 1130279280   659436 1129619844   1% /mnt/ost0001
/dev/mapper/ost000f 1130279280   667208 1129612072   1% /mnt/ost000f
Lustre filesystems on oss2
/dev/mapper/ost0002 1130279280   697520 1129581760   1% /mnt/ost0002
/dev/mapper/ost0003 1130279280   585260 1129694020   1% /mnt/ost0003
/dev/mapper/ost0010 1130279280   600640 1129678640   1% /mnt/ost0010
Lustre filesystems on oss3
/dev/mapper/ost0004 1130279280   515628 1129763652   1% /mnt/ost0004
/dev/mapper/ost0005 1130279280   549292 1129729988   1% /mnt/ost0005
/dev/mapper/ost0011 1130279280   697956 1129581324   1% /mnt/ost0011
Lustre filesystems on oss4
/dev/mapper/ost0006 1130279280   565684 1129713596   1% /mnt/ost0006
/dev/mapper/ost0012 1130279280    482856 1129796424   1% /mnt/ost0012
/dev/mapper/ost0013 1130279280   482856 1129796424   1% /mnt/ost0013
Lustre filesystems on oss5
/dev/mapper/ost0007 1130279280   532844 1129746436   1% /mnt/ost0007
/dev/mapper/ost0008 1130279280   682308 1129596972   1% /mnt/ost0008
/dev/mapper/ost0014 1130279280   532016 1129747264   1% /mnt/ost0014
/dev/mapper/ost0015 1130279280   482856 1129796424   1% /mnt/ost0015
Lustre filesystems on oss6
/dev/mapper/ost0009 1130279280   482860 1129796420   1% /mnt/ost0009
/dev/mapper/ost000a 1130279280   585260 1129694020   1% /mnt/ost000a
/dev/mapper/ost0016 1130279280   499244 1129780036   1% /mnt/ost0016
/dev/mapper/ost0017 1130279280   482856 1129796424   1% /mnt/ost0017
Lustre filesystems on oss7
/dev/mapper/ost000b 1130279280   482852 1129796428   1% /mnt/ost000b
/dev/mapper/ost000c 1130279280   482872 1129796408   1% /mnt/ost000c
/dev/mapper/ost0018 1130279280   581172 1129698108   1% /mnt/ost0018
/dev/mapper/ost0019 1130279280   665556 1129613724   1% /mnt/ost0019
Lustre filesystems on oss8
/dev/mapper/ost000d 1130279280   687688 1129591592   1% /mnt/ost000d
/dev/mapper/ost000e 1130279280   606008 1129673272   1% /mnt/ost000e
/dev/mapper/ost001a 1130279280   511600 1129767680   1% /mnt/ost001a
/dev/mapper/ost001b 1130279280   482852 1129796428   1% /mnt/ost001b
> 
> OST4 is on a machine at 192.168.16.23
Yes, oss3.

> What machine is 192.168.16.24?  Is that the OST4 failover partner?
Yes, oss4 is the failover partner.
> 
> You have a client at 192.168.16.1?

Yes, it is hanging each time I attempt IO.

oss3:~ # tunefs.lustre --dryrun /dev/mapper/ost0004
checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata

   Read previous values:
Target:     lustre-OST0004
Index:      4
Lustre FS:  lustre
Mount type: ldiskfs
Flags:      0x2
              (OST )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=192.168.16.11 at o2ib mgsnode=192.168.17.11 at o2ib
failover.node=192.168.16.24 at o2ib

   Permanent disk data:
Target:     lustre-OST0004
Index:      4
Lustre FS:  lustre
Mount type: ldiskfs
Flags:      0x2
              (OST )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=192.168.16.11 at o2ib mgsnode=192.168.17.11 at o2ib
failover.node=192.168.16.24 at o2ib

exiting before disk write.

> 
> Kevin
> 
> 
> Dennis Nelson wrote:
>> 
>> Hi,
>> 
>> I have encountered an issue with Lustre that has happened a couple of
>> times
>> now.  I am beginning to suspect an issue with the IB fabric but wanted to
>> reach out to the list to confirm my suspicions.  The odd part is that even
>> when the MDS complains that it cannot connect to a given ost, lctl ping to
>> the OSS that owns the OST works without an issue.  Also, the OSS in
>> question
>> has other OSTs which, in the latest case, have not reported any errors.
>> 
>> I have attached a file with the errors that I encountered from the MDS.  I
>> am running Lustre 1.6.6 with a a pair of MDSs and 8 OSS and 28 OSTs spread
>> across the the 8 OSSs.  I am using IB DDR interconnects between all
>> systems.
>> 
>> Thanks,
>> 
>> ------------------------------------------------------------------------
>> 
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>   
>