[Lustre-discuss] network failover with IB+eth?

Erich Focht efocht at hpce.nec.com
Tue Apr 8 01:25:45 PDT 2008


Hello,

on a setup with o2ib and ethernet configured on both, lustre servers and
clients I'd expect that unplugging the infiniband cable on one of the
OSSes would lead the client to switch over to ethernet and continue I/O.
Unfortunately this doesn't happen, the client I/O stalls and continues
only after the IB cable is plugged back.

Is there anything wrong with the setup? It's with pairwise failover
servers,
so maybe that's part of the problem? Is the order of failnode arguments
correct?

Here's what we have: (sorry for the many details...)

MGS/MGT are mounted on the same node:
 Target:     MGS
 Index:      unassigned
 Lustre FS:  lustre
 Mount type: ldiskfs
 Flags:      0x174      (MGS needs_index first_time update writeconf )
 Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
Parameters:
 failover.node=10.3.0.227 at o2ib,192.168.130.227 at tcp,10.3.0.226 at o2ib,192.168.130.226 at tcp
 mgsnode=10.3.0.227 at o2ib,192.168.130.227 at tcp,10.3.0.226 at o2ib,192.168.130.226 at tcp

 Target:     lustre-MDT0000
 Index:      0
 Lustre FS:  lustre
 Mount type: ldiskfs
 Flags:      0x1        (MDT )
 Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
Parameters:
 mgsnode=10.3.0.226 at o2ib,192.168.130.226 at tcp,10.3.0.227 at o2ib,192.168.130.227 at tcp
 failover.node=10.3.0.227 at o2ib,192.168.130.227 at tcp
 mdt.group_upcall=/usr/sbin/l_getgroups

OST: parameters were rewritten with tunefs.lustre:
tunefs.lustre --ost --erase-param
 --mgsnode=10.3.0.226 at o2ib0,192.168.130.226 at tcp0:10.3.0.227 at o2ib0,192.168.130.227 at tcp0
 --failnode=10.3.0.229 at o2ib0,192.168.130.229 at tcp0 --writeconf
 /dev/mpath/ost100


Client notices the failed OST path:
# lfs check servers
lustre-MDT0000-mdc-ffff810007107000 active.
error: check 'lustre-OST0000-osc-ffff810007107000': Connection timed out
(110)

but tries to connect to the failover OSS partner instead of trying the
other
network:
netptune121: LustreError: 11-0: an error occurred while communicating
  with 10.3.0.229 at o2ib. The ost_connect operation failed with -19
doss2: LustreError: 137-5: UUID 'lustre-OST0000_UUID' is not available
  for connect (no target)

Thanks in advance for any hint...

Best regards,
Erich
<br><br>




More information about the lustre-discuss mailing list