[Lustre-discuss] Multihoned Problem, can mount o2ib but not tcp

Mike Hanby mhanby at uab.edu
Thu Oct 29 15:13:55 PDT 2009


Howdy,

I have a working Lustre file system set up using Infiniband:
 1 x MDS/MGS server
 2 x OSS/OST servers, in active active failover
 25 x client nodes

All of these systems use Infiniband with Lustre.

Now, I have 60 older compute nodes that I'd like to add to the system. These only have Gigabit Ethernet.

I've added the tcp network to lnet (see steps below), but when I attempt to mount my luster filesystem on the tcp client it fails, and the error in /var/log/messages seems to indicate that it's trying to use o2ib, and not tcp:

# mount -t lustre 172.20.20.30 at tcp:/lustre /lustre
mount.lustre: mount 172.20.20.30 at tcp:/lustre at /lustre failed: No such file or directory
Is the MGS specification correct?
Is the filesystem name correct?
If upgrading, is the copied client log valid? (see upgrade docs)

kernel: LustreError: 2860:0:(events.c:460:ptlrpc_uuid_to_peer()) No NID found for 172.20.21.30 at o2ib 
kernel: LustreError: 2860:0:(client.c:69:ptlrpc_uuid_to_connection()) cannot find peer 172.20.21.30 at o2ib! 
kernel: LustreError: 2860:0:(ldlm_lib.c:329:client_obd_setup()) can't add initial connection 
kernel: LustreError: 2860:0:(obd_config.c:370:class_setup()) setup lustre-MDT0000-mdc-ffff81007eb82400 failed (-2) 
kernel: LustreError: 2860:0:(obd_config.c:1197:class_config_llog_handler()) Err -2 on cfg command: 
kernel: LustreError: 15c-8: MGC172.20.20.30 at tcp: The configuration from log 'lustre-client' failed (-2). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.

Here's my modprobe.conf entry for the OSS/MDS servers (is order important here or tcp and o2ib?)
options lnet networks="tcp0(eth0),o2ib(ib0)"
options ko2iblnd concurrent_sends=7
options ptlrpc at_max=600 
options ost oss_num_threads=512

modprobe.conf file for the IB clients
options lnet networks="o2ib(ib0)"
options ko2iblnd concurrent_sends=7
options ptlrpc at_max=600 
options ost oss_num_threads=512

And modprobe.conf for the TCP clients
options lnet networks="tcp0(eth0)"
options ptlrpc at_max=600 
options ost oss_num_threads=512

The 'lctl list_nids' command prints the expected results on the servers and clients, listing the networks provided in the modprobe.conf file

I added the failover and mgsnode settings to each lun (6 luns) using the following:
tunefs.lustre --failnode=172.20.20.31 at tcp --failnode=172.20.20.32 \
--mgsnode=172.20.20.30 at tcp /dev/mpath/lun1

With the final parameters being:
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: failover.node=172.20.21.31 at o2ib failover.node=172.20.21.32 at o2ib mgsnode=172.20.21.30 at o2ib failover.node=172.20.20.31 at tcp failover.node=172.20.20.32 at tcp mgsnode=172.20.20.30 at tcp

The /etc/fstab on the TCP clients has this entry:
172.20.20.30 at tcp0:/lustre    /lustre                 lustre  _netdev         0 0

I've rebooted all of the servers after making all of the changes and still I can't mount from the TCP clients, but can from the IB clients.

Any suggestions?

=================================
Mike Hanby
mhanby at uab.edu
Information Systems Specialist II
IT HPCS / Research Computing





More information about the lustre-discuss mailing list