[Lustre-discuss] Problem with LNET and openibd on Lustre 1.8.4 while rebooting

Roy Dragseth roy.dragseth at uit.no
Wed Sep 22 05:26:46 PDT 2010


(couldn't decide on top-post or down-post her so I deleted the whole original 
message)

We have just upgraded our Rocks cluster to use the CentOS 5.5 rpms and it 
includes a complete OFED stack (v1.4.2?) so we decided to just ditch our own 
self compiled version of OFED 1.4.1.

We then ran into the same problems with openibd hanging on shutdown.  After a 
futile attempt trying to inject a lustre-unload-modules service between netfs 
and openib to run lustre_rmmod.  I tried to hack modprobe.conf to eject the 
lustre modules by inserting this 

remove rdma_cm /usr/sbin/lustre_rmmod && /sbin/modprobe  -r --ignore-remove 
rdma_cm

this didn't work either because the openibd service script use rmmod instead 
of modprobe -r (aargghh).

So, the solution that seems to work is to disable openibd (chkconfig openibd 
off) and let the network initialization take care of loading the right modules 
by putting this into modprobe.conf:

alias ib0 ib_ipoib
install ib_ipoib modprobe mlx4_ib && /sbin/modprobe --ignore-install ib_ipoib

Then network startup will load the right ib modules and the netfs service will 
automatically load the lustre modules when mounting the lustre partitions.

The downside might be that you will not get any clean unload of neither the 
lustre nor the ofed modules on shutdown/reboot.

If you run other hw than us you might have to change the mlx4_ib module with 
whatever you need.

(wasted two days on this, sometimes I make really good use of taxpayers 
money...)

r.





More information about the lustre-discuss mailing list