[Lustre-discuss] Problem with LNET and openibd on Lustre 1.8.4 while rebooting
Roy Dragseth
roy.dragseth at uit.no
Wed Sep 22 05:26:46 PDT 2010
(couldn't decide on top-post or down-post her so I deleted the whole original
message)
We have just upgraded our Rocks cluster to use the CentOS 5.5 rpms and it
includes a complete OFED stack (v1.4.2?) so we decided to just ditch our own
self compiled version of OFED 1.4.1.
We then ran into the same problems with openibd hanging on shutdown. After a
futile attempt trying to inject a lustre-unload-modules service between netfs
and openib to run lustre_rmmod. I tried to hack modprobe.conf to eject the
lustre modules by inserting this
remove rdma_cm /usr/sbin/lustre_rmmod && /sbin/modprobe -r --ignore-remove
rdma_cm
this didn't work either because the openibd service script use rmmod instead
of modprobe -r (aargghh).
So, the solution that seems to work is to disable openibd (chkconfig openibd
off) and let the network initialization take care of loading the right modules
by putting this into modprobe.conf:
alias ib0 ib_ipoib
install ib_ipoib modprobe mlx4_ib && /sbin/modprobe --ignore-install ib_ipoib
Then network startup will load the right ib modules and the netfs service will
automatically load the lustre modules when mounting the lustre partitions.
The downside might be that you will not get any clean unload of neither the
lustre nor the ofed modules on shutdown/reboot.
If you run other hw than us you might have to change the mlx4_ib module with
whatever you need.
(wasted two days on this, sometimes I make really good use of taxpayers
money...)
r.
More information about the lustre-discuss
mailing list