[Lustre-discuss] Lustre over o2ib issue

Diego Moreno Diego.Moreno-Lazaro at bull.net
Wed Mar 23 03:16:15 PDT 2011


I think we don't have any other parameter. I checked it and there is no 
such other parameter. I'm wondering why it's not possible to "lctl ping" 
from a client to the second interface on server and why it's possible to 
"lctl ping" in the opposite direction when the first connection is 
established. There's no Lustre routers in the middle.

Maybe an OFED bug? Strange as there's no problem with standard ping.


root at berlin4 ~]# cat /etc/modprobe.d/lustre.conf
install mdc /sbin/modprobe lquota >/dev/null 2>&1; /sbin/modprobe 
--ignore-install mdc
#install fld /sbin/modprobe ptlrpc
#install fid /sbin/modprobe ptlrpc
#install mdc /sbin/modprobe ptlrpc
#install lustre /sbin/modprobe ptlrpc
install fld /sbin/modprobe ptlrpc ; /sbin/modprobe --ignore-install fld
install lnet \
         PORTKEEPER=/sbin/keep_port_988 ;\
         [ -x $PORTKEEPER ] && killall -w `basename $PORTKEEPER` ;\
         LNET_NETWORKS_LIST=""; \
         if /sbin/lsmod|grep -qE "^elan"; then \
                     LNET_NETWORKS_LIST="elan0"; \
         fi; \
         if /sbin/lsmod|grep -qE "^ib_mthca|^mlx4_ib"; then \
             LNET_NETWORKS_LIST="o2ib(ib0),o2ib1(ib1)"; \
         fi; \
         if [ -z "${LNET_NETWORKS_LIST}" ]; then \
             LNET_OPTIONS="networks=tcp0(eth0)"; \
         else \
             LNET_OPTIONS="networks=${LNET_NETWORKS_LIST},tcp0(eth0)"; \
         fi; \
         if [ -e /etc/lustre/routers.conf ]; then . 
/etc/lustre/routers.conf ; fi;\
         if [ -e /etc/lustre/multirail.conf ]; then . 
/etc/lustre/multirail.conf ; fi;\
         /sbin/modprobe --ignore-install lnet $LNET_OPTIONS 
$LNET_ROUTER_OPTIONS $LNET_MULTIRAIL_OPTIONS
remove libcfs \
         PORTKEEPER=/sbin/keep_port_988 ;\
         attempt=0;\
         while [ 1 ];\
                 do\
                 rmmod `lsmod | grep libcfs | awk '{ print $4}' | tr ',' 
' '` >/dev/null 2>&1;\
                 [ $? == 0 ] && break;\
                 attempt=`expr $attempt + 1`;\
                 [ $attempt -gt 4 ] && break;\
                 done;\
         modprobe -r --ignore-remove libcfs;\
         modprobe -r ldiskfs >/dev/null 2>&1;\
         if [ $? == 0 ] && [ $attempt -le 4 ] && [ -x $PORTKEEPER ]; then \
                 $PORTKEEPER > /dev/null ;\
         fi

options lpfc lpfc_sg_seg_cnt=256



[root at berlin4 ~]# cat /etc/lustre/multirail.conf

#!/bin/sh

unset LNET_OPTIONS
unset LNET_ROUTER_OPTIONS
unset LNET_MULTIRAIL_OPTIONS

#export LNET_MULTIRAIL_OPTIONS="networks=o2ib0(ib0),o2ib1(ib1)"

export LNET_ROUTER_OPTIONS="ip2nets=\"o2ib0(ib0) 10.50.0.[7-10] ; 
o2ib1(ib1) 10.50.1.[7-10] ; o2ib0(ib0) 10.50.*.* ; o2ib1(ib0) 10.50.*.* \""



On 23/03/2011 10:55, Liang Zhen wrote:
> Hi Diego,
>
> Do  you have any other module parameter  for lnet and lnd?
>
> Regards
> Liang
>
>
> On Mar 22, 2011, at 9:26 PM, Diego Moreno wrote:
>
>> Hi,
>>
>> We are having this problem right now with our Lustre 2.0. We tried the
>> proposed solutions but we didn't get it.
>>
>> We have 2 QDR IB cards on 4 servers and we have to do "lctl ping" from
>> each server to every client if we want clients to connect to servers. We
>> don't have ib_mthca modules loaded because we don't have DDR cards and
>> we configured ip2nets with no result.
>>
>> Our ip2nets configuration ([7-10] interfaces are in servers, the others
>> are in clients):
>> o2ib0(ib0) 10.50.0.[7-10] ; o2ib1(ib1) 10.50.1.[7-10] ; o2ib0(ib0)
>> 10.50.*.* ; o2ib1(ib0) 10.50.*.*
>>
>> So the only way of having clients connected to servers is doing
>> something like this on every server:
>>
>> for i in $CLIENT_IB_LIST ; do
>> lctl ping $i at o2ib0
>> lctl ping $i at o2ib1
>> done
>>
>> Before "lctl ping" we get messages like this one:
>>
>> Lustre: 50389:0:(lib-move.c:1028:lnet_post_send_locked()) Dropping
>> message for 12345-10.50.1.7 at o2ib1: peer not alive
>>
>> After "lctl ping' everything works right.
>>
>> Maybe I'm missing something or this is a known bug in lustre 2.0...
>>
>>
>> On 16/03/2011 22:13, Andreas Dilger wrote:
>>> On 2011-03-16, at 3:04 PM, Mike Hanby wrote:
>>>> Thanks, I forgot to include the card info:
>>>>
>>>> The servers each have a single IB card: dual port MT26528 QDR
>>>> o2ib0(ib0) on each server is attached to the QLogic switch (with three attached M3601Q switches 48 attached blades)
>>>> o2ib1(ib1) on each server is attached to a stack of two M3601Q switches with 24 attached blades
>>>>
>>>> The blades connected to o2ib0 each have an MT26428 QDR IB card
>>>> The blades connected to o2ib1 each have an MT25418 DDR IB card
>>>
>>> You may also want to check out the ip2nets option for specifying the Lustre networks.  It is made to handle configuration issues like this where the interface name is not constant across client/server nodes.
>>>
>>>>
>>>> -----Original Message-----
>>>> From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of Nirmal Seenu
>>>> Sent: Wednesday, March 16, 2011 2:10 PM
>>>> To: lustre-discuss at lists.lustre.org
>>>> Subject: Re: [Lustre-discuss] Lustre over o2ib issue
>>>>
>>>> If you are using DDR and QDR or any 2 different cards cards in the same machine there is no guarantee that the same IB cards get assigned to ib0 and ib.
>>>>
>>>> To fix that problem you need to comment out the following 3 lines /etc/init.d/openibd:
>>>>
>>>>      #for i in `grep "^driver: " /etc/sysconfig/hwconf | sed -e 's/driver: //' | grep -w "ib_mthca\\\|ib_ipath\\\|mlx4_core\\\|cxgb3\\\|iw_nes"`; do
>>>>      #    load_modules $i
>>>>      #done
>>>>
>>>> and include the following lines instead(we wanted the DDR card to be ib0 and the QDR card to be ib1):
>>>>      load_modules ib_mthca
>>>>      /bin/sleep 10
>>>>      load_modules mlx4_core
>>>>
>>>> and you will need to restart openibd once again (we included it in rc.local) to make sure that the same IB cards are assigned to the devices ib0 and ib1.
>>>>
>>>> Nirmal
>>>> _______________________________________________
>>>> Lustre-discuss mailing list
>>>> Lustre-discuss at lists.lustre.org
>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>> _______________________________________________
>>>> Lustre-discuss mailing list
>>>> Lustre-discuss at lists.lustre.org
>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>
>>>
>>> Cheers, Andreas
>>> --
>>> Andreas Dilger
>>> Principal Engineer
>>> Whamcloud, Inc.
>>>
>>>
>>>
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>
>>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>
>

-- 
Diego Moreno
Bull S.A.S
1, rue de Provence
B.P. 208
38432 ECHIROLLES CEDEX
FRANCE
Phone: +33 (0) 4 76 29 71 86 (229-7186)
http://www.bull-world.com/



More information about the lustre-discuss mailing list