[Lustre-discuss] New lustre 1.8.5 over IB problem

Gary Molenkamp gary at sharcnet.ca
Mon Dec 13 11:44:08 PST 2010


Colin Faber wrote:
> 
> 
> On 12/13/2010 11:54 AM, Gary Molenkamp wrote:
>> I'm attempting to deploy a new lustre filesystem using lustre 1.8.5, but
>> this is my first stab at incorporating an IB network.  I've deployed
>> several over tcp using 1.8.4 without issue, so I'm not sure if there is
>> an IB configuration or a 1.8.5 issue here. Any assistance would be
>> appreciated.
>>
>> This new cluster has two parallel networks:
>>     gige:  10.27.5.0/23
>>     ib  :  10.27.8.0/23
>>
>> On the lfs servers and clients, lnet is configured as:
>>     options lnet networks=o2ib0(ib0),tcp0(ib0)
>                                                                      ^^^^^
> Why are you assigning two different network types to the same physical
> device?

My assumption was that this indicated to lnet when IPoIB was to be used
  vs native IB, but by your question, I assume that is not the case. :)

I retested with just
   options lnet networks=o2ib0(ib0)
And the resulting error conditions below still hold true.


>> The IB network is routable to 10/8 and clients mount other lustre
>> filesystems using 1.8.4 over tcp.
>>
>> On the MDS (with an intended failover to a secondary) the mgs,mdt
>> filesystem is created with:
>>
>>   mkfs.lustre --fsname lfs --mdt --mgs \
>>     --mkfsoptions='-i 1024 -I 512' \
>>     --failnode=10.27.9.133 at o2ib0 --failnode=10.27.9.132 at o2ib0  \
>>     --mountfsoptions=iopen_nopriv,user_xattr,errors=remount-ro,acl \
>>     /dev/sda
>>
>> However, this mount then fails with:
>>
>> mount.lustre: mount /dev/sda at /data/mds failed: Cannot assign
>> requested address
>>
>> An lctl shows the proper nids:
>>   10.27.9.133 at o2ib
>>   10.27.9.133 at tcp
>>
>> Dmesg shows a parsing error with the o2ib0 nid:
>>
>> LustreError: 159-d: Can't parse NID 'failover.node=10.27.9.133 at o2ib0'
>> Lustre: Denying initial registration attempt from nid 10.27.9.133 at o2ib,
>> specified as failover
>> LustreError: 9571:0:(obd_mount.c:1097:server_start_targets()) Required
>> registration failed for lfs-MDT0000: -99
>>
>> Am I specifying the failover incorrectly?  What should it be when using
>> o2ib as the primary interconnect.  If I remove the failover parameters
>> using tunefs.lustre the mount succeeds,  but clients cannot connect to
>> the mdt.
>>
>>


-- 
Gary Molenkamp			SHARCNET
Systems Administrator		University of Western Ontario
Compute/Calcul Canada		http://www.computecanada.org
gary at sharcnet.ca		http://www.sharcnet.ca
(519) 661-2111 x88429		(519) 661-4000



More information about the lustre-discuss mailing list