[Lustre-discuss] Trying to mount lustre on a client when one or more OST is disabled

Kevin Van Maren kevin.van.maren at oracle.com
Tue Dec 14 14:12:34 PST 2010


The clients (and servers) get the list of NIDs for each mdt/ost device 
from the MGS at mount time.

Having the clients fail to connect to 10.10.1.49 is _expected_ when the 
service is failed over
to 10.10.1.48.  However, they should succeed in connecting to 10.10.1.48 
and then you should
no longer get complaints about 10.10.1.49.

If the clients are not failing over to 10.10.1.48, then you might not 
have the failover NID
properly specified to allow failover.  Are you sure you properly 
specified the failover parameters
during mkfs on the MDT and did the first mount from the correct machine?

If the NIDs are wrong, it is possible to correct it using --writeconf.  
See the manual (or search
the list archives).

Kevin


Bob Ball wrote:
> OK, so, we rebooted 10.10.1.49 into a different, non-lustre kernel.  
> Then, to be as certain as I could be that the client did not know about 
> 10.10.1.49, I rebooted it as well.  After it was fully up (with the 
> lustre file system mount in /etc/fstab) I umounted it, then mounted 
> again as below.  And, the message still came back that it was trying to 
> contact 10.10.1.49 instead of 10.10.1.48 as it should.  To repeat, the 
> dmesg is logging:
>
> Lustre: MGC10.10.1.140 at tcp: Reactivating import
> Lustre: 10523:0:(obd_mount.c:1786:lustre_check_exclusion()) Excluding 
> umt3-OST0019 (on exclusion list)
> Lustre: 5936:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request 
> x1355139761832543 sent from umt3-MDT0000-mdc-ffff81062c82c400 to NID 
> 10.10.1.49 at tcp 0s ago has failed due to network error (5s prior to 
> deadline).
>    req at ffff81060e4ebc00 x1355139761832543/t0 
> o38->umt3-MDT0000_UUID at 10.10.1.49@tcp:12/10 lens 368/584 e 0 to 1 dl 
> 1292362202 ref 1 fl Rpc:N/0/0 rc 0/0
> Lustre: Client umt3-client has started
>
> I guess I need to know why, in the world, is this client still trying to 
> access 10.10.1.49?  Is there something, perhaps, on the MGC machine that 
> is causing this mis-direct?  What?  And, most importantly, how do I fix 
> this?
>
> bob
>
> On 12/14/2010 3:05 PM, Bob Ball wrote:
>   
>> Well, you are absolutely right, it is a timeout talking to what it
>> THINKS is the MDT.  The thing is, it is NOT!
>>
>> We were set up for HA for the MDT, with 10.10.1.48 and 10.10.1.49
>> watching and talking to one another.  The RedHat service was
>> problematic, so right now 10.10.1.48 is the MDT, and has /mnt/mdt
>> mounted, and 10.10.1.49 is being used to do backups, and has
>> /mnt/mdt_snapshot mounted.  The actual volume is an iSCSI location.
>>
>> So, somehow, the client node has found and is talking to the wrong
>> host!  Not good.  Scary.  Got to do something about this.....
>>
>> Suggestions appreciated....
>>
>> bob
>>
>> On 12/14/2010 11:57 AM, Andreas Dilger wrote:
>>     
>>> The error message shows a timeout connecting to umt3-MDT0000 and not the OST.  The operation 38 is MDS_CONNECT, AFAIK.
>>>
>>> Cheers, Andreas
>>>
>>> On 2010-12-14, at 9:19, Bob Ball<ball at umich.edu>   wrote:
>>>
>>>       
>>>> I am trying to get a lustre client to mount the service, but with one or
>>>> more OST disabled.  This does not appear to be working.  Lustre version
>>>> is 1.8.4.
>>>>
>>>>    mount -o localflock,exclude=umt3-OST0019 -t lustre
>>>> 10.10.1.140 at tcp0:/umt3 /lustre/umt3
>>>>
>>>> dmesg on this client shows the following during the umount/mount sequence:
>>>>
>>>> Lustre: client ffff810c25c03800 umount complete
>>>> Lustre: Skipped 1 previous similar message
>>>> Lustre: MGC10.10.1.140 at tcp: Reactivating import
>>>> Lustre: 450250:0:(obd_mount.c:1786:lustre_check_exclusion()) Excluding
>>>> umt3-OST0019 (on exclusion list)
>>>> Lustre: 450250:0:(obd_mount.c:1786:lustre_check_exclusion()) Skipped 1
>>>> previous similar message
>>>> Lustre: 5942:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
>>>> x1354682302740498 sent from umt3-MDT0000-mdc-ffff810628209000 to NID
>>>> 10.10.1.49 at tcp 0s ago has failed due to network error (5s prior to
>>>> deadline).
>>>>     req at ffff810620e66400 x1354682302740498/t0
>>>> o38->umt3-MDT0000_UUID at 10.10.1.49@tcp:12/10 lens 368/584 e 0 to 1 dl
>>>> 1292342239 ref 1 fl Rpc:N/0/0 rc 0/0
>>>> Lustre: 5942:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 1
>>>> previous similar message
>>>> Lustre: Client umt3-client has started
>>>>
>>>> When I check following the mount, using "lctl dl", I see the following,
>>>> and it is clear that the OST is active as far as this client is concerned.
>>>>
>>>>    19 UP osc umt3-OST0018-osc-ffff810628209000
>>>> 05b29472-d125-c36e-c023-e0eb76aaf353 5
>>>>    20 UP osc umt3-OST0019-osc-ffff810628209000
>>>> 05b29472-d125-c36e-c023-e0eb76aaf353 5
>>>>    21 UP osc umt3-OST001a-osc-ffff810628209000
>>>> 05b29472-d125-c36e-c023-e0eb76aaf353 5
>>>>
>>>> Two questions here.  The first, obviously, is what is wrong with this
>>>> picture?  Why can't I exclude this OST from activity on this client?  Is
>>>> it because the OSS serving that OST still has the OST active?  If the
>>>> OST were deactivated or otherwise unavailable on the OSS, would the
>>>> client mount then succeed to exclude this OST?  (OK, more than one
>>>> question in the group....)
>>>>
>>>> Second group, what is the correct syntax for excluding more that one
>>>> OST?  Is it a comma-separated list of exclusions, or are separate
>>>> excludes required?
>>>>
>>>>    mount -o localflock,exclude=umt3-OST0019,umt3-OST0020 -t lustre
>>>> 10.10.1.140 at tcp0:/umt3/lustre/umt3
>>>>                  or
>>>>    mount -o localflock,exclude=umt3-OST0019,exclude=umt3-OST0020 -t
>>>> lustre 10.10.1.140 at tcp0:/umt3 /lustre/umt3
>>>>
>>>> Thanks,
>>>> bob
>>>> _______________________________________________
>>>> Lustre-discuss mailing list
>>>> Lustre-discuss at lists.lustre.org
>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>>         
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>>
>>     
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>   




More information about the lustre-discuss mailing list