[Lustre-discuss] Resource "always" unavailable

Adam adam at sharcnet.ca
Tue Jun 22 20:36:53 PDT 2010


Confirmed -- if anyone else runs into this problem with a 1.8.2 client 
using a 1.6.6 server; upgrading the server to 1.8.3 will restore 
connections without requiring a umount, or any other changes to the 
client (at least in my case)

Cheers,
Adam

Adam wrote:
> Ah excellent. I'm upgrading the servers tonight, so if successful the 
> problem will vanish without any changes on the clients.
>
> Thanks Jason!
>
> Adam
>
> Jason Rappleye wrote:
>   
>> On Jun 22, 2010, at 12:56 PM, Adam wrote:
>>
>>     
>>> Hello, quick question about the manual.
>>>
>>> Under the recovery section the manual states that a client needs to
>>> invalidate all locks, or flush it's saved state in order to reconnect to
>>> a particular osc/mdc that has evicted it.
>>>
>>> We've found that one of our 1.8 clients will frequently get into a state
>>> where many of the oscs report a 'Resource temporarily unavailable' state
>>> after an outage to a 1.6 LFS server. The LFS can be accessed again on
>>> the client by remounting the LFS, but it does not auto-recover.
>>>       
>> That sounds familiar. Are you using IB? There's a problem with LNet 
>> peer health detection when used with 1.8 clients and 1.6 servers. See 
>> bug 23076. I haven't tried the patch, but bug 23076 and my comments in 
>> bug 22920 describe the problem we saw at our site.
>>
>> Disabling peer health detection by setting ko2iblnd's peer_timeout 
>> option to zero works around the problem. If you're going to upgrade 
>> the servers to 1.8 at some point, it's ok to leave it at the default 
>> of 180 on the servers and set it to zero on the clients until all of 
>> the 1.6 servers have been upgraded. Then, you can reboot your clients 
>> with the default value of peer_timeout at will, allowing you to take 
>> advantage of the feature without an outage on the servers.
>>
>> We tested that approach at our site. It worked for us, and that's how 
>> we'll be rolling it out over the next month.
>>
>> Jason
>>
>> -- 
>> Jason Rappleye
>> System Administrator
>> NASA Advanced Supercomputing Division
>> NASA Ames Research Center
>> Moffett Field, CA 94035
>>
>>
>>
>>
>>
>>
>>
>>
>>     
>
>
>   


-- 
Adam Munro
System Administrator  | SHARCNET | http://www.sharcnet.ca
Compute Canada | http://www.computecanada.org
519-888-4567 x36453





More information about the lustre-discuss mailing list