[Lustre-discuss] Resource "always" unavailable
Adam
adam at sharcnet.ca
Tue Jun 22 20:36:53 PDT 2010
Confirmed -- if anyone else runs into this problem with a 1.8.2 client
using a 1.6.6 server; upgrading the server to 1.8.3 will restore
connections without requiring a umount, or any other changes to the
client (at least in my case)
Cheers,
Adam
Adam wrote:
> Ah excellent. I'm upgrading the servers tonight, so if successful the
> problem will vanish without any changes on the clients.
>
> Thanks Jason!
>
> Adam
>
> Jason Rappleye wrote:
>
>> On Jun 22, 2010, at 12:56 PM, Adam wrote:
>>
>>
>>> Hello, quick question about the manual.
>>>
>>> Under the recovery section the manual states that a client needs to
>>> invalidate all locks, or flush it's saved state in order to reconnect to
>>> a particular osc/mdc that has evicted it.
>>>
>>> We've found that one of our 1.8 clients will frequently get into a state
>>> where many of the oscs report a 'Resource temporarily unavailable' state
>>> after an outage to a 1.6 LFS server. The LFS can be accessed again on
>>> the client by remounting the LFS, but it does not auto-recover.
>>>
>> That sounds familiar. Are you using IB? There's a problem with LNet
>> peer health detection when used with 1.8 clients and 1.6 servers. See
>> bug 23076. I haven't tried the patch, but bug 23076 and my comments in
>> bug 22920 describe the problem we saw at our site.
>>
>> Disabling peer health detection by setting ko2iblnd's peer_timeout
>> option to zero works around the problem. If you're going to upgrade
>> the servers to 1.8 at some point, it's ok to leave it at the default
>> of 180 on the servers and set it to zero on the clients until all of
>> the 1.6 servers have been upgraded. Then, you can reboot your clients
>> with the default value of peer_timeout at will, allowing you to take
>> advantage of the feature without an outage on the servers.
>>
>> We tested that approach at our site. It worked for us, and that's how
>> we'll be rolling it out over the next month.
>>
>> Jason
>>
>> --
>> Jason Rappleye
>> System Administrator
>> NASA Advanced Supercomputing Division
>> NASA Ames Research Center
>> Moffett Field, CA 94035
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>
>
--
Adam Munro
System Administrator | SHARCNET | http://www.sharcnet.ca
Compute Canada | http://www.computecanada.org
519-888-4567 x36453
More information about the lustre-discuss
mailing list