[Lustre-discuss] "fsck failed : SLES11 X86_64

Adam adam at sharcnet.ca
Tue Jun 22 18:38:19 PDT 2010


Ah excellent. I'm upgrading the servers tonight, so if successful the 
problem will vanish without any changes on the clients.

Thanks Jason!

Adam

Jason Rappleye wrote:
>
> On Jun 22, 2010, at 12:56 PM, Adam wrote:
>
>> Hello, quick question about the manual.
>>
>> Under the recovery section the manual states that a client needs to
>> invalidate all locks, or flush it's saved state in order to reconnect to
>> a particular osc/mdc that has evicted it.
>>
>> We've found that one of our 1.8 clients will frequently get into a state
>> where many of the oscs report a 'Resource temporarily unavailable' state
>> after an outage to a 1.6 LFS server. The LFS can be accessed again on
>> the client by remounting the LFS, but it does not auto-recover.
>
> That sounds familiar. Are you using IB? There's a problem with LNet 
> peer health detection when used with 1.8 clients and 1.6 servers. See 
> bug 23076. I haven't tried the patch, but bug 23076 and my comments in 
> bug 22920 describe the problem we saw at our site.
>
> Disabling peer health detection by setting ko2iblnd's peer_timeout 
> option to zero works around the problem. If you're going to upgrade 
> the servers to 1.8 at some point, it's ok to leave it at the default 
> of 180 on the servers and set it to zero on the clients until all of 
> the 1.6 servers have been upgraded. Then, you can reboot your clients 
> with the default value of peer_timeout at will, allowing you to take 
> advantage of the feature without an outage on the servers.
>
> We tested that approach at our site. It worked for us, and that's how 
> we'll be rolling it out over the next month.
>
> Jason
>
> -- 
> Jason Rappleye
> System Administrator
> NASA Advanced Supercomputing Division
> NASA Ames Research Center
> Moffett Field, CA 94035
>
>
>
>
>
>
>
>


-- 
Adam Munro
System Administrator  | SHARCNET | http://www.sharcnet.ca
Compute Canada | http://www.computecanada.org
519-888-4567 x36453





More information about the lustre-discuss mailing list