[Lustre-discuss] "fsck failed : SLES11 X86_64

Tue Jun 22 13:24:00 PDT 2010

On Jun 22, 2010, at 12:56 PM, Adam wrote:

> Hello, quick question about the manual.
>
> Under the recovery section the manual states that a client needs to
> invalidate all locks, or flush it's saved state in order to  
> reconnect to
> a particular osc/mdc that has evicted it.
>
> We've found that one of our 1.8 clients will frequently get into a  
> state
> where many of the oscs report a 'Resource temporarily unavailable'  
> state
> after an outage to a 1.6 LFS server. The LFS can be accessed again on
> the client by remounting the LFS, but it does not auto-recover.

That sounds familiar. Are you using IB? There's a problem with LNet  
peer health detection when used with 1.8 clients and 1.6 servers. See  
bug 23076. I haven't tried the patch, but bug 23076 and my comments in  
bug 22920 describe the problem we saw at our site.

Disabling peer health detection by setting ko2iblnd's peer_timeout  
option to zero works around the problem. If you're going to upgrade  
the servers to 1.8 at some point, it's ok to leave it at the default  
of 180 on the servers and set it to zero on the clients until all of  
the 1.6 servers have been upgraded. Then, you can reboot your clients  
with the default value of peer_timeout at will, allowing you to take  
advantage of the feature without an outage on the servers.

We tested that approach at our site. It worked for us, and that's how  
we'll be rolling it out over the next month.

Jason

--
Jason Rappleye
System Administrator
NASA Advanced Supercomputing Division
NASA Ames Research Center
Moffett Field, CA 94035