[Lustre-discuss] OST targets not mountable after disabling/enabling MMP

Mon Aug 9 13:32:12 PDT 2010

Andreas Dilger wrote:
> On 2010-08-09, at 14:11, Edward Walter wrote:
>   
>> We're continuing to test things and seeing weird behavior when we run an ost-survey though. It looks as though the lustre client is getting 
>> shuffled back and forth between OSS server pairs for our OSTs. The 
>> client times out connecting to the primary server, attempts to connect to the failover server (and fails because the OST is on the primary) and then reconnects to the primary server and finishes the survey. This behavior is not isolated to one particular OST (or client) and doesn't occur with every survey.
>>
>> and here's the relevant dmesg info:
>>
>> [root at compute-2-7 ~]# dmesg |grep Lustre
>> Lustre: Client data-client has started
>> Lustre: Request x121943 sent from data-OST000b-osc-ffff81041f9b2c00 to 
>> NID 172.16.1.25 at o2ib 100s ago has timed out (limit 100s).
>> Lustre: Skipped 1 previous similar message
>>     
>
> If you have a larger cluster (hundreds of clients) with 1.6.6 you have to increase the lustre timeout value beyond 100s for the worst-case IO (300s is pretty typical at 1000 clients), but this is too long for most cases.
>
> What you really want is to upgrade to 1.8.x in order to get adaptive timeouts.  This allows the clients/servers to handle varying network  and storage latency, instead of having a fixed timeout.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Lustre Technical Lead
> Oracle Corporation Canada Inc.
>   
Hi Andreas,

Our cluster is fairly modest in size (104 clients, 4 OSS, 12 OSTs, 1 
active MDS). We have plans for upgrading to 1.8.x but those plans now 
include stabilizing our 1.6.6 installation so that we can do a full 
backup before upgrading.

For now; we're doing our testing from 2-3 nodes without any of the other 
nodes mounting lustre. This configuration was stable and reliable until 
the hard shutdown. Obviously we'd like to get back to where we were 
before upgrading. Our timeout on the clients (cat 
/proc/sys/lustre/timeout) is 100s. Shouldn't this be sufficient for 2 
clients? I think something else is going on.

-Ed