[Lustre-discuss] OST targets not mountable after disabling/enabling MMP

Mon Aug 9 13:13:14 PDT 2010

On 2010-08-09, at 14:11, Edward Walter wrote:
> We're continuing to test things and seeing weird behavior when we run an ost-survey though. It looks as though the lustre client is getting 
> shuffled back and forth between OSS server pairs for our OSTs. The 
> client times out connecting to the primary server, attempts to connect to the failover server (and fails because the OST is on the primary) and then reconnects to the primary server and finishes the survey. This behavior is not isolated to one particular OST (or client) and doesn't occur with every survey.
> 
> and here's the relevant dmesg info:
> 
> [root at compute-2-7 ~]# dmesg |grep Lustre
> Lustre: Client data-client has started
> Lustre: Request x121943 sent from data-OST000b-osc-ffff81041f9b2c00 to 
> NID 172.16.1.25 at o2ib 100s ago has timed out (limit 100s).
> Lustre: Skipped 1 previous similar message

If you have a larger cluster (hundreds of clients) with 1.6.6 you have to increase the lustre timeout value beyond 100s for the worst-case IO (300s is pretty typical at 1000 clients), but this is too long for most cases.

What you really want is to upgrade to 1.8.x in order to get adaptive timeouts.  This allows the clients/servers to handle varying network  and storage latency, instead of having a fixed timeout.

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.