[Lustre-discuss] OST targets not mountable after disabling/enabling MMP

Tue Aug 10 05:05:06 PDT 2010

  hi
Timeout could due to Ur IB network?
it seems that there is not harm just increase timeout from 100s to 200s 
to see that ost-survey will finish without any error
after power outage did U check all FC paths and IB path are all good?
my 2c

On 8/9/2010 4:32 PM, Edward Walter wrote:
> Andreas Dilger wrote:
>> On 2010-08-09, at 14:11, Edward Walter wrote:
>>
>>> We're continuing to test things and seeing weird behavior when we run an ost-survey though. It looks as though the lustre client is getting
>>> shuffled back and forth between OSS server pairs for our OSTs. The
>>> client times out connecting to the primary server, attempts to connect to the failover server (and fails because the OST is on the primary) and then reconnects to the primary server and finishes the survey. This behavior is not isolated to one particular OST (or client) and doesn't occur with every survey.
>>>
>>> and here's the relevant dmesg info:
>>>
>>> [root at compute-2-7 ~]# dmesg |grep Lustre
>>> Lustre: Client data-client has started
>>> Lustre: Request x121943 sent from data-OST000b-osc-ffff81041f9b2c00 to
>>> NID 172.16.1.25 at o2ib 100s ago has timed out (limit 100s).
>>> Lustre: Skipped 1 previous similar message
>>>
>> If you have a larger cluster (hundreds of clients) with 1.6.6 you have to increase the lustre timeout value beyond 100s for the worst-case IO (300s is pretty typical at 1000 clients), but this is too long for most cases.
>>
>> What you really want is to upgrade to 1.8.x in order to get adaptive timeouts.  This allows the clients/servers to handle varying network  and storage latency, instead of having a fixed timeout.
>>
>> Cheers, Andreas
>> --
>> Andreas Dilger
>> Lustre Technical Lead
>> Oracle Corporation Canada Inc.
>>
> Hi Andreas,
>
> Our cluster is fairly modest in size (104 clients, 4 OSS, 12 OSTs, 1
> active MDS). We have plans for upgrading to 1.8.x but those plans now
> include stabilizing our 1.6.6 installation so that we can do a full
> backup before upgrading.
>
> For now; we're doing our testing from 2-3 nodes without any of the other
> nodes mounting lustre. This configuration was stable and reliable until
> the hard shutdown. Obviously we'd like to get back to where we were
> before upgrading. Our timeout on the clients (cat
> /proc/sys/lustre/timeout) is 100s. Shouldn't this be sufficient for 2
> clients? I think something else is going on.
>
> -Ed
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
-------------- next part --------------
A non-text attachment was scrubbed...
Name: laotsao.vcf
Type: text/x-vcard
Size: 139 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20100810/94a0fcea/attachment.vcf>