[Lustre-discuss] OST targets not mountable after disabling/enabling MMP

Mon Aug 9 11:11:34 PDT 2010

Ok, we're making progress. We updated our e2fsprogs to 
e2fsprogs-1.41.10.sun2-0redhat. That let us clear the MMP blocks and 
mount our OSTs as part of our lustre volume. :)

We're continuing to test things and seeing weird behavior when we run an 
ost-survey though. It looks as though the lustre client is getting 
shuffled back and forth between OSS server pairs for our OSTs. The 
client times out connecting to the primary server, attempts to connect 
to the failover server (and fails because the OST is on the primary) and 
then reconnects to the primary server and finishes the survey. This 
behavior is not isolated to one particular OST (or client) and doesn't 
occur with every survey.

###

Here's an example of the error we see on the client when this occurs:

[root at compute-2-7 ~]# lfs check servers
data-MDT0000-mdc-ffff81041f9b2c00 active.
data-OST0000-osc-ffff81041f9b2c00 active.
data-OST0001-osc-ffff81041f9b2c00 active.
data-OST0002-osc-ffff81041f9b2c00 active.
data-OST0003-osc-ffff81041f9b2c00 active.
data-OST0004-osc-ffff81041f9b2c00 active.
data-OST0005-osc-ffff81041f9b2c00 active.
data-OST0006-osc-ffff81041f9b2c00 active.
data-OST0007-osc-ffff81041f9b2c00 active.
data-OST0008-osc-ffff81041f9b2c00 active.
data-OST0009-osc-ffff81041f9b2c00 active.
data-OST000a-osc-ffff81041f9b2c00 active.
error: check 'data-OST000b-osc-ffff81041f9b2c00': Resource temporarily 
unavailable (11)

###

and here's the relevant dmesg info:

[root at compute-2-7 ~]# dmesg |grep Lustre
Lustre: Client data-client has started
Lustre: Request x121943 sent from data-OST000b-osc-ffff81041f9b2c00 to 
NID 172.16.1.25 at o2ib 100s ago has timed out (limit 100s).
Lustre: Skipped 1 previous similar message
Lustre: data-OST000b-osc-ffff81041f9b2c00: Connection to service 
data-OST000b via nid 172.16.1.25 at o2ib was lost; in progress operations 
using this service will wait for recovery to complete.
Lustre: Skipped 3 previous similar messages
LustreError: 11-0: an error occurred while communicating with 
172.16.1.25 at o2ib. The ost_connect operation failed with -16
LustreError: Skipped 11 previous similar messages
Lustre: Changing connection for data-OST000b-osc-ffff81041f9b2c00 to 
172.16.1.23 at o2ib/172.16.1.23 at o2ib
Lustre: Skipped 11 previous similar messages
Lustre: 4264:0:(import.c:410:import_select_connection()) 
data-OST000b-osc-ffff81041f9b2c00: tried all connections, increasing 
latency to 6s
Lustre: 4264:0:(import.c:410:import_select_connection()) Skipped 4 
previous similar messages
LustreError: 11-0: an error occurred while communicating with 
172.16.1.25 at o2ib. The ost_connect operation failed with -16
LustreError: Skipped 1 previous similar message
Lustre: Changing connection for data-OST000b-osc-ffff81041f9b2c00 to 
172.16.1.23 at o2ib/172.16.1.23 at o2ib
Lustre: Skipped 1 previous similar message
Lustre: 4264:0:(import.c:410:import_select_connection()) 
data-OST000b-osc-ffff81041f9b2c00: tried all connections, increasing 
latency to 11s
Lustre: data-OST000b-osc-ffff81041f9b2c00: Connection restored to 
service data-OST000b using nid 172.16.1.25 at o2ib.
Lustre: Skipped 1 previous similar message

###

the ost-survey completes but it's obvious that something's not right:

[root at compute-2-7 ~]# ost-survey -s 50 /lustre/
/usr/bin/ost-survey: 08/09/10 OST speed survey on /lustre/ from 
172.16.255.223 at o2ib
Number of Active OST devices : 12
Worst Read OST indx: 11 speed: 2.449542
Best Read OST indx: 3 speed: 2.512130
Read Average: 2.480302 +/- 0.018453 MB/s
Worst Write OST indx: 11 speed: 0.209190
Best Write OST indx: 4 speed: 5.595996
Write Average: 4.223409 +/- 2.038925 MB/s
Ost# Read(MB/s) Write(MB/s) Read-time Write-time
----------------------------------------------------
0 2.481 5.527 20.152 9.046
1 2.464 5.484 20.294 9.118
2 2.492 5.559 20.067 8.994
3 2.512 4.413 19.903 11.330
4 2.476 5.596 20.190 8.935
5 2.485 5.444 20.117 9.184
6 2.499 5.525 20.005 9.050
7 2.468 1.387 20.260 36.047
8 2.494 5.468 20.047 9.144
9 2.491 5.398 20.071 9.263
10 2.451 0.671 20.400 74.568
11 2.450 0.209 20.412 239.017

###

Sorry for the wall of text here and thanks for the help everyone.

-Ed

laotsao ?? wrote:
> http://downloads.lustre.org/public/tools/e2fsprogs/
>
>
> On 8/9/2010 11:49 AM, laotsao ?? wrote:
>>
>> hi
>> I did go through various lustre download
>> it seems that 1.6.7 1.8 1.8.0.1 all has e2fsprogs-1.40.11-sun1
>> the 1.8.1.1 has 1.41.6.sun1, hope that this version is good for Ur 
>> centos 5.2 and
>> kernel version
>>
>> regards
>>
>> On 8/9/2010 11:22 AM, Ken Hornstein wrote:
>>>> Using 'tune2fs -f -E clear-mmp' causes tune2fs to segfault:
>>> Ewww .... well, not sure what to tell you about that.
>>>
>>>> Did you use a newer version of tune2fs/e2fsprogs? Our current version
>>>> is e2fsprogs-1.40.11.sun1-0redhat. Do you know if it's safe to rev up
>>>> versions on e2fsprogs while running an older lustre kernel revision 
>>>> (1.6.6)?
>>> I am using e2fsprogs-1.41.6.sun1-0suse ... and I know that is old.
>>>
>>> I was going to say that I don't know if revving up e2fsprogs is 
>>> okay, but
>>> I see that Andreas already answered that one. I can't be 100% sure that
>>> upgrading e2fsprogs _will_ solve your problem, but I think it's worth
>>> a shot.
>>>
>>> --Ken
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>