[Lustre-discuss] MDT crash: ll_mdt at 100%

Thomas Roth t.roth at gsi.de
Tue Jul 7 04:42:46 PDT 2009


Hi,

Mag Gam wrote:
> Exactly the symptoms I had. How long were you running this for?  Also,
> how easy is it for you to reproduce this error?

the MDS-going-on-strike - instances happened only twice since we
upgraded the cluster from Lustre 1.6.5.1 to 1.6.7.1 end of April.
Since last week everything seems to work fine again. The difference: I
had to move data off of one OST whose RAID announces hardware errors. To
do that, I ran "lfs find --obd <OST> /lustre/<dir>", at first massivel
parallel, then with 6 processes, and for the last few directories only
step-by-step. Of course I'm bewildered that such a well defined
operation should be able to break the MDT's operation, while the things
our users do in their unlimited ingenuity did not.
In the other hand, there is that issues with switching on quota. As I
have reported earlier, "lfs quotacheck -ug" also leads to enormous loads
on the MDT, finally stopping everything.
Maybe it's more of a hardware issue.

> 
> This should clear up your doubts. But you said you are running at
> 1.6.7.1 which is bizzare because I was running at 1.6.7 . Maybe this
> could be a different bug?
> 
> http://lists.lustre.org/pipermail/lustre-discuss/2009-April/010167.html

Well, that was the bug causing data corruption on the MDT. There were
patches for 1.6.7.0 and then the patched release 1.6.7.1 to correct that.
But now we experienced this stop of operation of the MDT. After curing
it in the way I described earlier, there were no data corruptions or
losses that could be attributed to this outage.


Regards,
Thomas


> 
> On Fri, Jul 3, 2009 at 10:44 AM, Thomas Roth<t.roth at gsi.de> wrote:
>>
>> Mag Gam wrote:
>>> http://lists.lustre.org/pipermail/lustre-discuss/2009-March/009928.html
>>>
>>> Look familiar?
>>>
>> Yes, I've read the thread - that's why I addressed you in addition to
>> the list  ;-)
>>
>> But I was not aware that this is supposed to be a bug in this particular
>> Lustre version.
>>
>> Right now the MDT stops cooperating without any ll_mdt processes going
>> up. Load is 0.5 or so on the MDT but no connections possible.
>> Â In the log I only noted some "still busy with 2 active RPCs" messages.
>> I just hope I don't have to writeconf the MDT again - I learned on this
>> list that this would be necessary if these RPCs are never finished.
>>
>> Regards,
>> Thomas
>>
>>
>>> On Fri, Jul 3, 2009 at 7:32 AM, Thomas Roth<t.roth at gsi.de> wrote:
>>>> Hi,
>>>>
>>>> I didn't take notice of a discussion of such problems with 1.6.7.1. Â Do
>>>> you know something more specific about it? We won't want to downgrade
>>>> since our users are happier after the last upgrade (1.6.5 -> 1.6.7). And
>>>> we don't have the 1.6.7.2 (Debian-) packages yet. But I could try to
>>>> speed that up and force an upgrade if you told me that 1.6.7.1 wasn't
>>>> really reliable.
>>>>
>>>> For the moment the problem seems to have been fixed by shutdown,
>>>> fs-check and writeconf of all servers.
>>>> However, I don't want to do that every other week ...
>>>>
>>>> Thanks a lot for your help,
>>>> Thomas
>>>>
>>>> Mag Gam wrote:
>>>>> Hi Tom:
>>>>>
>>>>> There was a known issue with 1.6.7.1. What I did was downgrade to
>>>>> 1.6.6 and everything worked well. Or you can try upgrading, but there
>>>>> is something def wrong with that version...
>>>>>
>>>>> If you like, I can help you offline. I should be free this weekend (I
>>>>> have a long weekend)
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Jul 2, 2009 at 8:22 AM, Thomas Roth<t.roth at gsi.de> wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> our MDT gets stuck and unresponsive with very high loads (Lustre
>>>>>> 1.6.7.1, Kernel 2.6.22, 8 Core, 32GB RAM). The only thing calling
>>>>>> attention is one ll_mt_?? process running with 100% cpu. Nothing unusual
>>>>>> happening on the cluster before that.
>>>>>> After reboot as well as after moving the service to another server, this
>>>>>> behavior reappears. The initial stages - mounting MGS, mouting MDT,
>>>>>> recovery - work fine, but then the load goes up and the system is
>>>>>> rendered unusable.
>>>>>>
>>>>>> Atm, I don't know what to do, except shutting down all servers and
>>>>>> possible do a writeconf everywhere.
>>>>>>
>>>>>> I see that a similar problem was reported by Mag in March this year, but
>>>>>> no clues or solutions appeared.
>>>>>> Any ideas?
>>>>>>
>>>>>> Yours,
>>>>>> Thomas
>>>>>>
>>>> --
>>>> --------------------------------------------------------------------
>>>> Thomas Roth
>>>> Department: Informationstechnologie
>>>> Location: SB3 1.262
>>>> Phone: +49-6159-71 1453 Â Fax: +49-6159-71 2986
>>>>
>>>> GSI Helmholtzzentrum für Schwerionenforschung GmbH
>>>> Planckstraße 1
>>>> D-64291 Darmstadt
>>>> www.gsi.de
>>>>
>>>> Gesellschaft mit beschränkter Haftung
>>>> Sitz der Gesellschaft: Darmstadt
>>>> Handelsregister: Amtsgericht Darmstadt, HRB 1528
>>>>
>>>> Geschäftsführer: Professor Dr. Horst Stöcker
>>>>
>>>> Vorsitzende des Aufsichtsrates: Dr. Beatrix Vierkorn-Rudolph,
>>>> Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt
>>>>
>> --
>> --------------------------------------------------------------------
>> Thomas Roth
>> Department: Informationstechnologie
>> Location: SB3 1.262
>> Phone: +49-6159-71 1453 Â Fax: +49-6159-71 2986
>>
>> GSI Helmholtzzentrum für Schwerionenforschung GmbH
>> Planckstraße 1
>> D-64291 Darmstadt
>> www.gsi.de
>>
>> Gesellschaft mit beschränkter Haftung
>> Sitz der Gesellschaft: Darmstadt
>> Handelsregister: Amtsgericht Darmstadt, HRB 1528
>>
>> Geschäftsführer: Professor Dr. Horst Stöcker
>>
>> Vorsitzende des Aufsichtsrates: Dr. Beatrix Vierkorn-Rudolph,
>> Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt
>>

-- 
--------------------------------------------------------------------
Thomas Roth
Department: Informationstechnologie
Location: SB3 1.262
Phone: +49-6159-71 1453  Fax: +49-6159-71 2986

GSI Helmholtzzentrum für Schwerionenforschung GmbH
Planckstraße 1
D-64291 Darmstadt
www.gsi.de

Gesellschaft mit beschränkter Haftung
Sitz der Gesellschaft: Darmstadt
Handelsregister: Amtsgericht Darmstadt, HRB 1528

Geschäftsführer: Professor Dr. Horst Stöcker

Vorsitzende des Aufsichtsrates: Dr. Beatrix Vierkorn-Rudolph,
Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt



More information about the lustre-discuss mailing list