[Lustre-discuss] MDT crash: ll_mdt at 100%

Tue Jul 7 04:58:54 PDT 2009

So, are you all good now?

Thanks for the explanation, BTW!

On Tue, Jul 7, 2009 at 7:42 AM, Thomas Roth<t.roth at gsi.de> wrote:
> Hi,
>
> Mag Gam wrote:
>> Exactly the symptoms I had. How long were you running this for?  Also,
>> how easy is it for you to reproduce this error?
>
> the MDS-going-on-strike - instances happened only twice since we
> upgraded the cluster from Lustre 1.6.5.1 to 1.6.7.1 end of April.
> Since last week everything seems to work fine again. The difference: I
> had to move data off of one OST whose RAID announces hardware errors. To
> do that, I ran "lfs find --obd <OST> /lustre/<dir>", at first massivel
> parallel, then with 6 processes, and for the last few directories only
> step-by-step. Of course I'm bewildered that such a well defined
> operation should be able to break the MDT's operation, while the things
> our users do in their unlimited ingenuity did not.
> In the other hand, there is that issues with switching on quota. As I
> have reported earlier, "lfs quotacheck -ug" also leads to enormous loads
> on the MDT, finally stopping everything.
> Maybe it's more of a hardware issue.
>
>>
>> This should clear up your doubts. But you said you are running at
>> 1.6.7.1 which is bizzare because I was running at 1.6.7 . Maybe this
>> could be a different bug?
>>
>> http://lists.lustre.org/pipermail/lustre-discuss/2009-April/010167.html
>
> Well, that was the bug causing data corruption on the MDT. There were
> patches for 1.6.7.0 and then the patched release 1.6.7.1 to correct that.
> But now we experienced this stop of operation of the MDT. After curing
> it in the way I described earlier, there were no data corruptions or
> losses that could be attributed to this outage.
>
>
> Regards,
> Thomas
>
>
>>
>> On Fri, Jul 3, 2009 at 10:44 AM, Thomas Roth<t.roth at gsi.de> wrote:
>>>
>>> Mag Gam wrote:
>>>> http://lists.lustre.org/pipermail/lustre-discuss/2009-March/009928.html
>>>>
>>>> Look familiar?
>>>>
>>> Yes, I've read the thread - that's why I addressed you in addition to
>>> the list Â ;-)
>>>
>>> But I was not aware that this is supposed to be a bug in this particular
>>> Lustre version.
>>>
>>> Right now the MDT stops cooperating without any ll_mdt processes going
>>> up. Load is 0.5 or so on the MDT but no connections possible.
>>> Â In the log I only noted some "still busy with 2 active RPCs" messages.
>>> I just hope I don't have to writeconf the MDT again - I learned on this
>>> list that this would be necessary if these RPCs are never finished.
>>>
>>> Regards,
>>> Thomas
>>>
>>>
>>>> On Fri, Jul 3, 2009 at 7:32 AM, Thomas Roth<t.roth at gsi.de> wrote:
>>>>> Hi,
>>>>>
>>>>> I didn't take notice of a discussion of such problems with 1.6.7.1. Ã‚ Do
>>>>> you know something more specific about it? We won't want to downgrade
>>>>> since our users are happier after the last upgrade (1.6.5 -> 1.6.7). And
>>>>> we don't have the 1.6.7.2 (Debian-) packages yet. But I could try to
>>>>> speed that up and force an upgrade if you told me that 1.6.7.1 wasn't
>>>>> really reliable.
>>>>>
>>>>> For the moment the problem seems to have been fixed by shutdown,
>>>>> fs-check and writeconf of all servers.
>>>>> However, I don't want to do that every other week ...
>>>>>
>>>>> Thanks a lot for your help,
>>>>> Thomas
>>>>>
>>>>> Mag Gam wrote:
>>>>>> Hi Tom:
>>>>>>
>>>>>> There was a known issue with 1.6.7.1. What I did was downgrade to
>>>>>> 1.6.6 and everything worked well. Or you can try upgrading, but there
>>>>>> is something def wrong with that version...
>>>>>>
>>>>>> If you like, I can help you offline. I should be free this weekend (I
>>>>>> have a long weekend)
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Jul 2, 2009 at 8:22 AM, Thomas Roth<t.roth at gsi.de> wrote:
>>>>>>> Hi all,
>>>>>>>
>>>>>>> our MDT gets stuck and unresponsive with very high loads (Lustre
>>>>>>> 1.6.7.1, Kernel 2.6.22, 8 Core, 32GB RAM). The only thing calling
>>>>>>> attention is one ll_mt_?? process running with 100% cpu. Nothing unusual
>>>>>>> happening on the cluster before that.
>>>>>>> After reboot as well as after moving the service to another server, this
>>>>>>> behavior reappears. The initial stages - mounting MGS, mouting MDT,
>>>>>>> recovery - work fine, but then the load goes up and the system is
>>>>>>> rendered unusable.
>>>>>>>
>>>>>>> Atm, I don't know what to do, except shutting down all servers and
>>>>>>> possible do a writeconf everywhere.
>>>>>>>
>>>>>>> I see that a similar problem was reported by Mag in March this year, but
>>>>>>> no clues or solutions appeared.
>>>>>>> Any ideas?
>>>>>>>
>>>>>>> Yours,
>>>>>>> Thomas
>>>>>>>
>>>>> --
>>>>> --------------------------------------------------------------------
>>>>> Thomas Roth
>>>>> Department: Informationstechnologie
>>>>> Location: SB3 1.262
>>>>> Phone: +49-6159-71 1453 Ã‚ Fax: +49-6159-71 2986
>>>>>
>>>>> GSI Helmholtzzentrum fÃƒÂ¼r Schwerionenforschung GmbH
>>>>> PlanckstraÃƒÅ¸e 1
>>>>> D-64291 Darmstadt
>>>>> www.gsi.de
>>>>>
>>>>> Gesellschaft mit beschrÃƒÂ¤nkter Haftung
>>>>> Sitz der Gesellschaft: Darmstadt
>>>>> Handelsregister: Amtsgericht Darmstadt, HRB 1528
>>>>>
>>>>> GeschÃƒÂ¤ftsfÃƒÂ¼hrer: Professor Dr. Horst StÃƒÂ¶cker
>>>>>
>>>>> Vorsitzende des Aufsichtsrates: Dr. Beatrix Vierkorn-Rudolph,
>>>>> Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt
>>>>>
>>> --
>>> --------------------------------------------------------------------
>>> Thomas Roth
>>> Department: Informationstechnologie
>>> Location: SB3 1.262
>>> Phone: +49-6159-71 1453 Â Fax: +49-6159-71 2986
>>>
>>> GSI Helmholtzzentrum fÃ¼r Schwerionenforschung GmbH
>>> PlanckstraÃŸe 1
>>> D-64291 Darmstadt
>>> www.gsi.de
>>>
>>> Gesellschaft mit beschrÃ¤nkter Haftung
>>> Sitz der Gesellschaft: Darmstadt
>>> Handelsregister: Amtsgericht Darmstadt, HRB 1528
>>>
>>> GeschÃ¤ftsfÃ¼hrer: Professor Dr. Horst StÃ¶cker
>>>
>>> Vorsitzende des Aufsichtsrates: Dr. Beatrix Vierkorn-Rudolph,
>>> Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt
>>>
>
> --
> --------------------------------------------------------------------
> Thomas Roth
> Department: Informationstechnologie
> Location: SB3 1.262
> Phone: +49-6159-71 1453  Fax: +49-6159-71 2986
>
> GSI Helmholtzzentrum für Schwerionenforschung GmbH
> Planckstraße 1
> D-64291 Darmstadt
> www.gsi.de
>
> Gesellschaft mit beschränkter Haftung
> Sitz der Gesellschaft: Darmstadt
> Handelsregister: Amtsgericht Darmstadt, HRB 1528
>
> Geschäftsführer: Professor Dr. Horst Stöcker
>
> Vorsitzende des Aufsichtsrates: Dr. Beatrix Vierkorn-Rudolph,
> Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt
>