[Lustre-discuss] MDT crash: ll_mdt at 100%

Fri Jul 3 07:44:25 PDT 2009

Mag Gam wrote:
> http://lists.lustre.org/pipermail/lustre-discuss/2009-March/009928.html
> 
> Look familiar?
> 
Yes, I've read the thread - that's why I addressed you in addition to
the list  ;-)

But I was not aware that this is supposed to be a bug in this particular
Lustre version.

Right now the MDT stops cooperating without any ll_mdt processes going
up. Load is 0.5 or so on the MDT but no connections possible.
 In the log I only noted some "still busy with 2 active RPCs" messages.
I just hope I don't have to writeconf the MDT again - I learned on this
list that this would be necessary if these RPCs are never finished.

Regards,
Thomas

> 
> On Fri, Jul 3, 2009 at 7:32 AM, Thomas Roth<t.roth at gsi.de> wrote:
>> Hi,
>>
>> I didn't take notice of a discussion of such problems with 1.6.7.1. Â Do
>> you know something more specific about it? We won't want to downgrade
>> since our users are happier after the last upgrade (1.6.5 -> 1.6.7). And
>> we don't have the 1.6.7.2 (Debian-) packages yet. But I could try to
>> speed that up and force an upgrade if you told me that 1.6.7.1 wasn't
>> really reliable.
>>
>> For the moment the problem seems to have been fixed by shutdown,
>> fs-check and writeconf of all servers.
>> However, I don't want to do that every other week ...
>>
>> Thanks a lot for your help,
>> Thomas
>>
>> Mag Gam wrote:
>>> Hi Tom:
>>>
>>> There was a known issue with 1.6.7.1. What I did was downgrade to
>>> 1.6.6 and everything worked well. Or you can try upgrading, but there
>>> is something def wrong with that version...
>>>
>>> If you like, I can help you offline. I should be free this weekend (I
>>> have a long weekend)
>>>
>>>
>>>
>>> On Thu, Jul 2, 2009 at 8:22 AM, Thomas Roth<t.roth at gsi.de> wrote:
>>>> Hi all,
>>>>
>>>> our MDT gets stuck and unresponsive with very high loads (Lustre
>>>> 1.6.7.1, Kernel 2.6.22, 8 Core, 32GB RAM). The only thing calling
>>>> attention is one ll_mt_?? process running with 100% cpu. Nothing unusual
>>>> happening on the cluster before that.
>>>> After reboot as well as after moving the service to another server, this
>>>> behavior reappears. The initial stages - mounting MGS, mouting MDT,
>>>> recovery - work fine, but then the load goes up and the system is
>>>> rendered unusable.
>>>>
>>>> Atm, I don't know what to do, except shutting down all servers and
>>>> possible do a writeconf everywhere.
>>>>
>>>> I see that a similar problem was reported by Mag in March this year, but
>>>> no clues or solutions appeared.
>>>> Any ideas?
>>>>
>>>> Yours,
>>>> Thomas
>>>>
>> --
>> --------------------------------------------------------------------
>> Thomas Roth
>> Department: Informationstechnologie
>> Location: SB3 1.262
>> Phone: +49-6159-71 1453 Â Fax: +49-6159-71 2986
>>
>> GSI Helmholtzzentrum fÃ¼r Schwerionenforschung GmbH
>> PlanckstraÃŸe 1
>> D-64291 Darmstadt
>> www.gsi.de
>>
>> Gesellschaft mit beschrÃ¤nkter Haftung
>> Sitz der Gesellschaft: Darmstadt
>> Handelsregister: Amtsgericht Darmstadt, HRB 1528
>>
>> GeschÃ¤ftsfÃ¼hrer: Professor Dr. Horst StÃ¶cker
>>
>> Vorsitzende des Aufsichtsrates: Dr. Beatrix Vierkorn-Rudolph,
>> Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt
>>

-- 
--------------------------------------------------------------------
Thomas Roth
Department: Informationstechnologie
Location: SB3 1.262
Phone: +49-6159-71 1453  Fax: +49-6159-71 2986

GSI Helmholtzzentrum für Schwerionenforschung GmbH
Planckstraße 1
D-64291 Darmstadt
www.gsi.de

Gesellschaft mit beschränkter Haftung
Sitz der Gesellschaft: Darmstadt
Handelsregister: Amtsgericht Darmstadt, HRB 1528

Geschäftsführer: Professor Dr. Horst Stöcker

Vorsitzende des Aufsichtsrates: Dr. Beatrix Vierkorn-Rudolph,
Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt