[Lustre-discuss] Node randomly panic
Somsak Sriprayoonsakul
somsak_sr at thaigrid.or.th
Mon Nov 26 05:36:49 PST 2007
We have about 177 client nodes.
I think the crashed happened only with OSS.
I do not have screenshot yet. How can I get the crashdump log?
Wojciech Turek wrote:
> Hi,
>
> how many clients (compute nodes) you have in your cluster? What is
> crashing randomly: clients or OSS or MDS or maybe all of them?
> Do you have screenshot of the kernel panic or crashdump log?
>
> cheers,
>
> Wojciech Turek
> On 26 Nov 2007, at 13:25, Somsak Sriprayoonsakul wrote:
>
>> No. I use stock bnx2 driver from pre-built latest kernel-lustre-1.6.3.
>>
>> I forgot to mention about oops. It's something about lustre
>> (lustre_blah_blah_blah something).
>>
>> All other nodes also use bnx2. There's no problem at all.
>>
>> Matt wrote:
>>> Somsak,
>>>
>>> Did you build your own bnx2 driver? I was getting kernel panics when
>>> hitting a certain load with Dell 1950s that also use the bnx2 driver.
>>> My solution was to grab the bnx2 source code and build it under the
>>> Lustre kernel. If you search the mailing list you'll find the mails
>>> dealing with this.
>>>
>>> If you see bnx2 mentioned in your kernel panic output, then it's
>>> probably the cause.
>>>
>>> Thanks,
>>>
>>> Matt
>>>
>>> On 26/11/2007, *Somsak Sriprayoonsakul * <somsak_sr at thaigrid.or.th
>>> <mailto:somsak_sr at thaigrid.or.th>
>>> <mailto:somsak_sr at thaigrid.or.th>> wrote:
>>>
>>> Hello,
>>>
>>> We have a 4 nodes Lustre Cluster that provides parallel file
>>> system
>>> for our 192 nodes cluster. The Lustre Cluster are CentOS 4.5, x86_64
>>> (Intel series 4000), on HP DL360-G5. The cluster that use it is
>>> ROCKS
>>> 4.2.1, on the same set of hardware. Our network is Gigabit Ethernet,
>>> using bnx2 driver. Lustre setup is
>>>
>>> storage-0-0: mgs+mdt, ost0, ost1 (backup)
>>> storage-0-1: mgs+mdt (backup), ost0 (backup), ost1
>>> storage-0-2: ost2, ost3 (backup)
>>> storage-0-3: ost2 (backup), ost3
>>>
>>> We're using heartbeat 2.0.8 base on pre-built RPM from
>>> CentOS. All
>>> backup is configure in the way that it'll not run simultaneously
>>> with
>>> primary. Note that, we enable flock and quota on Lustre.
>>>
>>> The problem we have right now is, some of the nodes are randomly
>>> panic. This happened about once a week or two week. We tolerate this
>>> stupidly by setting kernel.panic=60 and hope that the backup node
>>> will
>>> not failed within the time frame, though this is working quite well
>>> (base on user feedback, they do not know that the file system is
>>> failed). The backup node take-over OST and do recovery for about 250
>>> secs then everything back to normal.
>>>
>>> Anyways we're trying to nail down the reason why the file
>>> system is
>>> panic. I believe that information above will not suffice to
>>> track down
>>> the reason. Could someone give me a way to debug or dump some useful
>>> information that I can send to the list for later analysis?
>>> Also, does
>>> the "RECOVERING" suffice to make the file system stable? Do we
>>> need to
>>> shutdown the whole system and do e2fsck+lfsck?
>>>
>>> Also, every panic time, quota that was enabled will be
>>> disabled (lfs
>>> quota <user> /fs yield "No such process). I have to do quotaoff and
>>> quotaon again. It seems that the quota is not being turn on when
>>> OST is
>>> boot up. Is there a way to always turn this on?
>>>
>>>
>>> Thank you very much in advance
>>>
>>>
>>> --
>>>
>>>
>>> -----------------------------------------------------------------------------------
>>> Somsak Sriprayoonsakul
>>>
>>> Thai National Grid Center
>>> Software Industry Promotion Agency
>>> Ministry of ICT, Thailand
>>> somsak_sr at thaigrid.or.th <mailto:somsak_sr at thaigrid.or.th>
>>>
>>> -----------------------------------------------------------------------------------
>>>
>>>
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at clusterfs.com <mailto:Lustre-discuss at clusterfs.com>
>>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>>> <https://mail.clusterfs.com/mailman/listinfo/lustre-discuss>
>>>
>>>
>>
>> --
>>
>> -----------------------------------------------------------------------------------
>> Somsak Sriprayoonsakul
>>
>> Thai National Grid Center
>> Software Industry Promotion Agency
>> Ministry of ICT, Thailand
>> somsak_sr at thaigrid.or.th <mailto:somsak_sr at thaigrid.or.th>
>> -----------------------------------------------------------------------------------
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at clusterfs.com <mailto:Lustre-discuss at clusterfs.com>
>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>
> Mr Wojciech Turek
> Assistant System Manager
> University of Cambridge
> High Performance Computing service
> email: wjt27 at cam.ac.uk <mailto:wjt27 at cam.ac.uk>
> tel. +441223763517
>
>
>
--
-----------------------------------------------------------------------------------
Somsak Sriprayoonsakul
Thai National Grid Center
Software Industry Promotion Agency
Ministry of ICT, Thailand
somsak_sr at thaigrid.or.th
-----------------------------------------------------------------------------------
More information about the lustre-discuss
mailing list