[Lustre-discuss] Node randomly panic
Wojciech Turek
wjt27 at cam.ac.uk
Mon Nov 26 05:45:41 PST 2007
Hi,
On 26 Nov 2007, at 13:36, Somsak Sriprayoonsakul wrote:
> We have about 177 client nodes.
>
> I think the crashed happened only with OSS.
We have had similar problem. We have 600 clients and crashes happened
avery 2 days. There is bug https://bugzilla.lustre.org/show_bug.cgi?
id=14293
If your kernel panic looks similar you might be almost certain that
it is the same issue.
>
> I do not have screenshot yet. How can I get the crashdump log?
You can try netdump
http://www.redhat.com/support/wpapers/redhat/netdump/
>
> Wojciech Turek wrote:
>> Hi,
>>
>> how many clients (compute nodes) you have in your cluster? What is
>> crashing randomly: clients or OSS or MDS or maybe all of them?
>> Do you have screenshot of the kernel panic or crashdump log?
>>
>> cheers,
>>
>> Wojciech Turek On 26 Nov 2007, at 13:25, Somsak Sriprayoonsakul
>> wrote:
>>
>>> No. I use stock bnx2 driver from pre-built latest kernel-
>>> lustre-1.6.3.
>>>
>>> I forgot to mention about oops. It's something about lustre
>>> (lustre_blah_blah_blah something).
>>>
>>> All other nodes also use bnx2. There's no problem at all.
>>>
>>> Matt wrote:
>>>> Somsak,
>>>>
>>>> Did you build your own bnx2 driver? I was getting kernel panics
>>>> when hitting a certain load with Dell 1950s that also use the
>>>> bnx2 driver. My solution was to grab the bnx2 source code and
>>>> build it under the Lustre kernel. If you search the mailing
>>>> list you'll find the mails dealing with this.
>>>>
>>>> If you see bnx2 mentioned in your kernel panic output, then it's
>>>> probably the cause.
>>>>
>>>> Thanks,
>>>>
>>>> Matt
>>>>
>>>> On 26/11/2007, *Somsak Sriprayoonsakul *
>>>> <somsak_sr at thaigrid.or.th <mailto:somsak_sr at thaigrid.or.th>
>>>> <mailto:somsak_sr at thaigrid.or.th>> wrote:
>>>>
>>>> Hello,
>>>>
>>>> We have a 4 nodes Lustre Cluster that provides parallel
>>>> file
>>>> system
>>>> for our 192 nodes cluster. The Lustre Cluster are CentOS
>>>> 4.5, x86_64
>>>> (Intel series 4000), on HP DL360-G5. The cluster that use it
>>>> is ROCKS
>>>> 4.2.1, on the same set of hardware. Our network is Gigabit
>>>> Ethernet,
>>>> using bnx2 driver. Lustre setup is
>>>>
>>>> storage-0-0: mgs+mdt, ost0, ost1 (backup)
>>>> storage-0-1: mgs+mdt (backup), ost0 (backup), ost1
>>>> storage-0-2: ost2, ost3 (backup)
>>>> storage-0-3: ost2 (backup), ost3
>>>>
>>>> We're using heartbeat 2.0.8 base on pre-built RPM from
>>>> CentOS. All
>>>> backup is configure in the way that it'll not run
>>>> simultaneously with
>>>> primary. Note that, we enable flock and quota on Lustre.
>>>>
>>>> The problem we have right now is, some of the nodes are
>>>> randomly
>>>> panic. This happened about once a week or two week. We
>>>> tolerate this
>>>> stupidly by setting kernel.panic=60 and hope that the backup
>>>> node
>>>> will
>>>> not failed within the time frame, though this is working
>>>> quite well
>>>> (base on user feedback, they do not know that the file
>>>> system is
>>>> failed). The backup node take-over OST and do recovery for
>>>> about 250
>>>> secs then everything back to normal.
>>>>
>>>> Anyways we're trying to nail down the reason why the file
>>>> system is
>>>> panic. I believe that information above will not suffice to
>>>> track down
>>>> the reason. Could someone give me a way to debug or dump
>>>> some useful
>>>> information that I can send to the list for later analysis?
>>>> Also, does
>>>> the "RECOVERING" suffice to make the file system stable? Do
>>>> we need to
>>>> shutdown the whole system and do e2fsck+lfsck?
>>>>
>>>> Also, every panic time, quota that was enabled will be
>>>> disabled (lfs
>>>> quota <user> /fs yield "No such process). I have to do
>>>> quotaoff and
>>>> quotaon again. It seems that the quota is not being turn on
>>>> when
>>>> OST is
>>>> boot up. Is there a way to always turn this on?
>>>>
>>>>
>>>> Thank you very much in advance
>>>>
>>>>
>>>> --
>>>>
>>>>
>>>> -------------------------------------------------------------------
>>>> ----------------
>>>> Somsak Sriprayoonsakul
>>>>
>>>> Thai National Grid Center
>>>> Software Industry Promotion Agency
>>>> Ministry of ICT, Thailand
>>>> somsak_sr at thaigrid.or.th <mailto:somsak_sr at thaigrid.or.th>
>>>>
>>>> -------------------------------------------------------------------
>>>> ----------------
>>>>
>>>>
>>>> _______________________________________________
>>>> Lustre-discuss mailing list
>>>> Lustre-discuss at clusterfs.com <mailto:Lustre-
>>>> discuss at clusterfs.com>
>>>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>>>> <https://mail.clusterfs.com/mailman/listinfo/lustre-discuss>
>>>>
>>>>
>>>
>>> --
>>>
>>> --------------------------------------------------------------------
>>> ---------------
>>> Somsak Sriprayoonsakul
>>>
>>> Thai National Grid Center
>>> Software Industry Promotion Agency
>>> Ministry of ICT, Thailand
>>> somsak_sr at thaigrid.or.th <mailto:somsak_sr at thaigrid.or.th>
>>> --------------------------------------------------------------------
>>> ---------------
>>>
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at clusterfs.com <mailto:Lustre-discuss at clusterfs.com>
>>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>>
>> Mr Wojciech Turek
>> Assistant System Manager
>> University of Cambridge
>> High Performance Computing service email: wjt27 at cam.ac.uk
>> <mailto:wjt27 at cam.ac.uk>
>> tel. +441223763517
>>
>>
>>
>
> --
>
> ----------------------------------------------------------------------
> -------------
> Somsak Sriprayoonsakul
>
> Thai National Grid Center
> Software Industry Promotion Agency
> Ministry of ICT, Thailand
> somsak_sr at thaigrid.or.th
> ----------------------------------------------------------------------
> -------------
>
Mr Wojciech Turek
Assistant System Manager
University of Cambridge
High Performance Computing service
email: wjt27 at cam.ac.uk
tel. +441223763517
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20071126/66335bcd/attachment.htm>
More information about the lustre-discuss
mailing list