[Lustre-discuss] Node randomly panic
Wojciech Turek
wjt27 at cam.ac.uk
Mon Nov 26 05:34:33 PST 2007
Hi,
how many clients (compute nodes) you have in your cluster? What is
crashing randomly: clients or OSS or MDS or maybe all of them?
Do you have screenshot of the kernel panic or crashdump log?
cheers,
Wojciech Turek
On 26 Nov 2007, at 13:25, Somsak Sriprayoonsakul wrote:
> No. I use stock bnx2 driver from pre-built latest kernel-lustre-1.6.3.
>
> I forgot to mention about oops. It's something about lustre
> (lustre_blah_blah_blah something).
>
> All other nodes also use bnx2. There's no problem at all.
>
> Matt wrote:
>> Somsak,
>>
>> Did you build your own bnx2 driver? I was getting kernel panics when
>> hitting a certain load with Dell 1950s that also use the bnx2 driver.
>> My solution was to grab the bnx2 source code and build it under the
>> Lustre kernel. If you search the mailing list you'll find the mails
>> dealing with this.
>>
>> If you see bnx2 mentioned in your kernel panic output, then it's
>> probably the cause.
>>
>> Thanks,
>>
>> Matt
>>
>> On 26/11/2007, *Somsak Sriprayoonsakul * <somsak_sr at thaigrid.or.th
>> <mailto:somsak_sr at thaigrid.or.th>> wrote:
>>
>> Hello,
>>
>> We have a 4 nodes Lustre Cluster that provides parallel file
>> system
>> for our 192 nodes cluster. The Lustre Cluster are CentOS 4.5,
>> x86_64
>> (Intel series 4000), on HP DL360-G5. The cluster that use it
>> is ROCKS
>> 4.2.1, on the same set of hardware. Our network is Gigabit
>> Ethernet,
>> using bnx2 driver. Lustre setup is
>>
>> storage-0-0: mgs+mdt, ost0, ost1 (backup)
>> storage-0-1: mgs+mdt (backup), ost0 (backup), ost1
>> storage-0-2: ost2, ost3 (backup)
>> storage-0-3: ost2 (backup), ost3
>>
>> We're using heartbeat 2.0.8 base on pre-built RPM from
>> CentOS. All
>> backup is configure in the way that it'll not run
>> simultaneously with
>> primary. Note that, we enable flock and quota on Lustre.
>>
>> The problem we have right now is, some of the nodes are
>> randomly
>> panic. This happened about once a week or two week. We
>> tolerate this
>> stupidly by setting kernel.panic=60 and hope that the backup node
>> will
>> not failed within the time frame, though this is working quite
>> well
>> (base on user feedback, they do not know that the file system is
>> failed). The backup node take-over OST and do recovery for
>> about 250
>> secs then everything back to normal.
>>
>> Anyways we're trying to nail down the reason why the file
>> system is
>> panic. I believe that information above will not suffice to
>> track down
>> the reason. Could someone give me a way to debug or dump some
>> useful
>> information that I can send to the list for later analysis?
>> Also, does
>> the "RECOVERING" suffice to make the file system stable? Do we
>> need to
>> shutdown the whole system and do e2fsck+lfsck?
>>
>> Also, every panic time, quota that was enabled will be
>> disabled (lfs
>> quota <user> /fs yield "No such process). I have to do
>> quotaoff and
>> quotaon again. It seems that the quota is not being turn on when
>> OST is
>> boot up. Is there a way to always turn this on?
>>
>>
>> Thank you very much in advance
>>
>>
>> --
>>
>>
>> ---------------------------------------------------------------------
>> --------------
>> Somsak Sriprayoonsakul
>>
>> Thai National Grid Center
>> Software Industry Promotion Agency
>> Ministry of ICT, Thailand
>> somsak_sr at thaigrid.or.th <mailto:somsak_sr at thaigrid.or.th>
>>
>> ---------------------------------------------------------------------
>> --------------
>>
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at clusterfs.com <mailto:Lustre-
>> discuss at clusterfs.com>
>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>> <https://mail.clusterfs.com/mailman/listinfo/lustre-discuss>
>>
>>
>
> --
>
> ----------------------------------------------------------------------
> -------------
> Somsak Sriprayoonsakul
>
> Thai National Grid Center
> Software Industry Promotion Agency
> Ministry of ICT, Thailand
> somsak_sr at thaigrid.or.th
> ----------------------------------------------------------------------
> -------------
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at clusterfs.com
> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
Mr Wojciech Turek
Assistant System Manager
University of Cambridge
High Performance Computing service
email: wjt27 at cam.ac.uk
tel. +441223763517
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20071126/4c384626/attachment.htm>
More information about the lustre-discuss
mailing list