[Lustre-discuss] Node randomly panic

Mon Nov 26 02:41:34 PST 2007

Somsak,

Did you build your own bnx2 driver? I was getting kernel panics when hitting
a certain load with Dell 1950s that also use the bnx2 driver.  My solution
was to grab the bnx2 source code and build it under the Lustre kernel.  If
you search the mailing list you'll find the mails dealing with this.

If you see bnx2 mentioned in your kernel panic output, then it's probably
the cause.

Thanks,

Matt

On 26/11/2007, Somsak Sriprayoonsakul <somsak_sr at thaigrid.or.th> wrote:
>
> Hello,
>
>     We have a 4 nodes Lustre Cluster that provides parallel file system
> for our 192 nodes cluster. The Lustre Cluster are CentOS 4.5, x86_64
> (Intel series 4000), on HP DL360-G5. The cluster that use it is ROCKS
> 4.2.1, on the same set of hardware. Our network is Gigabit Ethernet,
> using bnx2 driver. Lustre setup is
>
> storage-0-0: mgs+mdt, ost0, ost1 (backup)
> storage-0-1: mgs+mdt (backup), ost0 (backup), ost1
> storage-0-2: ost2, ost3 (backup)
> storage-0-3: ost2 (backup), ost3
>
>     We're using heartbeat 2.0.8 base on pre-built RPM from CentOS. All
> backup is configure in the way that it'll not run simultaneously with
> primary. Note that, we enable flock and quota on Lustre.
>
>     The problem we have right now is, some of the nodes are randomly
> panic. This happened about once a week or two week. We tolerate this
> stupidly by setting kernel.panic=60 and hope that the backup node will
> not failed within the time frame, though this is working quite well
> (base on user feedback, they do not know that the file system is
> failed). The backup node take-over OST and do recovery for about 250
> secs then everything back to normal.
>
>     Anyways we're trying to nail down the reason why the file system is
> panic. I believe that information above will not suffice to track down
> the reason. Could someone give me a way to debug or dump some useful
> information that I can send to the list for later analysis? Also, does
> the "RECOVERING" suffice to make the file system stable? Do we need to
> shutdown the whole system and do e2fsck+lfsck?
>
>     Also, every panic time, quota that was enabled will be disabled (lfs
> quota <user> /fs yield "No such process). I have to do quotaoff and
> quotaon again. It seems that the quota is not being turn on when OST is
> boot up. Is there a way to always turn this on?
>
>
>     Thank you very much in advance
>
>
> --
>
>
> -----------------------------------------------------------------------------------
> Somsak Sriprayoonsakul
>
> Thai National Grid Center
> Software Industry Promotion Agency
> Ministry of ICT, Thailand
> somsak_sr at thaigrid.or.th
>
> -----------------------------------------------------------------------------------
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at clusterfs.com
> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20071126/18d9ef78/attachment.htm>