[Lustre-discuss] Node randomly panic
Somsak Sriprayoonsakul
somsak_sr at thaigrid.or.th
Mon Nov 26 05:25:53 PST 2007
No. I use stock bnx2 driver from pre-built latest kernel-lustre-1.6.3.
I forgot to mention about oops. It's something about lustre
(lustre_blah_blah_blah something).
All other nodes also use bnx2. There's no problem at all.
Matt wrote:
> Somsak,
>
> Did you build your own bnx2 driver? I was getting kernel panics when
> hitting a certain load with Dell 1950s that also use the bnx2 driver.
> My solution was to grab the bnx2 source code and build it under the
> Lustre kernel. If you search the mailing list you'll find the mails
> dealing with this.
>
> If you see bnx2 mentioned in your kernel panic output, then it's
> probably the cause.
>
> Thanks,
>
> Matt
>
> On 26/11/2007, *Somsak Sriprayoonsakul * <somsak_sr at thaigrid.or.th
> <mailto:somsak_sr at thaigrid.or.th>> wrote:
>
> Hello,
>
> We have a 4 nodes Lustre Cluster that provides parallel file
> system
> for our 192 nodes cluster. The Lustre Cluster are CentOS 4.5, x86_64
> (Intel series 4000), on HP DL360-G5. The cluster that use it is ROCKS
> 4.2.1, on the same set of hardware. Our network is Gigabit Ethernet,
> using bnx2 driver. Lustre setup is
>
> storage-0-0: mgs+mdt, ost0, ost1 (backup)
> storage-0-1: mgs+mdt (backup), ost0 (backup), ost1
> storage-0-2: ost2, ost3 (backup)
> storage-0-3: ost2 (backup), ost3
>
> We're using heartbeat 2.0.8 base on pre-built RPM from CentOS. All
> backup is configure in the way that it'll not run simultaneously with
> primary. Note that, we enable flock and quota on Lustre.
>
> The problem we have right now is, some of the nodes are randomly
> panic. This happened about once a week or two week. We tolerate this
> stupidly by setting kernel.panic=60 and hope that the backup node
> will
> not failed within the time frame, though this is working quite well
> (base on user feedback, they do not know that the file system is
> failed). The backup node take-over OST and do recovery for about 250
> secs then everything back to normal.
>
> Anyways we're trying to nail down the reason why the file
> system is
> panic. I believe that information above will not suffice to track down
> the reason. Could someone give me a way to debug or dump some useful
> information that I can send to the list for later analysis? Also, does
> the "RECOVERING" suffice to make the file system stable? Do we need to
> shutdown the whole system and do e2fsck+lfsck?
>
> Also, every panic time, quota that was enabled will be
> disabled (lfs
> quota <user> /fs yield "No such process). I have to do quotaoff and
> quotaon again. It seems that the quota is not being turn on when
> OST is
> boot up. Is there a way to always turn this on?
>
>
> Thank you very much in advance
>
>
> --
>
> -----------------------------------------------------------------------------------
> Somsak Sriprayoonsakul
>
> Thai National Grid Center
> Software Industry Promotion Agency
> Ministry of ICT, Thailand
> somsak_sr at thaigrid.or.th <mailto:somsak_sr at thaigrid.or.th>
> -----------------------------------------------------------------------------------
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at clusterfs.com <mailto:Lustre-discuss at clusterfs.com>
> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
> <https://mail.clusterfs.com/mailman/listinfo/lustre-discuss>
>
>
--
-----------------------------------------------------------------------------------
Somsak Sriprayoonsakul
Thai National Grid Center
Software Industry Promotion Agency
Ministry of ICT, Thailand
somsak_sr at thaigrid.or.th
-----------------------------------------------------------------------------------
More information about the lustre-discuss
mailing list