[Lustre-discuss] Node randomly panic

Mon Nov 26 05:25:53 PST 2007

No. I use stock bnx2 driver from pre-built latest kernel-lustre-1.6.3.

I forgot to mention about oops. It's something about lustre 
(lustre_blah_blah_blah something).

All other nodes also use bnx2. There's no problem at all.

Matt wrote:
> Somsak,
>
> Did you build your own bnx2 driver? I was getting kernel panics when 
> hitting a certain load with Dell 1950s that also use the bnx2 driver.  
> My solution was to grab the bnx2 source code and build it under the 
> Lustre kernel.  If you search the mailing list you'll find the mails 
> dealing with this.
>
> If you see bnx2 mentioned in your kernel panic output, then it's 
> probably the cause.
>
> Thanks,
>
> Matt
>
> On 26/11/2007, *Somsak Sriprayoonsakul * <somsak_sr at thaigrid.or.th 
> <mailto:somsak_sr at thaigrid.or.th>> wrote:
>
>     Hello,
>
>         We have a 4 nodes Lustre Cluster that provides parallel file
>     system
>     for our 192 nodes cluster. The Lustre Cluster are CentOS 4.5, x86_64
>     (Intel series 4000), on HP DL360-G5. The cluster that use it is ROCKS
>     4.2.1, on the same set of hardware. Our network is Gigabit Ethernet,
>     using bnx2 driver. Lustre setup is
>
>     storage-0-0: mgs+mdt, ost0, ost1 (backup)
>     storage-0-1: mgs+mdt (backup), ost0 (backup), ost1
>     storage-0-2: ost2, ost3 (backup)
>     storage-0-3: ost2 (backup), ost3
>
>         We're using heartbeat 2.0.8 base on pre-built RPM from CentOS. All
>     backup is configure in the way that it'll not run simultaneously with
>     primary. Note that, we enable flock and quota on Lustre.
>
>         The problem we have right now is, some of the nodes are randomly
>     panic. This happened about once a week or two week. We tolerate this
>     stupidly by setting kernel.panic=60 and hope that the backup node
>     will
>     not failed within the time frame, though this is working quite well
>     (base on user feedback, they do not know that the file system is
>     failed). The backup node take-over OST and do recovery for about 250
>     secs then everything back to normal.
>
>         Anyways we're trying to nail down the reason why the file
>     system is
>     panic. I believe that information above will not suffice to track down
>     the reason. Could someone give me a way to debug or dump some useful
>     information that I can send to the list for later analysis? Also, does
>     the "RECOVERING" suffice to make the file system stable? Do we need to
>     shutdown the whole system and do e2fsck+lfsck?
>
>         Also, every panic time, quota that was enabled will be
>     disabled (lfs
>     quota <user> /fs yield "No such process). I have to do quotaoff and
>     quotaon again. It seems that the quota is not being turn on when
>     OST is
>     boot up. Is there a way to always turn this on?
>
>
>         Thank you very much in advance
>
>
>     --
>
>     -----------------------------------------------------------------------------------
>     Somsak Sriprayoonsakul
>
>     Thai National Grid Center
>     Software Industry Promotion Agency
>     Ministry of ICT, Thailand
>     somsak_sr at thaigrid.or.th <mailto:somsak_sr at thaigrid.or.th>
>     -----------------------------------------------------------------------------------
>
>
>     _______________________________________________
>     Lustre-discuss mailing list
>     Lustre-discuss at clusterfs.com <mailto:Lustre-discuss at clusterfs.com>
>     https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>     <https://mail.clusterfs.com/mailman/listinfo/lustre-discuss>
>
>

-- 

-----------------------------------------------------------------------------------
Somsak Sriprayoonsakul

Thai National Grid Center
Software Industry Promotion Agency
Ministry of ICT, Thailand
somsak_sr at thaigrid.or.th
-----------------------------------------------------------------------------------