[Lustre-discuss] Node randomly panic

Mon Nov 26 05:36:49 PST 2007

We have about 177 client nodes.

I think the crashed happened only with OSS.

I do not have screenshot yet. How can I get the crashdump log?

Wojciech Turek wrote:
> Hi,
>
> how many clients (compute nodes) you have in your cluster? What is 
> crashing randomly: clients or OSS or MDS or maybe all of them?
> Do you have screenshot of the kernel panic or crashdump log?
>
> cheers,
>
> Wojciech Turek 
> On 26 Nov 2007, at 13:25, Somsak Sriprayoonsakul wrote:
>
>> No. I use stock bnx2 driver from pre-built latest kernel-lustre-1.6.3.
>>
>> I forgot to mention about oops. It's something about lustre 
>> (lustre_blah_blah_blah something).
>>
>> All other nodes also use bnx2. There's no problem at all.
>>
>> Matt wrote:
>>> Somsak,
>>>
>>> Did you build your own bnx2 driver? I was getting kernel panics when 
>>> hitting a certain load with Dell 1950s that also use the bnx2 driver.  
>>> My solution was to grab the bnx2 source code and build it under the 
>>> Lustre kernel.  If you search the mailing list you'll find the mails 
>>> dealing with this.
>>>
>>> If you see bnx2 mentioned in your kernel panic output, then it's 
>>> probably the cause.
>>>
>>> Thanks,
>>>
>>> Matt
>>>
>>> On 26/11/2007, *Somsak Sriprayoonsakul * <somsak_sr at thaigrid.or.th 
>>> <mailto:somsak_sr at thaigrid.or.th> 
>>> <mailto:somsak_sr at thaigrid.or.th>> wrote:
>>>
>>>     Hello,
>>>
>>>         We have a 4 nodes Lustre Cluster that provides parallel file
>>>     system
>>>     for our 192 nodes cluster. The Lustre Cluster are CentOS 4.5, x86_64
>>>     (Intel series 4000), on HP DL360-G5. The cluster that use it is 
>>> ROCKS
>>>     4.2.1, on the same set of hardware. Our network is Gigabit Ethernet,
>>>     using bnx2 driver. Lustre setup is
>>>
>>>     storage-0-0: mgs+mdt, ost0, ost1 (backup)
>>>     storage-0-1: mgs+mdt (backup), ost0 (backup), ost1
>>>     storage-0-2: ost2, ost3 (backup)
>>>     storage-0-3: ost2 (backup), ost3
>>>
>>>         We're using heartbeat 2.0.8 base on pre-built RPM from 
>>> CentOS. All
>>>     backup is configure in the way that it'll not run simultaneously 
>>> with
>>>     primary. Note that, we enable flock and quota on Lustre.
>>>
>>>         The problem we have right now is, some of the nodes are randomly
>>>     panic. This happened about once a week or two week. We tolerate this
>>>     stupidly by setting kernel.panic=60 and hope that the backup node
>>>     will
>>>     not failed within the time frame, though this is working quite well
>>>     (base on user feedback, they do not know that the file system is
>>>     failed). The backup node take-over OST and do recovery for about 250
>>>     secs then everything back to normal.
>>>
>>>         Anyways we're trying to nail down the reason why the file
>>>     system is
>>>     panic. I believe that information above will not suffice to 
>>> track down
>>>     the reason. Could someone give me a way to debug or dump some useful
>>>     information that I can send to the list for later analysis? 
>>> Also, does
>>>     the "RECOVERING" suffice to make the file system stable? Do we 
>>> need to
>>>     shutdown the whole system and do e2fsck+lfsck?
>>>
>>>         Also, every panic time, quota that was enabled will be
>>>     disabled (lfs
>>>     quota <user> /fs yield "No such process). I have to do quotaoff and
>>>     quotaon again. It seems that the quota is not being turn on when
>>>     OST is
>>>     boot up. Is there a way to always turn this on?
>>>
>>>
>>>         Thank you very much in advance
>>>
>>>
>>>     --
>>>
>>>     
>>> -----------------------------------------------------------------------------------
>>>     Somsak Sriprayoonsakul
>>>
>>>     Thai National Grid Center
>>>     Software Industry Promotion Agency
>>>     Ministry of ICT, Thailand
>>>     somsak_sr at thaigrid.or.th <mailto:somsak_sr at thaigrid.or.th>
>>>     
>>> -----------------------------------------------------------------------------------
>>>
>>>
>>>     _______________________________________________
>>>     Lustre-discuss mailing list
>>>     Lustre-discuss at clusterfs.com <mailto:Lustre-discuss at clusterfs.com>
>>>     https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>>>     <https://mail.clusterfs.com/mailman/listinfo/lustre-discuss>
>>>
>>>
>>
>> -- 
>>
>> -----------------------------------------------------------------------------------
>> Somsak Sriprayoonsakul
>>
>> Thai National Grid Center
>> Software Industry Promotion Agency
>> Ministry of ICT, Thailand
>> somsak_sr at thaigrid.or.th <mailto:somsak_sr at thaigrid.or.th>
>> -----------------------------------------------------------------------------------
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at clusterfs.com <mailto:Lustre-discuss at clusterfs.com>
>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>
> Mr Wojciech Turek
> Assistant System Manager
> University of Cambridge
> High Performance Computing service 
> email: wjt27 at cam.ac.uk <mailto:wjt27 at cam.ac.uk>
> tel. +441223763517
>
>
>

-- 

-----------------------------------------------------------------------------------
Somsak Sriprayoonsakul

Thai National Grid Center
Software Industry Promotion Agency
Ministry of ICT, Thailand
somsak_sr at thaigrid.or.th
-----------------------------------------------------------------------------------