[Lustre-discuss] Node randomly panic

Wojciech Turek wjt27 at cam.ac.uk
Mon Nov 26 05:34:33 PST 2007


Hi,

how many clients (compute nodes) you have in your cluster? What is  
crashing randomly: clients or OSS or MDS or maybe all of them?
Do you have screenshot of the kernel panic or crashdump log?

cheers,

Wojciech Turek
On 26 Nov 2007, at 13:25, Somsak Sriprayoonsakul wrote:

> No. I use stock bnx2 driver from pre-built latest kernel-lustre-1.6.3.
>
> I forgot to mention about oops. It's something about lustre
> (lustre_blah_blah_blah something).
>
> All other nodes also use bnx2. There's no problem at all.
>
> Matt wrote:
>> Somsak,
>>
>> Did you build your own bnx2 driver? I was getting kernel panics when
>> hitting a certain load with Dell 1950s that also use the bnx2 driver.
>> My solution was to grab the bnx2 source code and build it under the
>> Lustre kernel.  If you search the mailing list you'll find the mails
>> dealing with this.
>>
>> If you see bnx2 mentioned in your kernel panic output, then it's
>> probably the cause.
>>
>> Thanks,
>>
>> Matt
>>
>> On 26/11/2007, *Somsak Sriprayoonsakul * <somsak_sr at thaigrid.or.th
>> <mailto:somsak_sr at thaigrid.or.th>> wrote:
>>
>>     Hello,
>>
>>         We have a 4 nodes Lustre Cluster that provides parallel file
>>     system
>>     for our 192 nodes cluster. The Lustre Cluster are CentOS 4.5,  
>> x86_64
>>     (Intel series 4000), on HP DL360-G5. The cluster that use it  
>> is ROCKS
>>     4.2.1, on the same set of hardware. Our network is Gigabit  
>> Ethernet,
>>     using bnx2 driver. Lustre setup is
>>
>>     storage-0-0: mgs+mdt, ost0, ost1 (backup)
>>     storage-0-1: mgs+mdt (backup), ost0 (backup), ost1
>>     storage-0-2: ost2, ost3 (backup)
>>     storage-0-3: ost2 (backup), ost3
>>
>>         We're using heartbeat 2.0.8 base on pre-built RPM from  
>> CentOS. All
>>     backup is configure in the way that it'll not run  
>> simultaneously with
>>     primary. Note that, we enable flock and quota on Lustre.
>>
>>         The problem we have right now is, some of the nodes are  
>> randomly
>>     panic. This happened about once a week or two week. We  
>> tolerate this
>>     stupidly by setting kernel.panic=60 and hope that the backup node
>>     will
>>     not failed within the time frame, though this is working quite  
>> well
>>     (base on user feedback, they do not know that the file system is
>>     failed). The backup node take-over OST and do recovery for  
>> about 250
>>     secs then everything back to normal.
>>
>>         Anyways we're trying to nail down the reason why the file
>>     system is
>>     panic. I believe that information above will not suffice to  
>> track down
>>     the reason. Could someone give me a way to debug or dump some  
>> useful
>>     information that I can send to the list for later analysis?  
>> Also, does
>>     the "RECOVERING" suffice to make the file system stable? Do we  
>> need to
>>     shutdown the whole system and do e2fsck+lfsck?
>>
>>         Also, every panic time, quota that was enabled will be
>>     disabled (lfs
>>     quota <user> /fs yield "No such process). I have to do  
>> quotaoff and
>>     quotaon again. It seems that the quota is not being turn on when
>>     OST is
>>     boot up. Is there a way to always turn this on?
>>
>>
>>         Thank you very much in advance
>>
>>
>>     --
>>
>>      
>> --------------------------------------------------------------------- 
>> --------------
>>     Somsak Sriprayoonsakul
>>
>>     Thai National Grid Center
>>     Software Industry Promotion Agency
>>     Ministry of ICT, Thailand
>>     somsak_sr at thaigrid.or.th <mailto:somsak_sr at thaigrid.or.th>
>>      
>> --------------------------------------------------------------------- 
>> --------------
>>
>>
>>     _______________________________________________
>>     Lustre-discuss mailing list
>>     Lustre-discuss at clusterfs.com <mailto:Lustre- 
>> discuss at clusterfs.com>
>>     https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>>     <https://mail.clusterfs.com/mailman/listinfo/lustre-discuss>
>>
>>
>
> -- 
>
> ---------------------------------------------------------------------- 
> -------------
> Somsak Sriprayoonsakul
>
> Thai National Grid Center
> Software Industry Promotion Agency
> Ministry of ICT, Thailand
> somsak_sr at thaigrid.or.th
> ---------------------------------------------------------------------- 
> -------------
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at clusterfs.com
> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss

Mr Wojciech Turek
Assistant System Manager
University of Cambridge
High Performance Computing service
email: wjt27 at cam.ac.uk
tel. +441223763517



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20071126/4c384626/attachment.htm>


More information about the lustre-discuss mailing list