[Lustre-discuss] Node randomly panic

Mon Nov 26 05:45:41 PST 2007

Hi,
On 26 Nov 2007, at 13:36, Somsak Sriprayoonsakul wrote:

> We have about 177 client nodes.
>
> I think the crashed happened only with OSS.
We have had similar problem. We have 600 clients and crashes happened  
avery 2 days. There is bug https://bugzilla.lustre.org/show_bug.cgi? 
id=14293
If your kernel panic looks similar you might be almost certain that  
it is the same issue.
>
> I do not have screenshot yet. How can I get the crashdump log?
You can try netdump
http://www.redhat.com/support/wpapers/redhat/netdump/
>
> Wojciech Turek wrote:
>> Hi,
>>
>> how many clients (compute nodes) you have in your cluster? What is  
>> crashing randomly: clients or OSS or MDS or maybe all of them?
>> Do you have screenshot of the kernel panic or crashdump log?
>>
>> cheers,
>>
>> Wojciech Turek On 26 Nov 2007, at 13:25, Somsak Sriprayoonsakul  
>> wrote:
>>
>>> No. I use stock bnx2 driver from pre-built latest kernel- 
>>> lustre-1.6.3.
>>>
>>> I forgot to mention about oops. It's something about lustre  
>>> (lustre_blah_blah_blah something).
>>>
>>> All other nodes also use bnx2. There's no problem at all.
>>>
>>> Matt wrote:
>>>> Somsak,
>>>>
>>>> Did you build your own bnx2 driver? I was getting kernel panics  
>>>> when hitting a certain load with Dell 1950s that also use the  
>>>> bnx2 driver.  My solution was to grab the bnx2 source code and  
>>>> build it under the Lustre kernel.  If you search the mailing  
>>>> list you'll find the mails dealing with this.
>>>>
>>>> If you see bnx2 mentioned in your kernel panic output, then it's  
>>>> probably the cause.
>>>>
>>>> Thanks,
>>>>
>>>> Matt
>>>>
>>>> On 26/11/2007, *Somsak Sriprayoonsakul *  
>>>> <somsak_sr at thaigrid.or.th <mailto:somsak_sr at thaigrid.or.th>  
>>>> <mailto:somsak_sr at thaigrid.or.th>> wrote:
>>>>
>>>>     Hello,
>>>>
>>>>         We have a 4 nodes Lustre Cluster that provides parallel  
>>>> file
>>>>     system
>>>>     for our 192 nodes cluster. The Lustre Cluster are CentOS  
>>>> 4.5, x86_64
>>>>     (Intel series 4000), on HP DL360-G5. The cluster that use it  
>>>> is ROCKS
>>>>     4.2.1, on the same set of hardware. Our network is Gigabit  
>>>> Ethernet,
>>>>     using bnx2 driver. Lustre setup is
>>>>
>>>>     storage-0-0: mgs+mdt, ost0, ost1 (backup)
>>>>     storage-0-1: mgs+mdt (backup), ost0 (backup), ost1
>>>>     storage-0-2: ost2, ost3 (backup)
>>>>     storage-0-3: ost2 (backup), ost3
>>>>
>>>>         We're using heartbeat 2.0.8 base on pre-built RPM from  
>>>> CentOS. All
>>>>     backup is configure in the way that it'll not run  
>>>> simultaneously with
>>>>     primary. Note that, we enable flock and quota on Lustre.
>>>>
>>>>         The problem we have right now is, some of the nodes are  
>>>> randomly
>>>>     panic. This happened about once a week or two week. We  
>>>> tolerate this
>>>>     stupidly by setting kernel.panic=60 and hope that the backup  
>>>> node
>>>>     will
>>>>     not failed within the time frame, though this is working  
>>>> quite well
>>>>     (base on user feedback, they do not know that the file  
>>>> system is
>>>>     failed). The backup node take-over OST and do recovery for  
>>>> about 250
>>>>     secs then everything back to normal.
>>>>
>>>>         Anyways we're trying to nail down the reason why the file
>>>>     system is
>>>>     panic. I believe that information above will not suffice to  
>>>> track down
>>>>     the reason. Could someone give me a way to debug or dump  
>>>> some useful
>>>>     information that I can send to the list for later analysis?  
>>>> Also, does
>>>>     the "RECOVERING" suffice to make the file system stable? Do  
>>>> we need to
>>>>     shutdown the whole system and do e2fsck+lfsck?
>>>>
>>>>         Also, every panic time, quota that was enabled will be
>>>>     disabled (lfs
>>>>     quota <user> /fs yield "No such process). I have to do  
>>>> quotaoff and
>>>>     quotaon again. It seems that the quota is not being turn on  
>>>> when
>>>>     OST is
>>>>     boot up. Is there a way to always turn this on?
>>>>
>>>>
>>>>         Thank you very much in advance
>>>>
>>>>
>>>>     --
>>>>
>>>>      
>>>> ------------------------------------------------------------------- 
>>>> ----------------
>>>>     Somsak Sriprayoonsakul
>>>>
>>>>     Thai National Grid Center
>>>>     Software Industry Promotion Agency
>>>>     Ministry of ICT, Thailand
>>>>     somsak_sr at thaigrid.or.th <mailto:somsak_sr at thaigrid.or.th>
>>>>      
>>>> ------------------------------------------------------------------- 
>>>> ----------------
>>>>
>>>>
>>>>     _______________________________________________
>>>>     Lustre-discuss mailing list
>>>>     Lustre-discuss at clusterfs.com <mailto:Lustre- 
>>>> discuss at clusterfs.com>
>>>>     https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>>>>     <https://mail.clusterfs.com/mailman/listinfo/lustre-discuss>
>>>>
>>>>
>>>
>>> -- 
>>>
>>> -------------------------------------------------------------------- 
>>> ---------------
>>> Somsak Sriprayoonsakul
>>>
>>> Thai National Grid Center
>>> Software Industry Promotion Agency
>>> Ministry of ICT, Thailand
>>> somsak_sr at thaigrid.or.th <mailto:somsak_sr at thaigrid.or.th>
>>> -------------------------------------------------------------------- 
>>> ---------------
>>>
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at clusterfs.com <mailto:Lustre-discuss at clusterfs.com>
>>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>>
>> Mr Wojciech Turek
>> Assistant System Manager
>> University of Cambridge
>> High Performance Computing service email: wjt27 at cam.ac.uk  
>> <mailto:wjt27 at cam.ac.uk>
>> tel. +441223763517
>>
>>
>>
>
> -- 
>
> ---------------------------------------------------------------------- 
> -------------
> Somsak Sriprayoonsakul
>
> Thai National Grid Center
> Software Industry Promotion Agency
> Ministry of ICT, Thailand
> somsak_sr at thaigrid.or.th
> ---------------------------------------------------------------------- 
> -------------
>

Mr Wojciech Turek
Assistant System Manager
University of Cambridge
High Performance Computing service
email: wjt27 at cam.ac.uk
tel. +441223763517

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20071126/66335bcd/attachment.htm>