[Lustre-discuss] Node randomly panic

Somsak Sriprayoonsakul somsak_sr at thaigrid.or.th
Sun Nov 25 21:27:05 PST 2007


Hello,

    We have a 4 nodes Lustre Cluster that provides parallel file system 
for our 192 nodes cluster. The Lustre Cluster are CentOS 4.5, x86_64 
(Intel series 4000), on HP DL360-G5. The cluster that use it is ROCKS 
4.2.1, on the same set of hardware. Our network is Gigabit Ethernet, 
using bnx2 driver. Lustre setup is

storage-0-0: mgs+mdt, ost0, ost1 (backup)
storage-0-1: mgs+mdt (backup), ost0 (backup), ost1
storage-0-2: ost2, ost3 (backup)
storage-0-3: ost2 (backup), ost3

    We're using heartbeat 2.0.8 base on pre-built RPM from CentOS. All 
backup is configure in the way that it'll not run simultaneously with 
primary. Note that, we enable flock and quota on Lustre.

    The problem we have right now is, some of the nodes are randomly 
panic. This happened about once a week or two week. We tolerate this 
stupidly by setting kernel.panic=60 and hope that the backup node will 
not failed within the time frame, though this is working quite well 
(base on user feedback, they do not know that the file system is 
failed). The backup node take-over OST and do recovery for about 250 
secs then everything back to normal.

    Anyways we're trying to nail down the reason why the file system is 
panic. I believe that information above will not suffice to track down 
the reason. Could someone give me a way to debug or dump some useful 
information that I can send to the list for later analysis? Also, does 
the "RECOVERING" suffice to make the file system stable? Do we need to 
shutdown the whole system and do e2fsck+lfsck?

    Also, every panic time, quota that was enabled will be disabled (lfs 
quota <user> /fs yield "No such process). I have to do quotaoff and 
quotaon again. It seems that the quota is not being turn on when OST is 
boot up. Is there a way to always turn this on?


    Thank you very much in advance


-- 

-----------------------------------------------------------------------------------
Somsak Sriprayoonsakul

Thai National Grid Center
Software Industry Promotion Agency
Ministry of ICT, Thailand
somsak_sr at thaigrid.or.th
-----------------------------------------------------------------------------------





More information about the lustre-discuss mailing list