[Lustre-discuss] OSS panicing.....

Wed Aug 7 17:44:28 PDT 2013

On Tue, 2013-08-06 at 14:22 +0100, Phill Harvey-Smith wrote:
> Hi all,
> 
> Our OSS has started panicing in the last couple of days, it seems to be 
> related to nfs4, but not sure so asking the group for pointers.
> 
> Fistly a couple of screen grabs are at :
> 
> http://penguin.stats.warwick.ac.uk/~stsxab/Lustre/

It looks like a nfsd4 error in the backtrace. You should look into the
nfs side of your setup. It likely has nothing to do with Lustre (outside
of the kernel you are running) If this is a new install it may not like
the NFS userspace you have with the kernel you are using but that is
just a wild guess. 

Thanks,
  Keith Mannthey 

> 
> The OSS server is currently running Ubuntu 10.04 LTS with an alien 
> (redhat I believe) kernel installed.
> 
> The running kernel is :
> 
> 2.6.32-131.6.1.el6_lustre.g65156ed.x86_64
> 
> I believe that it is running lustre 1.6.x. The MDS is also setup in a 
> similar manner.
> 
> The clients are a mixture of Ubuntu 10.04 LTS with Lustre 1.6.x and the 
> 3 most recent nodes are Ubuntu 12.04 LTS with Lustre 2.5.x which I built 
> recently.
> 
> The OSS has 2 raid arrays, one on the onboard SAS controller which has 
> two of the Lustre volumes (/home and /scratch), along with the NFS 
> exported file system, on a separate XFS partition. The second raid array 
> is on an external PCIE Raid controler, and an external disk array and 
> holds the other Lustre filesystem on two virtual disks.
> 
> The OSS also has a couple of NFS4 shares :
> 
> /export 
> 192.168.0.0/24(rw,async,fsid=0,crossmnt,no_root_squash,no_subtree_check) 
> 192.168.1.0/24(rw,sync,fsid=0,no_root_squash,crossmnt,no_subtree_check)
> 
> /export/software/packages-x86_64-linux-gnu 
> 192.168.0.0/24(rw,async,no_subtree_check,no_root_squash)
> 
> Which are on a separate disk.
> 
> If I disable the NFS shares then the OSS server seems to stay up and 
> client machines can access the lustre file systems. But once I enable 
> the NFS shares the OSS will panic within a few minutes, this is why I 
> suspect some interaction with NFS.
> 
> The odd thing is the machine only started doing this yesterday, I have 
> replaced / re-seated the RAM, CPUs and cards (Ethernet & SAS), but this 
> doesn't seem to have changed anything.
> 
> I am aware that this setup is not a supported architecture (I inherited 
> custody of the cluster from a previous admin) and am planning on 
> re-installing both the OSS and MDS with (probably) CentOS, as that is 
> supported for the server. Is there anything I need to be aware of in 
> planning this upgrade ?
> 
> Does anyone have any clue as to what I might try, is there an easy way I 
> can check the integrity of the Lustre volumes ?
> 
> Cheers.
> 
> Phill.
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss