[Lustre-discuss] OSS panicing.....

Tue Aug 6 06:22:59 PDT 2013

Hi all,

Our OSS has started panicing in the last couple of days, it seems to be 
related to nfs4, but not sure so asking the group for pointers.

Fistly a couple of screen grabs are at :

http://penguin.stats.warwick.ac.uk/~stsxab/Lustre/

The OSS server is currently running Ubuntu 10.04 LTS with an alien 
(redhat I believe) kernel installed.

The running kernel is :

2.6.32-131.6.1.el6_lustre.g65156ed.x86_64

I believe that it is running lustre 1.6.x. The MDS is also setup in a 
similar manner.

The clients are a mixture of Ubuntu 10.04 LTS with Lustre 1.6.x and the 
3 most recent nodes are Ubuntu 12.04 LTS with Lustre 2.5.x which I built 
recently.

The OSS has 2 raid arrays, one on the onboard SAS controller which has 
two of the Lustre volumes (/home and /scratch), along with the NFS 
exported file system, on a separate XFS partition. The second raid array 
is on an external PCIE Raid controler, and an external disk array and 
holds the other Lustre filesystem on two virtual disks.

The OSS also has a couple of NFS4 shares :

/export 
192.168.0.0/24(rw,async,fsid=0,crossmnt,no_root_squash,no_subtree_check) 
192.168.1.0/24(rw,sync,fsid=0,no_root_squash,crossmnt,no_subtree_check)

/export/software/packages-x86_64-linux-gnu 
192.168.0.0/24(rw,async,no_subtree_check,no_root_squash)

Which are on a separate disk.

If I disable the NFS shares then the OSS server seems to stay up and 
client machines can access the lustre file systems. But once I enable 
the NFS shares the OSS will panic within a few minutes, this is why I 
suspect some interaction with NFS.

The odd thing is the machine only started doing this yesterday, I have 
replaced / re-seated the RAM, CPUs and cards (Ethernet & SAS), but this 
doesn't seem to have changed anything.

I am aware that this setup is not a supported architecture (I inherited 
custody of the cluster from a previous admin) and am planning on 
re-installing both the OSS and MDS with (probably) CentOS, as that is 
supported for the server. Is there anything I need to be aware of in 
planning this upgrade ?

Does anyone have any clue as to what I might try, is there an easy way I 
can check the integrity of the Lustre volumes ?

Cheers.

Phill.