[Lustre-discuss] slow ls on flaky system

Tue Jan 20 03:11:12 PST 2009

We have lustre 1.6.5.1 on our cluster (CentOS 4.7). The /home filesystem
contains 27TB, distributed over three OSS servers, each containing a RAID split
into two filesystems. It's 90% full and contains 2 million files.

It is not particularly stable. Every 1-3 weeks the filesystem goes awol and I
have to reboot the machine. This morning I did an "ls -lR" on the front-end
(which serves as MDS), just to count the files, and it took more than one hour.
"top" showed "ls" taking up anything between 5% and 90% of a CPU during this
time (most of the time in the 10-30% range). Is this normal?

The crash this weekend was preceded by half a day during which the front-end
kept losing and regaining connection to the filesystem. It worked for a while,
then "df" gave an "input/output error", or "Cannot send after transport
endpoint", then recovered again. It seemed OK all the time from the compute
nodes, and from one external Lustre client (until it went away completely).

I have inherited this cluster and I am not an expert in filesystems. The timeout
is set to its default 100s. How do I find out what's wrong?

  Herbert
-- 
Herbert Fruchtl
Senior Scientific Computing Officer
School of Chemistry, School of Mathematics and Statistics
University of St Andrews
--
The University of St Andrews is a charity registered in Scotland:
No SC013532