[Lustre-discuss] slow ls on flaky system

Tue Jan 20 10:02:22 PST 2009

On Jan 20, 2009  11:11 +0000, Herbert Fruchtl wrote:
> It is not particularly stable. Every 1-3 weeks the filesystem goes awol and I
> have to reboot the machine. This morning I did an "ls -lR" on the front-end
> (which serves as MDS), just to count the files, and it took more than one hour.
> "top" showed "ls" taking up anything between 5% and 90% of a CPU during this
> time (most of the time in the 10-30% range). Is this normal?
> 
> The crash this weekend was preceded by half a day during which the front-end
> kept losing and regaining connection to the filesystem. It worked for a while,
> then "df" gave an "input/output error", or "Cannot send after transport
> endpoint", then recovered again. It seemed OK all the time from the compute
> nodes, and from one external Lustre client (until it went away completely).
> 
> I have inherited this cluster and I am not an expert in filesystems. The
> timeout is set to its default 100s. How do I find out what's wrong?

There was a series of bugs related to "statahead" in the 1.6.5.1 release
that could cause problems with "ls -lR" type workloads.  You should cat
disable this feature with "echo 0 > /proc/fs/lustre/llite/*/statahead_max",
or upgrading at least the clients to the 1.6.6 release.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.