[Lustre-discuss] MDS Problems
Andreas Dilger
adilger at sun.com
Fri Jun 13 14:46:44 PDT 2008
On Jun 13, 2008 16:03 -0400, Charles Taylor wrote:
> We have been running the config below on three different lustre file
> systems since early January and, for the most part, things have been
> pretty stable. We are now experiencing frequent hangs on some
> clients - particularly our interactive login nodes. All processes
> get blocked behind Lustre I/O requests. When this happens there are
> *no* messages in either dmesg or syslog on the clients. They seem
> unaware of a problem.
This is likely due to "client statahead" problems. Please disable this
with "echo 0 > /proc/fs/lustre/llite/*/statahead_max" on the clients.
This should also be fixed in 1.6.5
> 1. A ton of lustre-log.M.N files get dumped into /tmp in a short
> period of time. Most of them appear to be full of garbage and
> unprintable characters rather than thread stack traces. Many of them
> are also zero length.
The lustre-log files are not stack traces. They are dumped lustre debug
logs.
> We have been adjusting lru_size on the clients but so far it has made
> no difference. We have "options mds mds_num_threads=512" and our
> system timeout is 1000 (sure, go ahead and flame me but if we don't do
> that we get tons of "endpoint transport failures" on the clients and
> no, there are no connectivity issues). :)
>
> We are open to suggestion and wondering if we should update the MDSs
> to 1.6.5. Can we do that safely without also upgrading the clients
> and OSTs?
In general the MDS and OSS nodes should run the same level of software,
as that is what we test, but there isn't a hard requirement for it.
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
More information about the lustre-discuss
mailing list