[Lustre-discuss] MDS Problems

Andreas Dilger adilger at sun.com
Fri Jun 13 14:46:44 PDT 2008


On Jun 13, 2008  16:03 -0400, Charles Taylor wrote:
> We have been running the config below on three different lustre file  
> systems since early January and, for the most part, things have been  
> pretty stable.    We are now experiencing frequent hangs on some  
> clients - particularly our interactive login nodes.    All processes   
> get blocked behind Lustre I/O requests.   When this happens there are  
> *no* messages in either dmesg or syslog on the clients.     They seem  
> unaware of a problem.

This is likely due to "client statahead" problems.  Please disable this
with "echo 0 > /proc/fs/lustre/llite/*/statahead_max" on the clients.
This should also be fixed in 1.6.5

> 1. A ton of lustre-log.M.N files get dumped into /tmp in a  short  
> period of time.   Most of them appear to be full of garbage and  
> unprintable characters rather than thread stack traces.   Many of them  
> are also zero length.

The lustre-log files are not stack traces.  They are dumped lustre debug
logs.

> We have been adjusting lru_size on the clients but so far it has made  
> no difference.    We have "options mds mds_num_threads=512" and our  
> system timeout is 1000 (sure, go ahead and flame me but if we don't do  
> that we get tons of "endpoint transport failures" on the clients and  
> no, there are no connectivity issues).   :)
> 
> We are open to suggestion and wondering if we should update the MDSs  
> to 1.6.5.   Can we do that safely without also upgrading the clients  
> and OSTs?

In general the MDS and OSS nodes should run the same level of software,
as that is what we test, but there isn't a hard requirement for it.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.




More information about the lustre-discuss mailing list