[Lustre-discuss] Problems with MDS Crashing

Gary Brooks garybrooks at cloudaccess.net
Wed May 12 14:42:38 PDT 2010


Help Needed:

We're having trouble with our MDS server. Nothing suspicious in logs - at
some point they are just not being created anymore.

The scenario is as following: we're having a MDS running on DRBD, 2 OSS and
ca. 10 clients. The traffic pattern is lots of small file reads and
writes.   We provision Joomla! sites.  Joomla! site has about 17000 small
files.   We are writing 1 new Joomla! site every 30 seconds.    This happens
all day long and does not stop.

During operation, load on MDS is around 2 (it's a 8-core machine raid 10
using a 3ware card, pretty heavily equipped and should handle much more).
iostat says that there is constantly about 5 MB/s read and 100kB-7MB/s
write. There are about 5000 r/w ops per second.

Then, all of the sudden the MDS stops responding, ssh sessions die and only
hard restart helps. After the restart, /var/log/messages contains normal
information (some timeout chit-chat).

While this happens randomly, there is an almost sure way to trigger it:
issue sysctl -w lnet.debug=0 on all clients and servers, after which the
file system becomes super responsive, load on MDS is still low, our gig-e
link is well utilized (unlike when lnet logging is enabled) and after a few
minutes MDS dies as described above.

I know that this is too little information to ask for help, but maybe you
could at least tell me where to look for any information?

Gary
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20100512/9a0d49e7/attachment.htm>


More information about the lustre-discuss mailing list