[Lustre-discuss] Problems with MDS Crashing

Gregory Matthews greg.matthews at diamond.ac.uk
Tue May 18 01:10:20 PDT 2010


Gary Brooks wrote:
> Then, all of the sudden the MDS stops responding, ssh sessions die and 
> only hard restart helps. After the restart, /var/log/messages contains 
> normal information (some timeout chit-chat).

is your hardware using the bnx2 NIC driver? We've just been seeing very 
similar issues on Lustre clients on brand new Dell Power Edge R610s. The 
workaround is to turn off MSI-X but there has recently been a fix merged 
into the mainline kernel which has also been backported by Red Hat.

> While this happens randomly, there is an almost sure way to trigger it: 
> issue sysctl -w lnet.debug=0 on all clients and servers, after which the 
> file system becomes super responsive, load on MDS is still low, our 
> gig-e link is well utilized (unlike when lnet logging is enabled) and 
> after a few minutes MDS dies as described above.

we have not been able to trigger it in any predictable fashion either.

GREG

> 
> I know that this is too little information to ask for help, but maybe 
> you could at least tell me where to look for any information?
> 
> Gary
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss


-- 
Greg Matthews            01235 778658
Senior Computer Systems Administrator
Diamond Light Source, Oxfordshire, UK



More information about the lustre-discuss mailing list