[Lustre-discuss] mds device unhealthy - clients got stuck

Andreas Dilger adilger at sun.com
Fri Feb 6 14:23:50 PST 2009


On Feb 06, 2009  17:48 +0000, Wojciech Turek wrote:
> Today our mds started to behave unstable. /proc/fs/lustre/health_check
> file reported that mds device is not healthy. All clients connected to
> ddn_home file system got stuck and MDS server started to refuse client
> connections and after some time it started to evict clients. Can some
> one help me get to the bottom of this problem? Below I attached log
> snippets from the MDS and one client

> Lustre-1.6.6
> RHEL4
> 2.6.9-67.0.22.EL_lustre.1.6.6smp
> 600 clients
> 24 OST/4 OSS

NB - please post plain-text emails to the mailing list.

> Feb  6 11:42:34 mds01 kernel: LustreError:
> 19469:0:(ldlm_request.c:81:ldlm_expired_completion_wait()) ### lock
> timed out (enqueued at 1233920454, 100s ago); not entering recovery in
> server code, just going back to sleep ns: mds-ddn_home-<br>

It looks like the MDS is stuck in a lock deadlock of some kind.
It is worthwhile to get a stack track (sysrq-t), capture any debug
logs dumped by the initial watchdog timer, and file a bug.  If the
debug logs do not contain "dlmtrace" then it may be very hard to
debug this problem.

The MDS will need to be rebooted in any case, and should resolve the
problem for the time being.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.




More information about the lustre-discuss mailing list