[Lustre-discuss] Watchdog triggered for pid 5383: it was inactive for 100s (+ stack trace)

Mon Dec 22 00:15:34 PST 2008

On Dec 18, 2008  13:47 -0600, Hendelman, Rob wrote:
> Is this something to be concerned about?  We have quite a few of these.  This is on our mgs/mds box.
> 
> Our mgs/mds aren't in one filesystem (separate spindles with separate spindles for journals as well), but are on the same box.
> 
> Lustre: 0:0:(linux-debug.c:168:libcfs_debug_dumpstack()) showing stack for
> process 5383

You missed some important messages above this that explain why this was hit.

> Call Trace:
>  [<ffffffff80097e35>] call_usermodehelper_keys+0xea/0xff
>  [<ffffffff80097e4a>] __call_usermodehelper+0x0/0x4f
>  [<ffffffff884065af>] :lvfs:upcall_cache_get_entry+0x5bf/0xa50

This implies that you are using something like LDAP for users/groups
on the MDS, and it can't reply in a timely manner (e.g. within several
seconds).  You can tune this to put less load on your LDAP server by
increasing /proc/fs/lustre/mds/myth-MDT0000/group_expire_interval
(number of seconds to refresh a user->group mapping, default 600s).

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.