[Lustre-discuss] [SPAM] Lustre 1.6.4.2 Error
Andreas Dilger
adilger at sun.com
Fri Mar 21 23:35:17 PDT 2008
On Mar 21, 2008 19:15 +0100, Dilling wrote:
> some days ago one of my users started a lot of matlab jobs flooding all
> processors on our 40 nodes 2CPU cluster (4Core System). More details and
> log files can be found in the appendix. As a result of this I observed a
> strange behavior of lustre. Ptlrpcd used 100% of one CPU, the second CPU
> was completly occupied by pwd. Pwd was a child of the matlab process
> invoked by the user. I/O on lustre was partly possible but df reported
> access denied. A recovery with the mdt started after lustre.timeout=300 but
> did not complete. I had to reboot all nodes which showed this behavior. The
> ost showed the message:
> Mar 14 16:47:05 cn46 kernel: LustreError: 138-a: lustre-OST0003: A client
> on nid 10.128.15.2 at tcp was evicted due to a lock glimpse callback to
> 10.128.15.2 at tcp timed out: rc -110
> The client kernels reported soft lockup on all available cores.
> Does anyone have an idea how to prevent such behavior. Thanks for your help.
You missed an important detail right at the beginning of your woes:
ll_sai_entry_set()) ASSERTION(entry->se_stat == SA_ENTRY_UNSTATED) failed
This is a bug in the "statahead" code. This is a new feature which detects
apps doing "readdir + sequential stat" operations on a directory and starts
multiple concurrent metadata RPCs in order to hide the network latency of
the serialized "stat" operations.
This is a known bug 15175 in our bugzilla and is being worked on. You can
disable statahead on the clients until this is resolved:
echo 0 > /proc/fs/lustre/llite/*/statahead_count
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
More information about the lustre-discuss
mailing list