[Lustre-discuss] [SPAM] Lustre 1.6.4.2 Error

Fri Mar 21 23:35:17 PDT 2008

On Mar 21, 2008  19:15 +0100, Dilling wrote:
> some days ago one of my users started a lot of matlab jobs flooding all 
> processors on our 40 nodes 2CPU cluster (4Core System). More details and 
> log files can be found in the appendix. As a result of this I observed a 
> strange behavior of lustre. Ptlrpcd used 100% of one CPU, the second CPU 
> was completly occupied by pwd. Pwd was a child of the matlab process 
> invoked by the user. I/O on lustre was partly possible but df reported 
> access denied. A recovery with the mdt started after lustre.timeout=300 but 
> did not complete. I had to reboot all nodes which showed this behavior. The 
> ost showed the message:
>  Mar 14 16:47:05 cn46 kernel: LustreError: 138-a: lustre-OST0003: A client 
> on nid 10.128.15.2 at tcp was evicted due to a lock glimpse callback to 
> 10.128.15.2 at tcp timed out: rc -110
> The client kernels reported soft lockup on all available cores.
> Does anyone have an idea how to prevent such behavior. Thanks for your help.

You missed an important detail right at the beginning of your woes:

ll_sai_entry_set()) ASSERTION(entry->se_stat == SA_ENTRY_UNSTATED) failed

This is a bug in the "statahead" code.  This is a new feature which detects
apps doing "readdir + sequential stat" operations on a directory and starts
multiple concurrent metadata RPCs in order to hide the network latency of
the serialized "stat" operations.

This is a known bug 15175 in our bugzilla and is being worked on.  You can
disable statahead on the clients until this is resolved:

	echo 0 > /proc/fs/lustre/llite/*/statahead_count

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.