[Lustre-discuss] Timeouts and Dumps

Denise Hummel denise_hummel at nrel.gov
Mon Dec 29 07:32:50 PST 2008


Hi;

Thanks for all of the help and suggestions.  I was able to narrow the
problem down to two racks of intel nodes.  Checking the HP Procurve
switches in each rack showed that LACP was turned on all of the ports
causing ports randomly blocked by LACP with off-line and on-line
messages.  Lustre was being accurate, the nodes were timing out
(repeatedly) and I think it caused a cascade effect to the rest of the
nodes  I turned off LACP Dec. 24 and have not had a Lustre timeout
since. 
Again, thanks for all of your help,
Denise

On Tue, 2008-12-23 at 20:04 -0500, Isaac Huang wrote:
> On Tue, Dec 23, 2008 at 06:45:09AM -0700, Denise Hummel wrote:
> > Hi;
> > 
> > Thanks.  I have suspected the network, however have not been able to
> > pinpoint the problem.  I have looked at the ethernet and infiniband
> > switches - found a few with IGMP turned on and some multicast issues.
> > Those have been fixed.  I am checking the network stats on the oss, mdt
> > and nodes and find a few dropped packets, however the system stats do
> > not indicate a heavy load during that time.  If anyone has any
> > suggestions on anything else I can look at please let me know.
> 
> Turning on console logging of low-level network errors may reveal
> something useful:
> 
> echo +neterror > /proc/sys/lnet/printk
> 
> Isaac




More information about the lustre-discuss mailing list