[Lustre-discuss] Timeouts and Dumps

Denise Hummel denise_hummel at nrel.gov
Tue Dec 23 05:45:09 PST 2008


Hi;

Thanks.  I have suspected the network, however have not been able to
pinpoint the problem.  I have looked at the ethernet and infiniband
switches - found a few with IGMP turned on and some multicast issues.
Those have been fixed.  I am checking the network stats on the oss, mdt
and nodes and find a few dropped packets, however the system stats do
not indicate a heavy load during that time.  If anyone has any
suggestions on anything else I can look at please let me know.

Thanks for all of your help,
Denise

On Mon, 2008-12-22 at 20:49 -0700, Andreas Dilger wrote:
> On Dec 22, 2008  13:22 -0700, Denise Hummel wrote:
> > Dec 22 13:00:44 oss1 kernel: LustreError: 138-a: lustre-OST0000: A
> > client on nid 172.16.100.1 at tcp was evicted due to a lock blocking
> > callback to 172.16.100.1 at tcp timed out: rc -107
> > Dec 22 13:00:44 oss1 kernel: LustreError:
> > 27250:0:(ost_handler.c:1065:ost_brw_write()) @@@ Eviction on bulk GET
> > req at 00000100bff5c800 x91545/t0
> > 27250:0:(ost_handler.c:1205:ost_brw_write()) lustre-OST0000: ignoring
> > bulk IO comm error with
> 
> These messages could relate to network problems on the oss1 node.  That
> said, this is most interesting if only oss1 is showing these messages.
> In particular "eviction on bulk GET" indicates the network stopped working
> in the middle of a data transfer.
> 
> 
> > The messages in the syslog on the login node are:
> > lustre-OST0000-osc-000001018197f800: Connection to service
> > lustre-OST0000 via nid 172.16.100.41 at tcp was lost; in progress
> > operations using this service will wait for recovery to complete.
> 
> This is just the client's version of the same issue.
> 
> 
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
> 




More information about the lustre-discuss mailing list