[Lustre-discuss] Intermittent "obd_ping operation failed" errors

Wang Yibin Yibin.Wang at Sun.COM
Mon Mar 23 20:26:00 PDT 2009


It looks like that the network of str-ter-OST003(IP: 172.16.103.26) is
quite flaky. Can you please check the stability of this particular OST?

在 2009-03-23一的 11:46 -0400,Roger Spellman写道: 
> I am seeing the following error, multiple times on clients trying to
> talk to a particular OST.  The errors are intermittent:  I get five to
> ten every few seconds, then none for several hours (or even several
> days), then five to ten again.
> 
>  
> 
> The servers are running Lustre 1.6.6, and are configured for IB and
> tcp (over eth0).  There are 20 IB clients, and 200 tcp/eth0 clients.
>  The system was relatively quiet while these errors were occurring.  
> 
>  
> 
> modprobe.conf contains:
> 
> options ib_mthca msi_x=1
> 
> options lnet networks="tcp0(eth0),o2ib(ib0)"
> 
> options ko2iblnd ipif_name=ib0
> 
>  
> 
> Any idea what causes this, and how to resolve it?
> 
>  
> 
> Thanks.
> 
>  
> 
> Mar 18 10:26:39 ts-nrel-01 kernel: LustreError: 11-0: an error
> occurred while communicating with 172.16.103.26 at tcp. The obd_ping
> operation failed with -107
> 
> Mar 18 10:26:39 ts-nrel-01 kernel: Lustre: lstr-ter-OST0003-osc:
> Connection to service lstr-ter-OST0003 via nid 172.16.103.26 at tcp was
> lost; in progress operations using this service will wait for recovery
> to complete.
> 
> Mar 18 10:26:39 ts-nrel-01 kernel: LustreError: 167-0: This client was
> evicted by lstr-ter-OST0003; in progress operations using this service
> will fail.
> 
> Mar 18 10:26:39 ts-nrel-01 kernel: Lustre: lstr-ter-OST0003-osc:
> Connection restored to service lstr-ter-OST0003 using nid
> 172.16.103.26 at tcp.
> 
> Mar 18 10:26:39 ts-nrel-01 kernel: Lustre: MDS lstr-ter-MDT0000:
> lstr-ter-OST0003_UUID now active, resetting orphans
> 
> Mar 18 10:30:49 ts-nrel-01 kernel: LustreError: 11-0: an error
> occurred while communicating with 172.16.103.26 at tcp. The obd_ping
> operation failed with -107
> 
> Mar 18 10:30:49 ts-nrel-01 kernel: Lustre: lstr-ter-OST0003-osc:
> Connection to service lstr-ter-OST0003 via nid 172.16.103.26 at tcp was
> lost; in progress operations using this service will wait for recovery
> to complete.
> 
> Mar 18 10:30:49 ts-nrel-01 kernel: LustreError: 167-0: This client was
> evicted by lstr-ter-OST0003; in progress operations using this service
> will fail.
> 
> Mar 18 10:30:49 ts-nrel-01 kernel: Lustre: lstr-ter-OST0003-osc:
> Connection restored to service lstr-ter-OST0003 using nid
> 172.16.103.26 at tcp.
> 
> Mar 18 10:30:49 ts-nrel-01 kernel: Lustre: MDS lstr-ter-MDT0000:
> lstr-ter-OST0003_UUID now active, resetting orphans
> 
> Mar 18 10:34:59 ts-nrel-01 kernel: LustreError: 11-0: an error
> occurred while communicating with 172.16.103.26 at tcp. The obd_ping
> operation failed with -107
> 
> Mar 18 10:34:59 ts-nrel-01 kernel: Lustre: lstr-ter-OST0003-osc:
> Connection to service lstr-ter-OST0003 via nid 172.16.103.26 at tcp was
> lost; in progress operations using this service will wait for recovery
> to complete.
> 
> Mar 18 10:34:59 ts-nrel-01 kernel: LustreError: 167-0: This client was
> evicted by lstr-ter-OST0003; in progress operations using this service
> will fail.
> 
> Mar 18 10:34:59 ts-nrel-01 kernel: Lustre: lstr-ter-OST0003-osc:
> Connection restored to service lstr-ter-OST0003 using nid
> 172.16.103.26 at tcp.
> 
> Mar 18 10:34:59 ts-nrel-01 kernel: Lustre: MDS lstr-ter-MDT0000:
> lstr-ter-OST0003_UUID now active, resetting orphans
> 
>  
> 
> Roger Spellman
> 
> Staff Engineer
> 
> Terascala, Inc.
> 
> 508-588-1501
> 
> www.terascala.com <http://www.terascala.com/>
> 
>  
> 
>  
> 
>  
> 
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss




More information about the lustre-discuss mailing list