[Lustre-discuss] Intermittent "obd_ping operation failed" errors
Wang Yibin
Yibin.Wang at Sun.COM
Mon Mar 23 20:26:00 PDT 2009
It looks like that the network of str-ter-OST003(IP: 172.16.103.26) is
quite flaky. Can you please check the stability of this particular OST?
在 2009-03-23一的 11:46 -0400,Roger Spellman写道:
> I am seeing the following error, multiple times on clients trying to
> talk to a particular OST. The errors are intermittent: I get five to
> ten every few seconds, then none for several hours (or even several
> days), then five to ten again.
>
>
>
> The servers are running Lustre 1.6.6, and are configured for IB and
> tcp (over eth0). There are 20 IB clients, and 200 tcp/eth0 clients.
> The system was relatively quiet while these errors were occurring.
>
>
>
> modprobe.conf contains:
>
> options ib_mthca msi_x=1
>
> options lnet networks="tcp0(eth0),o2ib(ib0)"
>
> options ko2iblnd ipif_name=ib0
>
>
>
> Any idea what causes this, and how to resolve it?
>
>
>
> Thanks.
>
>
>
> Mar 18 10:26:39 ts-nrel-01 kernel: LustreError: 11-0: an error
> occurred while communicating with 172.16.103.26 at tcp. The obd_ping
> operation failed with -107
>
> Mar 18 10:26:39 ts-nrel-01 kernel: Lustre: lstr-ter-OST0003-osc:
> Connection to service lstr-ter-OST0003 via nid 172.16.103.26 at tcp was
> lost; in progress operations using this service will wait for recovery
> to complete.
>
> Mar 18 10:26:39 ts-nrel-01 kernel: LustreError: 167-0: This client was
> evicted by lstr-ter-OST0003; in progress operations using this service
> will fail.
>
> Mar 18 10:26:39 ts-nrel-01 kernel: Lustre: lstr-ter-OST0003-osc:
> Connection restored to service lstr-ter-OST0003 using nid
> 172.16.103.26 at tcp.
>
> Mar 18 10:26:39 ts-nrel-01 kernel: Lustre: MDS lstr-ter-MDT0000:
> lstr-ter-OST0003_UUID now active, resetting orphans
>
> Mar 18 10:30:49 ts-nrel-01 kernel: LustreError: 11-0: an error
> occurred while communicating with 172.16.103.26 at tcp. The obd_ping
> operation failed with -107
>
> Mar 18 10:30:49 ts-nrel-01 kernel: Lustre: lstr-ter-OST0003-osc:
> Connection to service lstr-ter-OST0003 via nid 172.16.103.26 at tcp was
> lost; in progress operations using this service will wait for recovery
> to complete.
>
> Mar 18 10:30:49 ts-nrel-01 kernel: LustreError: 167-0: This client was
> evicted by lstr-ter-OST0003; in progress operations using this service
> will fail.
>
> Mar 18 10:30:49 ts-nrel-01 kernel: Lustre: lstr-ter-OST0003-osc:
> Connection restored to service lstr-ter-OST0003 using nid
> 172.16.103.26 at tcp.
>
> Mar 18 10:30:49 ts-nrel-01 kernel: Lustre: MDS lstr-ter-MDT0000:
> lstr-ter-OST0003_UUID now active, resetting orphans
>
> Mar 18 10:34:59 ts-nrel-01 kernel: LustreError: 11-0: an error
> occurred while communicating with 172.16.103.26 at tcp. The obd_ping
> operation failed with -107
>
> Mar 18 10:34:59 ts-nrel-01 kernel: Lustre: lstr-ter-OST0003-osc:
> Connection to service lstr-ter-OST0003 via nid 172.16.103.26 at tcp was
> lost; in progress operations using this service will wait for recovery
> to complete.
>
> Mar 18 10:34:59 ts-nrel-01 kernel: LustreError: 167-0: This client was
> evicted by lstr-ter-OST0003; in progress operations using this service
> will fail.
>
> Mar 18 10:34:59 ts-nrel-01 kernel: Lustre: lstr-ter-OST0003-osc:
> Connection restored to service lstr-ter-OST0003 using nid
> 172.16.103.26 at tcp.
>
> Mar 18 10:34:59 ts-nrel-01 kernel: Lustre: MDS lstr-ter-MDT0000:
> lstr-ter-OST0003_UUID now active, resetting orphans
>
>
>
> Roger Spellman
>
> Staff Engineer
>
> Terascala, Inc.
>
> 508-588-1501
>
> www.terascala.com <http://www.terascala.com/>
>
>
>
>
>
>
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
More information about the lustre-discuss
mailing list