[Lustre-discuss] Intermittent "obd_ping operation failed" errors

Roger Spellman roger at terascala.com
Mon Mar 23 08:46:28 PDT 2009


I am seeing the following error, multiple times on clients trying to
talk to a particular OST.  The errors are intermittent:  I get five to
ten every few seconds, then none for several hours (or even several
days), then five to ten again.

 

The servers are running Lustre 1.6.6, and are configured for IB and tcp
(over eth0).  There are 20 IB clients, and 200 tcp/eth0 clients.  The
system was relatively quiet while these errors were occurring.  

 

modprobe.conf contains:

options ib_mthca msi_x=1

options lnet networks="tcp0(eth0),o2ib(ib0)"

options ko2iblnd ipif_name=ib0

 

Any idea what causes this, and how to resolve it?

 

Thanks.

 

Mar 18 10:26:39 ts-nrel-01 kernel: LustreError: 11-0: an error occurred
while communicating with 172.16.103.26 at tcp. The obd_ping operation
failed with -107

Mar 18 10:26:39 ts-nrel-01 kernel: Lustre: lstr-ter-OST0003-osc:
Connection to service lstr-ter-OST0003 via nid 172.16.103.26 at tcp was
lost; in progress operations using this service will wait for recovery
to complete.

Mar 18 10:26:39 ts-nrel-01 kernel: LustreError: 167-0: This client was
evicted by lstr-ter-OST0003; in progress operations using this service
will fail.

Mar 18 10:26:39 ts-nrel-01 kernel: Lustre: lstr-ter-OST0003-osc:
Connection restored to service lstr-ter-OST0003 using nid
172.16.103.26 at tcp.

Mar 18 10:26:39 ts-nrel-01 kernel: Lustre: MDS lstr-ter-MDT0000:
lstr-ter-OST0003_UUID now active, resetting orphans

Mar 18 10:30:49 ts-nrel-01 kernel: LustreError: 11-0: an error occurred
while communicating with 172.16.103.26 at tcp. The obd_ping operation
failed with -107

Mar 18 10:30:49 ts-nrel-01 kernel: Lustre: lstr-ter-OST0003-osc:
Connection to service lstr-ter-OST0003 via nid 172.16.103.26 at tcp was
lost; in progress operations using this service will wait for recovery
to complete.

Mar 18 10:30:49 ts-nrel-01 kernel: LustreError: 167-0: This client was
evicted by lstr-ter-OST0003; in progress operations using this service
will fail.

Mar 18 10:30:49 ts-nrel-01 kernel: Lustre: lstr-ter-OST0003-osc:
Connection restored to service lstr-ter-OST0003 using nid
172.16.103.26 at tcp.

Mar 18 10:30:49 ts-nrel-01 kernel: Lustre: MDS lstr-ter-MDT0000:
lstr-ter-OST0003_UUID now active, resetting orphans

Mar 18 10:34:59 ts-nrel-01 kernel: LustreError: 11-0: an error occurred
while communicating with 172.16.103.26 at tcp. The obd_ping operation
failed with -107

Mar 18 10:34:59 ts-nrel-01 kernel: Lustre: lstr-ter-OST0003-osc:
Connection to service lstr-ter-OST0003 via nid 172.16.103.26 at tcp was
lost; in progress operations using this service will wait for recovery
to complete.

Mar 18 10:34:59 ts-nrel-01 kernel: LustreError: 167-0: This client was
evicted by lstr-ter-OST0003; in progress operations using this service
will fail.

Mar 18 10:34:59 ts-nrel-01 kernel: Lustre: lstr-ter-OST0003-osc:
Connection restored to service lstr-ter-OST0003 using nid
172.16.103.26 at tcp.

Mar 18 10:34:59 ts-nrel-01 kernel: Lustre: MDS lstr-ter-MDT0000:
lstr-ter-OST0003_UUID now active, resetting orphans

 

Roger Spellman

Staff Engineer

Terascala, Inc.

508-588-1501

www.terascala.com <http://www.terascala.com/>

 

 

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20090323/f9050b44/attachment.htm>


More information about the lustre-discuss mailing list