[Lustre-discuss] can't find mdc_open_data, but the close succeeded
Nico.Budewitz
Nico.Budewitz at aei.mpg.de
Wed Jan 28 02:14:00 PST 2009
Hi,
I had strange problems this morning:
A parallel application on 256 cores has been started and recovered
512GB from the last checkpoint this morning at 4.03. This application
had to read the last checkpoint (256x2GB) from lustre into the main memory.
It seems that the filesystem was hanging for a while and other
applications which were using lustre at the same time could not open
files anymore and crashed instantly .
I received this error messages on the master node of the 256 core job:
Jan 28 04:04:07 node005 kernel: Lustre: Request x1138068 sent from
aeifs2-MDT0000-mdc-ffff81021db45400 to NID 10.13.32.12 at o2ib 8s ago has
timed out (limit 8s).
Jan 28 04:04:24 node005 kernel: Lustre:
8957:0:(import.c:410:import_select_connection())
aeifs2-MDT0000-mdc-ffff81021db45400: tried all connections, increasing
latenc
y to 6s
Jan 28 04:04:24 node005 kernel: Lustre: Changing connection for
aeifs2-MDT0000-mdc-ffff81021db45400 to 10.13.32.11 at o2ib/10.13.32.11 at o2ib
Jan 28 04:04:24 node005 kernel: LustreError: 167-0: This client was
evicted by aeifs2-MDT0000; in progress operations using this service
will fail.
Jan 28 04:04:24 node005 kernel: LustreError:
9022:0:(file.c:116:ll_close_inode_openhandle()) inode 4262393 mdc close
failed: rc = -5
Jan 28 04:04:24 node005 kernel: LustreError:
10503:0:(client.c:722:ptlrpc_import_delay_req()) @@@ IMP_INVALID
req at ffff8101f337be00 x1138066/t0 o35->aeifs2-MDT0000_UU
ID at 10.13.32.11@o2ib:23/10 lens 296/1248 e 0 to 6 dl 0 ref 1 fl Rpc:/0/0
rc 0/0
Jan 28 04:04:24 node005 kernel: LustreError:
9022:0:(file.c:116:ll_close_inode_openhandle()) Skipped 1 previous
similar message
Jan 28 04:04:24 node005 kernel: LustreError:
15629:0:(mdc_locks.c:598:mdc_enqueue()) ldlm_cli_enqueue: -5
Jan 28 04:04:24 node005 kernel: Lustre:
aeifs2-MDT0000-mdc-ffff81021db45400: Connection restored to service
aeifs2-MDT0000 using nid 10.13.32.11 at o2ib.
Jan 28 04:04:25 node005 kernel: LustreError:
15629:0:(mdc_request.c:741:mdc_close()) Unexpected: can't find
mdc_open_data, but the close succeeded. Please tell <http
://bugzilla.lustre.org/>.
Jan 28 04:04:25 node005 kernel: LustreError:
10503:0:(mdc_request.c:741:mdc_close()) Unexpected: can't find
mdc_open_data, but the close succeeded. Please tell <http
://bugzilla.lustre.org/>.
Jan 28 04:04:26 node005 kernel: LustreError:
15629:0:(mdc_request.c:741:mdc_close()) Unexpected: can't find
mdc_open_data, but the close succeeded. Please tell <http
://bugzilla.lustre.org/>.
Jan 28 04:04:27 node005 kernel: LustreError: 11-0: an error occurred
while communicating with 10.13.32.11 at o2ib. The mds_close operation
failed with -116
Jan 28 04:04:27 node005 kernel: LustreError:
10506:0:(file.c:116:ll_close_inode_openhandle()) inode 7747211 mdc close
failed: rc = -116
Jan 28 04:04:27 node005 kernel: LustreError:
10506:0:(file.c:116:ll_close_inode_openhandle()) Skipped 25 previous
similar messages
I am using lustre 1.6.6 with adaptive timeouts and ofed 1.3.1. Do you
have an idea what's going wrong? Looks like an Ib network problem or are
the OST's to slow in this special case? I had never problems before.
So long!
"Du skal ikke gå over bekken etter vann"
Nico Budewitz
Cluster Administrator
Max-Planck-Institute for Gravitational Physics / Albert-Einstein-Institute
Am Muehlenberg 1, 14476 Golm
Tel.: +49 (0)331 567 7364 Fax: +49 (0)331 567 7297
http://supercomputers.aei.mpg.de
More information about the lustre-discuss
mailing list