[Lustre-discuss] concurrent open() fails sporadically
Michael Sternberg
sternberg at anl.gov
Wed Oct 28 16:00:14 PDT 2009
On Oct 28, 2009, at 15:47 , Brian J. Murrell wrote:
> On Wed, 2009-10-28 at 15:38 -0500, Michael Sternberg wrote:
>> I'm seeing open() failures when attempting concurrent access in a
>> lustre fs.
>> [..]
>> A C version never failed (thus far):
>
> This might be indicative. Maybe not. Fortran might just be
> exposing a
> race condition that the C version is not.
> [..]
> What would be ideal is an strace of the fortran program failing so
> that
> we can see what the system calls did.
Great suggestion! Turns out the file in question has mode 0440, but
since the open() is not otherwise specified, Fortran first tries to
open read-write, and only then read-only.
I'm using:
mpirun -np 2 bash -c 'strace -tt ./a.out 2> strace7-$$.err' >
strace7.out
Here's a failure case where the first process fails, and the seconds
succeeds. The difference is that in the first process the initial
open(.., O_RDWR) returns with ENOENT (fatal) vs. EACCES (will
retry). If the timestamps can be trusted, the failing open() comes
0.1 ms *after* the succeeding PID's open(.., O_RDONLY).
$ tail -n 15 strace7*
==> strace7-10831.err <==
17:27:42.630621 fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136,
2), ...}) = 0
17:27:42.630686 fstat(0, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
17:27:42.630753 open("test.dat", O_RDWR) = -1 ENOENT (No such file or
directory)
17:27:42.631044 write(2, "At line ", 8At line ) = 8
17:27:42.631111 write(2, "2", 12) = 1
17:27:42.631171 write(2, " of file ", 9 of file ) = 9
17:27:42.631248 write(2, "test.f", 6test.f) = 6
17:27:42.631322 write(2, "\n", 1
) = 1
17:27:42.631385 write(2, "Fortran runtime error: ", 23Fortran runtime
error: ) = 23
17:27:42.631443 write(2, "No such file or directory", 25No such file
or directory) = 25
17:27:42.631500 write(2, "\n", 1
) = 1
17:27:42.631563 close(0) = 0
17:27:42.631615 exit_group(2) = ?
==> strace7-10832.err <==
17:27:42.629790 fstat(2, {st_mode=S_IFREG|0664, st_size=5542, ...}) = 0
17:27:42.629984 ioctl(2, SNDCTL_TMR_TIMEBASE or TCGETS,
0x7fff8a624490) = -1 ENOTTY (Inappropriate ioctl for device)
17:27:42.630076 stat("test.dat", {st_mode=S_IFREG|0440,
st_size=805891, ...}) = 0
17:27:42.630163 fstat(2, {st_mode=S_IFREG|0664, st_size=5813, ...}) = 0
17:27:42.630235 fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136,
3), ...}) = 0
17:27:42.630299 fstat(0, {st_mode=S_IFCHR|0666, st_rdev=makedev(1,
3), ...}) = 0
17:27:42.630364 open("test.dat", O_RDWR) = -1 EACCES (Permission denied)
17:27:42.630648 open("test.dat", O_RDONLY) = 3
17:27:42.630921 fstat(3, {st_mode=S_IFREG|0440, st_size=805891, ...})
= 0
17:27:42.630998 ioctl(3, SNDCTL_TMR_TIMEBASE or TCGETS,
0x7fff8a623240) = -1 ENOTTY (Inappropriate ioctl for device)
17:27:42.631055 close(3) = 0
17:27:42.631133 write(1, " OK", 3) = 3
17:27:42.631193 write(1, "\n", 1) = 1
17:27:42.631252 close(0) = 0
17:27:42.631331 exit_group(0) = ?
A workaround for my user is to either "chmod u+w datafile" or, more
cleanly, be explicit in the Fortran open() by saying ACTION='READ'.
With best regards,
Michael
More information about the lustre-discuss
mailing list