[Lustre-discuss] open() ENOENT bug

Robin Humble rjh+lustre at cita.utoronto.ca
Wed Oct 29 22:35:03 PDT 2008


Hi,

we have a user with simultaneously starting fortran runs that fail
about 10% of the time because Lustre sometimes returns ENOENT instead
of EACCES to an open() request on a read-only file.

you might think this is an odd thing to want to do, but a standard
fortran open(3,file='file',status='old') always does a rw open() before
a ro open(), as well as a pile of other bizarre stuff. the rw open
should always fail with EACCES, and the subsequent ro open succeeds.
however if ENOENT is returned instead of EACCES then fortran exits.

attached is a minimal C code that triggers the problem by emulating a
subset of gfortran/ifort's open() statement and checking for EACCES.

standard Sun kernel 2.6.18-53.1.14.el5_lustre.1.6.5.1smp on IB and GigE
appears to have this problem, as does a patched RHEL with 1.6.4.2 on
GigE, and all RHEL and kernel.org patchless kernels we've tested
including with Lustre cvs b1_6 from a couple of days ago. turning off
statahead doesn't change anything.

I couldn't find anything in bz.

steps to reproduce:
  # on a Lustre fs
  gcc -o openFileMinimal openFileMinimal.c
  touch file
  chmod 400 file
  ./openFileMinimal file & ./openFileMinimal file &

the correct result would be no output (all opens fail with EACCES).

typical output is one of the 2 processes getting ENOENT. eg:
   % ./openFileMinimal file & ./openFileMinimal file &
  [1] 4948
  [2] 4949
   % "No such file or directory" on existing file. loop 2/1000
  [2]+  Exit 1                  ./openFileMinimal file
  [1]-  Done                    ./openFileMinimal file

let me know if you want me to create a bz entry.

cheers,
robin
-------------- next part --------------
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include <fcntl.h>
#include <errno.h>
#include <string.h>

int main(int argc, char **argv) {
   int i, f;
   struct stat b;

   if ( argc < 2 ) {
      printf( "run as an ordinary user to a file without rw access, like:\n" );
      printf( "  %s /lustre/some/root/owned/file\n", argv[0] );
      printf( "... and then run 2 copies and watch for failures\n" );
      exit(0);
   }

   for ( i=0; i<1000; i++ ) {
      stat( argv[1], &b );
      f = open( argv[1], O_RDWR );
      if ( f < 0 ) {
         if ( errno != EACCES ) { // EACCES is correct. other errors aren't
            printf( "\"%s\" on existing file. loop %d/1000\n", strerror(errno), i );
            exit(1);
         }
      }
      else {
         printf( "rw open succeeded - shouldn't happen - this test needs to be "
                 "run on a file with read-only access (eg. chmod 400, not owned "
                 "by you, or on a read-only Lustre fs).\n" );
         exit(2);
      }
   }
   exit(0);
}


More information about the lustre-discuss mailing list