[Lustre-discuss] open() ENOENT bug

Robin Humble rjh+lustre at cita.utoronto.ca
Sun Nov 2 17:30:23 PST 2008


On Thu, Oct 30, 2008 at 02:05:57PM +0100, Peter Kjellstrom wrote:
>On Thursday 30 October 2008, Brian J. Murrell wrote:
>> On Thu, 2008-10-30 at 01:35 -0400, Robin Humble wrote:
>> > we have a user with simultaneously starting fortran runs that fail
>> > about 10% of the time because Lustre sometimes returns ENOENT instead
>> > of EACCES to an open() request on a read-only file.
>>
>> I can reproduce this on 1.6.6 as well using your reproducer.
>
>We have also seen this bug on our systems (reported by a user running a 
>Fortran code). We have servers with both 1.4 
>(2.6.9-55.0.9.EL_lustre.1.4.11.1smp) and 1.6 
>(2.6.18-8.1.14.el5_lustre.1.6.4.2smp) lustre.
>
>The error is seen towards both server versions from a cluster with patchless 
>1.6.5.1 clients running centos-5.2.x86_64 (2.6.18-92.1.13.el5).
>
>However the error is not seen from another cluster running _patched_ 1.6.5.1 
>on centos-4.x86_64 (2.6.9-67.0.7.EL_lustre.1.6.5.1smp).

I dug up an old 2.6.9-67.0.7.EL_lustre.1.6.5.1 + IB kernel (who'd have
thought it'd boot with a RHEL5 userland!? :-) and you are right - my
openFileMinimal test case runs without problem. ie. 2.6.9 seems a lot
more robust than 2.6.18 and onwards.

however, when running ~10 copies of the below fortran code with the
above RHEL4 + 1.6.5.1 kernel, several of the copies of the code always
die with:
  Fortran runtime error: Stale NFS file handle

      program blah
      implicit none
      integer i
      do i=1,1000
      open(3,file='file',status='old')
      close(3)
      enddo
      stop
      end

so although my cut-down C code reproducer doesn't trigger anything, it
seems Lustre still has issues with the real fortran code. the user's
jobs would probably run ok in this RHEL4 environment though as they
don't run 10 copies at once.
it's a slightly different variant of the bug as well (different error
code), or maybe it's just a totaly different bug.

cheers,
robin



>
>/Peter
>
>> Can you file a bug in our bugzilla about it?  Please include your
>> reproducer program.
>>
>> b.



>_______________________________________________
>Lustre-discuss mailing list
>Lustre-discuss at lists.lustre.org
>http://lists.lustre.org/mailman/listinfo/lustre-discuss




More information about the lustre-discuss mailing list