[Lustre-discuss] I/O errors with NAMD

John Hammond jhammond at ices.utexas.edu
Fri Jul 23 18:47:10 PDT 2010


On 07/23/2010 06:39 PM, Rick Grubin wrote:
>
>> On 2010-07-23, at 11:53, Richard Lefebvre wrote:
>>
>>> If I had some Lustre error, it would give me a clue, but the only
>>> errors the users get is the following traceback on the
>>> application:
>>>
>>> -------------------------------------------------------------------
>>>
>>>
Reason: FATAL ERROR: Error on write to binary file
>>> restart/ABCD_les4.950000.vel: Interrupted system call
>>>
>>> Fatal error on PE 0>   FATAL ERROR: Error on write to binary
>>> file restart/ABCD_les4.950000.vel: Interrupted system call
>>>
>> There was a bug just filed on EINTR and flock.  I don't have the
>> number, but a quick search should find it.  No patch as yet, but it
>> would be worthwhile to subscribe to for updates.
>>
>
> Bug 23372
>
> https://bugzilla.lustre.org/show_bug.cgi?id=23372
> _______________________________________________ Lustre-discuss
> mailing list Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

In the scenario of 23372, -EINTR is not returned, although it should be.

However, there are several places where -EINTR is used (internally) by 
the ptlrpc layer.  And, IIRC, there are also some places where ptlrpc 
return codes are (inappropriately) used as the return codes for file 
operations.  So, it may be the case that -EINTR is generated in ptlrpc 
becuase of an eviction or some other mishap and returned by write(), but 
no signals were actually delivered during the call.

-- 
John L. Hammond, Ph.D.
ICES, The University of Texas at Austin
jhammond at ices.utexas.edu
(512) 471-9304



More information about the lustre-discuss mailing list