[Lustre-discuss] clients gets EINTR from time to time
Francois Chassaing
fch at weborama.com
Fri Feb 25 03:18:30 PST 2011
Thanks, but anyway, logs on the MDS/MGS does not show evicted client of any kind.
Also, the log output by lctl debug_kernel on clients does not show much, I can only see in there the last administrative actions I've taken (such as setting striping policy on a directory, creating a new server pool, ...) and four unrelated (because not happening at my problem hours) "Dropping PUT from"
I continue to parse debug logs and keep them posted.
Thanks
weborama line François Chassaing Directeur Technique - CTO
----- Mail Original -----
De: "Kevin Van Maren" <kevin.van.maren at oracle.com>
À: "DEGREMONT Aurelien" <aurelien.degremont at cea.fr>
Cc: "Francois Chassaing" <fch at weborama.com>, lustre-discuss at lists.lustre.org
Envoyé: Jeudi 24 Février 2011 18h43:25 GMT +01:00 Amsterdam / Berlin / Berne / Rome / Stockholm / Vienne
Objet: Re: [Lustre-discuss] clients gets EINTR from time to time
No, in case of an eviction or IO errors, EIO is returned to the
application, not EINTR.
Kevin
DEGREMONT Aurelien wrote:
> Hello
>
> From my understanding, Lustre can return EINTR for some I/O error cases.
> I think that when a client gets evicted in the middle of one of its RPC,
> it can returns EINTR to the caller.
> Is this can explain your issue?
>
> Can your verify your clients where not evicted at the same time?
>
> Aurélien
>
> Francois Chassaing a écrit :
>
>> OK, thanks it makes it more clear.
>> I indeed messed up my mind (and words) between signals and error return codes.
>> I did understood that the write()/pwrite() system was returning the EINTR error code because it received a signal, but I supposed that the signal was sent because of an error condition somewhere in the FS.
>> This is where I now think I'm wrong.
>>
>> As for your questions :
>> - I have to mention that I always had had this issue, and this is why I've upgraded from 1.8.4 to 1.8.5, hoping this would solve it.
>> - I will try to have that SA_RESTART flag set in the app... if I can find where the signal handler is set.
>> - How can I see that lustre is returning EINTR for any other reason ? As I said no logs shows nothing neither on MDS or OSSs, but I didn't go through examining "lctl debug_kernel" yet... which I'm going to do right away...
>>
>> my last question is : how can I tell which signal I am receiving ? because my app doesn't say, it just dumps outs the write/pwrite error code.
>> And if there is no signal handler, then it should follow the "standard" actions (as of man 7 signal). On the other hand, my app does not stop or dump core, and is not ignored, so it has to be handled in the code. Correct me if I'm wrong...
>>
>> At that point, you realize that I didn't write the app, nor am I a good Linux guru ;-)
>>
>> Tnaks a lot.
>>
>> weborama line François Chassaing Directeur Technique - CTO
>>
>> ----- Mail Original -----
>> De: "Ken Hornstein" <kenh at cmf.nrl.navy.mil>
>> À: "Francois Chassaing" <fch at weborama.com>
>> Cc: lustre-discuss at lists.lustre.org
>> Envoyé: Jeudi 24 Février 2011 15h54:24 GMT +01:00 Amsterdam / Berlin / Berne / Rome / Stockholm / Vienne
>> Objet: Re: [Lustre-discuss] clients gets EINTR from time to time
>>
>>
>>
>>> OK, the app is used to deal with standard disks, that is why it is not
>>> handling the EINTR signal propoerly.
>>>
>>>
>> I think you're misunderstanding what a "signal" is in the Unix sense.
>>
>> EINTR isn't a signal; it's a return code from the write() system call
>> that says, "Hey, you got a signal in the middle of this write() call
>> and it didn't complete". It doesn't mean that there was an error
>> writing the file; if that was happening, you'd get a (presumably
>> different) error code. Signals can be sent by the operating system,
>> but those signals are things like SIGSEGV, which basically means, "you're
>> program screwed up". Programs can also send signals to each other,
>> with kill(2) and the like.
>>
>> Now, NORMALLY systems calls like write() are interrupted by signals
>> when you're writing to "slow" devices, like network sockets. According
>> to the signal(7) man page, disks are not normally considered slow
>> devices, so I can understand the application not being used to handling
>> this. And you know, now that I think about it I'm not even sure that
>> network filesystems SHOULD allow I/O system calls to be interrupted by
>> signals ... I'd have to think more about it.
>>
>> I suspect what happened is that something changed between 1.8.5 and the
>> previous version of Lustre that you were using that allowed some operations
>> to be interruptable by signals. Some things to try:
>>
>> - Check to see if you are, in fact, receiving a signal in your application
>> and Lustre isn't returning EINTR for some other reason.
>> - If you are receiving a signal, when you set the signal handler for it
>> you could use the SA_RESTART flag to restart the interrupted I/O; I think
>> that would make everything work like it did before.
>>
>> --Ken
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
More information about the lustre-discuss
mailing list