[Lustre-discuss] clients gets EINTR from time to time

Francois Chassaing fch at weborama.com
Fri Feb 25 03:18:30 PST 2011


Thanks, but anyway, logs on the MDS/MGS does not show evicted client of any kind.
Also, the log output by lctl debug_kernel on clients does not show much, I can only see in there the last administrative actions I've taken (such as setting striping policy on a directory, creating a new server pool, ...) and four unrelated (because not happening at my problem hours) "Dropping PUT from"

I continue to parse debug logs and keep them posted.

Thanks

weborama	line	François Chassaing Directeur Technique - CTO 

----- Mail Original -----
De: "Kevin Van Maren" <kevin.van.maren at oracle.com>
À: "DEGREMONT Aurelien" <aurelien.degremont at cea.fr>
Cc: "Francois Chassaing" <fch at weborama.com>, lustre-discuss at lists.lustre.org
Envoyé: Jeudi 24 Février 2011 18h43:25 GMT +01:00 Amsterdam / Berlin / Berne / Rome / Stockholm / Vienne
Objet: Re: [Lustre-discuss] clients gets EINTR from time to time

No, in case of an eviction or IO errors, EIO is returned to the 
application, not EINTR.

Kevin


DEGREMONT Aurelien wrote:
> Hello
>
>  From my understanding, Lustre can return EINTR for some I/O error cases.
> I think that when a client gets evicted in the middle of one of its RPC, 
> it can returns EINTR to the caller.
> Is this can explain your issue?
>
> Can your verify your clients where not evicted at the same time?
>
> Aurélien
>
> Francois Chassaing a écrit :
>   
>> OK, thanks it makes it more clear.
>> I indeed messed up my mind (and words) between signals and error return codes.
>> I did understood that the write()/pwrite() system was returning the EINTR error code because it received a signal, but I supposed that the signal was sent because of an error condition somewhere in the FS. 
>> This is where I now think I'm wrong. 
>>  
>> As for your questions :
>> - I have to mention that I always had had this issue, and this is why I've upgraded from 1.8.4 to 1.8.5, hoping this would solve it.
>> - I will try to have that SA_RESTART flag set in the app... if I can find where the signal handler is set.
>> - How can I see that lustre is returning EINTR for any other reason ? As I said no logs shows nothing neither on MDS or OSSs, but I didn't go through examining "lctl debug_kernel" yet... which I'm going to do right away...
>>
>> my last question is : how can I tell which signal I am receiving ? because my app doesn't say, it just dumps outs the write/pwrite error code. 
>> And if there is no signal handler, then it should follow the "standard" actions (as of man 7 signal). On the other hand, my app does not stop or dump core, and is not ignored, so it has to be handled in the code. Correct me if I'm wrong...
>>
>> At that point, you realize that I didn't write the app, nor am I a good Linux guru ;-)
>>
>> Tnaks a lot.
>>
>> weborama	line	François Chassaing Directeur Technique - CTO 
>>
>> ----- Mail Original -----
>> De: "Ken Hornstein" <kenh at cmf.nrl.navy.mil>
>> À: "Francois Chassaing" <fch at weborama.com>
>> Cc: lustre-discuss at lists.lustre.org
>> Envoyé: Jeudi 24 Février 2011 15h54:24 GMT +01:00 Amsterdam / Berlin / Berne / Rome / Stockholm / Vienne
>> Objet: Re: [Lustre-discuss] clients gets EINTR from time to time
>>
>>   
>>     
>>> OK, the app is used to deal with standard disks, that is why it is not
>>> handling the EINTR signal propoerly.
>>>     
>>>       
>> I think you're misunderstanding what a "signal" is in the Unix sense.
>>
>> EINTR isn't a signal; it's a return code from the write() system call
>> that says, "Hey, you got a signal in the middle of this write() call
>> and it didn't complete".  It doesn't mean that there was an error
>> writing the file; if that was happening, you'd get a (presumably
>> different) error code.  Signals can be sent by the operating system,
>> but those signals are things like SIGSEGV, which basically means, "you're
>> program screwed up".  Programs can also send signals to each other,
>> with kill(2) and the like.
>>
>> Now, NORMALLY systems calls like write() are interrupted by signals
>> when you're writing to "slow" devices, like network sockets.  According
>> to the signal(7) man page, disks are not normally considered slow
>> devices, so I can understand the application not being used to handling
>> this.  And you know, now that I think about it I'm not even sure that
>> network filesystems SHOULD allow I/O system calls to be interrupted by
>> signals ... I'd have to think more about it.
>>
>> I suspect what happened is that something changed between 1.8.5 and the
>> previous version of Lustre that you were using that allowed some operations
>> to be interruptable by signals.  Some things to try:
>>
>> - Check to see if you are, in fact, receiving a signal in your application
>>   and Lustre isn't returning EINTR for some other reason.
>> - If you are receiving a signal, when you set the signal handler for it
>>   you could use the SA_RESTART flag to restart the interrupted I/O; I think
>>   that would make everything work like it did before.
>>
>> --Ken
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>   
>>     
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>   




More information about the lustre-discuss mailing list