[Lustre-discuss] clients gets EINTR from time to time

Francois Chassaing fch at weborama.com
Thu Feb 24 07:52:17 PST 2011


OK, thanks it makes it more clear.
I indeed messed up my mind (and words) between signals and error return codes.
I did understood that the write()/pwrite() system was returning the EINTR error code because it received a signal, but I supposed that the signal was sent because of an error condition somewhere in the FS. 
This is where I now think I'm wrong. 
 
As for your questions :
- I have to mention that I always had had this issue, and this is why I've upgraded from 1.8.4 to 1.8.5, hoping this would solve it.
- I will try to have that SA_RESTART flag set in the app... if I can find where the signal handler is set.
- How can I see that lustre is returning EINTR for any other reason ? As I said no logs shows nothing neither on MDS or OSSs, but I didn't go through examining "lctl debug_kernel" yet... which I'm going to do right away...

my last question is : how can I tell which signal I am receiving ? because my app doesn't say, it just dumps outs the write/pwrite error code. 
And if there is no signal handler, then it should follow the "standard" actions (as of man 7 signal). On the other hand, my app does not stop or dump core, and is not ignored, so it has to be handled in the code. Correct me if I'm wrong...

At that point, you realize that I didn't write the app, nor am I a good Linux guru ;-)

Tnaks a lot.

weborama	line	François Chassaing Directeur Technique - CTO 

----- Mail Original -----
De: "Ken Hornstein" <kenh at cmf.nrl.navy.mil>
À: "Francois Chassaing" <fch at weborama.com>
Cc: lustre-discuss at lists.lustre.org
Envoyé: Jeudi 24 Février 2011 15h54:24 GMT +01:00 Amsterdam / Berlin / Berne / Rome / Stockholm / Vienne
Objet: Re: [Lustre-discuss] clients gets EINTR from time to time

>OK, the app is used to deal with standard disks, that is why it is not
>handling the EINTR signal propoerly.

I think you're misunderstanding what a "signal" is in the Unix sense.

EINTR isn't a signal; it's a return code from the write() system call
that says, "Hey, you got a signal in the middle of this write() call
and it didn't complete".  It doesn't mean that there was an error
writing the file; if that was happening, you'd get a (presumably
different) error code.  Signals can be sent by the operating system,
but those signals are things like SIGSEGV, which basically means, "you're
program screwed up".  Programs can also send signals to each other,
with kill(2) and the like.

Now, NORMALLY systems calls like write() are interrupted by signals
when you're writing to "slow" devices, like network sockets.  According
to the signal(7) man page, disks are not normally considered slow
devices, so I can understand the application not being used to handling
this.  And you know, now that I think about it I'm not even sure that
network filesystems SHOULD allow I/O system calls to be interrupted by
signals ... I'd have to think more about it.

I suspect what happened is that something changed between 1.8.5 and the
previous version of Lustre that you were using that allowed some operations
to be interruptable by signals.  Some things to try:

- Check to see if you are, in fact, receiving a signal in your application
  and Lustre isn't returning EINTR for some other reason.
- If you are receiving a signal, when you set the signal handler for it
  you could use the SA_RESTART flag to restart the interrupted I/O; I think
  that would make everything work like it did before.

--Ken



More information about the lustre-discuss mailing list