[Lustre-discuss] clients gets EINTR from time to time

Ken Hornstein kenh at cmf.nrl.navy.mil
Thu Feb 24 06:54:24 PST 2011


>OK, the app is used to deal with standard disks, that is why it is not
>handling the EINTR signal propoerly.

I think you're misunderstanding what a "signal" is in the Unix sense.

EINTR isn't a signal; it's a return code from the write() system call
that says, "Hey, you got a signal in the middle of this write() call
and it didn't complete".  It doesn't mean that there was an error
writing the file; if that was happening, you'd get a (presumably
different) error code.  Signals can be sent by the operating system,
but those signals are things like SIGSEGV, which basically means, "you're
program screwed up".  Programs can also send signals to each other,
with kill(2) and the like.

Now, NORMALLY systems calls like write() are interrupted by signals
when you're writing to "slow" devices, like network sockets.  According
to the signal(7) man page, disks are not normally considered slow
devices, so I can understand the application not being used to handling
this.  And you know, now that I think about it I'm not even sure that
network filesystems SHOULD allow I/O system calls to be interrupted by
signals ... I'd have to think more about it.

I suspect what happened is that something changed between 1.8.5 and the
previous version of Lustre that you were using that allowed some operations
to be interruptable by signals.  Some things to try:

- Check to see if you are, in fact, receiving a signal in your application
  and Lustre isn't returning EINTR for some other reason.
- If you are receiving a signal, when you set the signal handler for it
  you could use the SA_RESTART flag to restart the interrupted I/O; I think
  that would make everything work like it did before.

--Ken



More information about the lustre-discuss mailing list