[Lustre-discuss] filesystem corruption

Richard Smith Richard.Smith at Sun.COM
Tue Sep 8 16:23:58 PDT 2009


Brian J. Murrell wrote:

> Well, whatever you want to call it... when the hardware tells the
> software (Lustre) that something is on a platter, in order for Lustre to
> work properly, it MUST be physically on the platter, or be able to make
> it there in the face of other environmental issues, such as power ...

I don't think its in dispute that there is a need at various times to ensure
that data has been written to non-volatile storage, at least not by me.
Where I was coming from is that high-performance software should be 
encouraged
to take full advantage of the capabilities of the underlying hardware 
provided
they can do so safely. [And under some circumstances people may even be 
prepared
to sacrifice safety for a performance benefit, but that's a separate issue.]

At least in the case of SCSI, the hardware doesn't tell software 
(Lustre) that
something is on a platter. The hardware receives requests and tells the 
software
that it has obeyed them, or has failed in the attempt. A WRITE carries 
with it
no guarantee that the data is on non-volatile media, hence my comment about
using SYNCHRONIZE CACHE or FUA bit as well if that is really what is wanted.

Neither SYNCHRONIZE CACHE or FUA bit is exposed at an application level, but
I think there's a reasonable expectation that the underlying software 
will do
whatever is necessary to maintain integrity of a filesystem. The way I 
interpret
this is that the combination of filesystem and device driver(s) should 
establish
what the device is capable of, and then use those capabilities to maintain
integrity while maximizing performance. Does it implement FUA? I'm 
caught out
here--I was unaware that there were devices that silently ignored FUA, 
and didn't
know if SCSI permitted that. If FUA can't be relied upon, then I'd expect
[system] software to use SYNCHRONIZE CACHE instead.

Admittedly I am out of my depth here. The block layer for a device is 
supposed
to implement the concept of a barrier request, and should take steps to 
force
a drive to write data to the media. Maybe some drivers do, and others don't.
I expect the implementation to require SYNCHRONIZE CACHE or FUA. At a higher
level then, all that should be required is the appropriate generation
of barrier requests, assuming the underlying layer implements them.

The final piece of the puzzle, unless there's something I've overlooked, is
for appropriate warnings to be generated in the case where the software 
stack
cannot verify it can implement barriers. This could mean [and I'm only 
guessing
here] a situation where a device has a write-cache enabled but provides 
no means
of informing higher layers of how to ensure data is written to physical 
media,
in order.

Adopting the same "fail-safe" principle as railways/railroads in theory use,
this should probably be inverted, so that unless you see a message stating
positively that the conditions for using a write-cache have been met, then
it would be prudent not to use a write-cache. At the same time, I think 
there
are circumstances in which the use of write-caches should be acceptable.
There's a bunch of other things that can go wrong with i/os that none of the
above address, so nothing is completely risk-free.

-- 
============================================================================
   ,-_|\   Richard Smith Staff Engineer PAE
  /     \  Sun Microsystems                   Phone : +61 3 9869 6200
richard.smith at Sun.COM                        Direct : +61 3 9869 6224
  \_,-._/  476 St Kilda Road                    Fax : +61 3 9869 6290
       v   Melbourne Vic 3004 Australia
=========================================================================== 




More information about the lustre-discuss mailing list