[Lustre-discuss] buffering

Thu Aug 12 14:40:18 PDT 2010

On 2010-08-12, at 14:52, burlen wrote:
> Andreas Dilger wrote:
>> On 2010-08-11, at 23:36, burlen wrote: 
>>> I am interested in how write()s are buffered in Lustre on the cleint, server, and network in between. Specifically I'd like to understand what happens during writes when large number of clients are making large writes to all of the OSTs on an OSS, and the buffers are inadequate to handle the outgoing/incoming data.
>> 
>> Lustre doesn't buffer dirty pages on the OSS, only on the client.  The clients are granted a "reserve" of space in each OST filesystem to ensure there is enough free space for any cached writes that they do.
> 
> If I understand the way write() typically works on Linux, during a large write(), too large to be buffered in the page cache, once the page cache is full dirty pages would be flushed to disk. the data transfer would block at that point until the dirty pages are written to disk, whence the data transfer would resume into the resulting free pages.  But in Lustre I assume that once the client's page cache is full, the dirty pages are sent over the network to the OSS where they are written to disk.

In fact, Lustre aggressively flushes dirty data from the client as soon as it can create a 1MB RPC.  Otherwise, the VM will cache dirty data for up to 30s, and if you work out that cache for all clients and the aggregate network bandwidth, it would be a huge waste of bandwidth to leave it sitting idle.

> In that case, does the network layer effectively act like a buffer? So that the client may resume the data transfer into the page cache before the former set dirty pages actually hit the disk? Or does the data transfer block until dirty pages actually reach the disk?

Lustre also limits the dirty page cache per OST far below the VM limits, for similar reasons as above.  Clients can have 32MB (default) dirty data per OST, and up to 8 RPCs (default) in flight per OST at one time.

The network does NOT act as a buffer, since the client must keep a copy of all {meta}data in memory until it is ACK'd by the server (it is not fire & forget) so that the client can replay this RPC in case of a server crash.  The server will send an ACK (RPC reply) when it has processed the RPC along with a transaction number for that RPC, and asynchronously notifies the client that RPCs <= "last_committed_transno" have been committed to disk and they can discard their copy of the RPC.

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.