<html>


<head>


<meta http-equiv="Content-Type" content="text/html; charset=utf-8">


</head>


<body style="overflow-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;">


On Nov 4, 2025, at 14:11, Day, Timothy <timday@amazon.com> wrote:<br>


<div>


<blockquote type="cite"><br class="Apple-interchange-newline">


<div>


<div>


<blockquote type="cite">VFS OSD won’t give us everything to make it an OSD. Transaction is one of the issue as Oleg mentioned.<br>


</blockquote>


<br>


We'll have to implement a lot of complexity in kernel space for<br>


the design you're suggesting. At some point, implementing<br>


a transaction log and delegating the rest to user space might<br>


be the more maintainable option? Doing that via VFS/Fuse is<br>


one option. Something similar to ublk for OSD could be another<br>


option.<br>


</div>


</div>


</blockquote>


<div><br>


</div>


If the writes to the spill device are well managed, it may be possible</div>


<div>to do it without transaction support, so long as they are not exposed</div>


<div>directly for writing to the clients.  Otherwise they essentially need</div>


<div>to be OSDs that expose transactions and recovery semantics.</div>


<div><br>


<blockquote type="cite">


<div>


<div>I'm not saying this is preferrable, but is this something you've<br>


considered?<br>


<br>


<blockquote type="cite">Extended attributes would be another thing since not all file systems would support it.<br>


</blockquote>


<br>


I think it's fine not accept filesystems that don't support EA.</div>


</div>


</blockquote>


<blockquote type="cite">


<div>


<div>Or have the OSD advertise that it doesn't support EA.<br>


</div>


</div>


</blockquote>


<div><br>


</div>


I don't think any storage system we care about today does not support</div>


<div>EAs or tags or similar metadata that can be used for this.</div>


<div><br>


<blockquote type="cite">


<div>


<div>


<blockquote type="cite">


<blockquote type="cite">


<blockquote type="cite">deliver better performance than block-level cache with dmcache, where<br>


recovery time is lengthy if the cache size is huge.<br>


</blockquote>


<br>


This seems speculative. Could you elaborate more on why you think<br>


this is the case?<br>


</blockquote>


<br>


DMcache doesn’t persist the bitmap used to indicate which blocks are holding dirty data, so an ungraceful shutdown will lead to a scanning of the entire cache in order to determine which blocks are > dirty.<br>


</blockquote>


<br>


This is a failing of DMcache rather than an indication that this<br>


problem can't be solved on the block layer.<br>


</div>


</div>


</blockquote>


<div><br>


</div>


That was my original discussion with Jinshan as well.  There was a</div>


<div>similar issue with mdraid, and they ended up with persistent bitmaps</div>


<div>in a flash device.  It should be possible to manage this with logs</div>


<div>or bitmaps stored in the fast OSD device instead of the spill device.</div>


<div><br>


<blockquote type="cite">


<div>


<div>Overall, I think the concept is interesting. It reminds me of how<br>


Bcachefs handle multi-device support. Each device can be<br>


designated as holding metadata or data replicas. And you<br>


can control the promotion and migration between different<br>


targets (all managed by a migration daemon). But this design is<br>


too limited, IMHO. If we're going to accept the additional complexity<br>


in the OSD, the solution has to be extensible. What if I want to<br>


replicate to multiple targets? What if I want more than two tiers?<br>


What if I want to transparently migrate data from one spill device to<br>


another? We don't need this for the initial implementation, sure.<br>


But these seem like natural extensions.<br>


</div>


</div>


</blockquote>


<div><br>


</div>


This is essentially replicating Lustre file layouts in the end, which</div>


<div>was my original suggestion - to use FLR and/or PCC-RO foreign</div>


<div>mirror layouts for this, even if it is not directly accessible from</div>


<div>clients.  That avoids reimplementing tools/formats that already</div>


<div>exist in Lustre today for relatively little benefit.</div>


<div><br>


<blockquote type="cite">


<div>


<div>I think we need to use some kind of common API for the<br>


different devices. Even if the spill device doesn't support atomic<br>


transactions, I don't see why we couldn't still use the common OSD<br>


API and implement the migration daemon as a stacking driver on-top<br>


of that. The spill device OSD driver could be made to advertise that it<br>


doesn't support atomic transactions and may not support EA. But we<br>


get the added benefit of being able to use existing OSDs with this<br>


feature, pretty much for free.<br>


</div>


</div>


</blockquote>


<br>


</div>


<div>Also my argument.  Using the VFS directly is constraining (lack of</div>


<div>transactions), but of backends that _can_ be a full OSD (or already</div>


<div>have an OSD like ldiskfs, ZFS, memfs) it is a drop-in replacement.</div>


<div><br>


</div>


<div>There is already an OSD API to query the functionality of the backing</div>


<div>storage, so it should be straight forward to add "transaction", "xattr",</div>


<div>and other supported features to that.</div>


<div><br>


</div>


<div>If we can implement a no-transaction osd-vfs, that would expose a</div>


<div>lot of flexibility for other reasons as well.  Possibly the osd-vfs could</div>


<div>implement a journal or other logging layer internally to make up for</div>


<div>lack of transactions, whether initially or at a later stage?</div>


<br>


<div>


<div dir="auto" style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none; overflow-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;">


<div dir="auto" style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none; overflow-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;">


<div>Cheers, Andreas</div>


<div>—</div>


<div>Andreas Dilger</div>


<div>Lustre Principal Architect</div>


<div>Whamcloud/DDN</div>


</div>


<br class="Apple-interchange-newline">


</div>


<br class="Apple-interchange-newline">


<br class="Apple-interchange-newline">


</div>


<br>


</body>


</html>