[lustre-discuss] trimming flash-based external journal device

Thu Aug 5 16:44:50 PDT 2021

On Thu, Aug 5, 2021 at 3:23 PM Andreas Dilger <adilger at whamcloud.com> wrote:

> On Aug 5, 2021, at 13:29, Nathan Dauchy wrote:
>
> Andreas, thanks as always for your insight.  Comments inline...
>
> On Thu, Aug 5, 2021 at 10:48 AM Andreas Dilger <adilger at whamcloud.com>
> wrote:
>
>> On Aug 5, 2021, at 09:28, Nathan Dauchy via lustre-discuss <
>> lustre-discuss at lists.lustre.org> wrote:
>>
>> Question:  Is it possible that a flash journal device on an ext4
>> filesystem can reach a point where there are not enough clean blocks to
>> write to, and they can suffer from very degraded write performance?
>>
>> ...

> Another related question would be how to benchmark the journal device on
>> it's own, particularly write performance, without losing data on an
>> existing file system; similar to the very useful obdfilter-survey tool, but
>> at a lower level.  But I am primarily looking to understand the nuances of
>> flash devices and ldiskfs external journals a bit better.
>>
>> While the external journal device has an ext4 superblock header for
>> identification (UUID/label), and a feature flag that prevents it from being
>> mounted/used directly, it is not really an ext4 filesystem, just a flat
>> "file".  You'd need to remove it from the main ext4/ldiskfs filesystem,
>> reformat it as ext4 and mount locally, and then run benchmarks (e.g. "dd"
>> would best match the JBD2 workload, or fio if you want random IOPS) against
>> it.  You could do this before/after trim (could use fstrim at this point)
>> to see if it affects the performance or not.
>>
>
> OK, thanks for confirming that there is no magic ext4 journal benchmarking
> tool.  I'll stop searching.  ;-)
>
> Note that there *are* some journal commit statistics -
> /proc/fs/jbd2/<dev>/info that you might be able to compare between
> devices.  Probably the most interesting is "average transaction commit
> time", which is how long it takes to write the blocks to the journal device
> after the transaction starts to commit.
>
>
Oh, that is interesting!

The "average transaction commit time" seems to fluctuate, possibly with
load, and doesn't have an obvious correlation to the slower OSTs.  But
perhaps I'll look at it again when running a clean benchmark during a
future dedicated time.

I _did_ find a stark difference in other metrics though:

# pdsh -g oss "grep 'handles per' /proc/fs/jbd2/md*/info" | sort
lfs4n04:   17 handles per transaction
lfs4n05:   18 handles per transaction
lfs4n06:   18 handles per transaction
lfs4n07:   18 handles per transaction
lfs4n08:   17 handles per transaction
lfs4n09:   17 handles per transaction
lfs4n10:   18 handles per transaction
lfs4n11:   18 handles per transaction
lfs4n12:   18 handles per transaction
lfs4n13:   17 handles per transaction
lfs4n16:   192 handles per transaction
lfs4n17:   178 handles per transaction
lfs4n18:   198 handles per transaction
lfs4n19:   192 handles per transaction
# pdsh -g oss "grep 'logged blocks per' /proc/fs/jbd2/md*/info" | sort
lfs4n04:   24 logged blocks per transaction
lfs4n05:   24 logged blocks per transaction
lfs4n06:   25 logged blocks per transaction
lfs4n07:   25 logged blocks per transaction
lfs4n08:   24 logged blocks per transaction
lfs4n09:   24 logged blocks per transaction
lfs4n10:   25 logged blocks per transaction
lfs4n11:   24 logged blocks per transaction
lfs4n12:   24 logged blocks per transaction
lfs4n13:   24 logged blocks per transaction
lfs4n16:   103 logged blocks per transaction
lfs4n17:   98 logged blocks per transaction
lfs4n18:   106 logged blocks per transaction
lfs4n19:   103 logged blocks per transaction

The last 4 nodes are the expansion OSTs which are performing better.  What
does that difference indicate?

Thanks again,
Nathan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20210805/8c2346aa/attachment-0001.html>