[lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming

Wed Feb 5 04:05:11 PST 2025

To better cover the skew between different VMs running different subtests, we could change the test-framework code to run the subtests in different order (either starting at a random offset, or in random order).

This would also expose some hidden assumptions and dependencies in the subtests themselves, so that would need to be fixed to avoid false test failures, but the main goal of the boilpot testing is finding crashes/deadlocks, so if a few tests fail because of minor test issues I don't think that is a blocker. 

Cheers, Andreas

> On Feb 4, 2025, at 13:38, Oleg Drokin <green at whamcloud.com> wrote:
> 
> On Tue, 2025-02-04 at 17:33 +0000, Andreas Dilger wrote:
>> You overlook that Tim works for AWS, so he would not actually pay to
>> run these nodes. He could run in machine idle times while no external
>> customer is paying for them.
> 
> If this could be arranged that would be great of course, but I don't
> want to assume something of this nature unless explicitly stated. And
> who knows what sort of internal accounting there might be in place to
> keep track (and approve) uses like this too.
> 
>> I suspect with the random nature of the boilpot that it is the total
>> number of hours runtime that matter, not whether they are contiguous
>> or not.  So running 24x boilpot nodes for 1h during off-peak times
>> would likely produce the same result as 24h continuous on one node.
> 
> Well, not exactly true. There need to be continuous chunks of at least
> 1x the longest testrun and preferably much more (2x is better as the
> minimum?).
> If conf-sanity takes 5 hours in this setup (cpu overcommit making
> things slow and whatnot) and you always only run for an hour - we never
> get to try most of conf-sanity.
> 
> Also 50 sessions of conf-sanity running in parallel 1x vs
> 10 sessions running conf-sanity in parallel 5x - the latter probably
> wins coverage wise because over time the other conflicting VMs would
> deviate more so the stress points in the code would fall more and more
> differently, I suspect (but we can probably test this by running both
> setups for long enough in parallel on the same code and see how much of
> a crash rate difference it makes)
> 
>> 
>> Cheers, Andreas
>> 
>>>> On Feb 3, 2025, at 15:30, Oleg Drokin <green at whamcloud.com> wrote:
>>> 
>>> On Mon, 2025-02-03 at 20:24 +0000, Oleg Drokin wrote:
>>> 
>>>> at $11/hour the m7a.metal-48xl would take $264 to run for just
>>>> one
>>>> day,
>>>> a week is an eye-watering $1848, so running this for every patch
>>>> is
>>>> not
>>>> super economical I'd say.
>>> 
>>> x2gd metal at $5.34 per hour makes more sense as it has more RAM
>>> (and
>>> 64 CPUs is adequate I'd say) but still quite pricey if you want to
>>> run
>>> this at any sort of scale.
>>> _______________________________________________
>>> lustre-devel mailing list
>>> lustre-devel at lists.lustre.org
>>> http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org
>