[lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming

Tue Feb 4 15:43:50 PST 2025

Obviously Tim would have to speak to this if he can, but that's not the way things worked at OCI and I would think it's the same at all the hyperscalers - there's no such thing as idle time, not really, or at least not like this.  They work very hard to minimize idle across the (many, many) datacenters/nodes and time is absolutely charged for internal use (perhaps charged differently, but still).  Plenty of people would love "idle" time, so there isn't any.

-Patrick
________________________________
From: lustre-devel <lustre-devel-bounces at lists.lustre.org> on behalf of Oleg Drokin <green at whamcloud.com>
Sent: Tuesday, February 4, 2025 12:38 PM
To: Andreas Dilger <adilger at ddn.com>
Cc: lustre-devel at lists.lustre.org <lustre-devel at lists.lustre.org>
Subject: Re: [lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming

On Tue, 2025-02-04 at 17:33 +0000, Andreas Dilger wrote:
> You overlook that Tim works for AWS, so he would not actually pay to
> run these nodes. He could run in machine idle times while no external
> customer is paying for them.

If this could be arranged that would be great of course, but I don't
want to assume something of this nature unless explicitly stated. And
who knows what sort of internal accounting there might be in place to
keep track (and approve) uses like this too.

> I suspect with the random nature of the boilpot that it is the total
> number of hours runtime that matter, not whether they are contiguous
> or not.  So running 24x boilpot nodes for 1h during off-peak times
> would likely produce the same result as 24h continuous on one node.

Well, not exactly true. There need to be continuous chunks of at least
1x the longest testrun and preferably much more (2x is better as the
minimum?).
If conf-sanity takes 5 hours in this setup (cpu overcommit making
things slow and whatnot) and you always only run for an hour - we never
get to try most of conf-sanity.

Also 50 sessions of conf-sanity running in parallel 1x vs
10 sessions running conf-sanity in parallel 5x - the latter probably
wins coverage wise because over time the other conflicting VMs would
deviate more so the stress points in the code would fall more and more
differently, I suspect (but we can probably test this by running both
setups for long enough in parallel on the same code and see how much of
a crash rate difference it makes)

>
> Cheers, Andreas
>
> > On Feb 3, 2025, at 15:30, Oleg Drokin <green at whamcloud.com> wrote:
> >
> > On Mon, 2025-02-03 at 20:24 +0000, Oleg Drokin wrote:
> >
> > > at $11/hour the m7a.metal-48xl would take $264 to run for just
> > > one
> > > day,
> > > a week is an eye-watering $1848, so running this for every patch
> > > is
> > > not
> > > super economical I'd say.
> >
> > x2gd metal at $5.34 per hour makes more sense as it has more RAM
> > (and
> > 64 CPUs is adequate I'd say) but still quite pricey if you want to
> > run
> > this at any sort of scale.
> > _______________________________________________
> > lustre-devel mailing list
> > lustre-devel at lists.lustre.org
> > http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

_______________________________________________
lustre-devel mailing list
lustre-devel at lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20250204/59120bc3/attachment.htm>