[lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming
Oleg Drokin
green at whamcloud.com
Thu Feb 6 10:47:22 PST 2025
On Thu, 2025-02-06 at 18:24 +0000, Day, Timothy wrote:
>
> I wrote a parallel ktest runner [1] a while back that probably does
> the needed orchestration on the host side. It was originally intended
> to run sanity tests faster (mostly for the OSD stuff I was working
> on).
> But I think it could be adapted to run boilpot without much work.
> It'd probably need some daemonize mode and I'd need to validate
> that ktest actually captures all of the error modes we care about.
Aha, thanks, I'll try to look into that.
>
> Ideally, the boilpot part would be platform agnostic. The cloud
> orchestration part would just create the VM, run boilpot, and shuffle
> the crash dumps off the box. My main goal (right now) is to get
In fact I ppre-process crashdumps on the boilpot and then feed a server
with that data and if the crash is deemed "new" or interesting enough
for other reason it wil lrequest more data that hte boilpot will then
provide.
After all there's only so many identical known crashes one needs to
store.
> something easily reproducible and get a sense of the signal/noise
> ratio on boilpot. Plus, it might be interesting to try and flush out
> bugs
After I filter for all the known and "invalid" failures I get probably
on the order of mat be 1 crash a day, sometimes less, sometimes more.
Last one out of current master-next came totally unknown:
https://knox.linuxhacker.ru/crashdb_ui_external.py.cgi?newid=72768
This allows much higher visibility when something breaks, so with the
recent https://review.whamcloud.com/c/fs/lustre-release/+/55724 all the
procfs failures were really visible (and when I changed the recovery-
small to only run tests 55, 56 and 57 - the frequency shot up to
several crashes every other hour).
You can also see time-sorted crashes from all sources that report to my
server here: https://knox.linuxhacker.ru/crashdb_ui_external.py.cgi
(add ?count=XXX if you want more than the default number. It only shows
"unvetted" crashes too, which is something I probably need to change
eventually, but those are the most important ones I guess)
> in my OSD as well [2]. It's hard to say how often I'd run it without
> first seeing how effective it is.
>
> Tim Day
>
> [1] https://github.com/tim-day-387/ktest/tree/pktest
> [2] https://review.whamcloud.com/c/fs/lustre-release/+/55594
>
>
More information about the lustre-devel
mailing list