[lustre-discuss] Inquiry Regarding LBUG Issue in Lustre 2.15.5 with RoCE and Sanity Test Coverage

Fri Apr 11 01:45:43 PDT 2025

On Mar 26, 2025, at 18:28, 권세훈 via lustre-discuss <lustre-discuss at lists.lustre.org> wrote:
> 
> My name is Sehoon Kwon, and I’m a developer at Gluesys, a storage solution provider based in South Korea.

Hello.

> We are currently working with Lustre version 2.15.5, and during testing in a RoCE environment, we encountered an LBUG issue. Upon checking the community issue tracker (LU-16637), we confirmed that a similar issue had been reported and resolved in a later release (Lustre 2.16).
> We also noted that there had been an effort to backport the fix to the b2_15 branch. However, based on our investigation, it appears that the patch has not yet been merged. As the stability of the fix remains unverified in this branch, we are preparing to evaluate the patch internally, referring to the Mallo-based testing you conducted as a reference.
> 
> We have backported the commit addressing LU-16637 to our ZFS-based Lustre 2.15.5 environment and successfully completed the build process, along with several other fixes.
> Following the Testing HOWTO on the Lustre Wiki, we executed sanity.sh and observed that the script includes nearly 1000 test cases. However, in some shared test logs from Whamcloud, we noticed that only around 300 tests were actually run.
> 
> We would appreciate your clarification on the following points:
>     • Are there any default test sets or predefined exclusions when running sanity.sh?
>  Alternatively, does Whamcloud maintain an internal list of commonly executed tests?

The number of subtests that are run depends on the configuration.  It will print a message for each subtest that is not run, for example because it depends on a newer server version, or two or more MDTs or OSTs, missing tools, etc.

>     • For the 2.15 branch, is there any recommended test suite or guideline for verifying backported patches?

The tests that should be run depend on what the patch is changing.  We run nearly all of the tests for every patch (about 150h of tests with different configurations, kernels, features, etc.), unless the patch is not changing any functional code and is marked  "trivial" so it only runs about 6-8h of testing (sanity, sanity-lnet).

>     • In addition to the sanity suite, we are aware of several other test categories.
>  If there is a commonly used baseline set for general validation, your guidance would be greatly appreciated.
> We aim to align our testing with community standards and ensure compatibility and stability, so any information or reference materials you could provide would be of great help.

Nearly all of the tests run in review testing will pass.  However, given the distributed nature of the filesystem and running in VMs, there are some subtests that fail intermittently.  It should be possible to re-run the failed tests to have them pass.

You are welcome to push the backported patch to the b2_15 branch of the fs/lustre repository in Gerrit.  Please follow the submission guidelines:
https://wiki.lustre.org/Submitting_Changes
https://wiki.lustre.org/Using_Gerrit
https://wiki.lustre.org/Commit_Comments

Since this is a backported patch, please add the following labels to indicate this is backported from the master branch
(see any backported patch on the b2_15 branch will have these labels.):

Lustre-change: https://jira.whamcloud.com/browse/LU-nnnnn
Lustre-commit: {git commit hash of master patch}

and remove the existing "Reviewed-on:" , "Reviewed-by: Oleg Drokin", and "Tested-by:" labels from the patch.

Cheers, Andreas
—
Andreas Dilger
Lustre Principal Architect
Whamcloud/DDN