[lustre-discuss] 2.15.4 o2iblnd on RoCEv2?

Andreas Dilger adilger at whamcloud.com
Wed Jan 10 13:55:57 PST 2024


Granted that I'm not an LNet expert, but "errno: -1 descr: cannot parse net '<255:65535>' " doesn't immediately lead me to the same conclusion as if "unknown internface 'ib0' " were printed for the error message.  Also "errno: -1" is "-EPERM = Operation not permitted", and doesn't give the same information as "-ENXIO = No such device or address" or even "-EINVAL = Invalid argument" would.

That said, I can't even offer a patch for this myself, since that exact error message is used in a few different places, though I suspect it is coming from lustre_lnet_config_ni().

Looking further into this, now that I've found where (I think) the error message is generated, it seems that "errno: -1" is not "-EPERM" but rather "LUSTRE_CFG_RC_BAD_PARAM", which is IMHO a travesty to use different error numbers (and then print them after "errno:") instead of existing POSIX error codes that could fill the same role (with some creative mapping):

    #define LUSTRE_CFG_RC_NO_ERR                     0  => fine
    #define LUSTRE_CFG_RC_BAD_PARAM                 -1  => -EINVAL
    #define LUSTRE_CFG_RC_MISSING_PARAM             -2  => -EFAULT
    #define LUSTRE_CFG_RC_OUT_OF_RANGE_PARAM        -3  => -ERANGE
    #define LUSTRE_CFG_RC_OUT_OF_MEM                -4  => -ENOMEM
    #define LUSTRE_CFG_RC_GENERIC_ERR               -5  => -ENODATA
    #define LUSTRE_CFG_RC_NO_MATCH                  -6  => -ENOMSG
    #define LUSTRE_CFG_RC_MATCH                     -7  => -EXFULL
    #define LUSTRE_CFG_RC_SKIP                      -8  => -EBADSLT
    #define LUSTRE_CFG_RC_LAST_ELEM                 -9  => -ECHRNG
    #define LUSTRE_CFG_RC_MARSHAL_FAIL              -10 => -ENOSTR

I don't think "overloading" the POSIX error codes to mean something similar is worse than using random numbers to report errors.  Also, in some cases (even in lustre_lnet_config_ni()) it is using "rc = -errno" so the LUSTRE_CFG_RC_* errors are *already* conflicting with POSIX error numbers, and it impossible to distinguish between them...

The main question is whether changing these numbers will break a user->kernel interface, or if these definitions are only in userspace?    It looks like lnetctl.c is only ever checking "!= LUSTRE_CFG_RC_NO_ERR", so maybe it is fine?  None of the values currently overlap, so it would be possible to start accepting either of the values for the return in the user tools, and then at some point in the future start actually returning them...  Something for the LNet folks to figure out.

Cheers, Andreas

On Jan 10, 2024, at 13:29, Jeff Johnson <jeff.johnson at aeoncomputing.com<mailto:jeff.johnson at aeoncomputing.com>> wrote:

A LU ticket and patch for lnetctl or for me being an under-caffeinated
idiot? ;-)

On Wed, Jan 10, 2024 at 12:06 PM Andreas Dilger <adilger at whamcloud.com<mailto:adilger at whamcloud.com>> wrote:

It would seem that the error message could be improved in this case?  Could you file an LU ticket for that with the reproducer below, and ideally along with a patch?

Cheers, Andreas
_______________________________________________
lustre-discuss mailing list
lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20240110/7996178d/attachment.htm>


More information about the lustre-discuss mailing list