[lustre-discuss] lnet_peer_ni_add_to_recoveryq
Chris Horn
hornc at cray.com
Mon Mar 9 09:59:28 PDT 2020
Network failures cause an interface's health value to decrement. Recovery mode is the mechanism that raises the health value back up. Interfaces are ping'd on a regular interval by the "lnet_monitor_thread". Successful pings increase the health value of the interface (remote or local).
When LNet is selecting the local and remote interfaces to use for a PUT or GET, it considers the health value of each interface. Healthier interfaces are preferred.
Chris Horn
On 3/9/20, 4:22 AM, "Degremont, Aurelien" <degremoa at amazon.com> wrote:
What's the impact of being in recovery mode with LNET health?
Le 06/03/2020 21:12, « lustre-discuss au nom de Chris Horn » <lustre-discuss-bounces at lists.lustre.org au nom de hornc at cray.com> a écrit :
> lneterror: 10164:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked())
> lpni <address> added to recovery queue. Health = 900
The message means that the health value of a remote peer interface has been decremented, and as a result, the interface has been put into recovery mode. This mechanism is part of the LNet health feature.
Health values are decremented when a PUT or GET fails. Usually there are other messages in the log that can tell you more about the specific failure. Depending on your network type you should probably see messages from socklnd or o2iblnd. Network congestion could certainly lead to message timeouts, which would in turn result in interfaces being placed into recovery mode.
Chris Horn
On 3/6/20, 8:59 AM, "lustre-discuss on behalf of Michael Di Domenico" <lustre-discuss-bounces at lists.lustre.org on behalf of mdidomenico4 at gmail.com> wrote:
along the aforementioned error i also see these at the same time
lustreerror: 9675:0:(obd_config.c:1428:class_modify_config())
<...>-clilov-<...>; failed to send uevent qos_threshold_rr=100
On Fri, Mar 6, 2020 at 9:39 AM Michael Di Domenico
<mdidomenico4 at gmail.com> wrote:
>
> On Fri, Mar 6, 2020 at 9:36 AM Degremont, Aurelien <degremoa at amazon.com> wrote:
> >
> > Did you see any actual error on your system?
> >
> > Because there is a patch that is just decreasing the verbosity level of such messages, which looks like could be ignored.
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__jira.whamcloud.com_browse_LU-2D13071&d=DwICAg&c=C5b8zRQO1miGmBeVZ2LFWg&r=hIaFpo9yRyCwkkAs6y0c7W-QqT7uZMMSOkAIByhcA-I&m=ByOR33WN61jv0rEVZTtNhUgN313iSqbgrdfakY-TAjc&s=jp8DpDcylEQYlbd9-s3efysfDy2KdLvBrptsplqR1ks&e=
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__review.whamcloud.com_-23_c_37718_&d=DwICAg&c=C5b8zRQO1miGmBeVZ2LFWg&r=hIaFpo9yRyCwkkAs6y0c7W-QqT7uZMMSOkAIByhcA-I&m=ByOR33WN61jv0rEVZTtNhUgN313iSqbgrdfakY-TAjc&s=8EUQ5wHRCuFFbd4PKxQCnTB_L9IgffvkzFw4_v6MEHg&e=
>
> thanks. it's not entirely clear just yet. i'm trying to track down a
> "slow jobs" issue. i see these messages everywhere, so it might be a
> non issue or a sign of something more pressing.
_______________________________________________
lustre-discuss mailing list
lustre-discuss at lists.lustre.org
https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.lustre.org_listinfo.cgi_lustre-2Ddiscuss-2Dlustre.org&d=DwICAg&c=C5b8zRQO1miGmBeVZ2LFWg&r=hIaFpo9yRyCwkkAs6y0c7W-QqT7uZMMSOkAIByhcA-I&m=ByOR33WN61jv0rEVZTtNhUgN313iSqbgrdfakY-TAjc&s=d36yZXUxMDJOjluQt2LUPivEkfLhScuCLIQT6Fl-Qhs&e=
_______________________________________________
lustre-discuss mailing list
lustre-discuss at lists.lustre.org
https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.lustre.org_listinfo.cgi_lustre-2Ddiscuss-2Dlustre.org&d=DwIGaQ&c=C5b8zRQO1miGmBeVZ2LFWg&r=hIaFpo9yRyCwkkAs6y0c7W-QqT7uZMMSOkAIByhcA-I&m=MWzLz3rQZoSqu_bMB83a0EdO1KMglAndLsxrBlOT9fA&s=Y-NtxxGn4LIKwsK_QtBwjw13E0CYycKLLS9PNuiGvms&e=
More information about the lustre-discuss
mailing list