[Lustre-discuss] [BUG] Lustre does not handle device removal events

Zhen, Liang liang.zhen at intel.com
Thu Nov 27 07:34:36 PST 2014


Release 2.7 will have Dynamic LNet Config (DLC) which allows user to dynamically add/remove LNet NI in userspace, but so far we don’t have plan to handle this event and trigger NI removal in kernel.

Regards
Liang

From: Jeff Johnson <jeff.johnson at aeoncomputing.com<mailto:jeff.johnson at aeoncomputing.com>>
Date: Thursday, November 27, 2014 at 11:22 PM
To: "lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>" <lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>>
Subject: Re: [Lustre-discuss] [BUG] Lustre does not handle device removal events

Eli,

LNET is bound to a particular device at the time the Lustre modules are loaded. Lustre modules need to be unloaded prior to any unloading of a device to which it is bound. This can be done w/ Lustre init scripts, manually or by using lustre_rmmod.

I can't speak to whether or not this will be fixed as I don't know that the developer community sees this as being broken. I'm sure someone will speak to that.

--Jeff

On Wednesday, November 26, 2014, Eli Cohen <eli at dev.mellanox.co.il<mailto:eli at dev.mellanox.co.il>> wrote:
Hi,
we installed Lustre over rdma_cm on our system. When we tried to
unload the inifinband drivers we got this call trace:

LNetError: 131-3: Received notification of device removal
Please shutdown LNET to allow this to proceed
INFO: task modprobe:6236 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
modprobe      D 000000000000000a     0  6236   6189 0x00000000
ffff881fb14b5c18 0000000000000086 0000000000000000 ffffffffa03c3f33
ffff881fb14b5c38 ffff881fb14b5c08 ffff881fb14b5ba8 ffff882028df5c70
ffff882013637ab8 ffff881fb14b5fd8 000000000000fb88 ffff882013637ab8
Call Trace:
[<ffffffffa03c3f33>] ? libcfs_debug_vmsg2+0x5d3/0xbc0 [libcfs]
[<ffffffff8150e555>] schedule_timeout+0x215/0x2e0
[<ffffffff8150e1d3>] wait_for_common+0x123/0x180
[<ffffffff81063310>] ? default_wake_function+0x0/0x20
[<ffffffff8150e2ed>] wait_for_completion+0x1d/0x20
[<ffffffffa02c10be>] cma_remove_one+0x18e/0x210 [rdma_cm]
[<ffffffffa024660f>] ib_unregister_device+0x4f/0x100 [ib_core]
[<ffffffff81063310>] ? default_wake_function+0x0/0x20
[<ffffffffa0316689>] mlx5_ib_remove+0x19/0x50 [mlx5_ib]
[<ffffffffa02f4245>] mlx5_remove_device+0x75/0x90 [mlx5_core]
[<ffffffffa02f4633>] mlx5_unregister_interface+0x43/0x80 [mlx5_core]
[<ffffffffa0328955>] __exit_compat+0x15/0xe2 [mlx5_ib]
[<ffffffff810b4814>] sys_delete_module+0x194/0x260
[<ffffffff8151311e>] ? do_page_fault+0x3e/0xa0
[<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

I saw this peace of code in the callback function in
kiblnd_cm_callback:

        case RDMA_CM_EVENT_DEVICE_REMOVAL:
                LCONSOLE_ERROR_MSG(0x131,
                                   "Received notification of device removal\n"
                                   "Please shutdown LNET to allow this to proceed\n");
                /* Can't remove network from underneath LNET for now, * so I have
                 * to ignore this */
                return 0;

which suggests that device removal events are not handled. Is there a
plan to fix this?
_______________________________________________
Lustre-discuss mailing list
Lustre-discuss at lists.lustre.org<javascript:;>
http://lists.lustre.org/mailman/listinfo/lustre-discuss


--
------------------------------
Jeff Johnson
Co-Founder
Aeon Computing

jeff.johnson at aeoncomputing.com<mailto:jeff.johnson at aeoncomputing.com>
www.aeoncomputing.com<http://www.aeoncomputing.com>
t: 858-412-3810 x1001   f: 858-412-3845
m: 619-204-9061

4170 Morena Boulevard, Suite D - San Diego, CA 92117

High-Performance Computing / Lustre Filesystems / Scale-out Storage




More information about the lustre-discuss mailing list