[Lustre-discuss] [BUG] Lustre does not handle device removal events

Jeff Johnson jeff.johnson at aeoncomputing.com
Thu Nov 27 07:22:27 PST 2014


Eli,

LNET is bound to a particular device at the time the Lustre modules are
loaded. Lustre modules need to be unloaded prior to any unloading of a
device to which it is bound. This can be done w/ Lustre init scripts,
manually or by using lustre_rmmod.

I can't speak to whether or not this will be fixed as I don't know that the
developer community sees this as being broken. I'm sure someone will speak
to that.

--Jeff

On Wednesday, November 26, 2014, Eli Cohen <eli at dev.mellanox.co.il> wrote:

> Hi,
> we installed Lustre over rdma_cm on our system. When we tried to
> unload the inifinband drivers we got this call trace:
>
> LNetError: 131-3: Received notification of device removal
> Please shutdown LNET to allow this to proceed
> INFO: task modprobe:6236 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> message.
> modprobe      D 000000000000000a     0  6236   6189 0x00000000
> ffff881fb14b5c18 0000000000000086 0000000000000000 ffffffffa03c3f33
> ffff881fb14b5c38 ffff881fb14b5c08 ffff881fb14b5ba8 ffff882028df5c70
> ffff882013637ab8 ffff881fb14b5fd8 000000000000fb88 ffff882013637ab8
> Call Trace:
> [<ffffffffa03c3f33>] ? libcfs_debug_vmsg2+0x5d3/0xbc0 [libcfs]
> [<ffffffff8150e555>] schedule_timeout+0x215/0x2e0
> [<ffffffff8150e1d3>] wait_for_common+0x123/0x180
> [<ffffffff81063310>] ? default_wake_function+0x0/0x20
> [<ffffffff8150e2ed>] wait_for_completion+0x1d/0x20
> [<ffffffffa02c10be>] cma_remove_one+0x18e/0x210 [rdma_cm]
> [<ffffffffa024660f>] ib_unregister_device+0x4f/0x100 [ib_core]
> [<ffffffff81063310>] ? default_wake_function+0x0/0x20
> [<ffffffffa0316689>] mlx5_ib_remove+0x19/0x50 [mlx5_ib]
> [<ffffffffa02f4245>] mlx5_remove_device+0x75/0x90 [mlx5_core]
> [<ffffffffa02f4633>] mlx5_unregister_interface+0x43/0x80 [mlx5_core]
> [<ffffffffa0328955>] __exit_compat+0x15/0xe2 [mlx5_ib]
> [<ffffffff810b4814>] sys_delete_module+0x194/0x260
> [<ffffffff8151311e>] ? do_page_fault+0x3e/0xa0
> [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
>
> I saw this peace of code in the callback function in
> kiblnd_cm_callback:
>
>         case RDMA_CM_EVENT_DEVICE_REMOVAL:
>                 LCONSOLE_ERROR_MSG(0x131,
>                                    "Received notification of device
> removal\n"
>                                    "Please shutdown LNET to allow this to
> proceed\n");
>                 /* Can't remove network from underneath LNET for now, * so
> I have
>                  * to ignore this */
>                 return 0;
>
> which suggests that device removal events are not handled. Is there a
> plan to fix this?
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org <javascript:;>
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>


-- 
------------------------------
Jeff Johnson
Co-Founder
Aeon Computing

jeff.johnson at aeoncomputing.com
www.aeoncomputing.com
t: 858-412-3810 x1001   f: 858-412-3845
m: 619-204-9061

4170 Morena Boulevard, Suite D - San Diego, CA 92117

High-Performance Computing / Lustre Filesystems / Scale-out Storage
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20141127/36368ed8/attachment.htm>


More information about the lustre-discuss mailing list