[Lustre-discuss] [BUG] Lustre does not handle device removal events

Eli Cohen eli at dev.mellanox.co.il
Wed Nov 26 22:49:56 PST 2014


Hi,
we installed Lustre over rdma_cm on our system. When we tried to
unload the inifinband drivers we got this call trace:

LNetError: 131-3: Received notification of device removal
Please shutdown LNET to allow this to proceed
INFO: task modprobe:6236 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
modprobe      D 000000000000000a     0  6236   6189 0x00000000
ffff881fb14b5c18 0000000000000086 0000000000000000 ffffffffa03c3f33
ffff881fb14b5c38 ffff881fb14b5c08 ffff881fb14b5ba8 ffff882028df5c70
ffff882013637ab8 ffff881fb14b5fd8 000000000000fb88 ffff882013637ab8
Call Trace:
[<ffffffffa03c3f33>] ? libcfs_debug_vmsg2+0x5d3/0xbc0 [libcfs]
[<ffffffff8150e555>] schedule_timeout+0x215/0x2e0
[<ffffffff8150e1d3>] wait_for_common+0x123/0x180
[<ffffffff81063310>] ? default_wake_function+0x0/0x20
[<ffffffff8150e2ed>] wait_for_completion+0x1d/0x20
[<ffffffffa02c10be>] cma_remove_one+0x18e/0x210 [rdma_cm]
[<ffffffffa024660f>] ib_unregister_device+0x4f/0x100 [ib_core]
[<ffffffff81063310>] ? default_wake_function+0x0/0x20
[<ffffffffa0316689>] mlx5_ib_remove+0x19/0x50 [mlx5_ib]
[<ffffffffa02f4245>] mlx5_remove_device+0x75/0x90 [mlx5_core]
[<ffffffffa02f4633>] mlx5_unregister_interface+0x43/0x80 [mlx5_core]
[<ffffffffa0328955>] __exit_compat+0x15/0xe2 [mlx5_ib]
[<ffffffff810b4814>] sys_delete_module+0x194/0x260
[<ffffffff8151311e>] ? do_page_fault+0x3e/0xa0
[<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

I saw this peace of code in the callback function in
kiblnd_cm_callback:

        case RDMA_CM_EVENT_DEVICE_REMOVAL:
                LCONSOLE_ERROR_MSG(0x131,
                                   "Received notification of device removal\n"
                                   "Please shutdown LNET to allow this to proceed\n");
                /* Can't remove network from underneath LNET for now, * so I have
                 * to ignore this */
                return 0;

which suggests that device removal events are not handled. Is there a
plan to fix this?



More information about the lustre-discuss mailing list