[Lustre-discuss] Problem with LNET and openibd on Lustre 1.8.4 while rebooting

Nirmal Seenu nirmal at fnal.gov
Thu Sep 9 09:56:36 PDT 2010


I just upgraded my lustre version from 1.8.1.1 to 1.8.4 and I can't reboot my lustre clients cleanly anymore. I am using the latest RHEL kernel and 
the openibd that comes part of that RHEL kernel + patchless lustre client installed from the tar ball.

The lustre client gets unmounted cleanly but the system deadlocks once the openibd driver is removed. I had to modify the openibd stop script to 
include "umount lustre" and "lustre_rmmod" as a work around.

The following is the error message that I get when I try to reboot the lustre client:

Scientific Linux SLF release 5.3 (Lederman)
Kernel 2.6.18-194.11.1.el5 on an x86_64

INIT:Shutting down smartd: [  OK  ]
Stopping atd: [  OK  ]
Shutting down process accounting:  [  OK  ]
Stopping xinetd: [  OK  ]
Stopping autofs:  Stopping automount: [  OK  ]
[  OK  ]
Stopping acpi daemon: [  OK  ]
Shutting down ntpd: [  OK  ]
Unmounting network block filesystems:  LustreError: 3697:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway
LustreError: 3697:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108
Lustre: client ffff81020f145400 umount complete
[  OK  ]
Unmounting NFS filesystems:  [  OK  ]
Stopping system message bus: [  OK  ]
Stopping RPC idmapd: [  OK  ]
Stopping NFS locking: [  OK  ]
Stopping NFS statd: [  OK  ]
Stopping portmap: [  OK  ]
Stopping PC/SC smart card daemon (pcscd): [  OK  ]
Shutting down kernel logger: [  OK  ]
Shutting down system logger: [  OK  ]
Unloading OpenIB kernel modules:NET: Unregistered protocol family 27

Failed to unload rdma_cm

Failed to unload ib_cm

Failed to unload iw_cm
LustreError: 131-3: Received notification of device removal
Please shutdown LNET to allow this to proceed
INFO: task rmmod:4151 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
rmmod         D ffff810227061420     0  4151   3795                     (NOTLB)
  ffff81021c8ddce8 0000000000000082 000000000000000f 0000000000000292
  00000000000000ef 0000000000000001 ffff81020ecdd100 ffff8102271ef040
  0000004a957c4bd9 000000000095dc57 ffff81020ecdd2e8 0000000480076646
Call Trace:
  [<ffffffff80063167>] wait_for_completion+0x79/0xa2
  [<ffffffff8008cfa1>] default_wake_function+0x0/0xe
  [<ffffffff80063b05>] mutex_lock+0xd/0x1d
  [<ffffffff8838d155>] :rdma_cm:cma_remove_one+0x171/0x1a2
  [<ffffffff80076525>] do_flush_tlb_all+0x0/0x6a
  [<ffffffff8817d5f0>] :ib_core:ib_unregister_device+0x30/0xdb
  [<ffffffff881a918a>] :ib_mthca:__mthca_remove_one+0x30/0x11a
  [<ffffffff80063b05>] mutex_lock+0xd/0x1d
  [<ffffffff881a928c>] :ib_mthca:mthca_remove_one+0x18/0x25
  [<ffffffff8015daeb>] pci_device_remove+0x24/0x3a
  [<ffffffff801c7a3e>] __device_release_driver+0x9f/0xe9
  [<ffffffff801c7e04>] driver_detach+0xad/0x101
  [<ffffffff801c6ffe>] bus_remove_driver+0x6f/0x92
  [<ffffffff801c7e8b>] driver_unregister+0xd/0x16
  [<ffffffff8015ddb4>] pci_unregister_driver+0x2a/0x79
  [<ffffffff881bc398>] :ib_mthca:mthca_cleanup+0x10/0x16
  [<ffffffff800a6674>] sys_delete_module+0x196/0x1c5
  [<ffffffff8005d116>] system_call+0x7e/0x83


Nirmal



More information about the lustre-discuss mailing list