[Lustre-discuss] Problem with LNET and openibd on Lustre 1.8.4 while rebooting

Nirmal Seenu nirmal at fnal.gov
Thu Sep 9 12:33:57 PDT 2010


lustre does get unmounted before NFS filesystem as seen in the log message... the problem is due to the fact that LNET is still up when openibd gets 
removed.

Nirmal

On 09/09/2010 02:28 PM, Andreas Dilger wrote:
> On 2010-09-09, at 10:56, Nirmal Seenu wrote:
>> I just upgraded my lustre version from 1.8.1.1 to 1.8.4 and I can't reboot my lustre clients cleanly anymore. I am using the latest RHEL kernel and
>> the openibd that comes part of that RHEL kernel + patchless lustre client installed from the tar ball.
>>
>> The lustre client gets unmounted cleanly but the system deadlocks once the openibd driver is removed. I had to modify the openibd stop script to
>> include "umount lustre" and "lustre_rmmod" as a work around.
>
> If you put "_netdev" in the lustre mount options, the shutdown scripts  should unmount it before trying to stop the networking.
>
>
>> The following is the error message that I get when I try to reboot the lustre client:
>>
>> Scientific Linux SLF release 5.3 (Lederman)
>> Kernel 2.6.18-194.11.1.el5 on an x86_64
>>
>> INIT:Shutting down smartd: [  OK  ]
>> Stopping atd: [  OK  ]
>> Shutting down process accounting:  [  OK  ]
>> Stopping xinetd: [  OK  ]
>> Stopping autofs:  Stopping automount: [  OK  ]
>> [  OK  ]
>> Stopping acpi daemon: [  OK  ]
>> Shutting down ntpd: [  OK  ]
>> Unmounting network block filesystems:  LustreError: 3697:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway
>> LustreError: 3697:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108
>> Lustre: client ffff81020f145400 umount complete
>> [  OK  ]
>> Unmounting NFS filesystems:  [  OK  ]
>> Stopping system message bus: [  OK  ]
>> Stopping RPC idmapd: [  OK  ]
>> Stopping NFS locking: [  OK  ]
>> Stopping NFS statd: [  OK  ]
>> Stopping portmap: [  OK  ]
>> Stopping PC/SC smart card daemon (pcscd): [  OK  ]
>> Shutting down kernel logger: [  OK  ]
>> Shutting down system logger: [  OK  ]
>> Unloading OpenIB kernel modules:NET: Unregistered protocol family 27
>>
>> Failed to unload rdma_cm
>>
>> Failed to unload ib_cm
>>
>> Failed to unload iw_cm
>> LustreError: 131-3: Received notification of device removal
>> Please shutdown LNET to allow this to proceed
>> INFO: task rmmod:4151 blocked for more than 120 seconds.
>> "echo 0>  /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> rmmod         D ffff810227061420     0  4151   3795                     (NOTLB)
>>   ffff81021c8ddce8 0000000000000082 000000000000000f 0000000000000292
>>   00000000000000ef 0000000000000001 ffff81020ecdd100 ffff8102271ef040
>>   0000004a957c4bd9 000000000095dc57 ffff81020ecdd2e8 0000000480076646
>> Call Trace:
>>   [<ffffffff80063167>] wait_for_completion+0x79/0xa2
>>   [<ffffffff8008cfa1>] default_wake_function+0x0/0xe
>>   [<ffffffff80063b05>] mutex_lock+0xd/0x1d
>>   [<ffffffff8838d155>] :rdma_cm:cma_remove_one+0x171/0x1a2
>>   [<ffffffff80076525>] do_flush_tlb_all+0x0/0x6a
>>   [<ffffffff8817d5f0>] :ib_core:ib_unregister_device+0x30/0xdb
>>   [<ffffffff881a918a>] :ib_mthca:__mthca_remove_one+0x30/0x11a
>>   [<ffffffff80063b05>] mutex_lock+0xd/0x1d
>>   [<ffffffff881a928c>] :ib_mthca:mthca_remove_one+0x18/0x25
>>   [<ffffffff8015daeb>] pci_device_remove+0x24/0x3a
>>   [<ffffffff801c7a3e>] __device_release_driver+0x9f/0xe9
>>   [<ffffffff801c7e04>] driver_detach+0xad/0x101
>>   [<ffffffff801c6ffe>] bus_remove_driver+0x6f/0x92
>>   [<ffffffff801c7e8b>] driver_unregister+0xd/0x16
>>   [<ffffffff8015ddb4>] pci_unregister_driver+0x2a/0x79
>>   [<ffffffff881bc398>] :ib_mthca:mthca_cleanup+0x10/0x16
>>   [<ffffffff800a6674>] sys_delete_module+0x196/0x1c5
>>   [<ffffffff8005d116>] system_call+0x7e/0x83
>>
>>
>> Nirmal
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>
> Cheers, Andreas
> --
> Andreas Dilger
> Lustre Technical Lead
> Oracle Corporation Canada Inc.
>



More information about the lustre-discuss mailing list