[Lustre-discuss] lustre + samba question

Wed Sep 4 07:17:51 PDT 2013

I don't to start a new thread. But something fishy is going on. The client
I write you about exports Lustre via samba for some windows users. The
client has a 10Gigabit Ethernet NIC. Usually it works fine, but sometimes
this happens :

Sep  4 16:07:22 cache kernel: LustreError: Skipped 1 previous similar
message
Sep  4 16:07:22 cache kernel: LustreError: 167-0:
lustre0-MDT0000-mdc-ffff88062edb0c00: This client was evicted by
lustre0-MDT0000; in progress operations using this service will fail.
Sep  4 16:07:22 cache kernel: LustreError:
7411:0:(mdc_locks.c:840:mdc_enqueue()) ldlm_cli_enqueue: -5
Sep  4 16:07:22 cache kernel: LustreError:
7411:0:(file.c:159:ll_close_inode_openhandle()) inode 144115380745406215
mdc close failed: rc = -108
Sep  4 16:07:22 cache smbd[7580]: [2013/09/04 16:07:22.451276,  0]
smbd/process.c:2440(keepalive_fn)
Sep  4 16:07:22 cache kernel: LustreError:
7411:0:(file.c:159:ll_close_inode_openhandle()) inode 144115307009623849
mdc close failed: rc = -108
Sep  4 16:07:22 cache kernel: LustreError:
7411:0:(file.c:159:ll_close_inode_openhandle()) Skipped 1 previous similar
message
Sep  4 16:07:22 cache kernel: LustreError:
6553:0:(mdc_locks.c:840:mdc_enqueue()) ldlm_cli_enqueue: -108
Sep  4 16:07:22 cache kernel: LustreError:
6553:0:(mdc_locks.c:840:mdc_enqueue()) Skipped 21 previous similar messages
Sep  4 16:07:22 cache kernel: LustreError:
6517:0:(vvp_io.c:1228:vvp_io_init()) lustre0: refresh file layout
[0x200002b8a:0x119dd:0x0] error -108.
Sep  4 16:07:22 cache kernel: LustreError:
6517:0:(vvp_io.c:1228:vvp_io_init()) lustre0: refresh file layout
[0x200002b8a:0x119dd:0x0] error -108.
Sep  4 16:07:22 cache kernel: LustreError:
6519:0:(dir.c:389:ll_get_dir_page()) lock enqueue: [0x200000007:0x1:0x0] at
0: rc -108
Sep  4 16:07:22 cache kernel: LustreError: 6519:0:(dir.c:595:ll_dir_read())
error reading dir [0x200000007:0x1:0x0] at 0: rc -108
Sep  4 16:07:22 cache kernel: LustreError:
6553:0:(dir.c:389:ll_get_dir_page()) lock enqueue: [0x200000007:0x1:0x0] at
0: rc -108
Sep  4 16:07:22 cache kernel: LustreError: 6553:0:(dir.c:595:ll_dir_read())
error reading dir [0x200000007:0x1:0x0] at 0: rc -108
Sep  4 16:07:22 cache smbd[6517]: [2013/09/04 16:07:22.469729,  0]
smbd/dfree.c:137(sys_disk_free)
Sep  4 16:07:22 cache smbd[6517]:   disk_free: sys_fsusage() failed. Error
was : Cannot send after transport endpoint shutdown
Sep  4 16:07:22 cache kernel: LustreError:
6517:0:(lmv_obd.c:1289:lmv_statfs()) can't stat MDS #0
(lustre0-MDT0000-mdc-ffff88062edb0c00), error -108
Sep  4 16:07:22 cache kernel: LustreError:
6517:0:(llite_lib.c:1610:ll_statfs_internal()) md_statfs fails: rc = -108
Sep  4 16:07:22 cache smbd[6517]: [2013/09/04 16:07:22.470531,  0]
smbd/dfree.c:137(sys_disk_free)
Sep  4 16:07:22 cache smbd[6517]:   disk_free: sys_fsusage() failed. Error
was : Cannot send after transport endpoint shutdown
Sep  4 16:07:22 cache smbd[6517]: [2013/09/04 16:07:22.470922,  0]
smbd/dfree.c:137(sys_disk_free)
Sep  4 16:07:22 cache smbd[6517]:   disk_free: sys_fsusage() failed. Error
was : Cannot send after transport endpoint shutdown
Sep  4 16:07:22 cache kernel: LustreError:
6517:0:(lmv_obd.c:1289:lmv_statfs()) can't stat MDS #0
(lustre0-MDT0000-mdc-ffff88062edb0c00), error -108
Sep  4 16:07:22 cache kernel: LustreError:
6553:0:(statahead.c:1397:is_first_dirent()) error reading dir
[0x200000007:0x1:0x0] at 0: [rc -108] [parent 6553]
Sep  4 16:07:22 cache kernel: LustreError:
6553:0:(statahead.c:1397:is_first_dirent()) error reading dir
[0x200000007:0x1:0x0] at 0: [rc -108] [parent 6553]
Sep  4 16:07:22 cache smbd[8958]: [2013/09/04 16:07:22.517112,  0]
smbd/dfree.c:137(sys_disk_free)
Sep  4 16:07:22 cache smbd[8958]:   disk_free: sys_fsusage() failed. Error
was : Cannot send after transport endpoint shutdown
Sep  4 16:07:22 cache smbd[8958]: [2013/09/04 16:07:22.517496,  0]
smbd/dfree.c:137(sys_disk_free)

Eventually it connects again to the MDS:

Sep  4 16:07:31 cache kernel: LustreError:
6582:0:(ldlm_resource.c:811:ldlm_resource_complain()) Resource:
ffff880a125f7e40 (8589945618/117308/0/0) (rc: 1)
Sep  4 16:07:31 cache kernel: LustreError:
6582:0:(ldlm_resource.c:1423:ldlm_resource_dump()) --- Resource:
ffff880a125f7e40 (8589945618/117308/0/0) (rc: 2)
Sep  4 16:07:31 cache kernel: Lustre: lustre0-MDT0000-mdc-ffff88062edb0c00:
Connection restored to lustre0-MDT0000 (at 192.168.11.23 at tcp)

But now the load average hits to high heaven ... 25 or even 50 for a 16 CPU
machine. And shortly after that :

Sep  4 16:08:14 cache kernel: INFO: task ptlrpcd_6:2425 blocked for more
than 120 seconds.
Sep  4 16:08:14 cache kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep  4 16:08:14 cache kernel: ptlrpcd_6     D 000000000000000c     0  2425
     2 0x00000080
Sep  4 16:08:14 cache kernel: ffff880c34ab7a10 0000000000000046
0000000000000000 ffffffffa0a01736
Sep  4 16:08:14 cache kernel: ffff880c34ab79d0 ffffffffa09fc199
ffff88062e274000 ffff880c34cf8000
Sep  4 16:08:14 cache kernel: ffff880c34aae638 ffff880c34ab7fd8
000000000000fb88 ffff880c34aae638
Sep  4 16:08:14 cache kernel: Call Trace:
Sep  4 16:08:14 cache kernel: [<ffffffffa0a01736>] ?
ksocknal_queue_tx_locked+0x136/0x530 [ksocklnd]
Sep  4 16:08:14 cache kernel: [<ffffffffa09fc199>] ?
ksocknal_find_conn_locked+0x159/0x290 [ksocklnd]
Sep  4 16:08:14 cache kernel: [<ffffffff8150f1ee>]
__mutex_lock_slowpath+0x13e/0x180
Sep  4 16:08:14 cache kernel: [<ffffffff8150f08b>] mutex_lock+0x2b/0x50
Sep  4 16:08:14 cache kernel: [<ffffffffa0751ecf>]
cl_lock_mutex_get+0x6f/0xd0 [obdclass]
Sep  4 16:08:14 cache kernel: [<ffffffffa0b5b469>]
lovsub_parent_lock+0x49/0x120 [lov]
Sep  4 16:08:14 cache kernel: [<ffffffffa0b5c60f>]
lovsub_lock_modify+0x7f/0x1e0 [lov]
Sep  4 16:08:14 cache kernel: [<ffffffffa07514d8>]
cl_lock_modify+0x98/0x310 [obdclass]
Sep  4 16:08:14 cache kernel: [<ffffffffa0748dae>] ?
cl_object_attr_unlock+0xe/0x20 [obdclass]
Sep  4 16:08:14 cache kernel: [<ffffffffa0ac1e52>] ?
osc_lock_lvb_update+0x1a2/0x470 [osc]
Sep  4 16:08:14 cache kernel: [<ffffffffa0ac2302>]
osc_lock_granted+0x1e2/0x2b0 [osc]
Sep  4 16:08:14 cache kernel: [<ffffffffa0ac30b0>]
osc_lock_upcall+0x3f0/0x5e0 [osc]
Sep  4 16:08:14 cache kernel: [<ffffffffa0ac2cc0>] ?
osc_lock_upcall+0x0/0x5e0 [osc]
Sep  4 16:08:14 cache kernel: [<ffffffffa0aa3876>]
osc_enqueue_fini+0x106/0x240 [osc]
Sep  4 16:08:14 cache kernel: [<ffffffffa0aa82c2>]
osc_enqueue_interpret+0xe2/0x1e0 [osc]
Sep  4 16:08:14 cache kernel: [<ffffffffa0884d2c>]
ptlrpc_check_set+0x2ac/0x1b20 [ptlrpc]
Sep  4 16:08:14 cache kernel: [<ffffffffa08b1c7b>]
ptlrpcd_check+0x53b/0x560 [ptlrpc]
Sep  4 16:08:14 cache kernel: [<ffffffffa08b21a3>] ptlrpcd+0x233/0x390
[ptlrpc]
Sep  4 16:08:14 cache kernel: [<ffffffff81063310>] ?
default_wake_function+0x0/0x20
Sep  4 16:08:14 cache kernel: [<ffffffffa08b1f70>] ? ptlrpcd+0x0/0x390
[ptlrpc]
Sep  4 16:08:14 cache kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
Sep  4 16:08:14 cache kernel: [<ffffffffa08b1f70>] ? ptlrpcd+0x0/0x390
[ptlrpc]
Sep  4 16:08:14 cache kernel: [<ffffffffa08b1f70>] ? ptlrpcd+0x0/0x390
[ptlrpc]
Sep  4 16:08:14 cache kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

Sometimes its the smbd daemon that spits similar call trace. I`m using
lustre 2.4.0-2.6.32_358.6.2.el6.x86_64_gd3f91c4.x86_64. This is the only
client I use for exporting lustre via samba. No other client is having
errors or issues like that. Sometimes I can see that this particular client
disconnects from some of the OSTs too. I`ll test the NIC to see if there
are any hardware problems. but besides that, does any one have any clues or
hints that want to share with me ?

Cheers,

On Tue, Jun 25, 2013 at 8:55 AM, Nikolay Kvetsinski <nkvecinski at gmail.com>wrote:

> Thank you all for your quick responses. Unfortunately my own stupidity
> played me this time ... the MDS server was not a domain member. After
> joining it it works OK.
>
> Cheers,
> :(
>
>
> On Mon, Jun 24, 2013 at 8:48 PM, Michael Watters <wattersmt at gmail.com>wrote:
>
>> Does your uid in Windows match the uid on the samba server?  Does the
>> samba account exist?  I ran into similar issues with NFS.
>>
>>
>> On Mon, Jun 24, 2013 at 5:59 AM, Nikolay Kvetsinski <nkvecinski at gmail.com
>> > wrote:
>>
>>> Hello guys,
>>>
>>> I`m using the latest feature release
>>> (lustre-2.4.0-2.6.32_358.6.2.el6_lustre.g230b174.x86_64_gd3f91c4.x86_64.rpm)
>>> + centos 6.4. Lustre itself is working fine, but when I export it with
>>> samba and try to connect with a windows 7 client I get :
>>>
>>> Jun 24 12:53:14 R-82L kernel: LustreError:
>>> 2326:0:(mdc_locks.c:840:mdc_enqueue()) ldlm_cli_enqueue: -13
>>> Jun 24 12:53:16 R-82L kernel: LustreError:
>>> 2326:0:(mdc_locks.c:840:mdc_enqueue()) ldlm_cli_enqueue: -13
>>> Jun 24 12:53:16 R-82L kernel: LustreError:
>>> 2326:0:(mdc_locks.c:840:mdc_enqueue()) Skipped 7 previous similar messages
>>> Jun 24 12:53:16 R-82L kernel: LustreError:
>>> 2326:0:(mdc_locks.c:840:mdc_enqueue()) ldlm_cli_enqueue: -13
>>> Jun 24 12:53:16 R-82L kernel: LustreError:
>>> 2326:0:(mdc_locks.c:840:mdc_enqueue()) Skipped 7 previous similar messages
>>>
>>> And on the windows client I get "You don't have permissions to access
>>> ....." error. Permissions are 777. I created a local share just to test the
>>> samba server and its working. The error pops up only when trying to access
>>> samba share with lustre backend storage.
>>>
>>> Any help will be greatly appreciated.
>>>
>>> Cheers,
>>>
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20130904/74ff62fd/attachment.htm>