[lustre-discuss] Lustre 2.12.0 and locking problems

Amir Shehata amir.shehata.whamcloud at gmail.com
Tue Mar 5 14:15:21 PST 2019


Take a look at this: https://jira.whamcloud.com/browse/LU-11840
Let me know if this is the same issue you're seeing.

On Tue, 5 Mar 2019 at 14:04, Amir Shehata <amir.shehata.whamcloud at gmail.com>
wrote:

> Hi Riccardo,
>
> It's not LNet Health. It's Dynamic Discovery. What's happening is that
> 2.12 is discovering all the interfaces on the peer. That's why you see all
> the interfaces in the peer show.
>
> Multi-Rail doesn't enable o2ib. It just sees it. If the node doing the
> discovery has only tcp, then it should never try to connect over the o2ib.
>
> Are you able to do a "lnetctl ping 172.21.48.250 at tcp" from the MDS
> multiple times? Do you see the ping failing intermittently?
>
> What should happen is that when the MDS (running 2.12) tries to talk to
> the peer you have identified, then it'll discover its interfaces. But then
> should realize that it can only reach it on the tcp network, since that's
> the only network configured on the MDS.
>
> It might help, if you just configure LNet only, on the MDS and the peer
> and run a simple
> lctl set_param debug=+"net neterror"
> lnetctl ping <>
> lctl dk >log
>
> If you can share the debug output, it'll help to pinpoint the problem.
>
> thanks
> amir
>
> On Tue, 5 Mar 2019 at 12:30, Riccardo Veraldi <
> Riccardo.Veraldi at cnaf.infn.it> wrote:
>
>> I think I figured out the problem.
>> My problem is related to Lnet Network Health feature:
>> https://jira.whamcloud.com/browse/LU-9120
>> the lustre MDS and the lsutre client having same version 2.12.0
>> negotiate a Multi-rail peer connection while this does not happen with
>> the other clients (2.10.5). So what happens is that both IB and tcp are
>> being used during transfers.
>> tcp is only for connecting to the MDS, IB only to connect to the OSS
>> anyway Multi-rail is enabled by default between the MDS,OSS and client.
>> This messes up the situation. the MDS has only one TCP interface and
>> cannot communicate by IB but in the "lnetctl peer show" a NID @o2ib
>> shows up and it should not. At this point the MDS tries to connect to
>> the client using IB and it will never work because there is no IB on the
>> MDS.
>> MDS Lnet configuration:
>>
>> net:
>>      - net type: lo
>>        local NI(s):
>>          - nid: 0 at lo
>>            status: up
>>      - net type: tcp
>>        local NI(s):
>>          - nid: 172.21.49.233 at tcp
>>            status: up
>>            interfaces:
>>                0: eth0
>>
>> but if I look at lnetctl peer show I See
>>
>>     - primary nid: 172.21.52.88 at o2ib
>>        Multi-Rail: True
>>        peer ni:
>>          - nid: 172.21.48.250 at tcp
>>            state: NA
>>          - nid: 172.21.52.88 at o2ib
>>            state: NA
>>          - nid: 172.21.48.250 at tcp1
>>            state: NA
>>          - nid: 172.21.48.250 at tcp2
>>            state: NA
>>
>> there should be no o2ib nid but Multi-rail for some reason enables it.
>> I do not have problems with the other clients (non 2.12.0)
>>
>> How can I disable Multi-rail on 2.12.0 ??
>>
>> thank you
>>
>>
>>
>> On 3/5/19 12:14 PM, Patrick Farrell wrote:
>> > Riccardo,
>> >
>> > Since 2.12 is still a relatively new maintenance release, it would be
>> helpful if you could open an LU and provide more detail there - Such as
>> what clients were doing, if you were using any new features (like DoM or
>> FLR), and full dmesg from the clients and servers involved in these
>> evictions.
>> >
>> > - Patrick
>> >
>> > On 3/5/19, 11:50 AM, "lustre-discuss on behalf of Riccardo Veraldi" <
>> lustre-discuss-bounces at lists.lustre.org on behalf of
>> Riccardo.Veraldi at cnaf.infn.it> wrote:
>> >
>> >      Hello,
>> >
>> >      I have quite a big issue on my Lustre 2.12.0 MDS/MDT.
>> >
>> >      Clients moving data to the OSS occur into a locking problem I
>> never met
>> >      before.
>> >
>> >      The clients are mostly 2.10.5 except for one which is 2.12.0 but
>> >      regardless the client version the problem is still there.
>> >
>> >      So these are the errors I see on hte MDS/MDT. When this happens
>> >      everything just hangs. If I reboot the MDS everything is back to
>> >      normality but it happened already 2 times in 3 days and it is
>> disrupting.
>> >
>> >      Any hints ?
>> >
>> >      Is it feasible to downgrade from 2.12.0 to 2.10.6 ?
>> >
>> >      thanks
>> >
>> >      Mar  5 11:10:33 psmdsana1501 kernel: Lustre:
>> >      7898:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request
>> sent has
>> >      failed due to network error: [sent 1551813033/real 1551813033]
>> >      req at ffff9fdcbecd0300 x1626845000210688/t0(0)
>> >      o104->ana15-MDT0000 at 172.21.52.87@o2ib:15/16 lens 296/224 e 0 to 1
>> dl
>> >      1551813044 ref 1 fl Rpc:eX/0/ffffffff rc 0/-1
>> >      Mar  5 11:10:33 psmdsana1501 kernel: Lustre:
>> >      7898:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 50552576
>> >      previous similar messages
>> >      Mar  5 11:13:03 psmdsana1501 kernel: LustreError:
>> >      7898:0:(ldlm_lockd.c:682:ldlm_handle_ast_error()) ### client (nid
>> >      172.21.52.87 at o2ib) failed to reply to blocking AST
>> (req at ffff9fdcbecd0300
>> >      x1626845000210688 status 0 rc -110), evict it ns:
>> mdt-ana15-MDT0000_UUID
>> >      lock: ffff9fde9b6873c0/0x9824623d2148ef38 lrc: 4/0,0 mode: PR/PR
>> res:
>> >      [0x2000013a9:0x1d347:0x0].0x0 bits 0x13/0x0 rrc: 5 type: IBT flags:
>> >      0x60200400000020 nid: 172.21.52.87 at o2ib remote: 0xd8efecd6e7621e63
>> >      expref: 8 pid: 7898 timeout: 333081 lvb_type: 0
>> >      Mar  5 11:13:03 psmdsana1501 kernel: LustreError: 138-a:
>> ana15-MDT0000:
>> >      A client on nid 172.21.52.87 at o2ib was evicted due to a lock
>> blocking
>> >      callback time out: rc -110
>> >      Mar  5 11:13:03 psmdsana1501 kernel: LustreError:
>> >      5321:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback
>> timer
>> >      expired after 150s: evicting client at 172.21.52.87 at o2ib ns:
>> >      mdt-ana15-MDT0000_UUID lock: ffff9fde9b6873c0/0x9824623d2148ef38
>> lrc:
>> >      3/0,0 mode: PR/PR res: [0x2000013a9:0x1d347:0x0].0x0 bits 0x13/0x0
>> rrc:
>> >      5 type: IBT flags: 0x60200400000020 nid: 172.21.52.87 at o2ib remote:
>> >      0xd8efecd6e7621e63 expref: 9 pid: 7898 timeout: 0 lvb_type: 0
>> >      Mar  5 11:13:04 psmdsana1501 kernel: Lustre: ana15-MDT0000:
>> Connection
>> >      restored to 59c5a826-f4e9-0dd0-8d4f-08c204f25941 (at
>> 172.21.52.87 at o2ib)
>> >      Mar  5 11:15:34 psmdsana1501 kernel: LustreError:
>> >      7898:0:(ldlm_lockd.c:682:ldlm_handle_ast_error()) ### client (nid
>> >      172.21.52.142 at o2ib) failed to reply to blocking AST
>> >      (req at ffff9fde2d393600 x1626845000213776 status 0 rc -110), evict
>> it ns:
>> >      mdt-ana15-MDT0000_UUID lock: ffff9fde9b6858c0/0x9824623d2148efee
>> lrc:
>> >      4/0,0 mode: PR/PR res: [0x2000013ac:0x1:0x0].0x0 bits 0x13/0x0
>> rrc: 3
>> >      type: IBT flags: 0x60200400000020 nid: 172.21.52.142 at o2ib remote:
>> >      0xbb35541ea6663082 expref: 9 pid: 7898 timeout: 333232 lvb_type: 0
>> >      Mar  5 11:15:34 psmdsana1501 kernel: LustreError: 138-a:
>> ana15-MDT0000:
>> >      A client on nid 172.21.52.142 at o2ib was evicted due to a lock
>> blocking
>> >      callback time out: rc -110
>> >      Mar  5 11:15:34 psmdsana1501 kernel: LustreError:
>> >      5321:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback
>> timer
>> >      expired after 151s: evicting client at 172.21.52.142 at o2ib ns:
>> >      mdt-ana15-MDT0000_UUID lock: ffff9fde9b6858c0/0x9824623d2148efee
>> lrc:
>> >      3/0,0 mode: PR/PR res: [0x2000013ac:0x1:0x0].0x0 bits 0x13/0x0
>> rrc: 3
>> >      type: IBT flags: 0x60200400000020 nid: 172.21.52.142 at o2ib remote:
>> >      0xbb35541ea6663082 expref: 10 pid: 7898 timeout: 0 lvb_type: 0
>> >      Mar  5 11:15:34 psmdsana1501 kernel: Lustre: ana15-MDT0000:
>> Connection
>> >      restored to 9d49a115-646b-c006-fd85-000a4b90019a (at
>> 172.21.52.142 at o2ib)
>> >      Mar  5 11:20:33 psmdsana1501 kernel: Lustre:
>> >      7898:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request
>> sent has
>> >      failed due to network error: [sent 1551813633/real 1551813633]
>> >      req at ffff9fdcc2a95100 x1626845000222624/t0(0)
>> >      o104->ana15-MDT0000 at 172.21.52.87@o2ib:15/16 lens 296/224 e 0 to 1
>> dl
>> >      1551813644 ref 1 fl Rpc:eX/2/ffffffff rc 0/-1
>> >      Mar  5 11:20:33 psmdsana1501 kernel: Lustre:
>> >      7898:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 23570550
>> >      previous similar messages
>> >      Mar  5 11:22:46 psmdsana1501 kernel: LustreError:
>> >      7898:0:(ldlm_lockd.c:682:ldlm_handle_ast_error()) ### client (nid
>> >      172.21.52.87 at o2ib) failed to reply to blocking AST
>> (req at ffff9fdcc2a95100
>> >      x1626845000222624 status 0 rc -110), evict it ns:
>> mdt-ana15-MDT0000_UUID
>> >      lock: ffff9fde86ffdf80/0x9824623d2148f23a lrc: 4/0,0 mode: PR/PR
>> res:
>> >      [0x2000013ae:0x1:0x0].0x0 bits 0x13/0x0 rrc: 3 type: IBT flags:
>> >      0x60200400000020 nid: 172.21.52.87 at o2ib remote: 0xd8efecd6e7621eb7
>> >      expref: 9 pid: 7898 timeout: 333665 lvb_type: 0
>> >      Mar  5 11:22:46 psmdsana1501 kernel: LustreError: 138-a:
>> ana15-MDT0000:
>> >      A client on nid 172.21.52.87 at o2ib was evicted due to a lock
>> blocking
>> >      callback time out: rc -110
>> >      Mar  5 11:22:46 psmdsana1501 kernel: LustreError:
>> >      5321:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback
>> timer
>> >      expired after 150s: evicting client at 172.21.52.87 at o2ib ns:
>> >      mdt-ana15-MDT0000_UUID lock: ffff9fde86ffdf80/0x9824623d2148f23a
>> lrc:
>> >      3/0,0 mode: PR/PR res: [0x2000013ae:0x1:0x0].0x0 bits 0x13/0x0
>> rrc: 3
>> >      type: IBT flags: 0x60200400000020 nid: 172.21.52.87 at o2ib remote:
>> >      0xd8efecd6e7621eb7 expref: 10 pid: 7898 timeout: 0 lvb_type: 0
>> >
>> >
>> >      _______________________________________________
>> >      lustre-discuss mailing list
>> >      lustre-discuss at lists.lustre.org
>> >      http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>> >
>> >
>> > _______________________________________________
>> > lustre-discuss mailing list
>> > lustre-discuss at lists.lustre.org
>> > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>
>>
>> _______________________________________________
>> lustre-discuss mailing list
>> lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20190305/f0f63520/attachment-0001.html>


More information about the lustre-discuss mailing list