[lustre-discuss] new client - failover mds: no connection

Brett Lee brettlee.lustre at gmail.com
Tue Oct 24 10:24:35 PDT 2017


Hi Thomas, nice to see you have remained active in the Lustre community.
To your question, I don't have an answer, but it seems like the timeout may
be masking the root issue - perhaps a system or network issue - I always
start with hostname resolution.  :)
On Oct 24, 2017 11:08 AM, "Thomas Roth" <t.roth at gsi.de> wrote:

> Sorry to have bothered you - works now.
>
> I have set /sys/fs/lustre/timeout=3000, quite brutally, to make things go
> verrry slowly, and after 25 minutes the mount was there.
>
> Which control aka timeout-parameter _should_ I have tuned instead in such
> a situation?
>
> Regards,
> Thomas
>
> On 10/24/2017 06:26 PM, Thomas Roth wrote:
>
>> Hi all,
>>
>> in a Lustre 2.10, CentOS 7.4 test system, I have a pair of MDS, format
>> command was
>>
>>  > mkfs.lustre --mgs --mdt --fsname=test --index=0
>> --servicenode=10.20.1.198 at o2ib5 --servicenode=10.20.1.199 at o2ib5
>>      --mgsnode=10.20.1.198 at o2ib5     --mgsnode=10.20.1.199 at o2ib5
>> /dev/drbd0
>>
>> I added some OSS and clients, everything working.
>>
>> Then I switched off 10.20.1.198 and mounted my MGS/MDT on 10.20.1.199.
>> All OSS and clients connected, everything working.
>>
>> Now I try to add a client that was never there before,
>>  > mount -t lustre 10.20.1.198 at o2ib5:10.20.1.199 at o2ib5:/test
>> /lustre/test
>>
>> But this client only tries to connect to 10.20.1.198 at o2ib5 - and fails.
>> The log says
>>
>> LNet: 47655:0:(o2iblnd_cb.c:2672:kiblnd_check_reconnect())
>> 10.20.1.198 at o2ib5: reconnect (invalid service id), 12, 12, msg_size:
>> 4096, queue_depth: 8/-1, max_frags: 256/-1
>> LNet: 47655:0:(o2iblnd_cb.c:2698:kiblnd_rejected()) 10.20.1.198 at o2ib5
>> rejected: no listener at 987
>> ...
>> LustreError: 48560:0:(mgc_request.c:251:do_config_log_add())
>> MGC10.20.1.198 at o2ib5: failed processing log, type 1: rc = -5
>> LNet: 48427:0:(o2iblnd_cb.c:3207:kiblnd_check_conns()) Timed out tx for
>> 10.20.1.198 at o2ib5: 4301501 seconds
>> Lustre: 48441:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request
>> sent has failed due to network error: [sent 1508861258/real 1508861264]
>> req at ffff88103dc78000 x1582155623825424/t0(0) o250->MGC10.20.1.198 at o2ib5
>> @10.20.1.198 at o2ib5:26/25 lens 520/544 e 0 to 1 dl 1508861408 ref 1 fl
>> Rpc:eXN/0/ffffffff rc 0/-1
>>
>>
>> all of which seems logical but not wanted - where is my 10.20.1.199 at o2ib5
>> ?
>>
>> Of course I can 'lctl ping 10.20.1.199 at o2ib5'.
>> And I have since umounted on one of the older clients, unloaded the
>> Lustre modules, and mounted again - works.
>>
>>
>> Regards,
>> Thomas
>>
>>
> --
> --------------------------------------------------------------------
> Thomas Roth
> Department: Informationstechnologie
> Location: SB3 1.250
> Phone: +49-6159-71 1453  Fax: +49-6159-71 2986
>
> GSI Helmholtzzentrum für Schwerionenforschung GmbH
> Planckstraße 1
> 64291 Darmstadt
> www.gsi.de
>
> Gesellschaft mit beschränkter Haftung
> Sitz der Gesellschaft: Darmstadt
> Handelsregister: Amtsgericht Darmstadt, HRB 1528
>
> Geschäftsführung: Ursula Weyrich
> Professor Dr. Paolo Giubellino
> Jörg Blaurock
>
> Vorsitzende des Aufsichtsrates: St Dr. Georg Schütte
> Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20171024/992f34e3/attachment.html>


More information about the lustre-discuss mailing list