[lustre-discuss] Filesystem hanging....
Stephane Thiell
sthiell at stanford.edu
Sat Aug 13 19:09:32 PDT 2016
Hi Phil,
I understand that you’re running master on your clients (tag v2_8_56 was created 4 days ago) and 2.1 on the servers? Running master in production is already a challenge. Also Lustre has never be good for cross-version compatibility. For example, it is possible to make 2.1 servers work with 2.5 clients and 2.5 servers work with 2.7 clients, even though additional patches may be needed.
I would say try to reduce the gap, upgrade your servers and/or try an official lustre release on your clients…
All the best,
Stephane
> On Aug 12, 2016, at 5:37 AM, Phill Harvey-Smith <p.harvey-smith at warwick.ac.uk> wrote:
>
> On 11/08/2016 16:10, Colin Faber wrote:
>>> First glance indicates you're having network connectivity problems,
>>> (possibly driver issue with your NIC?)
>
> I don't seem to have had any problems with any other services running on the cluster, and there are no messages in the journal or the /var/log files relating to network errors.
>
> Oddly though when the /home filesystem hangs the /storage and /scratch filesystems also served by the same luster servers continue to respond
> without problems.
>
> What does semm top have some bearing on it is that the first few writes seem to succeed and then it will hang, though it was first noticed through samba, it also appears to also happen logged in to the console directly.
>
>>> (Check MTU settings, etc?)
>
> Pasting as quotation as it stops thunderbird from wrapping the text.....
>
>> root at test-r710:~# ifconfig
>> eno1 Link encap:Ethernet HWaddr 00:26:b9:84:c7:8d
>> inet addr:192.168.1.80 Bcast:192.168.1.255 Mask:255.255.255.0
>> inet6 addr: fe80::226:b9ff:fe84:c78d/64 Scope:Link
>> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
>> RX packets:8516 errors:0 dropped:0 overruns:0 frame:0
>> TX packets:23199 errors:0 dropped:0 overruns:0 carrier:0
>> collisions:0 txqueuelen:1000
>> RX bytes:5297958 (5.2 MB) TX bytes:3222616 (3.2 MB)
>>
>> eno2 Link encap:Ethernet HWaddr 00:26:b9:84:c7:8f
>> inet addr:192.168.0.80 Bcast:192.168.0.255 Mask:255.255.255.0
>> inet6 addr: fe80::226:b9ff:fe84:c78f/64 Scope:Link
>> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
>> RX packets:1374513 errors:0 dropped:0 overruns:0 frame:0
>> TX packets:168485 errors:0 dropped:0 overruns:0 carrier:0
>> collisions:0 txqueuelen:1000
>> RX bytes:2026863011 (2.0 GB) TX bytes:21861558 (21.8 MB)
>>
>> eno4 Link encap:Ethernet HWaddr 00:26:b9:84:c7:93
>> inet addr:137.205.232.159 Bcast:137.205.232.255 Mask:255.255.255.128
>> inet6 addr: fe80::226:b9ff:fe84:c793/64 Scope:Link
>> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
>> RX packets:11483 errors:0 dropped:0 overruns:0 frame:0
>> TX packets:10560 errors:0 dropped:0 overruns:0 carrier:0
>> collisions:0 txqueuelen:1000
>> RX bytes:3504764 (3.5 MB) TX bytes:5731764 (5.7 MB)
>
>
>> root at test-r710:~# route -n
>> Kernel IP routing table
>> Destination Gateway Genmask Flags Metric Ref Use Iface
>> 0.0.0.0 137.205.232.254 0.0.0.0 UG 0 0 0 eno4
>> 137.205.232.128 0.0.0.0 255.255.255.128 U 0 0 0 eno4
>> 192.168.0.0 0.0.0.0 255.255.255.0 U 0 0 0 eno2
>> 192.168.1.0 0.0.0.0 255.255.255.0 U 0 0 0 eno1
>
> Lustre mounts in fstab :> # Lustre mounted
>> 192.168.0.4 at tcp0:/storage /storage lustre defaults,_netdev,flock 0 0
>> 192.168.0.4 at tcp0:/home /home lustre defaults,_netdev,flock 0 0
>> 192.168.0.4 at tcp0:/scratch /scratch lustre defaults,_netdev,flock 0 0
>
> I've also tried compiling the latest source and installing those modules : Lustre: Build Version: 2.8.56_26_g6fad3ab this does seem not to have the problem with matlab (mentioned about a month or so ago), but still has the hanging problem.
>
> The lustre startup logs in the joural are here :
>> Aug 12 12:57:10 test-r710 kernel: Lustre: Lustre: Build Version: 2.8.56_26_g6fad3ab
>> Aug 12 12:57:10 test-r710 kernel: Lustre: Server MGS version (2.1.0.0) is much older than client. Consider upgrading server (2.8.56_26_g6fad3ab)
>> Aug 12 12:57:10 test-r710 kernel: Lustre: Trying to mount a client with IR setting not compatible with current mgc. Force to use current mgc setting that is IR disabled.
>> Aug 12 12:57:10 test-r710 kernel: Lustre: Mounted home-client
>
>
> Cheers.
>
> Phill.
>
>
>
> Cheers.
>
> Phill.
>
>
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
More information about the lustre-discuss
mailing list