[lustre-discuss] Filesystem hanging....

Stephane Thiell sthiell at stanford.edu
Sat Aug 13 19:09:32 PDT 2016


Hi Phil,

I understand that you’re running master on your clients (tag v2_8_56 was created 4 days ago) and 2.1 on the servers? Running master in production is already a challenge. Also Lustre has never be good for cross-version compatibility. For example, it is possible to make 2.1 servers work with 2.5 clients and 2.5 servers work with 2.7 clients, even though additional patches may be needed.

I would say try to reduce the gap, upgrade your servers and/or try an official lustre release on your clients…

All the best,
Stephane


> On Aug 12, 2016, at 5:37 AM, Phill Harvey-Smith <p.harvey-smith at warwick.ac.uk> wrote:
> 
> On 11/08/2016 16:10, Colin Faber wrote:
>>> First glance indicates you're having network connectivity problems,
>>> (possibly driver issue with your NIC?)
> 
> I don't seem to have had any problems with any other services running on the cluster, and there are no messages in the journal or the /var/log files relating to network errors.
> 
> Oddly though when the /home filesystem hangs the /storage and /scratch filesystems also served by the same luster servers continue to respond
> without problems.
> 
> What does semm top have some bearing on it is that the first few writes seem to succeed and then it will hang, though it was first noticed through samba, it also appears to also happen logged in to the console directly.
> 
>>> (Check MTU settings, etc?)
> 
> Pasting as quotation as it stops thunderbird from wrapping the text.....
> 
>> root at test-r710:~# ifconfig
>> eno1      Link encap:Ethernet  HWaddr 00:26:b9:84:c7:8d
>>          inet addr:192.168.1.80  Bcast:192.168.1.255  Mask:255.255.255.0
>>          inet6 addr: fe80::226:b9ff:fe84:c78d/64 Scope:Link
>>          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>>          RX packets:8516 errors:0 dropped:0 overruns:0 frame:0
>>          TX packets:23199 errors:0 dropped:0 overruns:0 carrier:0
>>          collisions:0 txqueuelen:1000
>>          RX bytes:5297958 (5.2 MB)  TX bytes:3222616 (3.2 MB)
>> 
>> eno2      Link encap:Ethernet  HWaddr 00:26:b9:84:c7:8f
>>          inet addr:192.168.0.80  Bcast:192.168.0.255  Mask:255.255.255.0
>>          inet6 addr: fe80::226:b9ff:fe84:c78f/64 Scope:Link
>>          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>>          RX packets:1374513 errors:0 dropped:0 overruns:0 frame:0
>>          TX packets:168485 errors:0 dropped:0 overruns:0 carrier:0
>>          collisions:0 txqueuelen:1000
>>          RX bytes:2026863011 (2.0 GB)  TX bytes:21861558 (21.8 MB)
>> 
>> eno4      Link encap:Ethernet  HWaddr 00:26:b9:84:c7:93
>>          inet addr:137.205.232.159  Bcast:137.205.232.255  Mask:255.255.255.128
>>          inet6 addr: fe80::226:b9ff:fe84:c793/64 Scope:Link
>>          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>>          RX packets:11483 errors:0 dropped:0 overruns:0 frame:0
>>          TX packets:10560 errors:0 dropped:0 overruns:0 carrier:0
>>          collisions:0 txqueuelen:1000
>>          RX bytes:3504764 (3.5 MB)  TX bytes:5731764 (5.7 MB)
> 
> 
>> root at test-r710:~# route -n
>> Kernel IP routing table
>> Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
>> 0.0.0.0         137.205.232.254 0.0.0.0         UG    0      0        0 eno4
>> 137.205.232.128 0.0.0.0         255.255.255.128 U     0      0        0 eno4
>> 192.168.0.0     0.0.0.0         255.255.255.0   U     0      0        0 eno2
>> 192.168.1.0     0.0.0.0         255.255.255.0   U     0      0        0 eno1
> 
> Lustre mounts in fstab :> # Lustre mounted
>> 192.168.0.4 at tcp0:/storage       /storage        lustre  defaults,_netdev,flock 0 0
>> 192.168.0.4 at tcp0:/home          /home           lustre  defaults,_netdev,flock 0 0
>> 192.168.0.4 at tcp0:/scratch       /scratch        lustre  defaults,_netdev,flock 0 0
> 
> I've also tried compiling the latest source and installing those modules : Lustre: Build Version: 2.8.56_26_g6fad3ab this does seem not to have the problem with matlab (mentioned about a month or so ago), but still has the hanging problem.
> 
> The lustre startup logs in the joural are here :
>> Aug 12 12:57:10 test-r710 kernel: Lustre: Lustre: Build Version: 2.8.56_26_g6fad3ab
>> Aug 12 12:57:10 test-r710 kernel: Lustre: Server MGS version (2.1.0.0) is much older than client. Consider upgrading server (2.8.56_26_g6fad3ab)
>> Aug 12 12:57:10 test-r710 kernel: Lustre: Trying to mount a client with IR setting not compatible with current mgc. Force to use current mgc setting that is IR disabled.
>> Aug 12 12:57:10 test-r710 kernel: Lustre: Mounted home-client
> 
> 
> Cheers.
> 
> Phill.
> 
> 
> 
> Cheers.
> 
> Phill.
> 
> 
> 
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org



More information about the lustre-discuss mailing list