[lustre-discuss] Help troubleshooting a scalability issue

Fri Jan 20 06:20:37 PST 2017

Hi

I have a Lustre cluster composed by 1 MDS and 2 OSS servers.
Clients are both physical machines (~ 25 boxes) and virtual machines 
(instantiated on a OpenStack cluster). These Virtual Machines are 
dynamically created and destroyed as needed (we have a machinery which 
provides such automatic elasticity). They access the Lustre cluster 
through a NAT.

We start having problems when the number of virtual machines reaches a 
certain value (about 130 - 140).
In such scenario we start seeing problems: we are not able to mount 
anymore Lustre on new clients and the access to the lustre file system 
is very slow.

In the OSS and MDS syslogs I see a lot of errors, such as:

Request sent has timed out for slow reply
bulk GET failed
Request sent has failed due to network error
lock blocking callback time out

In:

https://dl.dropboxusercontent.com/u/7639059/LustreLog/lustre-mds.txt
https://dl.dropboxusercontent.com/u/7639059/LustreLog/lustre-oss-01.txt
https://dl.dropboxusercontent.com/u/7639059/LustreLog/lustre-oss-03.txt

I saved a copy of these syslogs (just related to Lustre, and just for a 
time slot when the problem happened).
In this example 10.64.22.248 is a new VM that is not able to mount the 
lustre filesystem.

There aren't network saturations when the problem happen and the lustre 
servers don't appear heavily loaded.

I would appreciate any hints that could help in troubleshooting this issue

Thanks, Massimo

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2272 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20170120/11f3ad86/attachment.bin>