[lustre-discuss] Help troubleshooting a scalability issue
Massimo Sgaravatto
massimo.sgaravatto at pd.infn.it
Fri Jan 20 06:20:37 PST 2017
Hi
I have a Lustre cluster composed by 1 MDS and 2 OSS servers.
Clients are both physical machines (~ 25 boxes) and virtual machines
(instantiated on a OpenStack cluster). These Virtual Machines are
dynamically created and destroyed as needed (we have a machinery which
provides such automatic elasticity). They access the Lustre cluster
through a NAT.
We start having problems when the number of virtual machines reaches a
certain value (about 130 - 140).
In such scenario we start seeing problems: we are not able to mount
anymore Lustre on new clients and the access to the lustre file system
is very slow.
In the OSS and MDS syslogs I see a lot of errors, such as:
Request sent has timed out for slow reply
bulk GET failed
Request sent has failed due to network error
lock blocking callback time out
In:
https://dl.dropboxusercontent.com/u/7639059/LustreLog/lustre-mds.txt
https://dl.dropboxusercontent.com/u/7639059/LustreLog/lustre-oss-01.txt
https://dl.dropboxusercontent.com/u/7639059/LustreLog/lustre-oss-03.txt
I saved a copy of these syslogs (just related to Lustre, and just for a
time slot when the problem happened).
In this example 10.64.22.248 is a new VM that is not able to mount the
lustre filesystem.
There aren't network saturations when the problem happen and the lustre
servers don't appear heavily loaded.
I would appreciate any hints that could help in troubleshooting this issue
Thanks, Massimo
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2272 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20170120/11f3ad86/attachment.bin>
More information about the lustre-discuss
mailing list