[lustre-discuss] Help troubleshooting a scalability issue
esr+lustre at mail.hebrew.edu
Sun Jan 22 11:02:59 PST 2017
On Fri, Jan 20, 2017 at 4:20 PM, Massimo Sgaravatto <
massimo.sgaravatto at pd.infn.it> wrote:
> I have a Lustre cluster composed by 1 MDS and 2 OSS servers.
> Clients are both physical machines (~ 25 boxes) and virtual machines
> (instantiated on a OpenStack cluster). These Virtual Machines are
> dynamically created and destroyed as needed (we have a machinery which
> provides such automatic elasticity). They access the Lustre cluster through
> a NAT.
Did you check if you are running out of available ports to maintain open
connections etc.? What about the 'switching' capacity of the virtual
switch/router? The throughput on the interface? RAM/CPU usage of the
Not really the Lustre side of things but also things that could be messing
up and can be ruled out fairly easily.
> We start having problems when the number of virtual machines reaches a
> certain value (about 130 - 140).
> In such scenario we start seeing problems: we are not able to mount
> anymore Lustre on new clients and the access to the lustre file system is
> very slow.
> In the OSS and MDS syslogs I see a lot of errors, such as:
> Request sent has timed out for slow reply
> bulk GET failed
> Request sent has failed due to network error
> lock blocking callback time out
> I saved a copy of these syslogs (just related to Lustre, and just for a
> time slot when the problem happened).
> In this example 10.64.22.248 is a new VM that is not able to mount the
> lustre filesystem.
> There aren't network saturations when the problem happen and the lustre
> servers don't appear heavily loaded.
> I would appreciate any hints that could help in troubleshooting this issue
> Thanks, Massimo
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the lustre-discuss