[Lustre-discuss] Large scale delete results in lag on clients

Fri Aug 7 05:25:33 PDT 2009

On Fri, Aug 7, 2009 at 6:45 AM, Arden Wiebe <albert682 at yahoo.com> wrote:
> --- On Thu, 8/6/09, Andreas Dilger <adilger at sun.com> wrote:
> > Jim McCusker wrote:

> > > We have a 15 TB luster volume across 4 OSTs and we recently deleted over 4
> > > million files from it in order to free up the 80 GB MDT/MDS (going from 100%
> > > capacity on it to 81%. As a result, after the rm completed, there is
> > > significant lag on most file system operations (but fast access once it
> > > occurs) even after the two servers that host the targets were rebooted. It
> > > seems to clear up for a little while after reboot, but comes back after some
> > > time.
> > >
> > > Any ideas?
> >
> > The Lustre unlink processing is somewhat asynchronous, so you may still be
> > catching up with unlinks.  You can check this by looking at the OSS service
> > RPC stats file to see if there are still object destroys being processed
> > by the OSTs.  You could also just check the system load/io on the OSTs to
> > see how busy they are in a "no load" situation.
> >
> >
> > > For the curious, we host a large image archive (almost 400k images) and do
> > > research on processing them. We had a lot of intermediate files that we
> > > needed to clean up:
> > >
> > >  http://krauthammerlab.med.yale.edu/imagefinder (currently laggy and
> > > unresponsive due to this problem)
> > >
>
> Jim, from the web side perspective it seems responsive.  Are you actually serving the images from the lustre cluster?  I have ran a few searches looking for "Purified HIV Electron Microscope" and your project returns 15 pages of results with great links to full abstracts almost instantly but obviously none with real purified HIV electron microscope images similar to a real pathogenic virus like http://krauthammerlab.med.yale.edu/imagefinder/Figure.external?sp=62982&state:Figure=BrO0ABXcRAAAAAQAACmRvY3VtZW50SWRzcgARamF2YS5sYW5nLkludGVnZXIS4qCk94GHOAIAAUkABXZhbHVleHIAEGphdmEubGFuZy5OdW1iZXKGrJUdC5TgiwIAAHhwAAD2Cg%3D%3D

The images and the lucene index are both served from the lustre
cluster (as is just about everything else on our network). I think
Andreas is right, it seems to have cleared itself up. You're seeing
typical performance. If you don't find what you're looking for, you
can expand your search to the full text, abstract, or title using the
checkboxes below the search box. Of course, the lack of images in
search has more to do with the availability of open access papers on
the topic than the performance of lustre. :-)

> Have you physically separated your MDS/MDT from the MGS portion on different servers?  I somehow doubt you overlooked this but if you didn't for some reason this could be a cause of unresponsiveness on the client side.  Again if your serving up the images from the cluster I find it works great.

This server started life as a 1.4.x server, so the MGS is still on the
same partition as MDS/MDT. We have one server with the MGS, MDS/MDT,
and two OSTs, and another server with two more OSTs. The first server
also provides NFS and SMB services for the volume in question. I know
that we're not supposed to mount the volume on a server that provides
it, but limited budget means limited servers, and performance has been
excellent except for this one problem.

Jim
--
Jim McCusker
Programmer Analyst
Krauthammer Lab, Pathology Informatics
Yale School of Medicine
james.mccusker at yale.edu | (203) 785-6330
http://krauthammerlab.med.yale.edu