[Lustre-devel] releasing BKL in lustre_fill_super

Mon Nov 8 14:16:01 PST 2010

I've had a chance to take a longer look at this and I think I was wrong
about the BKL.  I still don't see where it would be getting released but the
problem appears to be that all OBD's are using the same MGC from a server.

In server_start_targets, server_mgc_set_fs acquires the cl_mgc_sem, holds it
through lustre_process_log and releases with server_mgc_clear_fs after
that.  As a result all of our mounts that are started at the same time are
waiting for the cl_mgc_sem semaphore.  And each OBD has to process it's llog
one at a time.  When you have OSTs near capacity like bug 18456 the first
write when processing the llog can take minutes to complete.

I don't see any easy way to fix this because they are all using the same
sb->lsi->lsi_mgc.  I was thinking maybe some of these structures could just
modify a copy of that data instead of the actual structure itself but there
are so many functions called its hard to see if anything would be using it.

Any ideas for a way to work around this?

Jeremy
On Wed, Nov 3, 2010 at 11:57 AM, Ashley Pittman <apittman at ddn.com> wrote:

>
> On 2 Nov 2010, at 07:40, Andreas Dilger wrote:
>
> > On 2010-10-28, at 21:07, Jeremy Filizetti wrote:
> >> I've seen a lot of issues with mounting all of our OSTs on an OSS taking
> an excessive amount of time.  Most of the individual OST mount time was
> related to bug 18456, but we still see mount times take minutes per OST with
> the relevant patches.  At mount time the llog does a small write which ends
> up scanning nearly our entire 7+ TB OSTs to find the desired block and
> complete the write.
> >>
> >> To reduce startup time mounting multiple OSTs simultaneously would help,
> but during that process it looks like the code path is still holding the big
> kernel lock from the mount system call.  During that time all other mount
> commands are in an uninterruptible sleep (D state).  Based on the
> discussions from bug 23790 it doesn't appear that Lustre relies on the BKL
> so would it be reasonable to call unlock_kernel in lustre_fill_super or at
> least before lustre_start_mgc and lock it again before the return so
> multiple OSTs could be mounting at the same time?  I think the same thing
> would apply to unmounting but I haven't looked at the code path there.
> >
> > IIRC, the BKL is held at mount time to avoid potential races with
> mounting the same device multiple times.  However, the risk of this is
> pretty small, and can be controlled on an OSS, which has limited access.
>  Also, this code is being removed in newer kernels, as I don't think it is
> needed by most filesystems.
> >
> > I _think_ it should be OK, but YMMV.
>
> I've been thinking about this and can't make up my mind on if it's a good
> idea or not, we often see mount times in the ten minute region so anything
> we can do to speed them up is a good thing, I find it hard to believe the
> core kernel mount code would accept you doing this behind their back though
> and I'd be surprised if it worked.
>
> Then again - when we were discussing this yesterday is the mount command
> *really* holding the BKL for the entire duration?  Surely if this lock is
> being held for minutes we'd notice this in other ways because other kernel
> paths that require this lock would block?
>
> Ashley.
>
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20101108/99db6b4e/attachment.htm>