<div>I've had a chance to take a longer look at this and I think I was wrong about the BKL.  I still don't see where it would be getting released but the problem appears to be that all OBD's are using the same MGC from a server.</div>


<div> </div>

<div>In server_start_targets, server_mgc_set_fs acquires the cl_mgc_sem, holds it through lustre_process_log and releases with server_mgc_clear_fs after that.  As a result all of our mounts that are started at the same time are waiting for the cl_mgc_sem semaphore.  And each OBD has to process it's llog one at a time.  When you have OSTs near capacity like bug 18456 the first write when processing the llog can take minutes to complete.</div>


<div> </div>

<div>I don't see any easy way to fix this because they are all using the same sb->lsi->lsi_mgc.  I was thinking maybe some of these structures could just modify a copy of that data instead of the actual structure itself but there are so many functions called its hard to see if anything would be using it.</div>


<div> </div>

<div>Any ideas for a way to work around this?</div>

<div> </div>

<div>Jeremy <br></div>

<div class="gmail_quote">On Wed, Nov 3, 2010 at 11:57 AM, Ashley Pittman <span dir="ltr"><<a href="mailto:apittman@ddn.com">apittman@ddn.com</a>></span> wrote:<br>

<blockquote style="BORDER-LEFT: #ccc 1px solid; MARGIN: 0px 0px 0px 0.8ex; PADDING-LEFT: 1ex" class="gmail_quote">

<div>

<div></div>

<div class="h5"><br>On 2 Nov 2010, at 07:40, Andreas Dilger wrote:<br><br>> On 2010-10-28, at 21:07, Jeremy Filizetti wrote:<br>>> I've seen a lot of issues with mounting all of our OSTs on an OSS taking an excessive amount of time.  Most of the individual OST mount time was related to bug 18456, but we still see mount times take minutes per OST with the relevant patches.  At mount time the llog does a small write which ends up scanning nearly our entire 7+ TB OSTs to find the desired block and complete the write.<br>

>><br>>> To reduce startup time mounting multiple OSTs simultaneously would help, but during that process it looks like the code path is still holding the big kernel lock from the mount system call.  During that time all other mount commands are in an uninterruptible sleep (D state).  Based on the discussions from bug 23790 it doesn't appear that Lustre relies on the BKL so would it be reasonable to call unlock_kernel in lustre_fill_super or at least before lustre_start_mgc and lock it again before the return so multiple OSTs could be mounting at the same time?  I think the same thing would apply to unmounting but I haven't looked at the code path there.<br>

><br>> IIRC, the BKL is held at mount time to avoid potential races with mounting the same device multiple times.  However, the risk of this is pretty small, and can be controlled on an OSS, which has limited access.  Also, this code is being removed in newer kernels, as I don't think it is needed by most filesystems.<br>

><br>> I _think_ it should be OK, but YMMV.<br><br></div></div>I've been thinking about this and can't make up my mind on if it's a good idea or not, we often see mount times in the ten minute region so anything we can do to speed them up is a good thing, I find it hard to believe the core kernel mount code would accept you doing this behind their back though and I'd be surprised if it worked.<br>

<br>Then again - when we were discussing this yesterday is the mount command *really* holding the BKL for the entire duration?  Surely if this lock is being held for minutes we'd notice this in other ways because other kernel paths that require this lock would block?<br>

<br>Ashley.<br><br>_______________________________________________<br>Lustre-devel mailing list<br><a href="mailto:Lustre-devel@lists.lustre.org">Lustre-devel@lists.lustre.org</a><br><a href="http://lists.lustre.org/mailman/listinfo/lustre-devel" target="_blank">http://lists.lustre.org/mailman/listinfo/lustre-devel</a><br>

</blockquote></div><br>