[Lustre-discuss] Autoconf/Automake problem when running on Lustre 1.8.2 filesystem

Ashley Nicholls spookyinc at gmail.com
Thu Jun 3 10:50:37 PDT 2010


Hello all,

A little background info:
We have a cluster of fourteen servers (each with 8 cores) running builds for
employees at my workplace. We primarily used NFSv3 as a way of providing a
single file-system for all workstations and servers to share files. Each
build is run in a separate directory so cache coherence shouldn't be an
issue. When the cluster is full we noticed lots of jobs failing when
attempting to copy or move results from one pass of the build to the next
pass (once again, a 'job' is run on a single machine from beginning to end
so coherency shouldn't be an issue'). In most cases the error was along the
lines of 'File doesn't exist', but if you were to manually check its
presence it appeared fine and with all the correct permissions. Having tried
NFSv4 (and some other file-systems I won't name) and finding the same
results  we decided to implement a coherent clustering file-system that
would also provide more scalability.

Setup and problem:
After much trial and error I have managed to setup a small Lustre system
consisting of one MDS and one OSS. All machines involved  are CentOS 5 based
and run the 2.6.18-164.11.1.el5_lustre.1.8.2 kernel. This setup appears to
work correctly but now fails in a way that it didn't before.

One of our builds that uses autoconf started failing with the error:
"Can't locate auto/Autom4te/XFile/msg.al in @INC"

After having a look at some of the problems encountered on OpenBSD with
automake and a lockless NFS system a workaround was suggested. By adding the
line 'use Autom4te::Channels qw(msg);' and 'use Automake::Channels qw(msg);'
to the respective XFile.pm autoconf becomes more verbose and produces:
"autom4te: cannot lock autom4te.cache/requests with mode 2: Function not
implemented"

It also no longer fails and continues to build the configure file. But now
we come to the interesting part - The script runs ./configure directly after
running autoconf and this always fails the first time with the message
./configure: /bin/sh: bad interpreter: Text file busy

If you manually re-run the script again it works correctly! I have tried
flock and localflocks on the lustre mount but this doesn't appear to make
any difference. After googling around I can't seem to find anyone who has
had a similar problem.

Has anyone experienced this before or have any clues as to how I can go
about debugging this?

Thanks,
---------
Ashley Nicholls
Open all hailing frequencies and broadcast in all known languages. Including
Welsh.
- Arnold J. Rimmer
In my many years I have come to a conclusion that one useless man is a
shame, two is a law firm, and three or more is a congress.
- John Adams
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20100603/fc874b4c/attachment.htm>


More information about the lustre-discuss mailing list