[Lustre-discuss] mpi-io support
Kumaran Rajaram
krajaram at sgi.com
Thu May 8 13:43:54 PDT 2008
Marty,
If my understand is right, when multiple clients issue non-collective
I/O and if their data buffer is a vector of small non-overlapping file
regions, instead of performing 'n' seeks + read/write ROMIO uses data
sieving algorithm. For data-sieving write, first the extent of request
is read into big buffer and respective write vectors memcpy'd into big
buffer and then single BIG write is performed. Prior to performing
data-sieving write, ROMIO locks the portion of the file pertaining to
data-sieving buff-size, does seek + write, and then unlocks the
file-range. This ensures the file integrity. ROMIO relies on ADIO-FS
specific locking (in this case Lustre). So if the underlying file-system
does not support fcntl() lock, then you see errors when the extent of
the non-collective writes from multiple clients overlap.
The easy solution, would be to replace non-collective MPI-IO calls with
collective I/O MPI-IO calls. The two phase collective I/O algorithm
should ensure file integrity and does not rely on file-locking since
each process writes to a big non-overlapping region during the second
phase.
Or if you have to use non-collective I/O, may be implement ad_lustre
fcntl exclusive lock using
i)
fcntl(EXCL_LOCK) --> open(lock_file, O_CREATE | O_EXCL) + close
fcntl(UNLOCK) --> unlink(lock_file)
ii)
fcntl(EXCL_LOCK) --> MPI_Win_Lock()
fcntl(EXCL_LOCK) --> MPI_Win_Unlock()
Ofcourse you need to create a one-sided shared buffer in rank 0 when the
file is opened MPI_File_Open + buffer destroyed during MPI_File_close()
HTH,
-Kums
________________________________
From: lustre-discuss-bounces at lists.lustre.org
[mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of Marty
Barnaby
Sent: Thursday, May 08, 2008 12:35 PM
Cc: lustre-discuss at clusterfs.com
Subject: Re: [Lustre-discuss] mpi-io support
To return to this discussion, in recent testing, I have found that
writing to a Lustre FS via a higher level library, like PNetCDF, fails
because the default for value for romio_ds_write is not disable. This is
set in the mpich code in the file /src/mpi/romio/adio/common/ad_hints.c
I believe it has something to do with locking issues. I'm not sure how
best to handle this, I'd prefer the data sieving default be disable,
though I don't know all the implications there. Maybe an ad_lustre_open
should be a place where the _ds_ hints are set to disable.
Marty Barnaby
Weikuan Yu wrote:
Andreas Dilger wrote:
On Mar 11, 2008 16:10 -0600, Marty Barnaby wrote:
I'm not actually sure what ROMIO abstract device the
multiple CFS
deployments I utilize were defined with. Probably just
UFS, or maybe NFS.
Did you have a recommended option yourself.
The UFS driver is the one used for Lustre if no other one
exists.
Besides the fact that most of the adio that were created
over the years are
completely obsolete and could be cleaned from ROMIO,
what will the new one
for Lustre offer? Particularly with respect to controls
via the lfs utility
that I can already get?
There is improved collective IO that aligns the IO on Lustre
stripe
boundaries. Also the hints given to the MPIIO layer (before
open,
not after) result in lustre picking a better stripe count/size.
In addition, the one integrated into MPICH2-1.0.7 contains direct I/O
support. Lockless I/O support was purged out due into my lack of
confidence in low-level file system support. But it can be revived when
possible.
--
Weikuan Yu <+> 1-865-574-7990
http://ft.ornl.gov/~wyu/ <http://ft.ornl.gov/%7Ewyu/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20080508/e9fa2281/attachment.htm>
More information about the lustre-discuss
mailing list