[Lustre-discuss] mpi-io support

Kumaran Rajaram krajaram at sgi.com
Thu May 8 13:43:54 PDT 2008


Marty,

 

If my understand is right, when multiple clients issue non-collective
I/O and if their data buffer is a vector of small non-overlapping file
regions, instead of performing 'n' seeks + read/write ROMIO uses data
sieving algorithm. For data-sieving write, first the extent of request
is read into big buffer and respective write vectors memcpy'd into big
buffer and then single BIG write is performed. Prior to performing
data-sieving write, ROMIO locks the portion of the file pertaining to
data-sieving buff-size, does seek + write, and then unlocks the
file-range. This ensures the file integrity. ROMIO relies on ADIO-FS
specific locking (in this case Lustre). So if the underlying file-system
does not support fcntl() lock, then you see errors when the extent of
the non-collective writes from multiple clients overlap.

 

The easy solution, would be to replace non-collective MPI-IO calls with
collective I/O MPI-IO calls. The two phase collective I/O algorithm
should ensure file integrity and does not rely on file-locking since
each process writes to a big non-overlapping region during the second
phase.

 

Or if you have to use non-collective I/O, may be implement ad_lustre
fcntl exclusive lock using

i) 

fcntl(EXCL_LOCK) --> open(lock_file, O_CREATE | O_EXCL) + close

fcntl(UNLOCK) --> unlink(lock_file)

 

ii) 

fcntl(EXCL_LOCK)  --> MPI_Win_Lock()

fcntl(EXCL_LOCK)  --> MPI_Win_Unlock()

Ofcourse you need to create a one-sided shared buffer in rank 0 when the
file is opened MPI_File_Open + buffer destroyed during MPI_File_close()

 

HTH,

-Kums

 

________________________________

From: lustre-discuss-bounces at lists.lustre.org
[mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of Marty
Barnaby
Sent: Thursday, May 08, 2008 12:35 PM
Cc: lustre-discuss at clusterfs.com
Subject: Re: [Lustre-discuss] mpi-io support

 

To return to this discussion, in recent testing, I have found that
writing to a Lustre FS via a higher level library, like PNetCDF, fails
because the default for value for romio_ds_write is not disable. This is
set in the mpich code in the file /src/mpi/romio/adio/common/ad_hints.c

I believe it has something to do with locking issues. I'm not sure how
best to handle this, I'd prefer the data sieving default be disable,
though I don't know all the implications there. Maybe an ad_lustre_open
should be a place where the  _ds_  hints are set to disable.

Marty Barnaby


Weikuan Yu wrote: 

Andreas Dilger wrote:
  

	On Mar 11, 2008  16:10 -0600, Marty Barnaby wrote:
	    

		I'm not actually sure what ROMIO abstract device the
multiple CFS
		deployments I utilize were defined with. Probably just
UFS, or maybe NFS.
		Did you have a recommended option yourself.
		      

	The UFS driver is the one used for Lustre if no other one
exists.
	 
	    

		Besides the fact that most of the adio that were created
over the years are
		completely obsolete and could be cleaned from ROMIO,
what will the new one
		for Lustre offer? Particularly with respect to controls
via the lfs utility
		that I can  already get?
		      

	There is improved collective IO that aligns the IO on Lustre
stripe
	boundaries.  Also the hints given to the MPIIO layer (before
open,
	not after) result in lustre picking a better stripe count/size.
	 
	    

 
In addition, the one integrated into MPICH2-1.0.7 contains direct I/O
support. Lockless I/O support was purged out due into my lack of
confidence in low-level file system support. But it can be revived when
possible.
 
--
Weikuan Yu <+> 1-865-574-7990
http://ft.ornl.gov/~wyu/ <http://ft.ornl.gov/%7Ewyu/> 
 
  

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20080508/e9fa2281/attachment.htm>


More information about the lustre-discuss mailing list