[lustre-discuss] Lustre striping and MPI

John Bauer bauerj at iodoctors.com
Fri Oct 28 10:08:31 PDT 2016


Andreas

Thanks for the reply.  I should have clarified that my setting 
*stripe_count=rank* was purely for debugging purposes so I could tell 
which of the 4 ranks actually did set the striping when the test case 
failed. Normally, the stripe_count is user selectable, and would be the 
same for all ranks.  I was hoping that the first rank to get to the 
open/set stripe would do what's needed and the later arriving ranks 
would just open the already existing striped file.  It doesn't matter 
which rank gets there first as they all would be requesting the same 
striping.

There are several reasons that llapi_file_open() does not satisfy my 
needs.  Most notably, when my I/O library intercepts ( using LD_PRELOAD 
) functions such as the mkstemps() family, and some of the stdio opens, 
I can't necessarily replicate the open that would have occurred.  This 
has been discussed at length already on lustre-discuss.

Thanks again,

John


On 10/26/2016 5:54 PM, Dilger, Andreas wrote:
>
> Doing all of the ioctl() handling directly in your application is not 
> a great idea, as that will not allow using a bunch of new features 
> that are in the pipeline (e.g. progressive file layouts, file level 
> redundancy, etc).  It would be a lot better to use the provided 
> llapi_file_create() or llapi_layout_*() to isolate your application 
> from the underlying implementation of how the file layout is set.
>
> Specifics about your implementation:
>
> - it is only possible to set the layout on a file once, when it is 
> first created, so doing this from multiple threads for a single shared 
> file is broken.  You should do that only from rank 0.
>
> - it is possible to create a separate file for each thread/rank, but 
> you probably don't want to set the stripe *count* == rank for each 
> file.  it doesn't make sense to create a bunch of different files for 
> the same application, each one with a different stripe count.  You 
> probably meant to set the stripe_offset == rank so that the load is 
> spread evenly across all OSTs?
>
> - as a caveat for the above, specifying the OST index directly == rank 
> can cause problems, compared to just allowing the MDT to select the 
> OST indices for each file itself.  If num_ranks < ost_count then only 
> the first num_ranks OSTs would ever be used, and space usage on the 
> OSTs would be imbalanced. Also, if some OST is offline or overloaded 
> your application would not be able to create new files, while this can 
> be avoided by allowing the MDT to select the OST index for each file.  
> With one file per rank it is best to use stripe_count = 1 for all 
> files, since you already have parallelism at the application level.
>
> Cheers, Andreas
>
> -- 
>
> Andreas Dilger
>
> Lustre Principal Architect
>
> Intel High Performance Data Division
>
> On 2016/10/26, 06:51, " John Bauer" <bauerj at iodoctors.com 
> <mailto:bauerj at iodoctors.com>> wrote:
>
> All
>
> I am running a4 rank MPI job where all the ranks do an open of the 
> file, attempt to set the striping with ioctl() and then do a small 
> write. Intermittently, I get errors on the write() and ioctl().  This 
> is a synthetic test case, boiled down from a much larger real world 
> job.  Note that I set the stripe_count to rank+1 so I can tell which 
> of the ranks actually set the striping.
>
> I have determined that I only get the write failure when the ioctl 
> also failed with "No data available".  It also strikes me that at 
> most, only one rank reports "File exists".  With a 4 rank job, I would 
> think that normal behavior would be 1 rank would work as expected ( no 
> error ) and the other 3 would report file exists.
>
> Is this expected behavior?
>
> rank=1 doIO() -1=ioctl(fd=9) No data available
> rank=1 doIO() -1=write(fd=9) Bad file descriptor
> rank=3 doIO() -1=ioctl(fd=9) File exists
>
> oflags = O_CREAT|O_TRUNC|O_RDWR
>
> void
> doIO(const char *fileName, int rank){
> int status ;
>    int fd=open(fileName, O_RDWR|O_TRUNC|O_CREAT|O_LOV_DELAY_CREATE, 
> 0640 ) ;
>    if( fd < 0 ) return ;
>
>    struct lov_user_md opts = {0};
>    opts.lmm_magic = LOV_USER_MAGIC;
>    opts.lmm_stripe_size    = 1048576;
>    opts.lmm_stripe_offset  = -1 ;
>    opts.lmm_stripe_count   = rank+1 ;
>    opts.lmm_pattern        = 0 ;
>
>    status = ioctl ( fd , LL_IOC_LOV_SETSTRIPE, &opts);
>    if(status<0)fprintf(stderr,"rank=%d %s() %d=ioctl(fd=%d) 
> %s\n",rank,__func__,status,fd,strerror(errno));
>
>    char *string = "this is it\n" ;
>    int nc = strlen(string) ;
>    status = write( fd, string, nc ) ;
>    if( status != nc ) fprintf(stderr,"rank=%d %s() %d=write(fd=%d) 
> %s\n",rank,__func__,status,fd,status<0?strerror(errno):"");
>    status = close(fd) ;
>    if(status<0)fprintf(stderr,"rank=%d %s() %d=close(fd=%d) 
> %s\n",rank,__func__,status,fd,strerror(errno));
> }
>
> -- 
> I/O Doctors, LLC
> 507-766-0378
> bauerj at iodoctors.com <mailto:bauerj at iodoctors.com>

-- 
I/O Doctors, LLC
507-766-0378
bauerj at iodoctors.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20161028/09d23c1c/attachment-0001.htm>


More information about the lustre-discuss mailing list