[lustre-discuss] Lustre striping and MPI

Dilger, Andreas andreas.dilger at intel.com
Wed Oct 26 15:54:43 PDT 2016

Doing all of the ioctl() handling directly in your application is not a great idea, as that will not allow using a bunch of new features that are in the pipeline (e.g. progressive file layouts, file level redundancy, etc).  It would be a lot better to use the provided llapi_file_create() or llapi_layout_*() to isolate your application from the underlying implementation of how the file layout is set.

Specifics about your implementation:
- it is only possible to set the layout on a file once, when it is first created, so doing this from multiple threads for a single shared file is broken.  You should do that only from rank 0.
- it is possible to create a separate file for each thread/rank, but you probably don't want to set the stripe *count* == rank for each file.  it doesn't make sense to create a bunch of different files for the same application, each one with a different stripe count.  You probably meant to set the stripe_offset == rank so that the load is spread evenly across all OSTs?
- as a caveat for the above, specifying the OST index directly == rank can cause problems, compared to just allowing the MDT to select the OST indices for each file itself.  If num_ranks < ost_count then only the first num_ranks OSTs would ever be used, and space usage on the OSTs would be imbalanced.  Also, if some OST is offline or overloaded your application would not be able to create new files, while this can be avoided by allowing the MDT to select the OST index for each file.  With one file per rank it is best to use stripe_count = 1 for all files, since you already have parallelism at the application level.

Cheers, Andreas
Andreas Dilger
Lustre Principal Architect
Intel High Performance Data Division

On 2016/10/26, 06:51, " John Bauer" <bauerj at iodoctors.com<mailto:bauerj at iodoctors.com>> wrote:


I am running a 4 rank MPI job where all the ranks do an open of the file, attempt to set the striping with ioctl() and then do a small write.  Intermittently, I get errors on the write() and ioctl().  This is a synthetic test case, boiled down from a much larger real world job.  Note that I set the stripe_count to rank+1 so I can tell which of the ranks actually set the striping.

I have determined that I only get the write failure when the ioctl also failed with "No data available".  It also strikes me that at most, only one rank reports "File exists".  With a 4 rank job, I would think that normal behavior would be 1 rank would work as expected ( no error ) and the other 3 would report file exists.

Is this expected behavior?

rank=1 doIO() -1=ioctl(fd=9) No data available
rank=1 doIO() -1=write(fd=9) Bad file descriptor
rank=3 doIO() -1=ioctl(fd=9) File exists


doIO(const char *fileName, int rank){
int status ;
   int fd=open(fileName, O_RDWR|O_TRUNC|O_CREAT|O_LOV_DELAY_CREATE, 0640 ) ;
   if( fd < 0 ) return ;

   struct lov_user_md opts = {0};
   opts.lmm_magic = LOV_USER_MAGIC;
   opts.lmm_stripe_size    = 1048576;
   opts.lmm_stripe_offset  = -1 ;
   opts.lmm_stripe_count   = rank+1 ;
   opts.lmm_pattern        = 0 ;

   status = ioctl ( fd , LL_IOC_LOV_SETSTRIPE, &opts);
   if(status<0)fprintf(stderr,"rank=%d %s() %d=ioctl(fd=%d) %s\n",rank,__func__,status,fd,strerror(errno));

   char *string = "this is it\n" ;
   int nc = strlen(string) ;
   status = write( fd, string, nc ) ;
   if( status != nc ) fprintf(stderr,"rank=%d %s() %d=write(fd=%d) %s\n",rank,__func__,status,fd,status<0?strerror(errno):"");
   status = close(fd) ;
   if(status<0)fprintf(stderr,"rank=%d %s() %d=close(fd=%d) %s\n",rank,__func__,status,fd,strerror(errno));


I/O Doctors, LLC


bauerj at iodoctors.com<mailto:bauerj at iodoctors.com>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20161026/a4aff7d6/attachment.htm>

More information about the lustre-discuss mailing list