[Lustre-discuss] I/O on cluster with lustre

Mon Nov 30 16:40:16 PST 2009

On 2009-11-26, at 12:40, Goranka Bilalbegovic wrote:
> Recently the cluster I am using for computing has been updated to  
> the VMware with the Lustre file system.  Cluster uses: Oscar 6.0.3,   
> Sun Grid Engine 6.2u3, Nagios, Ganglia, InfiniBand 10 Gb/s. Nodes  
> access the file system using Ethernet via the Lustre InfiniBand/ 
> Ethernet router.
>
> I used to run one type of jobs as:
> ---
> #$ -N name
> #$ -o namesys.out
> #$ -e namesys.err
> #$ -pe mpi 2
> #$ -cwd
> #$ -v LD_LIBRARY_PATH
> mpirun -machinefile $TMPDIR/machines -np $NSLOTS /path/.../code.x <<  
> EOF
> name.in
> name.out
> EOF
> ---
>
> This is for an open source package (written in Fortran plus some C  
> utilities) and a such way of running was recommended by authors. It  
> was working on the previous version of the cluster, but it does not  
> run on a new lustre filesystem. It starts, but then stays in the  
> queue forever.

Without more information it is impossible to know what the problem  
is.  There shouldn't be any problem with running executables from  
Lustre,

General debugging steps that should be followed (not strictly related  
to this problem):
- presumably the Lustre filesystem is accessible from within your VM
   and is working fine other than this job launch problem?
- try to run the job by hand to see if it really is a Lustre problem
   or if it is related to the batch scheduler or something else
- check /var/log/messages to see if there are Lustre (or other) errors
- do "echo t > /proc/sysrq-trigger" to dump the stacks of all processes
   on the system, and see where your job is stuck

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.