[Lustre-discuss] I/O on cluster with lustre
Andreas Dilger
adilger at sun.com
Mon Nov 30 16:40:16 PST 2009
On 2009-11-26, at 12:40, Goranka Bilalbegovic wrote:
> Recently the cluster I am using for computing has been updated to
> the VMware with the Lustre file system. Cluster uses: Oscar 6.0.3,
> Sun Grid Engine 6.2u3, Nagios, Ganglia, InfiniBand 10 Gb/s. Nodes
> access the file system using Ethernet via the Lustre InfiniBand/
> Ethernet router.
>
> I used to run one type of jobs as:
> ---
> #$ -N name
> #$ -o namesys.out
> #$ -e namesys.err
> #$ -pe mpi 2
> #$ -cwd
> #$ -v LD_LIBRARY_PATH
> mpirun -machinefile $TMPDIR/machines -np $NSLOTS /path/.../code.x <<
> EOF
> name.in
> name.out
> EOF
> ---
>
> This is for an open source package (written in Fortran plus some C
> utilities) and a such way of running was recommended by authors. It
> was working on the previous version of the cluster, but it does not
> run on a new lustre filesystem. It starts, but then stays in the
> queue forever.
Without more information it is impossible to know what the problem
is. There shouldn't be any problem with running executables from
Lustre,
General debugging steps that should be followed (not strictly related
to this problem):
- presumably the Lustre filesystem is accessible from within your VM
and is working fine other than this job launch problem?
- try to run the job by hand to see if it really is a Lustre problem
or if it is related to the batch scheduler or something else
- check /var/log/messages to see if there are Lustre (or other) errors
- do "echo t > /proc/sysrq-trigger" to dump the stacks of all processes
on the system, and see where your job is stuck
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
More information about the lustre-discuss
mailing list