[Lustre-discuss] Issues with Lustre Client 1.8.4 and Server 1.8.1.1

Tue Oct 19 21:48:58 PDT 2010

Hi Jagga,

On Wed, Oct 13, 2010 at 02:33:35PM -0700, Jagga Soorma wrote:
..
>start seeing this issue.  All my clients are setup with SLES11 and the same
>packages with the exception of a newer kernel in the 1.8.4 environment due
>to the lustre dependency:
>
>reshpc208:~ # uname -a
>Linux reshpc208 2.6.27.39-0.3-default #1 SMP 2009-11-23 12:57:38 +0100 x86_64 x86_64 x86_64 GNU/Linux
...
>open("/proc/9598/stat", O_RDONLY)       = 6
>read(6, "9598 (gsnap) S 9596 9589 9589 0 "..., 1023) = 254
>close(6)                                = 0
>open("/proc/9598/status", O_RDONLY)     = 6
>read(6, "Name:\tgsnap\nState:\tS (sleeping)\n"..., 1023) = 1023
>close(6)                                = 0
>open("/proc/9598/cmdline", O_RDONLY)    = 6
>read(6,

did you get any further with this?

we've just seen something similar in that we had D state hung processes
and a strace of ps hung at the same place.

in the end our hang appeared to be /dev/shm related, and an 'ipcs -ma'
magically caused all the D state processes to continue... we don't have
a good idea why this might be. looks kinda like a generic kernel shm
deadlock, possibly unrelated to Lustre.

sys_shmdt features in the hung process tracebacks that the kernel
prints out.

if you do 'lsof' do you see lots of /dev/shm entries for your app?
the app we saw run into trouble was using HPMPI which is common in
commercial packages. does gsnap use HPMPI?

we are running vanilla 2.6.32.* kernels with Lustre 1.8.4 clients on
this cluster.

cheers,
robin
--
Dr Robin Humble, HPC Systems Analyst, NCI National Facility