[Lustre-discuss] Issues with Lustre Client 1.8.4 and Server 1.8.1.1

Jagga Soorma jagga13 at gmail.com
Tue Oct 19 22:05:01 PDT 2010


Hey Robin,

We are still looking into this issue and try to figure out what is causing
this problem.  We are using a application called gsnap that does use openmpi
and RMPI.  The next time this happens I will definitely look at lsof and see
if there are any /dev/shm related entries.  Will report back with more
information.

What os are you guys using on your clients?  What did you guys end up doing
for the long term fix of this issue?  I am thinking of downgrading to
2.6.27.29-0.1 kernel and 1.8.1.1 lustre client.

Regards,
-J

On Tue, Oct 19, 2010 at 9:48 PM, Robin Humble <
robin.humble+lustre at anu.edu.au <robin.humble%2Blustre at anu.edu.au>> wrote:

> Hi Jagga,
>
> On Wed, Oct 13, 2010 at 02:33:35PM -0700, Jagga Soorma wrote:
> ..
> >start seeing this issue.  All my clients are setup with SLES11 and the
> same
> >packages with the exception of a newer kernel in the 1.8.4 environment due
> >to the lustre dependency:
> >
> >reshpc208:~ # uname -a
> >Linux reshpc208 2.6.27.39-0.3-default #1 SMP 2009-11-23 12:57:38 +0100
> x86_64 x86_64 x86_64 GNU/Linux
> ...
> >open("/proc/9598/stat", O_RDONLY)       = 6
> >read(6, "9598 (gsnap) S 9596 9589 9589 0 "..., 1023) = 254
> >close(6)                                = 0
> >open("/proc/9598/status", O_RDONLY)     = 6
> >read(6, "Name:\tgsnap\nState:\tS (sleeping)\n"..., 1023) = 1023
> >close(6)                                = 0
> >open("/proc/9598/cmdline", O_RDONLY)    = 6
> >read(6,
>
> did you get any further with this?
>
> we've just seen something similar in that we had D state hung processes
> and a strace of ps hung at the same place.
>
> in the end our hang appeared to be /dev/shm related, and an 'ipcs -ma'
> magically caused all the D state processes to continue... we don't have
> a good idea why this might be. looks kinda like a generic kernel shm
> deadlock, possibly unrelated to Lustre.
>
> sys_shmdt features in the hung process tracebacks that the kernel
> prints out.
>
> if you do 'lsof' do you see lots of /dev/shm entries for your app?
> the app we saw run into trouble was using HPMPI which is common in
> commercial packages. does gsnap use HPMPI?
>
> we are running vanilla 2.6.32.* kernels with Lustre 1.8.4 clients on
> this cluster.
>
> cheers,
> robin
> --
> Dr Robin Humble, HPC Systems Analyst, NCI National Facility
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
> --
> You received this message because you are subscribed to the Google Groups
> "lustre-discuss-list" group.
> To post to this group, send email to lustre-discuss-list at googlegroups.com.
> To unsubscribe from this group, send email to
> lustre-discuss-list+unsubscribe at googlegroups.com<lustre-discuss-list%2Bunsubscribe at googlegroups.com>
> .
> For more options, visit this group at
> http://groups.google.com/group/lustre-discuss-list?hl=en.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20101019/61d34cf5/attachment.htm>


More information about the lustre-discuss mailing list