[Lustre-discuss] Issues with Lustre Client 1.8.4 and Server 1.8.1.1

Jagga Soorma jagga13 at gmail.com
Wed Oct 20 07:41:50 PDT 2010


Robin,

This does not seem to help us at all.  I was not able to find any /dev/shm
related messages in the lsof output and after running 'ipcs -ma' my gsnap
went from a D state to a S state.  However, now my nscd daemon has entered
the D state.

--
# ipcs -ma

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch
status

------ Semaphore Arrays --------
key        semid      owner      perms      nsems

------ Message Queues --------
key        msqid      owner      perms      used-bytes   messages
--

-J

On Tue, Oct 19, 2010 at 10:05 PM, Jagga Soorma <jagga13 at gmail.com> wrote:

> Hey Robin,
>
> We are still looking into this issue and try to figure out what is causing
> this problem.  We are using a application called gsnap that does use openmpi
> and RMPI.  The next time this happens I will definitely look at lsof and see
> if there are any /dev/shm related entries.  Will report back with more
> information.
>
> What os are you guys using on your clients?  What did you guys end up doing
> for the long term fix of this issue?  I am thinking of downgrading to
> 2.6.27.29-0.1 kernel and 1.8.1.1 lustre client.
>
> Regards,
> -J
>
>
> On Tue, Oct 19, 2010 at 9:48 PM, Robin Humble <
> robin.humble+lustre at anu.edu.au <robin.humble%2Blustre at anu.edu.au>> wrote:
>
>> Hi Jagga,
>>
>> On Wed, Oct 13, 2010 at 02:33:35PM -0700, Jagga Soorma wrote:
>> ..
>> >start seeing this issue.  All my clients are setup with SLES11 and the
>> same
>> >packages with the exception of a newer kernel in the 1.8.4 environment
>> due
>> >to the lustre dependency:
>> >
>> >reshpc208:~ # uname -a
>> >Linux reshpc208 2.6.27.39-0.3-default #1 SMP 2009-11-23 12:57:38 +0100
>> x86_64 x86_64 x86_64 GNU/Linux
>> ...
>> >open("/proc/9598/stat", O_RDONLY)       = 6
>> >read(6, "9598 (gsnap) S 9596 9589 9589 0 "..., 1023) = 254
>> >close(6)                                = 0
>> >open("/proc/9598/status", O_RDONLY)     = 6
>> >read(6, "Name:\tgsnap\nState:\tS (sleeping)\n"..., 1023) = 1023
>> >close(6)                                = 0
>> >open("/proc/9598/cmdline", O_RDONLY)    = 6
>> >read(6,
>>
>> did you get any further with this?
>>
>> we've just seen something similar in that we had D state hung processes
>> and a strace of ps hung at the same place.
>>
>> in the end our hang appeared to be /dev/shm related, and an 'ipcs -ma'
>> magically caused all the D state processes to continue... we don't have
>> a good idea why this might be. looks kinda like a generic kernel shm
>> deadlock, possibly unrelated to Lustre.
>>
>> sys_shmdt features in the hung process tracebacks that the kernel
>> prints out.
>>
>> if you do 'lsof' do you see lots of /dev/shm entries for your app?
>> the app we saw run into trouble was using HPMPI which is common in
>> commercial packages. does gsnap use HPMPI?
>>
>> we are running vanilla 2.6.32.* kernels with Lustre 1.8.4 clients on
>> this cluster.
>>
>> cheers,
>> robin
>> --
>> Dr Robin Humble, HPC Systems Analyst, NCI National Facility
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "lustre-discuss-list" group.
>> To post to this group, send email to lustre-discuss-list at googlegroups.com
>> .
>> To unsubscribe from this group, send email to
>> lustre-discuss-list+unsubscribe at googlegroups.com<lustre-discuss-list%2Bunsubscribe at googlegroups.com>
>> .
>> For more options, visit this group at
>> http://groups.google.com/group/lustre-discuss-list?hl=en.
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20101020/e7c2c594/attachment.htm>


More information about the lustre-discuss mailing list