[Lustre-discuss] Issues with Lustre Client 1.8.4 and Server 1.8.1.1

Jagga Soorma jagga13 at gmail.com
Wed Oct 13 14:33:35 PDT 2010


Hey Guys,

I have 16 clients running lustre 1.8.1.1 and 8 new clients running 1.8.4.
My server is still running lustre 1.8.1.1 (RHEL 5.3).  I just deployed these
new nodes a few weeks ago and have started seeing some user processes just
go into an uninteruptable state.  When this same workload is performed on
the 1.8.1.1 clients it runs fine, but when we run it in our 1.8.4 clients we
start seeing this issue.  All my clients are setup with SLES11 and the same
packages with the exception of a newer kernel in the 1.8.4 environment due
to the lustre dependency:

reshpc208:~ # uname -a
Linux reshpc208 2.6.27.39-0.3-default #1 SMP 2009-11-23 12:57:38 +0100
x86_64 x86_64 x86_64 GNU/Linux
reshpc208:~ # rpm -qa | grep -i lustre
lustre-client-modules-1.8.4-2.6.27_39_0.3_lustre.1.8.4_default
lustre-client-1.8.4-2.6.27_39_0.3_lustre.1.8.4_default
reshpc208:~ # rpm -qa | grep -i kernel-ib
kernel-ib-1.5.1-2.6.27.39_0.3_default

Doing a ps just hangs on the system and I need to just close and reopen a
session to the effected system.  The application (gsnap) is running from the
lustre filesystem and doing all IO to the lustre fs.  Here is a strace of
where ps hangs:

--
output from "strace ps -ef"

doing an ls in /proc/9598 just hangs the session as well

..snip..
open("/proc/9597/cmdline", O_RDONLY)    = 6
read(6, "sh\0-c\0/gne/home/coryba/bin/gsnap"..., 2047) = 359
close(6)                                = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2819, ...}) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2819, ...}) = 0
write(1, "degenhj2  9597  9595  0 11:35 ? "..., 130degenhj2  9597  9595  0
11:35 ?        00:00:00 sh -c /gne/home/coryba/bin/gsnap -M 3 -t 16 -m 3 -n
1 -d mm9 -e 1000 -E 1000 --pa
) = 130
stat("/proc/9598", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0
open("/proc/9598/stat", O_RDONLY)       = 6
read(6, "9598 (gsnap) S 9596 9589 9589 0 "..., 1023) = 254
close(6)                                = 0
open("/proc/9598/status", O_RDONLY)     = 6
read(6, "Name:\tgsnap\nState:\tS (sleeping)\n"..., 1023) = 1023
close(6)                                = 0
open("/proc/9598/cmdline", O_RDONLY)    = 6
read(6,
--

The "t" before "gsnap" is part of "\t", or a "tab" character.  It looks like
GSNAP was trying to open a file or read from it.

I don't see any recent lustre specific errors in my logs (The ones from Oct
10th are expected):

--
..snip..
Oct 10 12:52:01 reshpc208 kernel: Lustre:
12933:0:(import.c:517:import_select_connection())
reshpcfs-MDT0000-mdc-ffff88200d5d2400: tried all connections, increasing
latency to 2s
Oct 10 12:52:01 reshpc208 kernel: Lustre:
12933:0:(import.c:517:import_select_connection()) Skipped 2 previous similar
messages
Oct 10 12:52:29 reshpc208 kernel: LustreError: 166-1: MGC10.0.250.44 at o2ib3:
Connection to service MGS via nid 10.0.250.44 at o2ib3 was lost; in progress
operations using this service will fail.
Oct 10 12:52:43 reshpc208 kernel: Lustre:
12932:0:(import.c:855:ptlrpc_connect_interpret()) MGS at 10.0.250.44@o2ib3
changed server handle from 0x816b8508159f149f to 0x816b850815ae68ab
Oct 10 12:52:43 reshpc208 kernel: Lustre: MGC10.0.250.44 at o2ib3: Reactivating
import
Oct 10 12:52:43 reshpc208 kernel: Lustre: MGC10.0.250.44 at o2ib3: Connection
restored to service MGS using nid 10.0.250.44 at o2ib3.
Oct 10 12:52:43 reshpc208 kernel: Lustre: Skipped 1 previous similar message
Oct 10 12:52:45 reshpc208 kernel: LustreError: 11-0: an error occurred while
communicating with 10.0.250.44 at o2ib3. The obd_ping operation failed with
-107
--

Again we don't have any issues on our 1.8.1.1 client's and this seems to be
only happening on our 1.8.4 clients.  Any assistance would be greatly
appreciated.

Has anyone seen anything similar to this?  Should I just revert back to
1.8.1.1 on these new nodes?  When is 1.8.5 supposed to come out?  I would
prefer to jump to SLES 11 SP1.

Thanks,
-J
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20101013/70ba7476/attachment.htm>


More information about the lustre-discuss mailing list