<div id="qhide_281631" style="display: block;" class="qt">Robin,<br><br>I don't believe that is the cause, otherwise I would be seeing this on my older 1.8.1.1 clients as well.  Yes the /proc/sys/vm/zone_reclaim_mode is set to 0 on all of my old/new clients.  I did say that I only changed the lustre client rpms as well as the kernel due to the lustre clients dependency.  Other than that all packages are the same.<br>

<br>Another thing to note is that my user home directories are sitting in the lustre filesystem and we did have a failover occur over the past weekend.  Not sure if this would cause any issues but I did see the lustre client re-establish a connection with the lustre servers and did not see any other lustre specific errors after this.<br>

<br>-J<br><br>>On Wed, Oct 13, 2010 at 02:33:35PM -0700, Jagga Soorma wrote: <br> >>Doing a ps just hangs on the system and I need to just close and reopen a <br> >>session to the effected system.  The application (gsnap) is running from the <br>

 >>lustre filesystem and doing all IO to the lustre fs.  Here is a strace of <br> >>where ps hangs: <br><br>>one possible cause of hung processes (that's not Lustre related) is the <br></div> >VM tying itself in knots. are your clients NUMA machines? <br>

 >is /proc/sys/vm/zone_reclaim_mode = 0? <br> <p>>I guess this explanation is a bit unlikely if your only change is the <br> >client kernel version, but you don't say what you changed it from and <br> >I'm not familiar with SLES, so the possibility is there, and it's an <br>

 >easy fix (or actually a dodgy workaround) if that's the problem. <br> </p>>-- <br> >Dr Robin Humble, HPC Systems Analyst, NCI National Facility <br><br><div class="gmail_quote">On Wed, Oct 13, 2010 at 3:46 PM, Jagga Soorma <span dir="ltr"><<a href="mailto:jagga13@gmail.com">jagga13@gmail.com</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">Okay so one thing I noticed on both instances is that there was a Metadata server outage a few days before the users complained about these issues.  The clients reestablished the connection once the metadata services were brought back online.  But my understanding was that the processes would just hang while the storage is unavailable.  But it should be fine once the lustre filesystem was made available again.  Am I incorrect in this assumption?  Could this have led to these processes being in this hung state?  Again, it does not seem like all processes across all nodes were effected.<br>


<br>Thanks,<br>-Simran<div><div></div><div class="h5"><br><br><div class="gmail_quote">On Wed, Oct 13, 2010 at 2:33 PM, Jagga Soorma <span dir="ltr"><<a href="mailto:jagga13@gmail.com" target="_blank">jagga13@gmail.com</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">

Hey Guys,<br><br>I have 16 clients running lustre 1.8.1.1 and 8 new clients running 1.8.4.  My server is still running lustre 1.8.1.1 (RHEL 5.3).  I just deployed these new nodes a few weeks ago and have started seeing some user processes just go into an uninteruptable state.  When this same workload is performed on the 1.8.1.1 clients it runs fine, but when we run it in our 1.8.4 clients we start seeing this issue.  All my clients are setup with SLES11 and the same packages with the exception of a newer kernel in the 1.8.4 environment due to the lustre dependency:<br>


<br>reshpc208:~ # uname -a<br>Linux reshpc208 2.6.27.39-0.3-default #1 SMP 2009-11-23 12:57:38 +0100 x86_64 x86_64 x86_64 GNU/Linux<br>reshpc208:~ # rpm -qa | grep -i lustre<br>lustre-client-modules-1.8.4-2.6.27_39_0.3_lustre.1.8.4_default<br>


lustre-client-1.8.4-2.6.27_39_0.3_lustre.1.8.4_default<br>reshpc208:~ # rpm -qa | grep -i kernel-ib<br>kernel-ib-1.5.1-2.6.27.39_0.3_default<br><br>Doing a ps just hangs on the system and I need to just close and reopen a session to the effected system.  The application (gsnap) is running from the lustre filesystem and doing all IO to the lustre fs.  Here is a strace of where ps hangs:<br>


<br>--<br>output from "strace ps -ef"<br><br>doing an ls in /proc/9598 just hangs the session as well<br><br>..snip..<br>open("/proc/9597/cmdline", O_RDONLY)    = 6<br>read(6, "sh\0-c\0/gne/home/coryba/bin/gsnap"..., 2047) = 359<br>


close(6)                                = 0<br>stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2819, ...}) = 0<br>stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2819, ...}) = 0<br>write(1, "degenhj2  9597  9595  0 11:35 ? "..., 130degenhj2  9597  9595  0 11:35 ?        00:00:00 sh -c /gne/home/coryba/bin/gsnap -M 3 -t 16 -m 3 -n 1 -d mm9 -e 1000 -E 1000 --pa<br>


) = 130<br>stat("/proc/9598", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0<br>open("/proc/9598/stat", O_RDONLY)       = 6<br>read(6, "9598 (gsnap) S 9596 9589 9589 0 "..., 1023) = 254<br>close(6)                                = 0<br>


open("/proc/9598/status", O_RDONLY)     = 6<br>read(6, "Name:\tgsnap\nState:\tS (sleeping)\n"..., 1023) = 1023<br>close(6)                                = 0<br>open("/proc/9598/cmdline", O_RDONLY)    = 6<br>


read(6, <br>--<br><br>The "t" before "gsnap" is part of "\t", or a "tab" character.  It looks like GSNAP was trying to open a file or read from it.<br><br>I don't see any recent lustre specific errors in my logs (The ones from Oct 10th are expected):<br>


<br>--<br>..snip..<br>Oct 10 12:52:01 reshpc208 kernel: Lustre: 12933:0:(import.c:517:import_select_connection()) reshpcfs-MDT0000-mdc-ffff88200d5d2400: tried all connections, increasing latency to 2s<br>Oct 10 12:52:01 reshpc208 kernel: Lustre: 12933:0:(import.c:517:import_select_connection()) Skipped 2 previous similar messages<br>


Oct 10 12:52:29 reshpc208 kernel: LustreError: 166-1: MGC10.0.250.44@o2ib3: Connection to service MGS via nid 10.0.250.44@o2ib3 was lost; in progress operations using this service will fail.<br>Oct 10 12:52:43 reshpc208 kernel: Lustre: 12932:0:(import.c:855:ptlrpc_connect_interpret()) <a href="mailto:MGS@10.0.250.44" target="_blank">MGS@10.0.250.44</a>@o2ib3 changed server handle from 0x816b8508159f149f to 0x816b850815ae68ab<br>


Oct 10 12:52:43 reshpc208 kernel: Lustre: MGC10.0.250.44@o2ib3: Reactivating import<br>Oct 10 12:52:43 reshpc208 kernel: Lustre: MGC10.0.250.44@o2ib3: Connection restored to service MGS using nid 10.0.250.44@o2ib3.<br>Oct 10 12:52:43 reshpc208 kernel: Lustre: Skipped 1 previous similar message<br>


Oct 10 12:52:45 reshpc208 kernel: LustreError: 11-0: an error occurred while communicating with 10.0.250.44@o2ib3. The obd_ping operation failed with -107<br>--<br><br>Again we don't have any issues on our 1.8.1.1 client's and this seems to be only happening on our 1.8.4 clients.  Any assistance would be greatly appreciated.<br>


<br>Has anyone seen anything similar to this?  Should I just revert back to 1.8.1.1 on these new nodes?  When is 1.8.5 supposed to come out?  I would prefer to jump to SLES 11 SP1.<br><br>Thanks,<br><font color="#888888">-J<br>


</font></blockquote></div><br>

</div></div></blockquote></div><br>