[Lustre-discuss] I/O errors with NAMD

Fri Jul 23 12:27:38 PDT 2010

Hi Larry,

>From my experience, if the application is doing some I/O and server evicts
the node that application is running on this will definitely result in EIO
error being send to the application, thus the input/output error message in
the standard output of the application.
In the case of my cluster the eviction always happens with a particular
applications and this behaviour is very reproducible. I have checked cluster
network but it doesn't seem to have any congestion at the time of eviction.
This problem started after  upgrading  from 1.6.6 to 1.8.3
Currently for the applications affected by that problem we workaround by
using compute nodes local disks but this is not ideal and hopefully we will
see some progress on that case soon.

Best regards,

Wojciech

On 23 July 2010 15:41, Larry <tsrjzq at gmail.com> wrote:

> There are many kinds of reasons that a server evicts a client, maybe
> network error, maybe ptlrpcd bug, but according to my experience, the
> only chance to see the I/O error is running namd in lustre filesystem,
> I can see some other "evict"  events sometimes, but none of them
> results in I/O error. So besides the "evict client", there may be
> something else causing the "I/O error".
>
> On Fri, Jul 23, 2010 at 6:54 PM, Wojciech Turek <wjt27 at cam.ac.uk> wrote:
> > There is a similar thread on this mailing list:
> >
> http://groups.google.com/group/lustre-discuss-list/browse_thread/thread/afe24159554cd3ff/8b37bababf848123?lnk=gst&q=I%2FO+error+on+clients#
> > Also there is a bug open which reports similar problem:
> > https://bugzilla.lustre.org/show_bug.cgi?id=23190
> >
> >
> >
> > On 23 July 2010 10:02, Larry <tsrjzq at gmail.com> wrote:
> >>
> >> we have  the same problem when running namd in lustre sometimes, the
> >> console log suggest file lock expired, but I don't know why.
> >>
> >> On Fri, Jul 23, 2010 at 8:12 AM, Wojciech Turek <wjt27 at cam.ac.uk>
> wrote:
> >> > Hi Richard,
> >> >
> >> > If the cause of the I/O errors is Lustre there will be some message in
> >> > the
> >> > logs. I am seeing similar problem with some applications that run on
> our
> >> > cluster. The symptoms are always the same, just before application
> >> > crashes
> >> > with I/O error node gets evicted with a message like that:
> >> >  LustreError: 167-0: This client was evicted by ddn_data-OST000f; in
> >> > progress operations using this service will fail.
> >> >
> >> > The OSS that mounts the OST from the above message has following line
> in
> >> > the
> >> > log:
> >> > LustreError: 0:0:(ldlm_lockd.c:305:waiting_locks_callback()) ### lock
> >> > callback timer expired after 101s: evicting client at 10.143.5.9 at tcp
> >> > ns:
> >> > filter-ddn_data-OST000f_UUID lock: ffff81021a84ba00/0x744b1dd44
> >> > 81e38b2 lrc: 3/0,0 mode: PR/PR res: 34959884/0 rrc: 2 type: EXT
> >> > [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x20
> >> > remote:
> >> > 0x1d34b900a905375d expref: 9 pid: 1506 timeout 8374258376
> >> >
> >> > Can you please check your logs for similar messages?
> >> >
> >> > Best regards
> >> >
> >> > Wojciech
> >> >
> >> > On 22 July 2010 23:43, Andreas Dilger <andreas.dilger at oracle.com>
> wrote:
> >> >>
> >> >> On 2010-07-22, at 14:59, Richard Lefebvre wrote:
> >> >> > I have a problem with the Scalable molecular dynamics software
> NAMD.
> >> >> > It
> >> >> > write restart files once in a while. But sometime the binary write
> >> >> > crashes. The when it crashes is not constant. The only constant
> thing
> >> >> > is
> >> >> > it happens when it writes on our Lustre file system. When it write
> on
> >> >> > something else, it is fine. I can't seem find any errors in any of
> >> >> > the
> >> >> > /var/log/messages. Anyone had any problems with NAMD?
> >> >>
> >> >> Rarely has anyone complained about Lustre not providing error
> messages
> >> >> when there is a problem, so if there is nothing in /var/log/messages
> on
> >> >> either the client or the server then it is hard to know whether it is
> a
> >> >> Lustre problem or not...
> >> >>
> >> >> If possible, you could try running the application under strace
> >> >> (limited
> >> >> to the IO calls, or it would be much too much data) to see which
> system
> >> >> call
> >> >> the error is coming from.
> >> >>
> >> >> Cheers, Andreas
> >> >> --
> >> >> Andreas Dilger
> >> >> Lustre Technical Lead
> >> >> Oracle Corporation Canada Inc.
> >> >>
> >> >> _______________________________________________
> >> >> Lustre-discuss mailing list
> >> >> Lustre-discuss at lists.lustre.org
> >> >> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> >> >
> >> >
> >> >
> >> > _______________________________________________
> >> > Lustre-discuss mailing list
> >> > Lustre-discuss at lists.lustre.org
> >> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
> >> >
> >> >
> >
> >
> >
> >
> > _______________________________________________
> > Lustre-discuss mailing list
> > Lustre-discuss at lists.lustre.org
> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
> >
> >
>

-- 
--
Wojciech Turek

Assistant System Manager
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20100723/e5c9fec8/attachment.htm>