[Lustre-discuss] I/O errors with NAMD

Wojciech Turek wjt27 at cam.ac.uk
Fri Jul 23 03:54:25 PDT 2010


There is a similar thread on this mailing list:
http://groups.google.com/group/lustre-discuss-list/browse_thread/thread/afe24159554cd3ff/8b37bababf848123?lnk=gst&q=I%2FO+error+on+clients#
Also there is a bug open which reports similar problem:
https://bugzilla.lustre.org/show_bug.cgi?id=23190



On 23 July 2010 10:02, Larry <tsrjzq at gmail.com> wrote:

> we have  the same problem when running namd in lustre sometimes, the
> console log suggest file lock expired, but I don't know why.
>
> On Fri, Jul 23, 2010 at 8:12 AM, Wojciech Turek <wjt27 at cam.ac.uk> wrote:
> > Hi Richard,
> >
> > If the cause of the I/O errors is Lustre there will be some message in
> the
> > logs. I am seeing similar problem with some applications that run on our
> > cluster. The symptoms are always the same, just before application
> crashes
> > with I/O error node gets evicted with a message like that:
> >  LustreError: 167-0: This client was evicted by ddn_data-OST000f; in
> > progress operations using this service will fail.
> >
> > The OSS that mounts the OST from the above message has following line in
> the
> > log:
> > LustreError: 0:0:(ldlm_lockd.c:305:waiting_locks_callback()) ### lock
> > callback timer expired after 101s: evicting client at 10.143.5.9 at tcp
> ns:
> > filter-ddn_data-OST000f_UUID lock: ffff81021a84ba00/0x744b1dd44
> > 81e38b2 lrc: 3/0,0 mode: PR/PR res: 34959884/0 rrc: 2 type: EXT
> > [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x20
> remote:
> > 0x1d34b900a905375d expref: 9 pid: 1506 timeout 8374258376
> >
> > Can you please check your logs for similar messages?
> >
> > Best regards
> >
> > Wojciech
> >
> > On 22 July 2010 23:43, Andreas Dilger <andreas.dilger at oracle.com> wrote:
> >>
> >> On 2010-07-22, at 14:59, Richard Lefebvre wrote:
> >> > I have a problem with the Scalable molecular dynamics software NAMD.
> It
> >> > write restart files once in a while. But sometime the binary write
> >> > crashes. The when it crashes is not constant. The only constant thing
> is
> >> > it happens when it writes on our Lustre file system. When it write on
> >> > something else, it is fine. I can't seem find any errors in any of the
> >> > /var/log/messages. Anyone had any problems with NAMD?
> >>
> >> Rarely has anyone complained about Lustre not providing error messages
> >> when there is a problem, so if there is nothing in /var/log/messages on
> >> either the client or the server then it is hard to know whether it is a
> >> Lustre problem or not...
> >>
> >> If possible, you could try running the application under strace (limited
> >> to the IO calls, or it would be much too much data) to see which system
> call
> >> the error is coming from.
> >>
> >> Cheers, Andreas
> >> --
> >> Andreas Dilger
> >> Lustre Technical Lead
> >> Oracle Corporation Canada Inc.
> >>
> >> _______________________________________________
> >> Lustre-discuss mailing list
> >> Lustre-discuss at lists.lustre.org
> >> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> >
> >
> >
> > _______________________________________________
> > Lustre-discuss mailing list
> > Lustre-discuss at lists.lustre.org
> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
> >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20100723/79a0153f/attachment.htm>


More information about the lustre-discuss mailing list