[Lustre-discuss] I/O errors with NAMD

Wojciech Turek wjt27 at cam.ac.uk
Thu Jul 22 17:12:38 PDT 2010


Hi Richard,

If the cause of the I/O errors is Lustre there will be some message in the
logs. I am seeing similar problem with some applications that run on our
cluster. The symptoms are always the same, just before application crashes
with I/O error node gets evicted with a message like that:
 LustreError: 167-0: This client was evicted by ddn_data-OST000f; in
progress operations using this service will fail.

The OSS that mounts the OST from the above message has following line in the
log:
LustreError: 0:0:(ldlm_lockd.c:305:waiting_locks_callback()) ### lock
callback timer expired after 101s: evicting client at 10.143.5.9 at tcp  ns:
filter-ddn_data-OST000f_UUID lock: ffff81021a84ba00/0x744b1dd44
81e38b2 lrc: 3/0,0 mode: PR/PR res: 34959884/0 rrc: 2 type: EXT
[0->18446744073709551615] (req 0->18446744073709551615) flags: 0x20 remote:
0x1d34b900a905375d expref: 9 pid: 1506 timeout 8374258376

Can you please check your logs for similar messages?

Best regards

Wojciech

On 22 July 2010 23:43, Andreas Dilger <andreas.dilger at oracle.com> wrote:

> On 2010-07-22, at 14:59, Richard Lefebvre wrote:
> > I have a problem with the Scalable molecular dynamics software NAMD. It
> > write restart files once in a while. But sometime the binary write
> > crashes. The when it crashes is not constant. The only constant thing is
> > it happens when it writes on our Lustre file system. When it write on
> > something else, it is fine. I can't seem find any errors in any of the
> > /var/log/messages. Anyone had any problems with NAMD?
>
> Rarely has anyone complained about Lustre not providing error messages when
> there is a problem, so if there is nothing in /var/log/messages on either
> the client or the server then it is hard to know whether it is a Lustre
> problem or not...
>
> If possible, you could try running the application under strace (limited to
> the IO calls, or it would be much too much data) to see which system call
> the error is coming from.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Lustre Technical Lead
> Oracle Corporation Canada Inc.
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20100723/e64c7afa/attachment.htm>


More information about the lustre-discuss mailing list