[Lustre-discuss] I/O errors with NAMD

Larry tsrjzq at gmail.com
Fri Jul 23 02:02:46 PDT 2010


we have  the same problem when running namd in lustre sometimes, the
console log suggest file lock expired, but I don't know why.

On Fri, Jul 23, 2010 at 8:12 AM, Wojciech Turek <wjt27 at cam.ac.uk> wrote:
> Hi Richard,
>
> If the cause of the I/O errors is Lustre there will be some message in the
> logs. I am seeing similar problem with some applications that run on our
> cluster. The symptoms are always the same, just before application crashes
> with I/O error node gets evicted with a message like that:
>  LustreError: 167-0: This client was evicted by ddn_data-OST000f; in
> progress operations using this service will fail.
>
> The OSS that mounts the OST from the above message has following line in the
> log:
> LustreError: 0:0:(ldlm_lockd.c:305:waiting_locks_callback()) ### lock
> callback timer expired after 101s: evicting client at 10.143.5.9 at tcp  ns:
> filter-ddn_data-OST000f_UUID lock: ffff81021a84ba00/0x744b1dd44
> 81e38b2 lrc: 3/0,0 mode: PR/PR res: 34959884/0 rrc: 2 type: EXT
> [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x20 remote:
> 0x1d34b900a905375d expref: 9 pid: 1506 timeout 8374258376
>
> Can you please check your logs for similar messages?
>
> Best regards
>
> Wojciech
>
> On 22 July 2010 23:43, Andreas Dilger <andreas.dilger at oracle.com> wrote:
>>
>> On 2010-07-22, at 14:59, Richard Lefebvre wrote:
>> > I have a problem with the Scalable molecular dynamics software NAMD. It
>> > write restart files once in a while. But sometime the binary write
>> > crashes. The when it crashes is not constant. The only constant thing is
>> > it happens when it writes on our Lustre file system. When it write on
>> > something else, it is fine. I can't seem find any errors in any of the
>> > /var/log/messages. Anyone had any problems with NAMD?
>>
>> Rarely has anyone complained about Lustre not providing error messages
>> when there is a problem, so if there is nothing in /var/log/messages on
>> either the client or the server then it is hard to know whether it is a
>> Lustre problem or not...
>>
>> If possible, you could try running the application under strace (limited
>> to the IO calls, or it would be much too much data) to see which system call
>> the error is coming from.
>>
>> Cheers, Andreas
>> --
>> Andreas Dilger
>> Lustre Technical Lead
>> Oracle Corporation Canada Inc.
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>



More information about the lustre-discuss mailing list