[Lustre-discuss] I/O errors with NAMD

Fri Jul 23 10:53:45 PDT 2010

If I had some Lustre error, it would give me a clue, but the only errors 
the users get is the following traceback on the application:

-------------------------------------------------------------------
Reason: FATAL ERROR: Error on write to binary file
restart/ABCD_les4.950000.vel: Interrupted system call

Fatal error on PE 0> FATAL ERROR: Error on write to binary file
restart/ABCD_les4.950000.vel: Interrupted system call
"

"
FINISHED WRITING RESTART COORDINATES
WRITING VELOCITIES TO RESTART FILE AT STEP 950000
FATAL ERROR: Error on write to binary file restart/ABCD_les4.950000.vel:
Interrupted system call
[0] Stack Traceback:
   [0] CmiAbort+0x5f  [0xabb81b]
   [1] _Z8NAMD_errPKc+0x9d  [0x5099e9]
   [2] _ZN6Output17write_binary_fileEPciP6Vector+0xb0  [0x916d7e]
   [3] _ZN6Output25output_restart_velocitiesEiiP6Vector+0x249  [0x918d93]
   [4] _ZN6Output8velocityEiiP6Vector+0xdb  [0x918a53]
   [5]
_ZN24CkIndex_CollectionMaster40_call_receiveVelocities_CollectVectorMsgEPvP16CollectionMaster+0x16c
  [0x51c184]
   [6] CkDeliverMessageFree+0x21  [0xa2f899]
   [7] _Z15_processHandlerPvP11CkCoreState+0x50f  [0xa2ef63]
   [8] CsdScheduleForever+0xa5  [0xabc631]
   [9] CsdScheduler+0x1c  [0xabc232]
   [10] _ZN7BackEnd7suspendEv+0xb  [0x511ddd]
   [11] _ZN9ScriptTcl7Tcl_runEPvP10Tcl_InterpiPPc+0x140  [0x971ddc]
   [12] TclInvokeStringCommand+0x91  [0xae0518]
   [13] /share/apps/namd/NAMD_2.7b2_Linux-x86_64-ibverbs/namd2 [0xb16368]
   [14] Tcl_EvalEx+0x176  [0xb169ab]
   [15] Tcl_EvalFile+0x134  [0xb0e3b4]
   [16] _ZN9ScriptTcl3runEPc+0x14  [0x9714da]
   [17] _Z18after_backend_initiPPc+0x22b  [0x50da7b]
   [18] main+0x3a  [0x50d81a]
   [19] __libc_start_main+0xf4  [0x398661d974]
   [20] _ZNSt8ios_base4InitD1Ev+0x4a  [0x508bda]
"

When I told the user to use a slower NFS file system instead. The 
problem doesn't occur.

As another one commented about file lock, the lustre mount was added the 
"flock" parameter recently (for another application) but the NAMD still 
has problems.

RIchard

On 07/22/2010 08:12 PM, Wojciech Turek wrote:
> Hi Richard,
>
> If the cause of the I/O errors is Lustre there will be some message in
> the logs. I am seeing similar problem with some applications that run on
> our cluster. The symptoms are always the same, just before application
> crashes with I/O error node gets evicted with a message like that:
>   LustreError: 167-0: This client was evicted by ddn_data-OST000f; in
> progress operations using this service will fail.
>
> The OSS that mounts the OST from the above message has following line in
> the log:
> LustreError: 0:0:(ldlm_lockd.c:305:waiting_locks_callback()) ### lock
> callback timer expired after 101s: evicting client at 10.143.5.9 at tcp
> ns: filter-ddn_data-OST000f_UUID lock: ffff81021a84ba00/0x744b1dd44
> 81e38b2 lrc: 3/0,0 mode: PR/PR res: 34959884/0 rrc: 2 type: EXT
> [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x20
> remote: 0x1d34b900a905375d expref: 9 pid: 1506 timeout 8374258376
>
> Can you please check your logs for similar messages?
>
> Best regards
>
> Wojciech
>
> On 22 July 2010 23:43, Andreas Dilger <andreas.dilger at oracle.com
> <mailto:andreas.dilger at oracle.com>> wrote:
>
>     On 2010-07-22, at 14:59, Richard Lefebvre wrote:
>      > I have a problem with the Scalable molecular dynamics software
>     NAMD. It
>      > write restart files once in a while. But sometime the binary write
>      > crashes. The when it crashes is not constant. The only constant
>     thing is
>      > it happens when it writes on our Lustre file system. When it write on
>      > something else, it is fine. I can't seem find any errors in any
>     of the
>      > /var/log/messages. Anyone had any problems with NAMD?
>
>     Rarely has anyone complained about Lustre not providing error
>     messages when there is a problem, so if there is nothing in
>     /var/log/messages on either the client or the server then it is hard
>     to know whether it is a Lustre problem or not...
>
>     If possible, you could try running the application under strace
>     (limited to the IO calls, or it would be much too much data) to see
>     which system call the error is coming from.
>
>     Cheers, Andreas
>     --
>     Andreas Dilger
>     Lustre Technical Lead
>     Oracle Corporation Canada Inc.
>
>     _______________________________________________
>     Lustre-discuss mailing list
>     Lustre-discuss at lists.lustre.org <mailto:Lustre-discuss at lists.lustre.org>
>     http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>
>

-- 
Richard Lefebvre, Sys-admin, RQCHP, (514)343-6111 x5313    "Don't Panic"
Richard.Lefebvre at rqchp.qc.ca                                -- THGTTG
RQCHP (rqchp.ca) --------------------- Calcul Canada (computecanada.org)