[Lustre-discuss] Réf. : Re: Réf. : Re: [ ROMIOReq #940] a new Lustre ADIO driver]

pascal.deveze at bull.net pascal.deveze at bull.net
Fri Jun 5 08:34:38 PDT 2009


Rob,

> i_noncontig and noncontig require fcntl() locks, which Lustre supports
> only if you mount with a special mount option (I don't remember what
> that is).   Was that the cause for the 'abort'  (it should be pretty
> clear from the error messages).

mount | grep lustre:
XX.XX.XX.XX at tcp:/romio on /mnt/romio type lustre (rw,user_xattr,acl,flock)

Error message (same for i_noncontig and noncontig):
rank 0 in job 180  inti12_52233   caused collective abort of all ranks
  exit status of rank 0: killed by signal 9

>
> If coll_test passed, that's very good progress.
>
> Can you tell me more about the noncontig_coll and noncontig_coll2 test
> failures?

noncontig_coll: Problem of data in the file
Process 1: buf 1 is 0, should be 5001
Process 1: buf 3 is 0, should be 5003
Process 1: buf 5 is 0, should be 5005
Process 1: buf 7 is 0, should be 5007
Process 1: buf 9 is 0, should be 5009
Process 1: buf 11 is 0, should be 5011
Process 1: buf 13 is 0, should be 5013
Process 1: buf 15 is 0, should be 5015
Process 1: buf 17 is 0, should be 5017
Process 1: buf 19 is 0, should be 5019
Process 1: buf 21 is 0, should be 5021
Process 1: buf 23 is 0, should be 5023
.............

noncontig_coll2:
Fatal error in PMPI_Barrier: Other MPI error, error stack:
PMPI_Barrier(476)..................: MPI_Barrier(MPI_COMM_WORLD) failed
MPIR_Barrier(82)...................:
MPIC_Sendrecv(164).................:
MPIC_Wait(405).....................:
MPIDI_CH3I_Progress(149)...........:
MPID_nem_mpich2_blocking_recv(1074):
MPID_nem_tcp_connpoll(1667)........:
state_commrdy_handler(1517)........:
MPID_nem_tcp_recv_handler(1413)....: socket closed
Fatal error in PMPI_Barrier: Other MPI error, error stack:
PMPI_Barrier(476)............: MPI_Barrier(MPI_COMM_WORLD) failed
MPIR_Barrier(82).............:
MPIC_Sendrecv(158)...........:
MPID_Isend(113)..............: failure occurred while attempting to send an
eager message
MPIDI_CH3_iSend(29)..........:
MPID_nem_tcp_iSendContig(371): writev to socket failed - Broken pipe
rank 0 in job 183  inti12_52233   caused collective abort of all ranks

I'll investigate more next week.

Pascal






More information about the lustre-discuss mailing list