[Lustre-discuss] Lustre locking up on login/interactive nodes

Brian J. Murrell Brian.Murrell at Sun.COM
Mon Jul 21 08:51:55 PDT 2008


On Mon, 2008-07-21 at 11:43 -0400, Brock Palen wrote:
> Every so often lustre locks up. It will recover eventually. The  
> process show this self's in 'D'  Uninterruptible IO Wait.  This case  
> it was 'ar' making an archive.
> 
> Dmesg then shows:

Syslog is usually a better place to get messages from as it gives some
context as to the time of the messages.

> Lustre: nobackup-MDT0000-mdc-00000101fc467800: Connection to service  
> nobackup-MDT0000 via nid 141.212.30.184 at tcp was lost; in progress  
> operations using this service will wait for recovery to complete.
> LustreError: 167-0: This client was evicted by nobackup-MDT0000; in  
> progress operations using this service will fail.
> LustreError: 17575:0:(client.c:519:ptlrpc_import_delay_req()) @@@  
> IMP_INVALID  req at 0000010189e2f400 x912452/t0  
> o101->nobackup-MDT0000_UUID at 141.212.30.184@tcp:12 lens 488/768 ref 1  
> fl Rpc:P/0/0 rc 0/0
> LustreError: 17575:0:(mdc_locks.c:423:mdc_finish_enqueue())  
> ldlm_cli_enqueue: -108
> LustreError: 27076:0:(client.c:519:ptlrpc_import_delay_req()) @@@  
> IMP_INVALID  req at 00000101ed528a00 x912464/t0  
> o101->nobackup-MDT0000_UUID at 141.212.30.184@tcp:12 lens 440/768 ref 1  
> fl Rpc:/0/0 rc 0/0
> LustreError: 27076:0:(mdc_locks.c:423:mdc_finish_enqueue())  
> ldlm_cli_enqueue: -108
> LustreError: 27489:0:(file.c:97:ll_close_inode_openhandle()) inode  
> 12653753 mdc close failed: rc = -108
> LustreError: 27489:0:(file.c:97:ll_close_inode_openhandle()) inode  
> 12195682 mdc close failed: rc = -108
> LustreError: 27489:0:(file.c:97:ll_close_inode_openhandle()) Skipped  
> 46 previous similar messages
> Lustre: nobackup-MDT0000-mdc-00000101fc467800: Connection restored to  
> service nobackup-MDT0000 using nid 141.212.30.184 at tcp.
> LustreError: 11-0: an error occurred while communicating with  
> 141.212.30.184 at tcp. The mds_close operation failed with -116
> LustreError: 11-0: an error occurred while communicating with  
> 141.212.30.184 at tcp. The mds_close operation failed with -116
> LustreError: 26930:0:(file.c:97:ll_close_inode_openhandle()) inode  
> 11441446 mdc close failed: rc = -116
> LustreError: 26930:0:(file.c:97:ll_close_inode_openhandle()) Skipped  
> 113 previous similar messages

This looks like a pretty standard eviction.  Probably the most
interesting information is on the node that did the evicting.  If it
doesn't contain much other than a "have not heard from", then you have
node that is either disappearing from the network or getting wedged
enough to stop sending pings (or any other traffic in lieu of).

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20080721/e4756dfb/attachment.pgp>


More information about the lustre-discuss mailing list