[Lustre-discuss] Luster access locking up login nodes

Brock Palen brockp at umich.edu
Fri May 16 12:48:46 PDT 2008


I have seen this behavior a few times.
Under heavy IO lustre will just stop and dmesg will have the following:

LustreError: 3976:0:(events.c:134:client_bulk_callback()) event type  
0, status -5, desc 000001012ce12000
LustreError: 11-0: an error occurred while communicating with  
141.212.30.184 at tcp. The mds_statfs operation failed with -107
LustreError: Skipped 1 previous similar messageLustre: nobackup- 
MDT0000-mdc-00000100e9e9ac00: Connection to service nobackup-MDT0000  
via nid 141.212.30.184 at tcp was lost; in progress operations using  
this service will wait for recovery to complete.


No network connection issues between the login nodes.
When this happens the client does not recover till we reboot the  
node.  This does happen at times on the compute nodes but I see it  
most on login hosts.

If I just go to the lustre mount and try to ls it it will hang for  
forever.   Many times when lustre screws up it recovers but more and  
more it does not. and we see these bulk errors followed by mds errors.

We are using lustre 1.6.x


Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
brockp at umich.edu
(734)936-1985






More information about the lustre-discuss mailing list