[Lustre-discuss] Problems with NAT and/or version upgrade

Jeff Darcy jeffd at sicortex.com
Mon Feb 11 06:15:42 PST 2008


In our lab, we have several 108-node and 972-node systems, and we're 
using a small (four-node Opteron 2.6.16-27-0.9_lustre-1.6.0.1smp) Lustre 
server setup to provide some shared space for applications and such.  
Because there are so many client subnets involved and we don't want the 
server routing tables to get out of control, we use NAT for most of 
these.  I did notice that the server seems to know the "private" 
internal addresses of the clients as part of their NIDs, but it seemed 
to work.  Lately, though, especially since we upgraded clients from 
Linux 2.6.15 to 2.6.18 and from Lustre 1.5.99beta to 1.6.3, we've been 
seeing some instability on the servers.  The MDS has crashed repeatedly, 
requiring manual cleanup of last_recv to recover.  In about the same 
time period, clients have also started complaining about "Cannot send 
after transport endpoint shutdown (143)" and generally acting flaky.  
Has anybody else seen anything similar, either as part of a version 
upgrade or related to using NAT?



More information about the lustre-discuss mailing list