[Lustre-discuss] Problems with NAT and/or version upgrade
Jeff Darcy
jeffd at sicortex.com
Mon Feb 11 06:15:42 PST 2008
In our lab, we have several 108-node and 972-node systems, and we're
using a small (four-node Opteron 2.6.16-27-0.9_lustre-1.6.0.1smp) Lustre
server setup to provide some shared space for applications and such.
Because there are so many client subnets involved and we don't want the
server routing tables to get out of control, we use NAT for most of
these. I did notice that the server seems to know the "private"
internal addresses of the clients as part of their NIDs, but it seemed
to work. Lately, though, especially since we upgraded clients from
Linux 2.6.15 to 2.6.18 and from Lustre 1.5.99beta to 1.6.3, we've been
seeing some instability on the servers. The MDS has crashed repeatedly,
requiring manual cleanup of last_recv to recover. In about the same
time period, clients have also started complaining about "Cannot send
after transport endpoint shutdown (143)" and generally acting flaky.
Has anybody else seen anything similar, either as part of a version
upgrade or related to using NAT?
More information about the lustre-discuss
mailing list