[Lustre-discuss] failover problems

John White jwhite at lbl.gov
Fri Dec 11 16:48:06 PST 2009


Please disregard.  I just realized the difference between a ':' and ',' when running these commands.

On Dec 11, 2009, at 11:42 AM, John White wrote:

> So we have a cluster with an MGT and 2 MDTs.  Each has an NID on o2ib and tcp and are dual-connected to 2 MDSs.  We created the MGT and MDTs with the following commands:
> mkfs.lustre --mgs --reformat --failnode=10.4.200.0 at o2ib,10.4.200.1 at o2ib --failnode=10.4.200.0 at tcp0,10.4.200.1 at tcp0 /dev/dm-0
> mkfs.lustre --mdt --mgsnode=10.4.200.0 at o2ib --fsname=lrc --reformat --failnode=10.4.200.0 at o2ib,10.4.200.1 at o2ib,10.4.200.0 at tcp0,10.4.200.1 at tcp0 /dev/dm-1
> mkfs.lustre --mdt --mgsnode=10.4.200.0 at o2ib --fsname=nano --reformat --failnode=10.4.200.1 at o2ib,10.4.200.0 at o2ib,10.4.200.1 at tcp0,10.4.200.0 at tcp0 /dev/dm-2
> 
> The host cluster starts and mounts the luns just fine.  I mount TCP connected clients with both MGSs called out.  The client fails over to the secondary MDS/MGT just fine but keeps failing on the MDT.  It just keeps trying the old MDS NIDs:
> Lustre: Changing connection for lrc-MDT0000-mdc-ffff8101d57ad400 to 10.4.200.0 at o2ib/10.0.200.0 at tcp
> 
> Ideas?
> ----------------
> John White
> High Performance Computing Services (HPCS)
> (510) 486-7307
> One Cyclotron Rd, MS: 50B-3209C
> Lawrence Berkeley National Lab
> Berkeley, CA 94720
> 
> 
> 
> 
> 
> 
> 
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

----------------
John White
High Performance Computing Services (HPCS)
(510) 486-7307
One Cyclotron Rd, MS: 50B-3209C
Lawrence Berkeley National Lab
Berkeley, CA 94720











More information about the lustre-discuss mailing list