[Lustre-discuss] Starting a new MGS/MDS

Andreas Dilger adilger at sun.com
Fri Sep 5 12:38:51 PDT 2008


On Sep 05, 2008  11:11 -0400, Aaron Knister wrote:
> Does the new MDS actually have an MGS running? FYI- you only need one  
> mgs per lustre set up. In the commands you issued it doesn't look like  
> you actually set up an MGS on the host "mds2". Can you run an "lctl  
> dl" on mds2 and send the output?

There are tradeoffs between having a single MGS for multiple filesystems,
and having one MGS per filesystem (assuming different MDS nodes).  In
general, there isn't much benefit to sharing an MGS between multiple MDS
nodes, and the drawback is that it is a single point of failure, so you
may as well have one per MDS.

> On Sep 4, 2008, at 4:54 PM, Ms. Megan Larko wrote:
> 
> > Hi,
> >
> > I have a new MGS/MDS that I would like to start.   It is another of
> > the same Cent0S 5 kernel 2.6.18-53.1.13.el5
> > lustre-1.6.4.3smp as my other boxes.  Initially I had an IP number
> > that was used elsewhere in our group.  I
> > changed it using the tunefs.lustre command below for the new MDT.
> >
> > [root at mds2 ~]# tunefs.lustre --erase-params --writeconf
> > --mgsnode=ic-mds2 at o2ib /dev/sdd1
> > checking for existing Lustre data: found CONFIGS/mountdata
> > Reading CONFIGS/mountdata
> >
> >   Read previous values:
> > Target:     crew8-MDTffff
> > Index:      unassigned
> > Lustre FS:  crew8
> > Mount type: ldiskfs
> > Flags:      0x71
> >              (MDT needs_index first_time update )
> > Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
> > Parameters: mgsnode=172.18.0.9 at o2ib
> >
> >
> >   Permanent disk data:
> > Target:     crew8-MDTffff
> > Index:      unassigned
> > Lustre FS:  crew8
> > Mount type: ldiskfs
> > Flags:      0x171
> >              (MDT needs_index first_time update writeconf )
> > Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
> > Parameters: mgsnode=172.18.0.16 at o2ib
> >
> > Writing CONFIGS/mountdata
> >
> > Next I try to mount this new MDT onto the system....
> > [root at mds2 ~]# mount -t lustre /dev/sdd1 /srv/lustre/mds/crew8-MDT0000
> > mount.lustre: mount /dev/sdd1 at /srv/lustre/mds/crew8-MDT0000 failed:
> > Input/output error
> > Is the MGS running?
> >
> > Ummm---  yeah, I thought the MGS is running.
> >
> > [root at mds2 ~]# tail /var/log/messages
> > Sep  4 16:28:08 mds2 kernel: LDISKFS-fs: mounted filesystem with
> > ordered data mode.
> > Sep  4 16:28:13 mds2 kernel: LustreError:
> > 3526:0:(client.c:975:ptlrpc_expire_one_request()) @@@ timeout (sent at
> > 1220560088, 5s ago)  req at ffff81042f109000 x3/t0
> > o250->MGS at MGC172.18.0.16@o2ib_0:26 lens 240/272 ref 1 fl Rpc:/0/0 rc
> > 0/-22
> > Sep  4 16:28:13 mds2 kernel: LustreError:
> > 3797:0:(obd_mount.c:954:server_register_target()) registration with
> > the MGS failed (-5)
> > Sep  4 16:28:13 mds2 kernel: LustreError:
> > 3797:0:(obd_mount.c:1054:server_start_targets()) Required registration
> > failed for crew8-MDTffff: -5
> > Sep  4 16:28:13 mds2 kernel: LustreError: 15f-b: Communication error
> > with the MGS.  Is the MGS running?
> > Sep  4 16:28:13 mds2 kernel: LustreError:
> > 3797:0:(obd_mount.c:1570:server_fill_super()) Unable to start targets:
> > -5
> > Sep  4 16:28:13 mds2 kernel: LustreError:
> > 3797:0:(obd_mount.c:1368:server_put_super()) no obd crew8-MDTffff
> > Sep  4 16:28:13 mds2 kernel: LustreError:
> > 3797:0:(obd_mount.c:119:server_deregister_mount()) crew8-MDTffff not
> > registered
> > Sep  4 16:28:13 mds2 kernel: Lustre: server umount crew8-MDTffff  
> > complete
> > Sep  4 16:28:13 mds2 kernel: LustreError:
> > 3797:0:(obd_mount.c:1924:lustre_fill_super()) Unable to mount  (-5)
> >
> > The o2ib network is up.   It is ping-able via bash and lctl.   I can
> > get to it from itself and from other computers on
> > this local subnet.
> >
> > [root at mds2 ~]# lctl
> > lctl > ping 172.18.0.16 at o2ib
> > 12345-0 at lo
> > 12345-172.18.0.16 at o2ib
> > lctl > ping 172.18.0.15 at o2ib
> > 12345-0 at lo
> > 12345-172.18.0.15 at o2ib
> > lctl > quit
> >
> > On this net, there are no firewalls as the computers are using only
> > non-routable IP numbers.  So there is not a
> > firewall issue of which I am aware...
> > [root at mds2 ~]# iptables -L
> > -bash: iptables: command not found
> >
> > The only oddity I have found is that the modules in my working MGS/MDS
> > are used more than the modules in my
> > new MGS/MDT.
> >
> > Correctly functioning MGS/MDT:
> > [root at mds1 ~]# lsmod | grep mgs
> > mgs                   181512  1
> > mgc                    86744  2 mgs
> > ptlrpc                659512  8 osc,mds,mgs,mgc,lustre,lov,lquota,mdc
> > obdclass              542200  13
> > osc,mds,fsfilt_ldiskfs,mgs,mgc,lustre,lov,lquota,mdc,ptlrpc
> > lvfs                   84712  12
> > osc,mds,fsfilt_ldiskfs,mgs,mgc,lustre,lov,lquota,mdc,ptlrpc,obdclass
> > libcfs                183128  14
> > osc 
> > ,mds 
> > ,fsfilt_ldiskfs 
> > ,mgs,mgc,lustre,lov,lquota,mdc,ko2iblnd,ptlrpc,obdclass,lnet,lvfs
> > [root at mds1 ~]# lsmod | grep osc
> > osc                   172136  11
> > ptlrpc                659512  8 osc,mds,mgs,mgc,lustre,lov,lquota,mdc
> > obdclass              542200  13
> > osc,mds,fsfilt_ldiskfs,mgs,mgc,lustre,lov,lquota,mdc,ptlrpc
> > lvfs                   84712  12
> > osc,mds,fsfilt_ldiskfs,mgs,mgc,lustre,lov,lquota,mdc,ptlrpc,obdclass
> > libcfs                183128  14
> > osc 
> > ,mds 
> > ,fsfilt_ldiskfs 
> > ,mgs,mgc,lustre,lov,lquota,mdc,ko2iblnd,ptlrpc,obdclass,lnet,lvfs
> > [root at mds1 ~]# lsmod | grep lnet
> > lnet                  255656  4 lustre,ko2iblnd,ptlrpc,obdclass
> > libcfs                183128  14
> > osc 
> > ,mds 
> > ,fsfilt_ldiskfs 
> > ,mgs,mgc,lustre,lov,lquota,mdc,ko2iblnd,ptlrpc,obdclass,lnet,lvfs
> >
> > Failing MGS/MDT:
> > [root at mds2 ~]# lsmod | grep mgs
> > mgs                   181512  0
> > mgc                    86744  1 mgs
> > ptlrpc                659512  8 osc,lustre,lov,mdc,mds,lquota,mgs,mgc
> > obdclass              542200  10
> > osc,lustre,lov,mdc,fsfilt_ldiskfs,mds,lquota,mgs,mgc,ptlrpc
> > lvfs                   84712  12
> > osc,lustre,lov,mdc,fsfilt_ldiskfs,mds,lquota,mgs,mgc,ptlrpc,obdclass
> > libcfs                183128  14
> > osc 
> > ,lustre 
> > ,lov 
> > ,mdc 
> > ,fsfilt_ldiskfs,mds,lquota,mgs,mgc,ko2iblnd,ptlrpc,obdclass,lnet,lvfs
> > [root at mds2 ~]# lsmod | grep osc
> > osc                   172136  0
> > ptlrpc                659512  8 osc,lustre,lov,mdc,mds,lquota,mgs,mgc
> > obdclass              542200  10
> > osc,lustre,lov,mdc,fsfilt_ldiskfs,mds,lquota,mgs,mgc,ptlrpc
> > lvfs                   84712  12
> > osc,lustre,lov,mdc,fsfilt_ldiskfs,mds,lquota,mgs,mgc,ptlrpc,obdclass
> > libcfs                183128  14
> > osc 
> > ,lustre 
> > ,lov 
> > ,mdc 
> > ,fsfilt_ldiskfs,mds,lquota,mgs,mgc,ko2iblnd,ptlrpc,obdclass,lnet,lvfs
> > [root at mds2 ~]# lsmod | grep lnet
> > lnet                  255656  4 lustre,ko2iblnd,ptlrpc,obdclass
> > libcfs                183128  14
> > osc 
> > ,lustre 
> > ,lov 
> > ,mdc 
> > ,fsfilt_ldiskfs,mds,lquota,mgs,mgc,ko2iblnd,ptlrpc,obdclass,lnet,lvfs
> >
> > The failing MGS/MDT has a 0 by mgs and not a 1 like the working MGS/ 
> > MDT.
> > The osc module has 11 by it in the working version and 0 by it in the
> > non-working version.
> > The lnet is the same as are most of the other module comparisons.  Am
> > I missing something at the module mgs/mgc/osc
> > level?  Or are those modules just indicating that they are actually
> > in-use on my good MGS/MDT?
> >
> > Even with IB cabling aside (I'm working on the MGS/MDS itself), why
> > can I not mount a new MDT?  Why do I see the message:
> > Is the MGS running?  I am actually on the MGS/MDS itself.
> >
> > Also I receive the same result if I attempt to mount an OST on an OSS
> > which is pointing to this new MGS/MDT.  The OST won't
> > even mount locally on the OSS without successful communication with
> > its associated MGS/MDT.
> >
> > Any and all suggestions gratefully appreciated.
> >
> > megan
> > _______________________________________________
> > Lustre-discuss mailing list
> > Lustre-discuss at lists.lustre.org
> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.




More information about the lustre-discuss mailing list