[Lustre-discuss] Starting a new MGS/MDS
Ms. Megan Larko
dobsonunit at gmail.com
Thu Sep 4 13:54:33 PDT 2008
Hi,
I have a new MGS/MDS that I would like to start. It is another of
the same Cent0S 5 kernel 2.6.18-53.1.13.el5
lustre-1.6.4.3smp as my other boxes. Initially I had an IP number
that was used elsewhere in our group. I
changed it using the tunefs.lustre command below for the new MDT.
[root at mds2 ~]# tunefs.lustre --erase-params --writeconf
--mgsnode=ic-mds2 at o2ib /dev/sdd1
checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata
Read previous values:
Target: crew8-MDTffff
Index: unassigned
Lustre FS: crew8
Mount type: ldiskfs
Flags: 0x71
(MDT needs_index first_time update )
Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
Parameters: mgsnode=172.18.0.9 at o2ib
Permanent disk data:
Target: crew8-MDTffff
Index: unassigned
Lustre FS: crew8
Mount type: ldiskfs
Flags: 0x171
(MDT needs_index first_time update writeconf )
Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
Parameters: mgsnode=172.18.0.16 at o2ib
Writing CONFIGS/mountdata
Next I try to mount this new MDT onto the system....
[root at mds2 ~]# mount -t lustre /dev/sdd1 /srv/lustre/mds/crew8-MDT0000
mount.lustre: mount /dev/sdd1 at /srv/lustre/mds/crew8-MDT0000 failed:
Input/output error
Is the MGS running?
Ummm--- yeah, I thought the MGS is running.
[root at mds2 ~]# tail /var/log/messages
Sep 4 16:28:08 mds2 kernel: LDISKFS-fs: mounted filesystem with
ordered data mode.
Sep 4 16:28:13 mds2 kernel: LustreError:
3526:0:(client.c:975:ptlrpc_expire_one_request()) @@@ timeout (sent at
1220560088, 5s ago) req at ffff81042f109000 x3/t0
o250->MGS at MGC172.18.0.16@o2ib_0:26 lens 240/272 ref 1 fl Rpc:/0/0 rc
0/-22
Sep 4 16:28:13 mds2 kernel: LustreError:
3797:0:(obd_mount.c:954:server_register_target()) registration with
the MGS failed (-5)
Sep 4 16:28:13 mds2 kernel: LustreError:
3797:0:(obd_mount.c:1054:server_start_targets()) Required registration
failed for crew8-MDTffff: -5
Sep 4 16:28:13 mds2 kernel: LustreError: 15f-b: Communication error
with the MGS. Is the MGS running?
Sep 4 16:28:13 mds2 kernel: LustreError:
3797:0:(obd_mount.c:1570:server_fill_super()) Unable to start targets:
-5
Sep 4 16:28:13 mds2 kernel: LustreError:
3797:0:(obd_mount.c:1368:server_put_super()) no obd crew8-MDTffff
Sep 4 16:28:13 mds2 kernel: LustreError:
3797:0:(obd_mount.c:119:server_deregister_mount()) crew8-MDTffff not
registered
Sep 4 16:28:13 mds2 kernel: Lustre: server umount crew8-MDTffff complete
Sep 4 16:28:13 mds2 kernel: LustreError:
3797:0:(obd_mount.c:1924:lustre_fill_super()) Unable to mount (-5)
The o2ib network is up. It is ping-able via bash and lctl. I can
get to it from itself and from other computers on
this local subnet.
[root at mds2 ~]# lctl
lctl > ping 172.18.0.16 at o2ib
12345-0 at lo
12345-172.18.0.16 at o2ib
lctl > ping 172.18.0.15 at o2ib
12345-0 at lo
12345-172.18.0.15 at o2ib
lctl > quit
On this net, there are no firewalls as the computers are using only
non-routable IP numbers. So there is not a
firewall issue of which I am aware...
[root at mds2 ~]# iptables -L
-bash: iptables: command not found
The only oddity I have found is that the modules in my working MGS/MDS
are used more than the modules in my
new MGS/MDT.
Correctly functioning MGS/MDT:
[root at mds1 ~]# lsmod | grep mgs
mgs 181512 1
mgc 86744 2 mgs
ptlrpc 659512 8 osc,mds,mgs,mgc,lustre,lov,lquota,mdc
obdclass 542200 13
osc,mds,fsfilt_ldiskfs,mgs,mgc,lustre,lov,lquota,mdc,ptlrpc
lvfs 84712 12
osc,mds,fsfilt_ldiskfs,mgs,mgc,lustre,lov,lquota,mdc,ptlrpc,obdclass
libcfs 183128 14
osc,mds,fsfilt_ldiskfs,mgs,mgc,lustre,lov,lquota,mdc,ko2iblnd,ptlrpc,obdclass,lnet,lvfs
[root at mds1 ~]# lsmod | grep osc
osc 172136 11
ptlrpc 659512 8 osc,mds,mgs,mgc,lustre,lov,lquota,mdc
obdclass 542200 13
osc,mds,fsfilt_ldiskfs,mgs,mgc,lustre,lov,lquota,mdc,ptlrpc
lvfs 84712 12
osc,mds,fsfilt_ldiskfs,mgs,mgc,lustre,lov,lquota,mdc,ptlrpc,obdclass
libcfs 183128 14
osc,mds,fsfilt_ldiskfs,mgs,mgc,lustre,lov,lquota,mdc,ko2iblnd,ptlrpc,obdclass,lnet,lvfs
[root at mds1 ~]# lsmod | grep lnet
lnet 255656 4 lustre,ko2iblnd,ptlrpc,obdclass
libcfs 183128 14
osc,mds,fsfilt_ldiskfs,mgs,mgc,lustre,lov,lquota,mdc,ko2iblnd,ptlrpc,obdclass,lnet,lvfs
Failing MGS/MDT:
[root at mds2 ~]# lsmod | grep mgs
mgs 181512 0
mgc 86744 1 mgs
ptlrpc 659512 8 osc,lustre,lov,mdc,mds,lquota,mgs,mgc
obdclass 542200 10
osc,lustre,lov,mdc,fsfilt_ldiskfs,mds,lquota,mgs,mgc,ptlrpc
lvfs 84712 12
osc,lustre,lov,mdc,fsfilt_ldiskfs,mds,lquota,mgs,mgc,ptlrpc,obdclass
libcfs 183128 14
osc,lustre,lov,mdc,fsfilt_ldiskfs,mds,lquota,mgs,mgc,ko2iblnd,ptlrpc,obdclass,lnet,lvfs
[root at mds2 ~]# lsmod | grep osc
osc 172136 0
ptlrpc 659512 8 osc,lustre,lov,mdc,mds,lquota,mgs,mgc
obdclass 542200 10
osc,lustre,lov,mdc,fsfilt_ldiskfs,mds,lquota,mgs,mgc,ptlrpc
lvfs 84712 12
osc,lustre,lov,mdc,fsfilt_ldiskfs,mds,lquota,mgs,mgc,ptlrpc,obdclass
libcfs 183128 14
osc,lustre,lov,mdc,fsfilt_ldiskfs,mds,lquota,mgs,mgc,ko2iblnd,ptlrpc,obdclass,lnet,lvfs
[root at mds2 ~]# lsmod | grep lnet
lnet 255656 4 lustre,ko2iblnd,ptlrpc,obdclass
libcfs 183128 14
osc,lustre,lov,mdc,fsfilt_ldiskfs,mds,lquota,mgs,mgc,ko2iblnd,ptlrpc,obdclass,lnet,lvfs
The failing MGS/MDT has a 0 by mgs and not a 1 like the working MGS/MDT.
The osc module has 11 by it in the working version and 0 by it in the
non-working version.
The lnet is the same as are most of the other module comparisons. Am
I missing something at the module mgs/mgc/osc
level? Or are those modules just indicating that they are actually
in-use on my good MGS/MDT?
Even with IB cabling aside (I'm working on the MGS/MDS itself), why
can I not mount a new MDT? Why do I see the message:
Is the MGS running? I am actually on the MGS/MDS itself.
Also I receive the same result if I attempt to mount an OST on an OSS
which is pointing to this new MGS/MDT. The OST won't
even mount locally on the OSS without successful communication with
its associated MGS/MDT.
Any and all suggestions gratefully appreciated.
megan
More information about the lustre-discuss
mailing list