[Lustre-discuss] recover borked mds

Andreas Dilger adilger at sun.com
Thu Aug 20 14:09:16 PDT 2009


On Aug 20, 2009  09:09 -0400, Brock Palen wrote:
> Some additional details,
> I mounted the mds as ldiskfs  and deleted the files in  OBJECTS/*  and  
> CATALOGS,
> Remounted as lustre, same issue.
> I also did a write conf, restarted all the servers, saw messages on  
> the MGS, that new config logs were being created, but still same error  
> on the mds trying to start up.
> Is there a way to get lustre to stop trying to open  
> 0xf150010:80d24629:  ?  And not go though recovery?

With the exception of the CATALOGS file, I'm not sure what else would
be looking for an llog like this.  You could try touching that filename
in OBJECTS, but I'm not sure that would help.

> If not,  can I format a new mds,  and just untar  ROOTS/  and apply  
> the extended attributes to ROOTS from the old mds filesystem?

Well, it depends on where the request is coming from.  Deleting the
CATALOGS file should be essetially the same (and much faster) than
what you propose above.

> On Aug 19, 2009, at 12:57 PM, Brock Palen wrote:
> > After a network event (switches bouncing) looks like our mds got
> > borked somewhere, from all the random failovers (switches came up and
> > down rapidly over a few hours).
> >
> > Now we can not mount the mds,  when we do we get the following errors:
> >
> > Aug 19 12:37:39 mds2 kernel: LustreError: 137-5: UUID 'nobackup-
> > MDT0000_UUID' is not available  for connect (no target)
> > Aug 19 12:37:39 mds2 kernel: LustreError: 7455:0:(ldlm_lib.c:
> > 1619:target_send_reply_msg()) @@@ processing error (-19)
> > req at 000001037c9db600 x85226/t0 o38-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl
> > 1250699959 ref 1 fl Interpret:/0/0 rc -19/0
> > Aug 19 12:37:39 mds2 kernel: LustreError: 137-5: UUID 'nobackup-
> > MDT0000_UUID' is not available  for connect (no target)
> > Aug 19 12:37:39 mds2 kernel: LustreError: 7456:0:(ldlm_lib.c:
> > 1619:target_send_reply_msg()) @@@ processing error (-19)
> > req at 00000104163a6000 x47117/t0 o38-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl
> > 1250699959 ref 1 fl Interpret:/0/0 rc -19/0
> > Aug 19 12:37:39 mds2 kernel: LustreError: 137-5: UUID 'nobackup-
> > MDT0000_UUID' is not available  for connect (no target)Aug 19 12:37:39
> > mds2 kernel: LustreError: Skipped 11 previous similar messages
> > Aug 19 12:37:39 mds2 kernel: LustreError: 7468:0:(ldlm_lib.c:
> > 1619:target_send_reply_msg()) @@@ processing error (-19)
> > req at 0000010350a4d200 x81788/t0 o38-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl
> > 1250699959 ref 1 fl Interpret:/0/0 rc -19/0
> > Aug 19 12:37:39 mds2 kernel: LustreError: 7468:0:(ldlm_lib.c:
> > 1619:target_send_reply_msg()) Skipped 11 previous similar messages
> > Aug 19 12:37:40 mds2 kernel: LustreError: 137-5: UUID 'nobackup-
> > MDT0000_UUID' is not available  for connect (no target)
> > Aug 19 12:37:40 mds2 kernel: LustreError: Skipped 18 previous similar
> > messages
> > Aug 19 12:37:40 mds2 kernel: LustreError: 7455:0:(ldlm_lib.c:
> > 1619:target_send_reply_msg()) @@@ processing error (-19)
> > req at 0000010414dc1850 x81855/t0 o38-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl
> > 1250699960 ref 1 fl Interpret:/0/0 rc -19/0Aug 19 12:37:40 mds2
> > kernel: LustreError: 7455:0:(ldlm_lib.c:1619:target_send_reply_msg())
> > Skipped 18 previous similar messages
> > Aug 19 12:37:42 mds2 kernel: LustreError: 137-5: UUID 'nobackup-
> > MDT0000_UUID' is not available  for connect (no target)
> > Aug 19 12:37:42 mds2 kernel: LustreError: Skipped 42 previous similar
> > messages
> > Aug 19 12:37:42 mds2 kernel: LustreError: 7466:0:(ldlm_lib.c:
> > 1619:target_send_reply_msg()) @@@ processing error (-19)
> > req at 000001037c9db600 x77144/t0 o38-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl
> > 1250699962 ref 1 fl Interpret:/0/0 rc -19/0
> > Aug 19 12:37:42 mds2 kernel: LustreError: 7466:0:(ldlm_lib.c:
> > 1619:target_send_reply_msg()) Skipped 42 previous similar messages
> > Aug 19 12:37:43 mds2 kernel: Lustre: Request x3 sent from
> > MGC10.164.3.246 at tcp to NID 10.164.3.246 at tcp 5s ago has timed out
> > (limit 5s).
> > Aug 19 12:37:43 mds2 kernel: Lustre: Changing connection for
> > MGC10.164.3.246 at tcp to MGC10.164.3.246 at tcp_1/0 at lo
> > Aug 19 12:37:43 mds2 kernel: Lustre: Enabling user_xattr
> > Aug 19 12:37:43 mds2 kernel: Lustre: 7524:0:(mds_fs.c:
> > 493:mds_init_server_data()) RECOVERY: service nobackup-MDT0000, 439
> > recoverable clients, last_transno 3647966566
> > Aug 19 12:37:43 mds2 kernel: Lustre: MDT nobackup-MDT0000 now serving
> > dev (nobackup-MDT0000/57dddb69-2475-b551-4100-e045f91ce38c), but will
> > be in recovery for at least 5:00, or
> > until 439 clients reconnect. During this time new clients will not be
> > allowed to connect. Recovery progress can be monitored by watching /
> > proc/fs/lustre/mds/nobackup-MDT0000/rec
> > overy_status.
> > Aug 19 12:37:43 mds2 kernel: Lustre: 7524:0:(lproc_mds.c:
> > 273:lprocfs_wr_group_upcall()) nobackup-MDT0000: group upcall set to /
> > usr/sbin/l_getgroups
> > Aug 19 12:37:43 mds2 kernel: Lustre: nobackup-MDT0000.mdt: set
> > parameter group_upcall=/usr/sbin/l_getgroupsAug 19 12:37:43 mds2
> > kernel: Lustre: 7524:0:(mds_lov.c:1070:mds_notify()) MDS nobackup-
> > MDT0000: in recovery, not resetting orphans on nobackup-OST0000_UUID
> > Aug 19 12:37:43 mds2 kernel: Lustre: nobackup-MDT0000: temporarily
> > refusing client connection from 10.164.1.104 at tcp
> > Aug 19 12:37:43 mds2 kernel: LustreError: 7525:0:(llog_lvfs.c:
> > 612:llog_lvfs_create()) error looking up logfile 0xf150010:0x80d24629:
> > rc -2
> > Aug 19 12:37:43 mds2 kernel: LustreError: 7525:0:(llog_cat.c:
> > 176:llog_cat_id2handle()) error opening log id 0xf150010:80d24629:  
> > rc -2
> > Aug 19 12:37:43 mds2 kernel: LustreError: 7525:0:(llog_obd.c:
> > 262:cat_cancel_cb()) Cannot find handle for log 0xf150010
> > Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(llog_obd.c:
> > 329:llog_obd_origin_setup()) llog_process with cat_cancel_cb failed:  
> > -2
> > Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(osc_request.c:
> > 3664:osc_llog_init()) failed LLOG_MDS_OST_ORIG_CTXT
> > Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(osc_request.c:
> > 3675:osc_llog_init()) osc 'nobackup-OST0000-osc' tgt 'nobackup-
> > MDT0000' cnt 1 catid 00000101e1d979e8 rc=-2
> > Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(osc_request.c:
> > 3677:osc_llog_init()) logid 0xf150002:0x9642a0ac
> > Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(lov_log.c:
> > 230:lov_llog_init()) error osc_llog_init idx 0 osc 'nobackup-OST0000-
> > osc' tgt 'nobackup-MDT0000' (rc=-2)
> > Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(mds_log.c:
> > 220:mds_llog_init()) lov_llog_init err -2
> > Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(llog_obd.c:
> > 417:llog_cat_initialize()) rc: -2
> > Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(lov_obd.c:
> > 727:lov_add_target()) add failed (-2), deleting nobackup-OST0000_UUID
> > Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(obd_config.c:
> > 1093:class_config_llog_handler()) Err -2 on cfg command:
> > Aug 19 12:37:43 mds2 kernel: Lustre:    cmd=cf00d 0:nobackup-mdtlov
> > 1:nobackup-OST0000_UUID  2:0  3:1
> > Aug 19 12:37:43 mds2 kernel: LustreError: 15c-8: MGC10.164.3.246 at tcp:
> > The configuration from log 'nobackup-MDT0000' failed (-2). This may be
> > the result of communication errors b
> > etween this node and the MGS, a bad configuration, or other errors.
> > See the syslog for more information.
> > Aug 19 12:37:43 mds2 kernel: LustreError: 7438:0:(obd_mount.c:
> > 1113:server_start_targets()) failed to start server nobackup- 
> > MDT0000: -2
> > Aug 19 12:37:44 mds2 kernel: LustreError: 7438:0:(obd_mount.c:
> > 1623:server_fill_super()) Unable to start targets: -2
> > Aug 19 12:37:44 mds2 kernel: Lustre: Failing over nobackup-MDT0000
> > Aug 19 12:37:44 mds2 kernel: Lustre: *** setting obd nobackup-MDT0000
> > device 'unknown-block(8,16)' read-only ***
> >
> > We have ran e2fsck on the volume, found a few errors and corrected.
> > But the problem presists.  We also tried mounting with -o abort_recov
> > this resulted in a assertion (lbug) and does not work.
> > ANy thoughts?  The lines:
> > Aug 19 12:37:43 mds2 kernel: LustreError: 7525:0:(llog_lvfs.c:
> > 612:llog_lvfs_create()) error looking up logfile 0xf150010:0x80d24629:
> > rc -2
> > Aug 19 12:37:43 mds2 kernel: LustreError: 7525:0:(llog_cat.c:
> > 176:llog_cat_id2handle()) error opening log id 0xf150010:80d24629:  
> > rc -2
> > Aug 19 12:37:43 mds2 kernel: LustreError: 7525:0:(llog_obd.c:
> > 262:cat_cancel_cb()) Cannot find handle for log 0xf150010
> >
> > Catch my attention,
> > Thanks,  we are running 1.6.6
> >
> >
> > Brock Palen
> > www.umich.edu/~brockp
> > Center for Advanced Computing
> > brockp at umich.edu
> > (734)936-1985
> >
> >
> >
> > _______________________________________________
> > Lustre-discuss mailing list
> > Lustre-discuss at lists.lustre.org
> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
> >
> >
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.




More information about the lustre-discuss mailing list