[Lustre-discuss] recover borked mds

Brock Palen brockp at umich.edu
Thu Aug 20 06:09:38 PDT 2009


Some additional details,
I mounted the mds as ldiskfs  and deleted the files in  OBJECTS/*  and  
CATALOGS,
Remounted as lustre, same issue.
I also did a write conf, restarted all the servers, saw messages on  
the MGS, that new config logs were being created, but still same error  
on the mds trying to start up.
Is there a way to get lustre to stop trying to open  
0xf150010:80d24629:  ?  And not go though recovery?

If not,  can I format a new mds,  and just untar  ROOTS/  and apply  
the extended attributes to ROOTS from the old mds filesystem?

Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
brockp at umich.edu
(734)936-1985



On Aug 19, 2009, at 12:57 PM, Brock Palen wrote:

> After a network event (switches bouncing) looks like our mds got
> borked somewhere, from all the random failovers (switches came up and
> down rapidly over a few hours).
>
> Now we can not mount the mds,  when we do we get the following errors:
>
> Aug 19 12:37:39 mds2 kernel: LustreError: 137-5: UUID 'nobackup-
> MDT0000_UUID' is not available  for connect (no target)
> Aug 19 12:37:39 mds2 kernel: LustreError: 7455:0:(ldlm_lib.c:
> 1619:target_send_reply_msg()) @@@ processing error (-19)
> req at 000001037c9db600 x85226/t0 o38-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl
> 1250699959 ref 1 fl Interpret:/0/0 rc -19/0
> Aug 19 12:37:39 mds2 kernel: LustreError: 137-5: UUID 'nobackup-
> MDT0000_UUID' is not available  for connect (no target)
> Aug 19 12:37:39 mds2 kernel: LustreError: 7456:0:(ldlm_lib.c:
> 1619:target_send_reply_msg()) @@@ processing error (-19)
> req at 00000104163a6000 x47117/t0 o38-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl
> 1250699959 ref 1 fl Interpret:/0/0 rc -19/0
> Aug 19 12:37:39 mds2 kernel: LustreError: 137-5: UUID 'nobackup-
> MDT0000_UUID' is not available  for connect (no target)Aug 19 12:37:39
> mds2 kernel: LustreError: Skipped 11 previous similar messages
> Aug 19 12:37:39 mds2 kernel: LustreError: 7468:0:(ldlm_lib.c:
> 1619:target_send_reply_msg()) @@@ processing error (-19)
> req at 0000010350a4d200 x81788/t0 o38-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl
> 1250699959 ref 1 fl Interpret:/0/0 rc -19/0
> Aug 19 12:37:39 mds2 kernel: LustreError: 7468:0:(ldlm_lib.c:
> 1619:target_send_reply_msg()) Skipped 11 previous similar messages
> Aug 19 12:37:40 mds2 kernel: LustreError: 137-5: UUID 'nobackup-
> MDT0000_UUID' is not available  for connect (no target)
> Aug 19 12:37:40 mds2 kernel: LustreError: Skipped 18 previous similar
> messages
> Aug 19 12:37:40 mds2 kernel: LustreError: 7455:0:(ldlm_lib.c:
> 1619:target_send_reply_msg()) @@@ processing error (-19)
> req at 0000010414dc1850 x81855/t0 o38-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl
> 1250699960 ref 1 fl Interpret:/0/0 rc -19/0Aug 19 12:37:40 mds2
> kernel: LustreError: 7455:0:(ldlm_lib.c:1619:target_send_reply_msg())
> Skipped 18 previous similar messages
> Aug 19 12:37:42 mds2 kernel: LustreError: 137-5: UUID 'nobackup-
> MDT0000_UUID' is not available  for connect (no target)
> Aug 19 12:37:42 mds2 kernel: LustreError: Skipped 42 previous similar
> messages
> Aug 19 12:37:42 mds2 kernel: LustreError: 7466:0:(ldlm_lib.c:
> 1619:target_send_reply_msg()) @@@ processing error (-19)
> req at 000001037c9db600 x77144/t0 o38-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl
> 1250699962 ref 1 fl Interpret:/0/0 rc -19/0
> Aug 19 12:37:42 mds2 kernel: LustreError: 7466:0:(ldlm_lib.c:
> 1619:target_send_reply_msg()) Skipped 42 previous similar messages
> Aug 19 12:37:43 mds2 kernel: Lustre: Request x3 sent from
> MGC10.164.3.246 at tcp to NID 10.164.3.246 at tcp 5s ago has timed out
> (limit 5s).
> Aug 19 12:37:43 mds2 kernel: Lustre: Changing connection for
> MGC10.164.3.246 at tcp to MGC10.164.3.246 at tcp_1/0 at lo
> Aug 19 12:37:43 mds2 kernel: Lustre: Enabling user_xattr
> Aug 19 12:37:43 mds2 kernel: Lustre: 7524:0:(mds_fs.c:
> 493:mds_init_server_data()) RECOVERY: service nobackup-MDT0000, 439
> recoverable clients, last_transno 3647966566
> Aug 19 12:37:43 mds2 kernel: Lustre: MDT nobackup-MDT0000 now serving
> dev (nobackup-MDT0000/57dddb69-2475-b551-4100-e045f91ce38c), but will
> be in recovery for at least 5:00, or
> until 439 clients reconnect. During this time new clients will not be
> allowed to connect. Recovery progress can be monitored by watching /
> proc/fs/lustre/mds/nobackup-MDT0000/rec
> overy_status.
> Aug 19 12:37:43 mds2 kernel: Lustre: 7524:0:(lproc_mds.c:
> 273:lprocfs_wr_group_upcall()) nobackup-MDT0000: group upcall set to /
> usr/sbin/l_getgroups
> Aug 19 12:37:43 mds2 kernel: Lustre: nobackup-MDT0000.mdt: set
> parameter group_upcall=/usr/sbin/l_getgroupsAug 19 12:37:43 mds2
> kernel: Lustre: 7524:0:(mds_lov.c:1070:mds_notify()) MDS nobackup-
> MDT0000: in recovery, not resetting orphans on nobackup-OST0000_UUID
> Aug 19 12:37:43 mds2 kernel: Lustre: nobackup-MDT0000: temporarily
> refusing client connection from 10.164.1.104 at tcp
> Aug 19 12:37:43 mds2 kernel: LustreError: 7525:0:(llog_lvfs.c:
> 612:llog_lvfs_create()) error looking up logfile 0xf150010:0x80d24629:
> rc -2
> Aug 19 12:37:43 mds2 kernel: LustreError: 7525:0:(llog_cat.c:
> 176:llog_cat_id2handle()) error opening log id 0xf150010:80d24629:  
> rc -2
> Aug 19 12:37:43 mds2 kernel: LustreError: 7525:0:(llog_obd.c:
> 262:cat_cancel_cb()) Cannot find handle for log 0xf150010
> Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(llog_obd.c:
> 329:llog_obd_origin_setup()) llog_process with cat_cancel_cb failed:  
> -2
> Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(osc_request.c:
> 3664:osc_llog_init()) failed LLOG_MDS_OST_ORIG_CTXT
> Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(osc_request.c:
> 3675:osc_llog_init()) osc 'nobackup-OST0000-osc' tgt 'nobackup-
> MDT0000' cnt 1 catid 00000101e1d979e8 rc=-2
> Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(osc_request.c:
> 3677:osc_llog_init()) logid 0xf150002:0x9642a0ac
> Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(lov_log.c:
> 230:lov_llog_init()) error osc_llog_init idx 0 osc 'nobackup-OST0000-
> osc' tgt 'nobackup-MDT0000' (rc=-2)
> Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(mds_log.c:
> 220:mds_llog_init()) lov_llog_init err -2
> Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(llog_obd.c:
> 417:llog_cat_initialize()) rc: -2
> Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(lov_obd.c:
> 727:lov_add_target()) add failed (-2), deleting nobackup-OST0000_UUID
> Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(obd_config.c:
> 1093:class_config_llog_handler()) Err -2 on cfg command:
> Aug 19 12:37:43 mds2 kernel: Lustre:    cmd=cf00d 0:nobackup-mdtlov
> 1:nobackup-OST0000_UUID  2:0  3:1
> Aug 19 12:37:43 mds2 kernel: LustreError: 15c-8: MGC10.164.3.246 at tcp:
> The configuration from log 'nobackup-MDT0000' failed (-2). This may be
> the result of communication errors b
> etween this node and the MGS, a bad configuration, or other errors.
> See the syslog for more information.
> Aug 19 12:37:43 mds2 kernel: LustreError: 7438:0:(obd_mount.c:
> 1113:server_start_targets()) failed to start server nobackup- 
> MDT0000: -2
> Aug 19 12:37:44 mds2 kernel: LustreError: 7438:0:(obd_mount.c:
> 1623:server_fill_super()) Unable to start targets: -2
> Aug 19 12:37:44 mds2 kernel: Lustre: Failing over nobackup-MDT0000
> Aug 19 12:37:44 mds2 kernel: Lustre: *** setting obd nobackup-MDT0000
> device 'unknown-block(8,16)' read-only ***
>
> We have ran e2fsck on the volume, found a few errors and corrected.
> But the problem presists.  We also tried mounting with -o abort_recov
> this resulted in a assertion (lbug) and does not work.
> ANy thoughts?  The lines:
> Aug 19 12:37:43 mds2 kernel: LustreError: 7525:0:(llog_lvfs.c:
> 612:llog_lvfs_create()) error looking up logfile 0xf150010:0x80d24629:
> rc -2
> Aug 19 12:37:43 mds2 kernel: LustreError: 7525:0:(llog_cat.c:
> 176:llog_cat_id2handle()) error opening log id 0xf150010:80d24629:  
> rc -2
> Aug 19 12:37:43 mds2 kernel: LustreError: 7525:0:(llog_obd.c:
> 262:cat_cancel_cb()) Cannot find handle for log 0xf150010
>
> Catch my attention,
> Thanks,  we are running 1.6.6
>
>
> Brock Palen
> www.umich.edu/~brockp
> Center for Advanced Computing
> brockp at umich.edu
> (734)936-1985
>
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>




More information about the lustre-discuss mailing list