[Lustre-discuss] recover borked mds

Brock Palen brockp at umich.edu
Thu Aug 20 14:13:32 PDT 2009


Andreas, sorry I missed your reply yesterday.
Here is how we fixed it,

We deleted OBJECTS/* and CATALOGS,

shutdown all the ost's
At this point the mds mounted correctly with -o abort_recov
Remount ost's (with recovery)  and all worked well, I have sense  
enabled (re)quotas, heartbeat and bounced servers a few times.  All  
appears well.

Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
brockp at umich.edu
(734)936-1985



On Aug 20, 2009, at 9:09 AM, Brock Palen wrote:

> Some additional details,
> I mounted the mds as ldiskfs  and deleted the files in  OBJECTS/*  and
> CATALOGS,
> Remounted as lustre, same issue.
> I also did a write conf, restarted all the servers, saw messages on
> the MGS, that new config logs were being created, but still same error
> on the mds trying to start up.
> Is there a way to get lustre to stop trying to open
> 0xf150010:80d24629:  ?  And not go though recovery?
>
> If not,  can I format a new mds,  and just untar  ROOTS/  and apply
> the extended attributes to ROOTS from the old mds filesystem?
>
> Brock Palen
> www.umich.edu/~brockp
> Center for Advanced Computing
> brockp at umich.edu
> (734)936-1985
>
>
>
> On Aug 19, 2009, at 12:57 PM, Brock Palen wrote:
>
>> After a network event (switches bouncing) looks like our mds got
>> borked somewhere, from all the random failovers (switches came up and
>> down rapidly over a few hours).
>>
>> Now we can not mount the mds,  when we do we get the following  
>> errors:
>>
>> Aug 19 12:37:39 mds2 kernel: LustreError: 137-5: UUID 'nobackup-
>> MDT0000_UUID' is not available  for connect (no target)
>> Aug 19 12:37:39 mds2 kernel: LustreError: 7455:0:(ldlm_lib.c:
>> 1619:target_send_reply_msg()) @@@ processing error (-19)
>> req at 000001037c9db600 x85226/t0 o38-><?>@<?>:0/0 lens 304/0 e 0 to 0  
>> dl
>> 1250699959 ref 1 fl Interpret:/0/0 rc -19/0
>> Aug 19 12:37:39 mds2 kernel: LustreError: 137-5: UUID 'nobackup-
>> MDT0000_UUID' is not available  for connect (no target)
>> Aug 19 12:37:39 mds2 kernel: LustreError: 7456:0:(ldlm_lib.c:
>> 1619:target_send_reply_msg()) @@@ processing error (-19)
>> req at 00000104163a6000 x47117/t0 o38-><?>@<?>:0/0 lens 304/0 e 0 to 0  
>> dl
>> 1250699959 ref 1 fl Interpret:/0/0 rc -19/0
>> Aug 19 12:37:39 mds2 kernel: LustreError: 137-5: UUID 'nobackup-
>> MDT0000_UUID' is not available  for connect (no target)Aug 19  
>> 12:37:39
>> mds2 kernel: LustreError: Skipped 11 previous similar messages
>> Aug 19 12:37:39 mds2 kernel: LustreError: 7468:0:(ldlm_lib.c:
>> 1619:target_send_reply_msg()) @@@ processing error (-19)
>> req at 0000010350a4d200 x81788/t0 o38-><?>@<?>:0/0 lens 304/0 e 0 to 0  
>> dl
>> 1250699959 ref 1 fl Interpret:/0/0 rc -19/0
>> Aug 19 12:37:39 mds2 kernel: LustreError: 7468:0:(ldlm_lib.c:
>> 1619:target_send_reply_msg()) Skipped 11 previous similar messages
>> Aug 19 12:37:40 mds2 kernel: LustreError: 137-5: UUID 'nobackup-
>> MDT0000_UUID' is not available  for connect (no target)
>> Aug 19 12:37:40 mds2 kernel: LustreError: Skipped 18 previous similar
>> messages
>> Aug 19 12:37:40 mds2 kernel: LustreError: 7455:0:(ldlm_lib.c:
>> 1619:target_send_reply_msg()) @@@ processing error (-19)
>> req at 0000010414dc1850 x81855/t0 o38-><?>@<?>:0/0 lens 304/0 e 0 to 0  
>> dl
>> 1250699960 ref 1 fl Interpret:/0/0 rc -19/0Aug 19 12:37:40 mds2
>> kernel: LustreError: 7455:0:(ldlm_lib.c:1619:target_send_reply_msg())
>> Skipped 18 previous similar messages
>> Aug 19 12:37:42 mds2 kernel: LustreError: 137-5: UUID 'nobackup-
>> MDT0000_UUID' is not available  for connect (no target)
>> Aug 19 12:37:42 mds2 kernel: LustreError: Skipped 42 previous similar
>> messages
>> Aug 19 12:37:42 mds2 kernel: LustreError: 7466:0:(ldlm_lib.c:
>> 1619:target_send_reply_msg()) @@@ processing error (-19)
>> req at 000001037c9db600 x77144/t0 o38-><?>@<?>:0/0 lens 304/0 e 0 to 0  
>> dl
>> 1250699962 ref 1 fl Interpret:/0/0 rc -19/0
>> Aug 19 12:37:42 mds2 kernel: LustreError: 7466:0:(ldlm_lib.c:
>> 1619:target_send_reply_msg()) Skipped 42 previous similar messages
>> Aug 19 12:37:43 mds2 kernel: Lustre: Request x3 sent from
>> MGC10.164.3.246 at tcp to NID 10.164.3.246 at tcp 5s ago has timed out
>> (limit 5s).
>> Aug 19 12:37:43 mds2 kernel: Lustre: Changing connection for
>> MGC10.164.3.246 at tcp to MGC10.164.3.246 at tcp_1/0 at lo
>> Aug 19 12:37:43 mds2 kernel: Lustre: Enabling user_xattr
>> Aug 19 12:37:43 mds2 kernel: Lustre: 7524:0:(mds_fs.c:
>> 493:mds_init_server_data()) RECOVERY: service nobackup-MDT0000, 439
>> recoverable clients, last_transno 3647966566
>> Aug 19 12:37:43 mds2 kernel: Lustre: MDT nobackup-MDT0000 now serving
>> dev (nobackup-MDT0000/57dddb69-2475-b551-4100-e045f91ce38c), but will
>> be in recovery for at least 5:00, or
>> until 439 clients reconnect. During this time new clients will not be
>> allowed to connect. Recovery progress can be monitored by watching /
>> proc/fs/lustre/mds/nobackup-MDT0000/rec
>> overy_status.
>> Aug 19 12:37:43 mds2 kernel: Lustre: 7524:0:(lproc_mds.c:
>> 273:lprocfs_wr_group_upcall()) nobackup-MDT0000: group upcall set  
>> to /
>> usr/sbin/l_getgroups
>> Aug 19 12:37:43 mds2 kernel: Lustre: nobackup-MDT0000.mdt: set
>> parameter group_upcall=/usr/sbin/l_getgroupsAug 19 12:37:43 mds2
>> kernel: Lustre: 7524:0:(mds_lov.c:1070:mds_notify()) MDS nobackup-
>> MDT0000: in recovery, not resetting orphans on nobackup-OST0000_UUID
>> Aug 19 12:37:43 mds2 kernel: Lustre: nobackup-MDT0000: temporarily
>> refusing client connection from 10.164.1.104 at tcp
>> Aug 19 12:37:43 mds2 kernel: LustreError: 7525:0:(llog_lvfs.c:
>> 612:llog_lvfs_create()) error looking up logfile  
>> 0xf150010:0x80d24629:
>> rc -2
>> Aug 19 12:37:43 mds2 kernel: LustreError: 7525:0:(llog_cat.c:
>> 176:llog_cat_id2handle()) error opening log id 0xf150010:80d24629:
>> rc -2
>> Aug 19 12:37:43 mds2 kernel: LustreError: 7525:0:(llog_obd.c:
>> 262:cat_cancel_cb()) Cannot find handle for log 0xf150010
>> Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(llog_obd.c:
>> 329:llog_obd_origin_setup()) llog_process with cat_cancel_cb failed:
>> -2
>> Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(osc_request.c:
>> 3664:osc_llog_init()) failed LLOG_MDS_OST_ORIG_CTXT
>> Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(osc_request.c:
>> 3675:osc_llog_init()) osc 'nobackup-OST0000-osc' tgt 'nobackup-
>> MDT0000' cnt 1 catid 00000101e1d979e8 rc=-2
>> Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(osc_request.c:
>> 3677:osc_llog_init()) logid 0xf150002:0x9642a0ac
>> Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(lov_log.c:
>> 230:lov_llog_init()) error osc_llog_init idx 0 osc 'nobackup-OST0000-
>> osc' tgt 'nobackup-MDT0000' (rc=-2)
>> Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(mds_log.c:
>> 220:mds_llog_init()) lov_llog_init err -2
>> Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(llog_obd.c:
>> 417:llog_cat_initialize()) rc: -2
>> Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(lov_obd.c:
>> 727:lov_add_target()) add failed (-2), deleting nobackup-OST0000_UUID
>> Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(obd_config.c:
>> 1093:class_config_llog_handler()) Err -2 on cfg command:
>> Aug 19 12:37:43 mds2 kernel: Lustre:    cmd=cf00d 0:nobackup-mdtlov
>> 1:nobackup-OST0000_UUID  2:0  3:1
>> Aug 19 12:37:43 mds2 kernel: LustreError: 15c-8: MGC10.164.3.246 at tcp:
>> The configuration from log 'nobackup-MDT0000' failed (-2). This may  
>> be
>> the result of communication errors b
>> etween this node and the MGS, a bad configuration, or other errors.
>> See the syslog for more information.
>> Aug 19 12:37:43 mds2 kernel: LustreError: 7438:0:(obd_mount.c:
>> 1113:server_start_targets()) failed to start server nobackup-
>> MDT0000: -2
>> Aug 19 12:37:44 mds2 kernel: LustreError: 7438:0:(obd_mount.c:
>> 1623:server_fill_super()) Unable to start targets: -2
>> Aug 19 12:37:44 mds2 kernel: Lustre: Failing over nobackup-MDT0000
>> Aug 19 12:37:44 mds2 kernel: Lustre: *** setting obd nobackup-MDT0000
>> device 'unknown-block(8,16)' read-only ***
>>
>> We have ran e2fsck on the volume, found a few errors and corrected.
>> But the problem presists.  We also tried mounting with -o abort_recov
>> this resulted in a assertion (lbug) and does not work.
>> ANy thoughts?  The lines:
>> Aug 19 12:37:43 mds2 kernel: LustreError: 7525:0:(llog_lvfs.c:
>> 612:llog_lvfs_create()) error looking up logfile  
>> 0xf150010:0x80d24629:
>> rc -2
>> Aug 19 12:37:43 mds2 kernel: LustreError: 7525:0:(llog_cat.c:
>> 176:llog_cat_id2handle()) error opening log id 0xf150010:80d24629:
>> rc -2
>> Aug 19 12:37:43 mds2 kernel: LustreError: 7525:0:(llog_obd.c:
>> 262:cat_cancel_cb()) Cannot find handle for log 0xf150010
>>
>> Catch my attention,
>> Thanks,  we are running 1.6.6
>>
>>
>> Brock Palen
>> www.umich.edu/~brockp
>> Center for Advanced Computing
>> brockp at umich.edu
>> (734)936-1985
>>
>>
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>




More information about the lustre-discuss mailing list