[Lustre-discuss] recover borked mds
Brock Palen
brockp at umich.edu
Wed Aug 19 09:57:45 PDT 2009
After a network event (switches bouncing) looks like our mds got
borked somewhere, from all the random failovers (switches came up and
down rapidly over a few hours).
Now we can not mount the mds, when we do we get the following errors:
Aug 19 12:37:39 mds2 kernel: LustreError: 137-5: UUID 'nobackup-
MDT0000_UUID' is not available for connect (no target)
Aug 19 12:37:39 mds2 kernel: LustreError: 7455:0:(ldlm_lib.c:
1619:target_send_reply_msg()) @@@ processing error (-19)
req at 000001037c9db600 x85226/t0 o38-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl
1250699959 ref 1 fl Interpret:/0/0 rc -19/0
Aug 19 12:37:39 mds2 kernel: LustreError: 137-5: UUID 'nobackup-
MDT0000_UUID' is not available for connect (no target)
Aug 19 12:37:39 mds2 kernel: LustreError: 7456:0:(ldlm_lib.c:
1619:target_send_reply_msg()) @@@ processing error (-19)
req at 00000104163a6000 x47117/t0 o38-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl
1250699959 ref 1 fl Interpret:/0/0 rc -19/0
Aug 19 12:37:39 mds2 kernel: LustreError: 137-5: UUID 'nobackup-
MDT0000_UUID' is not available for connect (no target)Aug 19 12:37:39
mds2 kernel: LustreError: Skipped 11 previous similar messages
Aug 19 12:37:39 mds2 kernel: LustreError: 7468:0:(ldlm_lib.c:
1619:target_send_reply_msg()) @@@ processing error (-19)
req at 0000010350a4d200 x81788/t0 o38-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl
1250699959 ref 1 fl Interpret:/0/0 rc -19/0
Aug 19 12:37:39 mds2 kernel: LustreError: 7468:0:(ldlm_lib.c:
1619:target_send_reply_msg()) Skipped 11 previous similar messages
Aug 19 12:37:40 mds2 kernel: LustreError: 137-5: UUID 'nobackup-
MDT0000_UUID' is not available for connect (no target)
Aug 19 12:37:40 mds2 kernel: LustreError: Skipped 18 previous similar
messages
Aug 19 12:37:40 mds2 kernel: LustreError: 7455:0:(ldlm_lib.c:
1619:target_send_reply_msg()) @@@ processing error (-19)
req at 0000010414dc1850 x81855/t0 o38-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl
1250699960 ref 1 fl Interpret:/0/0 rc -19/0Aug 19 12:37:40 mds2
kernel: LustreError: 7455:0:(ldlm_lib.c:1619:target_send_reply_msg())
Skipped 18 previous similar messages
Aug 19 12:37:42 mds2 kernel: LustreError: 137-5: UUID 'nobackup-
MDT0000_UUID' is not available for connect (no target)
Aug 19 12:37:42 mds2 kernel: LustreError: Skipped 42 previous similar
messages
Aug 19 12:37:42 mds2 kernel: LustreError: 7466:0:(ldlm_lib.c:
1619:target_send_reply_msg()) @@@ processing error (-19)
req at 000001037c9db600 x77144/t0 o38-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl
1250699962 ref 1 fl Interpret:/0/0 rc -19/0
Aug 19 12:37:42 mds2 kernel: LustreError: 7466:0:(ldlm_lib.c:
1619:target_send_reply_msg()) Skipped 42 previous similar messages
Aug 19 12:37:43 mds2 kernel: Lustre: Request x3 sent from
MGC10.164.3.246 at tcp to NID 10.164.3.246 at tcp 5s ago has timed out
(limit 5s).
Aug 19 12:37:43 mds2 kernel: Lustre: Changing connection for
MGC10.164.3.246 at tcp to MGC10.164.3.246 at tcp_1/0 at lo
Aug 19 12:37:43 mds2 kernel: Lustre: Enabling user_xattr
Aug 19 12:37:43 mds2 kernel: Lustre: 7524:0:(mds_fs.c:
493:mds_init_server_data()) RECOVERY: service nobackup-MDT0000, 439
recoverable clients, last_transno 3647966566
Aug 19 12:37:43 mds2 kernel: Lustre: MDT nobackup-MDT0000 now serving
dev (nobackup-MDT0000/57dddb69-2475-b551-4100-e045f91ce38c), but will
be in recovery for at least 5:00, or
until 439 clients reconnect. During this time new clients will not be
allowed to connect. Recovery progress can be monitored by watching /
proc/fs/lustre/mds/nobackup-MDT0000/rec
overy_status.
Aug 19 12:37:43 mds2 kernel: Lustre: 7524:0:(lproc_mds.c:
273:lprocfs_wr_group_upcall()) nobackup-MDT0000: group upcall set to /
usr/sbin/l_getgroups
Aug 19 12:37:43 mds2 kernel: Lustre: nobackup-MDT0000.mdt: set
parameter group_upcall=/usr/sbin/l_getgroupsAug 19 12:37:43 mds2
kernel: Lustre: 7524:0:(mds_lov.c:1070:mds_notify()) MDS nobackup-
MDT0000: in recovery, not resetting orphans on nobackup-OST0000_UUID
Aug 19 12:37:43 mds2 kernel: Lustre: nobackup-MDT0000: temporarily
refusing client connection from 10.164.1.104 at tcp
Aug 19 12:37:43 mds2 kernel: LustreError: 7525:0:(llog_lvfs.c:
612:llog_lvfs_create()) error looking up logfile 0xf150010:0x80d24629:
rc -2
Aug 19 12:37:43 mds2 kernel: LustreError: 7525:0:(llog_cat.c:
176:llog_cat_id2handle()) error opening log id 0xf150010:80d24629: rc -2
Aug 19 12:37:43 mds2 kernel: LustreError: 7525:0:(llog_obd.c:
262:cat_cancel_cb()) Cannot find handle for log 0xf150010
Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(llog_obd.c:
329:llog_obd_origin_setup()) llog_process with cat_cancel_cb failed: -2
Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(osc_request.c:
3664:osc_llog_init()) failed LLOG_MDS_OST_ORIG_CTXT
Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(osc_request.c:
3675:osc_llog_init()) osc 'nobackup-OST0000-osc' tgt 'nobackup-
MDT0000' cnt 1 catid 00000101e1d979e8 rc=-2
Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(osc_request.c:
3677:osc_llog_init()) logid 0xf150002:0x9642a0ac
Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(lov_log.c:
230:lov_llog_init()) error osc_llog_init idx 0 osc 'nobackup-OST0000-
osc' tgt 'nobackup-MDT0000' (rc=-2)
Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(mds_log.c:
220:mds_llog_init()) lov_llog_init err -2
Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(llog_obd.c:
417:llog_cat_initialize()) rc: -2
Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(lov_obd.c:
727:lov_add_target()) add failed (-2), deleting nobackup-OST0000_UUID
Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(obd_config.c:
1093:class_config_llog_handler()) Err -2 on cfg command:
Aug 19 12:37:43 mds2 kernel: Lustre: cmd=cf00d 0:nobackup-mdtlov
1:nobackup-OST0000_UUID 2:0 3:1
Aug 19 12:37:43 mds2 kernel: LustreError: 15c-8: MGC10.164.3.246 at tcp:
The configuration from log 'nobackup-MDT0000' failed (-2). This may be
the result of communication errors b
etween this node and the MGS, a bad configuration, or other errors.
See the syslog for more information.
Aug 19 12:37:43 mds2 kernel: LustreError: 7438:0:(obd_mount.c:
1113:server_start_targets()) failed to start server nobackup-MDT0000: -2
Aug 19 12:37:44 mds2 kernel: LustreError: 7438:0:(obd_mount.c:
1623:server_fill_super()) Unable to start targets: -2
Aug 19 12:37:44 mds2 kernel: Lustre: Failing over nobackup-MDT0000
Aug 19 12:37:44 mds2 kernel: Lustre: *** setting obd nobackup-MDT0000
device 'unknown-block(8,16)' read-only ***
We have ran e2fsck on the volume, found a few errors and corrected.
But the problem presists. We also tried mounting with -o abort_recov
this resulted in a assertion (lbug) and does not work.
ANy thoughts? The lines:
Aug 19 12:37:43 mds2 kernel: LustreError: 7525:0:(llog_lvfs.c:
612:llog_lvfs_create()) error looking up logfile 0xf150010:0x80d24629:
rc -2
Aug 19 12:37:43 mds2 kernel: LustreError: 7525:0:(llog_cat.c:
176:llog_cat_id2handle()) error opening log id 0xf150010:80d24629: rc -2
Aug 19 12:37:43 mds2 kernel: LustreError: 7525:0:(llog_obd.c:
262:cat_cancel_cb()) Cannot find handle for log 0xf150010
Catch my attention,
Thanks, we are running 1.6.6
Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
brockp at umich.edu
(734)936-1985
More information about the lustre-discuss
mailing list