[Lustre-discuss] recover borked mds

Brock Palen brockp at umich.edu
Wed Aug 19 09:57:45 PDT 2009


After a network event (switches bouncing) looks like our mds got  
borked somewhere, from all the random failovers (switches came up and  
down rapidly over a few hours).

Now we can not mount the mds,  when we do we get the following errors:

Aug 19 12:37:39 mds2 kernel: LustreError: 137-5: UUID 'nobackup- 
MDT0000_UUID' is not available  for connect (no target)
Aug 19 12:37:39 mds2 kernel: LustreError: 7455:0:(ldlm_lib.c: 
1619:target_send_reply_msg()) @@@ processing error (-19)   
req at 000001037c9db600 x85226/t0 o38-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl  
1250699959 ref 1 fl Interpret:/0/0 rc -19/0
Aug 19 12:37:39 mds2 kernel: LustreError: 137-5: UUID 'nobackup- 
MDT0000_UUID' is not available  for connect (no target)
Aug 19 12:37:39 mds2 kernel: LustreError: 7456:0:(ldlm_lib.c: 
1619:target_send_reply_msg()) @@@ processing error (-19)   
req at 00000104163a6000 x47117/t0 o38-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl  
1250699959 ref 1 fl Interpret:/0/0 rc -19/0
Aug 19 12:37:39 mds2 kernel: LustreError: 137-5: UUID 'nobackup- 
MDT0000_UUID' is not available  for connect (no target)Aug 19 12:37:39  
mds2 kernel: LustreError: Skipped 11 previous similar messages
Aug 19 12:37:39 mds2 kernel: LustreError: 7468:0:(ldlm_lib.c: 
1619:target_send_reply_msg()) @@@ processing error (-19)   
req at 0000010350a4d200 x81788/t0 o38-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl  
1250699959 ref 1 fl Interpret:/0/0 rc -19/0
Aug 19 12:37:39 mds2 kernel: LustreError: 7468:0:(ldlm_lib.c: 
1619:target_send_reply_msg()) Skipped 11 previous similar messages
Aug 19 12:37:40 mds2 kernel: LustreError: 137-5: UUID 'nobackup- 
MDT0000_UUID' is not available  for connect (no target)
Aug 19 12:37:40 mds2 kernel: LustreError: Skipped 18 previous similar  
messages
Aug 19 12:37:40 mds2 kernel: LustreError: 7455:0:(ldlm_lib.c: 
1619:target_send_reply_msg()) @@@ processing error (-19)   
req at 0000010414dc1850 x81855/t0 o38-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl  
1250699960 ref 1 fl Interpret:/0/0 rc -19/0Aug 19 12:37:40 mds2  
kernel: LustreError: 7455:0:(ldlm_lib.c:1619:target_send_reply_msg())  
Skipped 18 previous similar messages
Aug 19 12:37:42 mds2 kernel: LustreError: 137-5: UUID 'nobackup- 
MDT0000_UUID' is not available  for connect (no target)
Aug 19 12:37:42 mds2 kernel: LustreError: Skipped 42 previous similar  
messages
Aug 19 12:37:42 mds2 kernel: LustreError: 7466:0:(ldlm_lib.c: 
1619:target_send_reply_msg()) @@@ processing error (-19)   
req at 000001037c9db600 x77144/t0 o38-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl  
1250699962 ref 1 fl Interpret:/0/0 rc -19/0
Aug 19 12:37:42 mds2 kernel: LustreError: 7466:0:(ldlm_lib.c: 
1619:target_send_reply_msg()) Skipped 42 previous similar messages
Aug 19 12:37:43 mds2 kernel: Lustre: Request x3 sent from  
MGC10.164.3.246 at tcp to NID 10.164.3.246 at tcp 5s ago has timed out  
(limit 5s).
Aug 19 12:37:43 mds2 kernel: Lustre: Changing connection for  
MGC10.164.3.246 at tcp to MGC10.164.3.246 at tcp_1/0 at lo
Aug 19 12:37:43 mds2 kernel: Lustre: Enabling user_xattr
Aug 19 12:37:43 mds2 kernel: Lustre: 7524:0:(mds_fs.c: 
493:mds_init_server_data()) RECOVERY: service nobackup-MDT0000, 439  
recoverable clients, last_transno 3647966566
Aug 19 12:37:43 mds2 kernel: Lustre: MDT nobackup-MDT0000 now serving  
dev (nobackup-MDT0000/57dddb69-2475-b551-4100-e045f91ce38c), but will  
be in recovery for at least 5:00, or
until 439 clients reconnect. During this time new clients will not be  
allowed to connect. Recovery progress can be monitored by watching / 
proc/fs/lustre/mds/nobackup-MDT0000/rec
overy_status.
Aug 19 12:37:43 mds2 kernel: Lustre: 7524:0:(lproc_mds.c: 
273:lprocfs_wr_group_upcall()) nobackup-MDT0000: group upcall set to / 
usr/sbin/l_getgroups
Aug 19 12:37:43 mds2 kernel: Lustre: nobackup-MDT0000.mdt: set  
parameter group_upcall=/usr/sbin/l_getgroupsAug 19 12:37:43 mds2  
kernel: Lustre: 7524:0:(mds_lov.c:1070:mds_notify()) MDS nobackup- 
MDT0000: in recovery, not resetting orphans on nobackup-OST0000_UUID
Aug 19 12:37:43 mds2 kernel: Lustre: nobackup-MDT0000: temporarily  
refusing client connection from 10.164.1.104 at tcp
Aug 19 12:37:43 mds2 kernel: LustreError: 7525:0:(llog_lvfs.c: 
612:llog_lvfs_create()) error looking up logfile 0xf150010:0x80d24629:  
rc -2
Aug 19 12:37:43 mds2 kernel: LustreError: 7525:0:(llog_cat.c: 
176:llog_cat_id2handle()) error opening log id 0xf150010:80d24629: rc -2
Aug 19 12:37:43 mds2 kernel: LustreError: 7525:0:(llog_obd.c: 
262:cat_cancel_cb()) Cannot find handle for log 0xf150010
Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(llog_obd.c: 
329:llog_obd_origin_setup()) llog_process with cat_cancel_cb failed: -2
Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(osc_request.c: 
3664:osc_llog_init()) failed LLOG_MDS_OST_ORIG_CTXT
Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(osc_request.c: 
3675:osc_llog_init()) osc 'nobackup-OST0000-osc' tgt 'nobackup- 
MDT0000' cnt 1 catid 00000101e1d979e8 rc=-2
Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(osc_request.c: 
3677:osc_llog_init()) logid 0xf150002:0x9642a0ac
Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(lov_log.c: 
230:lov_llog_init()) error osc_llog_init idx 0 osc 'nobackup-OST0000- 
osc' tgt 'nobackup-MDT0000' (rc=-2)
Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(mds_log.c: 
220:mds_llog_init()) lov_llog_init err -2
Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(llog_obd.c: 
417:llog_cat_initialize()) rc: -2
Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(lov_obd.c: 
727:lov_add_target()) add failed (-2), deleting nobackup-OST0000_UUID
Aug 19 12:37:43 mds2 kernel: LustreError: 7524:0:(obd_config.c: 
1093:class_config_llog_handler()) Err -2 on cfg command:
Aug 19 12:37:43 mds2 kernel: Lustre:    cmd=cf00d 0:nobackup-mdtlov   
1:nobackup-OST0000_UUID  2:0  3:1
Aug 19 12:37:43 mds2 kernel: LustreError: 15c-8: MGC10.164.3.246 at tcp:  
The configuration from log 'nobackup-MDT0000' failed (-2). This may be  
the result of communication errors b
etween this node and the MGS, a bad configuration, or other errors.  
See the syslog for more information.
Aug 19 12:37:43 mds2 kernel: LustreError: 7438:0:(obd_mount.c: 
1113:server_start_targets()) failed to start server nobackup-MDT0000: -2
Aug 19 12:37:44 mds2 kernel: LustreError: 7438:0:(obd_mount.c: 
1623:server_fill_super()) Unable to start targets: -2
Aug 19 12:37:44 mds2 kernel: Lustre: Failing over nobackup-MDT0000
Aug 19 12:37:44 mds2 kernel: Lustre: *** setting obd nobackup-MDT0000  
device 'unknown-block(8,16)' read-only ***

We have ran e2fsck on the volume, found a few errors and corrected.   
But the problem presists.  We also tried mounting with -o abort_recov   
this resulted in a assertion (lbug) and does not work.
ANy thoughts?  The lines:
Aug 19 12:37:43 mds2 kernel: LustreError: 7525:0:(llog_lvfs.c: 
612:llog_lvfs_create()) error looking up logfile 0xf150010:0x80d24629:  
rc -2
Aug 19 12:37:43 mds2 kernel: LustreError: 7525:0:(llog_cat.c: 
176:llog_cat_id2handle()) error opening log id 0xf150010:80d24629: rc -2
Aug 19 12:37:43 mds2 kernel: LustreError: 7525:0:(llog_obd.c: 
262:cat_cancel_cb()) Cannot find handle for log 0xf150010

Catch my attention,
Thanks,  we are running 1.6.6


Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
brockp at umich.edu
(734)936-1985






More information about the lustre-discuss mailing list