[Lustre-discuss] osc lost on MDS server

Lu Wang wanglu at ihep.ac.cn
Thu Nov 12 19:24:12 PST 2009


     We take the 2 servers back to the cluster. After 15 hours's running, we get this errors in /var/log/message:

Nov 13 10:37:04 beshome01 kernel: LustreError: 2359:0:(llog_obd.c:211:llog_add()) No ctxt
Nov 13 10:37:04 beshome01 kernel: LustreError: 2359:0:(llog_obd.c:211:llog_add()) Skipped 2 previous similar messages
Nov 13 10:39:49 beshome01 kernel: LustreError: 2360:0:(llog_obd.c:211:llog_add()) No ctxt
Nov 13 10:39:49 beshome01 kernel: LustreError: 2360:0:(llog_obd.c:211:llog_add()) Skipped 16 previous similar messages
Nov 13 10:52:43 beshome01 kernel: LustreError: 2332:0:(llog_obd.c:211:llog_add()) No ctxt
Nov 13 10:52:43 beshome01 kernel: LustreError: 2332:0:(llog_obd.c:211:llog_add()) Skipped 265 previous similar messages
Nov 13 10:53:46 beshome01 kernel: LustreError: 2346:0:(llog_obd.c:211:llog_add()) No ctxt
Nov 13 10:53:46 beshome01 kernel: LustreError: 2346:0:(llog_obd.c:211:llog_add()) Skipped 105 previous similar messages
Nov 13 10:54:38 beshome01 kernel: LustreError: 2335:0:(llog_obd.c:211:llog_add()) No ctxt
Nov 13 10:54:38 beshome01 kernel: LustreError: 2335:0:(llog_obd.c:211:llog_add()) Skipped 4 previous similar messages
Nov 13 10:56:04 beshome01 kernel: LustreError: 2356:0:(llog_obd.c:211:llog_add()) No ctxt
Nov 13 10:56:04 beshome01 kernel: LustreError: 2356:0:(llog_obd.c:211:llog_add()) Skipped 5 previous similar messages
Nov 13 10:59:26 beshome01 kernel: LustreError: 2357:0:(llog_obd.c:211:llog_add()) No ctxt
Nov 13 10:59:26 beshome01 kernel: LustreError: 2357:0:(llog_obd.c:211:llog_add()) Skipped 3 previous similar messages

     Since we still have 2 "UP" osc on MDS, and all the osc are "UP" on lustre clients, users feel the system is back to normal. However, new objects can only be created on  2 OSTs.  If the write I/Os increase, we will get:
   
 Nov 12 18:50:28 beshome01 kernel: Lustre: 2599:0:(filter_io_26.c:714:filter_commitrw_write()) besfs2-OST0002: slow i_mutex 30s


------------------				 
Lu Wang
2009-11-13

-------------------------------------------------------------
发件人:Lu Wang
发送日期:2009-11-12 17:52:05
收件人:lustre-discuss
抄送:
主题:Re: [Lustre-discuss] osc lost on MDS server

Hi list, 
	We have tried again trying to recover the system to a consistant state with following steps:
1. Fulled out the 10Gbit Ethernet links connecting to the computing clustre, and connected  the 2 server using a direct 
ether net link. This step isolated the 2 servers from computing clustre to avoid the interferes from running clients( may be umount unclearly).
2. umount all the osts
3. umount MDT.
4. mount MDT as ldiskfs, and rm all files under CONFIGS( this files are confirmed  not right )   and unmount.
5. running tunefs.lustre --erase-params --mgs --mdt --fsname=besfs2 --writeconf /dev/sda1 to MDT device. This command returned an Fatal Error which said it assumed that this is a upgrading operation  from 1.4 to 1.6. The device 
is trying to copy client file from /tmp/****/LOG/ but failed, so it made a log file from "last_rcvd". 
6. We ingored the error and mounted MDT successfully. lctl dl showed 5 device. 
7.  mount every OST as ldiskfs, and rm all files undre CONFIGS and umount
8.  running tunefs.lustre --erase-params  --ost --mgsnode=192.168.50.50 --index=old index --fsname=besfs2 --writeconf /dev/sd* to each OST. This command also rreturned the assumption about 1.4 to 1.6 upgrade,and then it made a log file from "last_recv".  However, there was no Fatal error. 
9. We mounted osts one by one. 
10. This time we could see every osc for OST, however, only 2 osc are UP , the other 5 are IN. 
  0 UP mgs MGS MGS 7
  1 UP mgc MGC192.168.50.50 at tcp 26aae9d0-202e-abf3-3cb0-746eea59d7a4 5
  2 UP mdt MDS MDS_uuid 3
  3 UP lov besfs2-mdtlov besfs2-mdtlov_UUID 4
  4 UP mds besfs2-MDT0000 besfs2-MDT0000_UUID 3
  5 IN osc besfs2-OST0000-osc besfs2-mdtlov_UUID 5
  6 UP osc besfs2-OST0001-osc besfs2-mdtlov_UUID 5
  7 IN osc besfs2-OST0003-osc besfs2-mdtlov_UUID 5
  8 UP osc besfs2-OST0002-osc besfs2-mdtlov_UUID 5
  9 IN osc besfs2-OST0004-osc besfs2-mdtlov_UUID 5
 10 IN osc besfs2-OST0005-osc besfs2-mdtlov_UUID 5
 11 IN osc besfs2-OST0006-osc besfs2-mdtlov_UUID 5
 12 UP ost OSS OSS_uuid 3
 13 UP obdfilter besfs2-OST0000 besfs2-OST0000_UUID 5
 14 UP obdfilter besfs2-OST0001 besfs2-OST0001_UUID 5
 15 UP obdfilter besfs2-OST0002 besfs2-OST0002_UUID 5

We find this error log in /var/log/message:
kernel: LustreError: 2407:0:(llog_lvfs.c:612:llog_lvfs_create()) error looking up logfile 0xbc28013:0xf77a298: rc -2
Nov 12 16:57:03 beshome01 kernel: LustreError: 2407:0:(llog_cat.c:176:llog_cat_id2handle()) error opening log id 0xbc28013:f77a298: rc -2
Nov 12 16:57:03 beshome01 kernel: LustreError: 2407:0:(llog_obd.c:262:cat_cancel_cb()) Cannot find handle for log 0xbc28013
Nov 12 16:57:03 beshome01 kernel: LustreError: 2398:0:(llog_obd.c:329:llog_obd_origin_setup()) llog_process with cat_cancel_cb failed: -2
Nov 12 16:57:03 beshome01 kernel: LustreError: 2398:0:(osc_request.c:3664:osc_llog_init()) failed LLOG_MDS_OST_ORIG_CTXT
Nov 12 16:57:03 beshome01 kernel: LustreError: 2398:0:(osc_request.c:3675:osc_llog_init()) osc 'besfs2-OST0000-osc' tgt 'besfs2-MDT0000' cnt 1 catid 00000104110f1ce8 rc=-2
Nov 12 16:57:03 beshome01 kernel: LustreError: 2398:0:(osc_request.c:3677:osc_llog_init()) logid 0xbc28002:0x9a60e39f
Nov 12 16:57:03 beshome01 kernel: LustreError: 2398:0:(lov_log.c:230:lov_llog_init()) error osc_llog_init idx 0 osc 'besfs2-OST0000-osc' tgt 'besfs2-MDT0000' (rc=-2)
Nov 12 16:57:03 beshome01 kernel: LustreError: 2398:0:(mds_log.c:220:mds_llog_init()) lov_llog_init err -2
Nov 12 16:57:03 beshome01 kernel: LustreError: 2398:0:(llog_obd.c:417:llog_cat_initialize()) rc: -2
Nov 12 16:57:03 beshome01 kernel: LustreError: 2398:0:(mds_lov.c:916:__mds_lov_synchronize()) besfs2-OST0000_UUID failed at update_mds: -2
Nov 12 16:57:03 beshome01 kernel: LustreError: 2398:0:(mds_lov.c:959:__mds_lov_synchronize()) besfs2-OST0000_UUID sync failed -2, deactivating




Any ideas?




------------------				 
Lu Wang
2009-11-12

-------------------------------------------------------------
发件人:huangql
发送日期:2009-11-12 09:25:44
收件人:Andreas Dilger; Lu Wang
抄送:lustre-discuss
主题:Re: Re: [Lustre-discuss] osc lost on MDS server

Hi, dearlist

We have done the steps from 1 to 5, but we can still only see 2/7 osc devices on the MDS, but we can see 7 osc devices on a mounted client. Then we run e2fsck on the osts, However, we get the same result.

And we had the problem, after we do the step 1 to 4  and umount the filesystem from ldiskfs, then run the step 5, we got the logs:
According to the logs, we have no idea why there are 63 clients and where the MDS get the clients information with we removing CONFIGS/* and linking down.

Nov 12 09:02:37 beshome01 kernel: Lustre: 2474:0:(mds_fs.c:493:mds_init_server_data()) RECOVERY: service besfs2-MDT0000, 63 recoverable clients, last_transno 118077355
Nov 12 09:02:37 beshome01 kernel: Lustre: MDT besfs2-MDT0000 now serving dev (besfs2-MDT0000/5ceb6ad6-e810-9fae-4862-8ed0913bf7e7), but will be in recovery for at least 5:00, or until 63 clients reconnect. During this time new clients will not be allowed to connect. Recovery progress can be monitored by watching /proc/fs/lustre/mds/besfs2-MDT0000/recovery_status.
2009-11-12 



huangql 



发件人: Andreas Dilger 
发送时间: 2009-11-12  06:19:53 
收件人: Lu Wang 
抄送: lustre-discuss 
主题: Re: [Lustre-discuss] osc lost on MDS server 
 
On 2009-11-11, at 05:59, Lu Wang wrote:
> Dear  list,
>  Our MDS losts 5/7 osc devices after a reconfiguration:
>  1.umount all osts
>  2.tunfs.lustre --writeconf --mgs --mdt /dev/sda1
>  3.mount -t ldiskfs mdtdevice mountpoint
>  4.rm CONFIGS/*
>  5.mount -t lustre mdtdevice mountpoint
Where is it documented to delete all of the files in CONFIGS?
This deletes the action of step #2 above, and isn't a good idea.
Presumably there was also a step 4b to unmount the filesystem
from type lfdiskfs?
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
_______________________________________________
Lustre-discuss mailing list
Lustre-discuss at lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss
_______________________________________________
Lustre-discuss mailing list
Lustre-discuss at lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


More information about the lustre-discuss mailing list