[lustre-discuss] Inactivated ost still showing up on the mds

Sean Brisbane sean.brisbane at physics.ox.ac.uk
Tue Feb 2 03:47:45 PST 2016


Dear All,

I am trying to do similar things to Kurt at the same time. I have attempted to decommission another OST since this thread started.

The symptom is that when I try to create a file this hangs indefinitely.


touch /lustre/atlas25/atlas/testfile

I have tried this with the OST mounted.
I have also tried this with the OST unmounted.

Does anyone have any  other pointers?

For the OSTs I want to decommission, none of these options work for me and the filesystem hangs indefinitely (in some cases I waited 20 mins).  The OST is healthy as far as I know, its just on old out of warranty hardware which is why I want to decommission it.  This process has previously worked for other OSTs in the file-system.  In this new case, the OST being decommissioned is the OST with the lowest index in the filesystem, could this be be the difference?



On clients (thanks to this thread for this)
lctl set_param llite.atlas25-ffff880205397c00.lazystatfs=1

on mds:

lctl set_param -P osc.atlas25-OST0033-osc-MDT0000.active=0

or on mgt (!=mds) and clients:

lctl set_param  osc.atlas25-OST0033-osc-MDT0000.active=0
lctl device 7 deactivate



Thanks,
Sean

>Unfortunately it was the pool under the OST that was corrupted, not the OST. I couldn't import it >due to corruption on the pool. Kurt J. Strosahl System Administrator Scientific Computing Group, >Thomas Jefferson National Accelerator Fac
________________________________
From: lustre-discuss [lustre-discuss-bounces at lists.lustre.org] on behalf of Kurt Strosahl [strosahl at jlab.org]
Sent: 26 January 2016 18:31
To: Alexander I Kulyavtsev
Cc: <lustre-discuss at lists.lustre.org>
Subject: Re: [lustre-discuss] Inactivated ost still showing up on the mds

Unfortunately it was the pool under the OST that was corrupted, not the OST. I couldn't import it due to corruption on the pool. Kurt J. Strosahl System Administrator Scientific Computing Group, Thomas Jefferson National Accelerator Facility ----- Original Message ----- From: Alexander I Kulyavtsev To: Kurt Strosahl Cc: Alexander I Kulyavtsev , Sent: Tue, 26 Jan 2016 13:23:20 -0500 (EST) Subject: Re: [lustre-discuss] Inactivated ost still showing up on the mds Hi Kurt, probably too late if you unlinked the files: Did you do zfs snapshot on MDT and damaged OST before removing files? I so, it may be possible to mount ost zfs as a regualr zfs and pull out objects corresponding to files. mdt zfs snapshot to get fids. Alex. On Jan 22, 2016, at 7:39 AM, Kurt Strosahl wrote: > Good Morning, > > The real issue here is that the OST was decomissioned because the zpool on which it resided died, which left about 30TB of data (and possibly several million files) to be scrubbed. > > The steps I took were as follows... I set active=0 on the mds, and then set lazystatfs=1 on the mds and the clients so that df commands wouldn't hang. > > I don't see in the documentation where you have to set the ost to active=0 on every client, did I miss that? Also that is a marked change from 1.8, where deactivating an OST just required active=0 on the mds. > > w/r, > Kurt > > ----- Original Message ----- > From: "Sean Brisbane" > To: "Kurt Strosahl" , "Chris Hunter" > Cc: lustre-discuss at lists.lustre.org > Sent: Friday, January 22, 2016 4:33:41 AM > Subject: RE: Inactivated ost still showing up on the mds > > Dear Kurt, > > Im not sure if this is exactly what you were trying to do, but when I decommission an OST I also deactivate the OST on the client, which means that nothing on the OST will be accessible but the filesystem will carry on happily. > > lctl set_param osc.lustresystem-OST00NN-osc*.active=0 > > Thanks, > Sean > ________________________________________ > From: lustre-discuss [lustre-discuss-bounces at lists.lustre.org] on behalf of Kurt Strosahl [strosahl at jlab.org] > Sent: 21 January 2016 18:09 > To: Chris Hunter > Cc: lustre-discuss at lists.lustre.org > Subject: Re: [lustre-discuss] Inactivated ost still showing up on the mds > > Good Afternoon Chris, > > I have already run the active=0 command on the mds, is there another step? From my testing under 2.5.3 the clients will hang indefinitely without using the lazystatfs=1. > > Our major issue at present is that when the OST died it had a fair amount of data on in (closing in on 2M files lost), and it seems like the client gets into a bad state when calls re made repeatedly to files that are lost (but still have their ost index information). As the crawl has unlinked files the number of errors has dropped, as have client crashes. > > w/r, > Kurt > > ----- Original Message ----- > From: "Chris Hunter" > To: lustre-discuss at lists.lustre.org > Cc: "Kurt Strosahl" > Sent: Thursday, January 21, 2016 12:50:03 PM > Subject: [lustre-discuss] Inactivated ost still showing up on the mds > > Hi Kurt, > For reference when an underlying OST object is missing, this is the > error message generated on our MDS (lustre 2.5): >> Lustre: 12752:0:(mdd_object.c:1983:mdd_dir_page_build()) build page failed: -5! > > I suspect until you update the MGS info the MDS will still connect to > the deactive OST. > > My experience is sometimes the recipe to deactivate an OST works > flawlessly sometimes other times the clients hang on "df" command and > timeout on file access. I guess the order which you run the commands > (ie. client vs server) is important. > > regards, > chris hunter > >> From: Kurt Strosahl >> To: lustre-discuss at lists.lustre.org >> Subject: [lustre-discuss] Inactivated ost still showing up on the mds >> >> All, >> >> Continuing the issues that I reported yesterday... I found that by unlinking lost files that I was able to stop the below error from occurring, this gives me hope that systems will stop crashing once all the lost files are scrubbed. >> >> LustreError: 7676:0:(sec.c:379:import_sec_validate_get()) import ffff880623098800 (NEW) with no sec >> LustreError: 7971:0:(sec.c:379:import_sec_validate_get()) import ffff880623098800 (NEW) with no sec >> >> I do note that the inactivated ost doesn't seem to ever REALLY go away. After I removed an ost from my test system I noticed that the mds still showed it... >> >> On a client hooked up to the test system... >> client: lfs df >> UUID 1K-blocks Used Available Use% Mounted on >> testL-MDT0000_UUID 1819458432 10112 1819446272 0% /testlustre[MDT:0] >> testL-OST0000_UUID 57914433152 12672 57914418432 0% /testlustre[OST:0] >> testL-OST0001_UUID 57914433408 12672 57914418688 0% /testlustre[OST:1] >> testL-OST0002_UUID 57914433408 12672 57914418688 0% /testlustre[OST:2] >> OST0003 : inactive device >> testL-OST0004_UUID 57914436992 144896 57914290048 0% /testlustre[OST:4 >> >> on the mds it still shows as up when I do lctl dl: >> mds: lctl dl | grep OST0003 >> 22 UP osp testL-OST0003-osc-MDT0000 testL-MDT0000-mdtlov_UUID 5 >> >> So I stopped the test system, ran lctl dl again (getting no results), and restarted it. Once the system was back up I still saw OST3 marked as UP with lctl dl: >> mds: lctl dl | grep OST0003 >> 11 UP osp testL-OST0003-osc-MDT0000 testL-MDT0000-mdtlov_UUID 5 >> >> Why does the mds still think that this OST is up? >> > _______________________________________________ > lustre-discuss mailing list > lustre-discuss at lists.lustre.org > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org > _______________________________________________ > lustre-discuss mailing list > lustre-discuss at lists.lustre.org > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20160202/57c1ad62/attachment.htm>


More information about the lustre-discuss mailing list