[lustre-discuss] Inactivated ost still showing up on the mds

Kurt Strosahl strosahl at jlab.org
Tue Feb 2 06:36:09 PST 2016


Good Morning Sean,

   That is a bit scary, as we have older osts in our production system that will be decommissioned in the not-to-distant future... so I ran this scenario through my test environment
   I noticed that there were some differences between how you removed the ost from the system and how I remove them.

You did:
mds -> lctl set_param -P osc.atlas25-OST0033-osc-MDT0000.active=0
mgs/clients -> lctl set_param  osc.atlas25-OST0033-osc-MDT0000.active=0

whereas I use lctl conf_param on the combined mgs/mdt, and did not run any commands on the clients (aside from the lazystatfs=1, which only has to be done to cover existing mount points):
mds ~> lctl conf_param testL.llite.lazystatfs=1
mds ~> lctl conf_param testL-OST0007.osc.active=0

I'm also not sure what you are doing with this line...
lctl device 7 deactivate

My procedure, run just as I was composing this email...

Test client:
~> sudo lctl set_param llite.*.lazystatfs=1
llite.testL-ffff88057855cc00.lazystatfs=1
~> lfs df                                                                
UUID                   1K-blocks        Used   Available Use% Mounted on               
testL-MDT0000_UUID    1819458432       10624  1819445760   0% /testlustre[MDT:0]       
testL-OST0000_UUID   57914437120       12672 57914422400   0% /testlustre[OST:0]       
testL-OST0001_UUID   57914437120       12672 57914422400   0% /testlustre[OST:1]       
testL-OST0002_UUID   57914437120       12672 57914422400   0% /testlustre[OST:2]       
testL-OST0003_UUID   57914437120       12672 57914422400   0% /testlustre[OST:3]       
testL-OST0004_UUID   57914437120       10624 57914424448   0% /testlustre[OST:4]       
testL-OST0005_UUID   37778032384        3072 37778017408   0% /testlustre[OST:5]       
testL-OST0006_UUID   37778032384        3200 37778027136   0% /testlustre[OST:6]       
testL-OST0007_UUID   30293705088        3072 30293699968   0% /testlustre[OST:7]       

filesystem summary:  395421955456       70656 395421858560   0% /testlustre

Combined mdt/mgs:
~> lctl conf_param testL.llite.lazystatfs=1
~> lctl conf_param testL-OST0007.osc.active=0

Back to the test client:
~> lfs df
UUID                   1K-blocks        Used   Available Use% Mounted on
testL-MDT0000_UUID    1819458432       10752  1819445632   0% /testlustre[MDT:0]
testL-OST0000_UUID   57914437120       12672 57914422400   0% /testlustre[OST:0]
testL-OST0001_UUID   57914437120       12672 57914422400   0% /testlustre[OST:1]
testL-OST0002_UUID   57914437120       12672 57914422400   0% /testlustre[OST:2]
testL-OST0003_UUID   57914437120       12672 57914422400   0% /testlustre[OST:3]
testL-OST0004_UUID   57914437120       10624 57914424448   0% /testlustre[OST:4]
testL-OST0005_UUID   37778032384        3456 37778026880   0% /testlustre[OST:5]
testL-OST0006_UUID   37778032256        3200 37778027008   0% /testlustre[OST:6]
OST0007             : inactive device

filesystem summary:  365128250240       67968 365128167936   0% /testlustre

And creating files from a test client...
~> /usr/mpi/gcc/openmpi-1.4.3/bin/mpirun -np 12 -bynode -machinefile ./nodelist ./ior -F -e -m -g -i 10 -t 1024k -b 15g -o /testlustre/broadtest/

Then watching it create files:
~> lfs df /testlustre/
UUID                   1K-blocks        Used   Available Use% Mounted on
testL-MDT0000_UUID    1819458432       11136  1819445248   0% /testlustre[MDT:0]
testL-OST0000_UUID   57914435712    11547904 57901246208   0% /testlustre[OST:0]
testL-OST0001_UUID   57914435712    10687616 57903165824   0% /testlustre[OST:1]
testL-OST0002_UUID   57914435712    10206592 57903029120   0% /testlustre[OST:2]
testL-OST0003_UUID   57914435712    12177664 57892853760   0% /testlustre[OST:3]
testL-OST0004_UUID   57914437120       10624 57914424448   0% /testlustre[OST:4]
testL-OST0005_UUID   37778032384    59419776 37707723392   0% /testlustre[OST:5]
testL-OST0006_UUID   37778032384    57212544 37720566528   0% /testlustre[OST:6]
OST0007             : inactive device

filesystem summary:  365128244736   161262720 364943009280   0% /testlustre

I also unmounted and remounted the filesystem on the test node, just to make sure that wouldn't be a surprise.

Ignore OST4, I'm trying to get failover to work with that ost (and as you can see it isn't working).

Respectfully,
Kurt J. Strosahl
System Administrator  Scientific Computing Group,
Thomas Jefferson National Accelerator Facility

----- Original Message -----
From: "Sean Brisbane" <sean.brisbane at physics.ox.ac.uk>
To: "Kurt Strosahl" <strosahl at jlab.org>, "aik" <aik at fnal.gov>
Cc: "<lustre-discuss at lists.lustre.org>" <lustre-discuss at lists.lustre.org>
Sent: Tuesday, February 2, 2016 6:47:45 AM
Subject: RE: [lustre-discuss] Inactivated ost still showing up on the mds

Dear All,

I am trying to do similar things to Kurt at the same time. I have attempted to decommission another OST since this thread started.

The symptom is that when I try to create a file this hangs indefinitely.


touch /lustre/atlas25/atlas/testfile

I have tried this with the OST mounted.
I have also tried this with the OST unmounted.

Does anyone have any  other pointers?

For the OSTs I want to decommission, none of these options work for me and the filesystem hangs indefinitely (in some cases I waited 20 mins).  The OST is healthy as far as I know, its just on old out of warranty hardware which is why I want to decommission it.  This process has previously worked for other OSTs in the file-system.  In this new case, the OST being decommissioned is the OST with the lowest index in the filesystem, could this be be the difference?



On clients (thanks to this thread for this)
lctl set_param llite.atlas25-ffff880205397c00.lazystatfs=1

on mds:

lctl set_param -P osc.atlas25-OST0033-osc-MDT0000.active=0

or on mgt (!=mds) and clients:

lctl set_param  osc.atlas25-OST0033-osc-MDT0000.active=0
lctl device 7 deactivate



Thanks,
Sean

>Unfortunately it was the pool under the OST that was corrupted, not the OST. I couldn't import it >due to corruption on the pool. Kurt J. Strosahl System Administrator Scientific Computing Group, >Thomas Jefferson National Accelerator Fac


More information about the lustre-discuss mailing list