[lustre-discuss] Inactivated ost still showing up on the mds
Kurt Strosahl
strosahl at jlab.org
Tue Feb 2 06:36:09 PST 2016
Good Morning Sean,
That is a bit scary, as we have older osts in our production system that will be decommissioned in the not-to-distant future... so I ran this scenario through my test environment
I noticed that there were some differences between how you removed the ost from the system and how I remove them.
You did:
mds -> lctl set_param -P osc.atlas25-OST0033-osc-MDT0000.active=0
mgs/clients -> lctl set_param osc.atlas25-OST0033-osc-MDT0000.active=0
whereas I use lctl conf_param on the combined mgs/mdt, and did not run any commands on the clients (aside from the lazystatfs=1, which only has to be done to cover existing mount points):
mds ~> lctl conf_param testL.llite.lazystatfs=1
mds ~> lctl conf_param testL-OST0007.osc.active=0
I'm also not sure what you are doing with this line...
lctl device 7 deactivate
My procedure, run just as I was composing this email...
Test client:
~> sudo lctl set_param llite.*.lazystatfs=1
llite.testL-ffff88057855cc00.lazystatfs=1
~> lfs df
UUID 1K-blocks Used Available Use% Mounted on
testL-MDT0000_UUID 1819458432 10624 1819445760 0% /testlustre[MDT:0]
testL-OST0000_UUID 57914437120 12672 57914422400 0% /testlustre[OST:0]
testL-OST0001_UUID 57914437120 12672 57914422400 0% /testlustre[OST:1]
testL-OST0002_UUID 57914437120 12672 57914422400 0% /testlustre[OST:2]
testL-OST0003_UUID 57914437120 12672 57914422400 0% /testlustre[OST:3]
testL-OST0004_UUID 57914437120 10624 57914424448 0% /testlustre[OST:4]
testL-OST0005_UUID 37778032384 3072 37778017408 0% /testlustre[OST:5]
testL-OST0006_UUID 37778032384 3200 37778027136 0% /testlustre[OST:6]
testL-OST0007_UUID 30293705088 3072 30293699968 0% /testlustre[OST:7]
filesystem summary: 395421955456 70656 395421858560 0% /testlustre
Combined mdt/mgs:
~> lctl conf_param testL.llite.lazystatfs=1
~> lctl conf_param testL-OST0007.osc.active=0
Back to the test client:
~> lfs df
UUID 1K-blocks Used Available Use% Mounted on
testL-MDT0000_UUID 1819458432 10752 1819445632 0% /testlustre[MDT:0]
testL-OST0000_UUID 57914437120 12672 57914422400 0% /testlustre[OST:0]
testL-OST0001_UUID 57914437120 12672 57914422400 0% /testlustre[OST:1]
testL-OST0002_UUID 57914437120 12672 57914422400 0% /testlustre[OST:2]
testL-OST0003_UUID 57914437120 12672 57914422400 0% /testlustre[OST:3]
testL-OST0004_UUID 57914437120 10624 57914424448 0% /testlustre[OST:4]
testL-OST0005_UUID 37778032384 3456 37778026880 0% /testlustre[OST:5]
testL-OST0006_UUID 37778032256 3200 37778027008 0% /testlustre[OST:6]
OST0007 : inactive device
filesystem summary: 365128250240 67968 365128167936 0% /testlustre
And creating files from a test client...
~> /usr/mpi/gcc/openmpi-1.4.3/bin/mpirun -np 12 -bynode -machinefile ./nodelist ./ior -F -e -m -g -i 10 -t 1024k -b 15g -o /testlustre/broadtest/
Then watching it create files:
~> lfs df /testlustre/
UUID 1K-blocks Used Available Use% Mounted on
testL-MDT0000_UUID 1819458432 11136 1819445248 0% /testlustre[MDT:0]
testL-OST0000_UUID 57914435712 11547904 57901246208 0% /testlustre[OST:0]
testL-OST0001_UUID 57914435712 10687616 57903165824 0% /testlustre[OST:1]
testL-OST0002_UUID 57914435712 10206592 57903029120 0% /testlustre[OST:2]
testL-OST0003_UUID 57914435712 12177664 57892853760 0% /testlustre[OST:3]
testL-OST0004_UUID 57914437120 10624 57914424448 0% /testlustre[OST:4]
testL-OST0005_UUID 37778032384 59419776 37707723392 0% /testlustre[OST:5]
testL-OST0006_UUID 37778032384 57212544 37720566528 0% /testlustre[OST:6]
OST0007 : inactive device
filesystem summary: 365128244736 161262720 364943009280 0% /testlustre
I also unmounted and remounted the filesystem on the test node, just to make sure that wouldn't be a surprise.
Ignore OST4, I'm trying to get failover to work with that ost (and as you can see it isn't working).
Respectfully,
Kurt J. Strosahl
System Administrator Scientific Computing Group,
Thomas Jefferson National Accelerator Facility
----- Original Message -----
From: "Sean Brisbane" <sean.brisbane at physics.ox.ac.uk>
To: "Kurt Strosahl" <strosahl at jlab.org>, "aik" <aik at fnal.gov>
Cc: "<lustre-discuss at lists.lustre.org>" <lustre-discuss at lists.lustre.org>
Sent: Tuesday, February 2, 2016 6:47:45 AM
Subject: RE: [lustre-discuss] Inactivated ost still showing up on the mds
Dear All,
I am trying to do similar things to Kurt at the same time. I have attempted to decommission another OST since this thread started.
The symptom is that when I try to create a file this hangs indefinitely.
touch /lustre/atlas25/atlas/testfile
I have tried this with the OST mounted.
I have also tried this with the OST unmounted.
Does anyone have any other pointers?
For the OSTs I want to decommission, none of these options work for me and the filesystem hangs indefinitely (in some cases I waited 20 mins). The OST is healthy as far as I know, its just on old out of warranty hardware which is why I want to decommission it. This process has previously worked for other OSTs in the file-system. In this new case, the OST being decommissioned is the OST with the lowest index in the filesystem, could this be be the difference?
On clients (thanks to this thread for this)
lctl set_param llite.atlas25-ffff880205397c00.lazystatfs=1
on mds:
lctl set_param -P osc.atlas25-OST0033-osc-MDT0000.active=0
or on mgt (!=mds) and clients:
lctl set_param osc.atlas25-OST0033-osc-MDT0000.active=0
lctl device 7 deactivate
Thanks,
Sean
>Unfortunately it was the pool under the OST that was corrupted, not the OST. I couldn't import it >due to corruption on the pool. Kurt J. Strosahl System Administrator Scientific Computing Group, >Thomas Jefferson National Accelerator Fac
More information about the lustre-discuss
mailing list