[Lustre-discuss] Problem deactivating OSTs
Nirmal Seenu
nirmal at fnal.gov
Thu Oct 6 13:17:58 PDT 2011
I am having trouble deactivating the OSTs following the instructions on the page:
http://wiki.lustre.org/manual/LustreManual18_HTML/ConfiguringLustre.html#50651184_pgfId-1298977
All our servers are RHEL 5.5 with the lustre version 1.8.6 running the kernel 2.6.18-238.12.1.el5_lustre.g266a955 .
The lustre client version is still 1.8.5.
I am trying to deactivate the following OSTs and executed the following command for each of the OST:
lctl --device 66 deactivate
66 IN osc lqcdproj-OST003c-osc lqcdproj-mdtlov_UUID 5
67 IN osc lqcdproj-OST003d-osc lqcdproj-mdtlov_UUID 5
68 IN osc lqcdproj-OST003e-osc lqcdproj-mdtlov_UUID 5
69 IN osc lqcdproj-OST003f-osc lqcdproj-mdtlov_UUID 5
70 IN osc lqcdproj-OST0040-osc lqcdproj-mdtlov_UUID 5
71 IN osc lqcdproj-OST0041-osc lqcdproj-mdtlov_UUID 5
72 IN osc lqcdproj-OST0042-osc lqcdproj-mdtlov_UUID 5
73 IN osc lqcdproj-OST0043-osc lqcdproj-mdtlov_UUID 5
74 IN osc lqcdproj-OST0044-osc lqcdproj-mdtlov_UUID 5
75 IN osc lqcdproj-OST0045-osc lqcdproj-mdtlov_UUID 5
76 IN osc lqcdproj-OST0046-osc lqcdproj-mdtlov_UUID 5
77 IN osc lqcdproj-OST0047-osc lqcdproj-mdtlov_UUID 5
A few minutes after I execute this command I see the following evictions on the OSSs:
- - - - - - - - - - - - - - dslustre11 - - - - - - - - - - - - - -
Lustre: lqcdproj-OST003c: haven't heard from client lqcdproj-mdtlov_UUID (at 172.19.11.211 at tcp) in 227 seconds. I think it's dead, and I am evicting it.
Lustre: Skipped 5 previous similar messages
- - - - - - - - - - - - - - dslustre12 - - - - - - - - - - - - - -
Lustre: lqcdproj-OST0044: haven't heard from client lqcdproj-mdtlov_UUID (at 172.19.11.211 at tcp) in 227 seconds. I think it's dead, and I am evicting it.
Lustre: Skipped 5 previous similar messages
Lustre: lqcdproj-OST0046: haven't heard from client lqcdproj-mdtlov_UUID (at 172.19.11.211 at tcp) in 227 seconds. I think it's dead, and I am evicting it.
Lustre: Skipped 2 previous similar messages
And they receive MDS connection once I re-activate those OSTs again by executing the following command for each OST:
lctl --device 66 activate
- - - - - - - - - - - - - - dslustre11 - - - - - - - - - - - - - -
Lustre: lqcdproj-OST003c: received MDS connection from 172.19.11.211 at tcp
Lustre: Skipped 2 previous similar messages
Lustre: lqcdproj-OST003f: received MDS connection from 172.19.11.211 at tcp
Lustre: Skipped 1 previous similar message
- - - - - - - - - - - - - - dslustre12 - - - - - - - - - - - - - -
Lustre: lqcdproj-OST0042: received MDS connection from 172.19.11.211 at tcp
Lustre: lqcdproj-OST0047: received MDS connection from 172.19.11.211 at tcp
Lustre: Skipped 4 previous similar messages
The lustre clients eventually loose connection to the above OSTs and the "lfs check servers" reports the following error:
error: check 'lqcdproj-OST003e-osc-ffff88021ee57800' Resource temporarily unavailable
I would really appreciate it if someone could point to the correct procedure to "Removing an OST from the File System".
Does execution of "lfs check servers" after doing a "lctl conf_param <OST name>.osc.active=0" still crash the node or has that problem been fixed over
the last few versions?
Thanks in advance for your help.
Nirmal
More information about the lustre-discuss
mailing list