[Lustre-discuss] Problem deactivating OSTs

Nirmal Seenu nirmal at fnal.gov
Thu Oct 6 13:17:58 PDT 2011


I am having trouble deactivating the OSTs following the instructions on the page:
http://wiki.lustre.org/manual/LustreManual18_HTML/ConfiguringLustre.html#50651184_pgfId-1298977

All our servers are RHEL 5.5 with the lustre version 1.8.6 running the kernel 2.6.18-238.12.1.el5_lustre.g266a955 .

The lustre client version is still 1.8.5.

I am trying to deactivate the following OSTs and executed the following command for each of the OST:

lctl --device 66 deactivate

  66 IN osc lqcdproj-OST003c-osc lqcdproj-mdtlov_UUID 5
  67 IN osc lqcdproj-OST003d-osc lqcdproj-mdtlov_UUID 5
  68 IN osc lqcdproj-OST003e-osc lqcdproj-mdtlov_UUID 5
  69 IN osc lqcdproj-OST003f-osc lqcdproj-mdtlov_UUID 5
  70 IN osc lqcdproj-OST0040-osc lqcdproj-mdtlov_UUID 5
  71 IN osc lqcdproj-OST0041-osc lqcdproj-mdtlov_UUID 5
  72 IN osc lqcdproj-OST0042-osc lqcdproj-mdtlov_UUID 5
  73 IN osc lqcdproj-OST0043-osc lqcdproj-mdtlov_UUID 5
  74 IN osc lqcdproj-OST0044-osc lqcdproj-mdtlov_UUID 5
  75 IN osc lqcdproj-OST0045-osc lqcdproj-mdtlov_UUID 5
  76 IN osc lqcdproj-OST0046-osc lqcdproj-mdtlov_UUID 5
  77 IN osc lqcdproj-OST0047-osc lqcdproj-mdtlov_UUID 5

A few minutes after I execute this command I see the following evictions on the OSSs:

- - - - - - - - - - - - - - dslustre11 - - - - - - - - - - - - - -
Lustre: lqcdproj-OST003c: haven't heard from client lqcdproj-mdtlov_UUID (at 172.19.11.211 at tcp) in 227 seconds. I think it's dead, and I am evicting it.
Lustre: Skipped 5 previous similar messages

- - - - - - - - - - - - - - dslustre12 - - - - - - - - - - - - - -
Lustre: lqcdproj-OST0044: haven't heard from client lqcdproj-mdtlov_UUID (at 172.19.11.211 at tcp) in 227 seconds. I think it's dead, and I am evicting it.
Lustre: Skipped 5 previous similar messages
Lustre: lqcdproj-OST0046: haven't heard from client lqcdproj-mdtlov_UUID (at 172.19.11.211 at tcp) in 227 seconds. I think it's dead, and I am evicting it.
Lustre: Skipped 2 previous similar messages


And they receive MDS connection once I re-activate those OSTs again by executing the following command for each OST:
lctl --device 66 activate

- - - - - - - - - - - - - - dslustre11 - - - - - - - - - - - - - -
Lustre: lqcdproj-OST003c: received MDS connection from 172.19.11.211 at tcp
Lustre: Skipped 2 previous similar messages

Lustre: lqcdproj-OST003f: received MDS connection from 172.19.11.211 at tcp
Lustre: Skipped 1 previous similar message

- - - - - - - - - - - - - - dslustre12 - - - - - - - - - - - - - -
Lustre: lqcdproj-OST0042: received MDS connection from 172.19.11.211 at tcp
Lustre: lqcdproj-OST0047: received MDS connection from 172.19.11.211 at tcp
Lustre: Skipped 4 previous similar messages


The lustre clients eventually loose connection to the above OSTs and the "lfs check servers" reports the following error:
error: check 'lqcdproj-OST003e-osc-ffff88021ee57800' Resource temporarily unavailable

I would really appreciate it if someone could point to the correct procedure to "Removing an OST from the File System".

Does execution of "lfs check servers" after doing a "lctl conf_param <OST name>.osc.active=0" still crash the node or has that problem been fixed over 
the last few versions?

Thanks in advance for your help.
Nirmal



More information about the lustre-discuss mailing list