[lustre-discuss] Problems moving an OSS from an old Lustre installation to a new one

Massimo Sgaravatto massimo.sgaravatto at pd.infn.it
Mon Jul 27 22:17:20 PDT 2015


Hi

We are migrating from an old Lustre installation composed by 1 MDS and 2 
OSS to a new Lustre 2.5.3 installation.

For this second installation we installated from scratch a new MDS + a 
new OST and we migrated the data from the old Lustre system).


Problems started when we tried to "move" a OSS from the old installation 
to the new one.

For this OSS server we reinstalled from scratch the Operating System 
(keeping the same IP name and number).
Then for the OSTs we formatted the file systems using commands such as:


  mkfs.lustre --reformat --fsname=cmswork 
--mgsnode=t2-mds-01.lnl.infn.it at tcp0 --ost --param ost.quota_type=ug 
--index=3 --mkfsoptions='-i 65536' /dev/mapper/MD1200_1p1


(t2-mds-01.lnl.infn.it is the new MDS)

and then we mounted the file systems

Apparently this worked.


After a while we realized that in the syslog of this "moved" OSS there 
were messages such as:

Jul 25 10:54:02 t2-oss-03 kernel: Lustre: cmswork-OST0003: haven't heard 
from client cmswork-MDT0000-mdtlov_UUID (at 10.60.16.8 at tcp) in 232 
seconds. I think it's dead, and I am evicting it. exp ffff8803123bf400, 
cur 1437814442 expire 1437814292 last 1437814210


10.60.16.8 is the IP name of the old MDS !!!


No idea why it was expecting communications from it !
At any rate on this old MDS I umounted the MGS and MDT file systems.


After a while users complaining that there were problems for some (not 
all) files written in the new OSTs, e.g.:

# ls -l
/lustre/cmswork/ronchese/pat_ntu/cmssw53B_slc6/dev08tmp/src/PDAnalysis/EDM/bin/ntu.root

ls: cannot access
/lustre/cmswork/ronchese/pat_ntu/cmssw53B_slc6/dev08tmp/src/PDAnalysis/EDM/bin/ntu.root:
Cannot allocate memory


In the syslog of the client:

Jul 26 08:01:09 t2-ui-13 kernel: LustreError: 11-0:
cmswork-OST0003-osc-ffff880818e50000: Communicating with 10.60.16.9 at tcp,
operation ldlm_enqueue failed with -12.


10.60.16.9 is the IP of the "moved" OSS.
In its syslog:


Jul 26 08:01:09 t2-oss-03 kernel: LustreError:
8114:0:(ldlm_resource.c:1188:ldlm_resource_get()) cmswork-OST0003:
lvbo_init failed for resource 0xb9:0x0: rc = -2
Jul 26 08:01:09 t2-oss-03 kernel: LustreError:
8114:0:(ldlm_resource.c:1188:ldlm_resource_get()) Skipped 1 previous
similar message


Reading:

https://jira.hpdd.intel.com/browse/LU-4034

I guess the memory is not the real problem. The problem is that the 
object was not found in the OST.


Some interesting messages found in the syslog of the "moved" OSS:

Jul 24 14:56:25 t2-oss-03 kernel: Lustre: cmswork-OST0003: Received MDS 
connection from 10.60.16.8 at tcp, removing former export from 10.60.16.38 at tcp

Jul 24 14:56:27 t2-oss-03 kernel: Lustre: cmswork-OST0003: already 
connected client cmswork-MDT0000-mdtlov_UUID \
(at 10.60.16.8 at tcp) with handle 0xdb376ec08bf7d020. Rejecting client 
with the same UUID trying to reconnect with\
  handle 0x6dffb49bb9b3bc70

10.60.16.8 is the IP of the old MDS
10.60.16.38 is the IP of the new MDS


For the the being we disabled the OSTs hosted on the "moved" OSS so that 
new objects are not written there.


Any idea what the problem is and how we could recover the system ?



Thanks, Massimo

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 1877 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20150728/af5b4a76/attachment.bin>


More information about the lustre-discuss mailing list