[lustre-discuss] Problems moving an OSS from an old Lustre installation to a new one
Massimo Sgaravatto
massimo.sgaravatto at pd.infn.it
Mon Jul 27 22:17:20 PDT 2015
Hi
We are migrating from an old Lustre installation composed by 1 MDS and 2
OSS to a new Lustre 2.5.3 installation.
For this second installation we installated from scratch a new MDS + a
new OST and we migrated the data from the old Lustre system).
Problems started when we tried to "move" a OSS from the old installation
to the new one.
For this OSS server we reinstalled from scratch the Operating System
(keeping the same IP name and number).
Then for the OSTs we formatted the file systems using commands such as:
mkfs.lustre --reformat --fsname=cmswork
--mgsnode=t2-mds-01.lnl.infn.it at tcp0 --ost --param ost.quota_type=ug
--index=3 --mkfsoptions='-i 65536' /dev/mapper/MD1200_1p1
(t2-mds-01.lnl.infn.it is the new MDS)
and then we mounted the file systems
Apparently this worked.
After a while we realized that in the syslog of this "moved" OSS there
were messages such as:
Jul 25 10:54:02 t2-oss-03 kernel: Lustre: cmswork-OST0003: haven't heard
from client cmswork-MDT0000-mdtlov_UUID (at 10.60.16.8 at tcp) in 232
seconds. I think it's dead, and I am evicting it. exp ffff8803123bf400,
cur 1437814442 expire 1437814292 last 1437814210
10.60.16.8 is the IP name of the old MDS !!!
No idea why it was expecting communications from it !
At any rate on this old MDS I umounted the MGS and MDT file systems.
After a while users complaining that there were problems for some (not
all) files written in the new OSTs, e.g.:
# ls -l
/lustre/cmswork/ronchese/pat_ntu/cmssw53B_slc6/dev08tmp/src/PDAnalysis/EDM/bin/ntu.root
ls: cannot access
/lustre/cmswork/ronchese/pat_ntu/cmssw53B_slc6/dev08tmp/src/PDAnalysis/EDM/bin/ntu.root:
Cannot allocate memory
In the syslog of the client:
Jul 26 08:01:09 t2-ui-13 kernel: LustreError: 11-0:
cmswork-OST0003-osc-ffff880818e50000: Communicating with 10.60.16.9 at tcp,
operation ldlm_enqueue failed with -12.
10.60.16.9 is the IP of the "moved" OSS.
In its syslog:
Jul 26 08:01:09 t2-oss-03 kernel: LustreError:
8114:0:(ldlm_resource.c:1188:ldlm_resource_get()) cmswork-OST0003:
lvbo_init failed for resource 0xb9:0x0: rc = -2
Jul 26 08:01:09 t2-oss-03 kernel: LustreError:
8114:0:(ldlm_resource.c:1188:ldlm_resource_get()) Skipped 1 previous
similar message
Reading:
https://jira.hpdd.intel.com/browse/LU-4034
I guess the memory is not the real problem. The problem is that the
object was not found in the OST.
Some interesting messages found in the syslog of the "moved" OSS:
Jul 24 14:56:25 t2-oss-03 kernel: Lustre: cmswork-OST0003: Received MDS
connection from 10.60.16.8 at tcp, removing former export from 10.60.16.38 at tcp
Jul 24 14:56:27 t2-oss-03 kernel: Lustre: cmswork-OST0003: already
connected client cmswork-MDT0000-mdtlov_UUID \
(at 10.60.16.8 at tcp) with handle 0xdb376ec08bf7d020. Rejecting client
with the same UUID trying to reconnect with\
handle 0x6dffb49bb9b3bc70
10.60.16.8 is the IP of the old MDS
10.60.16.38 is the IP of the new MDS
For the the being we disabled the OSTs hosted on the "moved" OSS so that
new objects are not written there.
Any idea what the problem is and how we could recover the system ?
Thanks, Massimo
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 1877 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20150728/af5b4a76/attachment.bin>
More information about the lustre-discuss
mailing list