[lustre-discuss] Problems moving an OSS from an old Lustre installation to a new one

Massimo Sgaravatto massimo.sgaravatto at pd.infn.it
Tue Jul 28 03:49:20 PDT 2015


I forgot to say that the filesystem names in the old and in the new 
Lustre installations are the same.



On 28/07/2015 07:17, Massimo Sgaravatto wrote:
> Hi
>
> We are migrating from an old Lustre installation composed by 1 MDS and 2
> OSS to a new Lustre 2.5.3 installation.
>
> For this second installation we installated from scratch a new MDS + a
> new OST and we migrated the data from the old Lustre system).
>
>
> Problems started when we tried to "move" a OSS from the old installation
> to the new one.
>
> For this OSS server we reinstalled from scratch the Operating System
> (keeping the same IP name and number).
> Then for the OSTs we formatted the file systems using commands such as:
>
>
>   mkfs.lustre --reformat --fsname=cmswork
> --mgsnode=t2-mds-01.lnl.infn.it at tcp0 --ost --param ost.quota_type=ug
> --index=3 --mkfsoptions='-i 65536' /dev/mapper/MD1200_1p1
>
>
> (t2-mds-01.lnl.infn.it is the new MDS)
>
> and then we mounted the file systems
>
> Apparently this worked.
>
>
> After a while we realized that in the syslog of this "moved" OSS there
> were messages such as:
>
> Jul 25 10:54:02 t2-oss-03 kernel: Lustre: cmswork-OST0003: haven't heard
> from client cmswork-MDT0000-mdtlov_UUID (at 10.60.16.8 at tcp) in 232
> seconds. I think it's dead, and I am evicting it. exp ffff8803123bf400,
> cur 1437814442 expire 1437814292 last 1437814210
>
>
> 10.60.16.8 is the IP name of the old MDS !!!
>
>
> No idea why it was expecting communications from it !
> At any rate on this old MDS I umounted the MGS and MDT file systems.
>
>
> After a while users complaining that there were problems for some (not
> all) files written in the new OSTs, e.g.:
>
> # ls -l
> /lustre/cmswork/ronchese/pat_ntu/cmssw53B_slc6/dev08tmp/src/PDAnalysis/EDM/bin/ntu.root
>
>
> ls: cannot access
> /lustre/cmswork/ronchese/pat_ntu/cmssw53B_slc6/dev08tmp/src/PDAnalysis/EDM/bin/ntu.root:
>
> Cannot allocate memory
>
>
> In the syslog of the client:
>
> Jul 26 08:01:09 t2-ui-13 kernel: LustreError: 11-0:
> cmswork-OST0003-osc-ffff880818e50000: Communicating with 10.60.16.9 at tcp,
> operation ldlm_enqueue failed with -12.
>
>
> 10.60.16.9 is the IP of the "moved" OSS.
> In its syslog:
>
>
> Jul 26 08:01:09 t2-oss-03 kernel: LustreError:
> 8114:0:(ldlm_resource.c:1188:ldlm_resource_get()) cmswork-OST0003:
> lvbo_init failed for resource 0xb9:0x0: rc = -2
> Jul 26 08:01:09 t2-oss-03 kernel: LustreError:
> 8114:0:(ldlm_resource.c:1188:ldlm_resource_get()) Skipped 1 previous
> similar message
>
>
> Reading:
>
> https://jira.hpdd.intel.com/browse/LU-4034
>
> I guess the memory is not the real problem. The problem is that the
> object was not found in the OST.
>
>
> Some interesting messages found in the syslog of the "moved" OSS:
>
> Jul 24 14:56:25 t2-oss-03 kernel: Lustre: cmswork-OST0003: Received MDS
> connection from 10.60.16.8 at tcp, removing former export from 10.60.16.38 at tcp
>
> Jul 24 14:56:27 t2-oss-03 kernel: Lustre: cmswork-OST0003: already
> connected client cmswork-MDT0000-mdtlov_UUID \
> (at 10.60.16.8 at tcp) with handle 0xdb376ec08bf7d020. Rejecting client
> with the same UUID trying to reconnect with\
>   handle 0x6dffb49bb9b3bc70
>
> 10.60.16.8 is the IP of the old MDS
> 10.60.16.38 is the IP of the new MDS
>
>
> For the the being we disabled the OSTs hosted on the "moved" OSS so that
> new objects are not written there.
>
>
> Any idea what the problem is and how we could recover the system ?
>
>
>
> Thanks, Massimo
>
>
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 1877 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20150728/ab6d5485/attachment.bin>


More information about the lustre-discuss mailing list