[lustre-discuss] Questions about migrate OSTs from ldiskfs to zfs

Alexander I Kulyavtsev aik at fnal.gov
Tue Mar 1 15:26:37 PST 2016


Is there way to run 2.5.3-llnl server with 1.8.9 client?
Or, do you have a hint what can be a problem causing LBUG when mounting two lustres (old 1.8.8 and new 2.5.3) from 1.8.9 client? No ldiskfs on new system, ZFS only.

I'm copying my previous posting at the end of the mail for specific error. Is it related to FID format changes and FID namespace ranges in 1.8 / 2.4  / 2.5 ?

Thanks, Alex.

Subject: lustre 1.8.9 client with LLNL server 2.5.3  LBUG
Date: Wed, 27 Jan 2016 15:29:24 -0600

Does anyone have experience running lustre 1.8.9 client with LLNL server 2.5.3 (zfs)?

I was almost instantly getting LBUG related to IGIF FID assertion after the mount:

dsg0515 kernel: LustreError: 30899:0:(mdc_fid.c:334:fid_le_to_cpu()) ASSERTION(fid_is_igif(dst) || fid_ver(dst) == 0) failed: [0x293e750000006ada:0x70f8b3:0xa721a500]
(full stack dump at the end of email).

This happened only when I tried to mount two lustre file systems (the old 1.8.9 servers and new 2.5.3 servers) on the same client 1.8.9 during the tests last summer. The new 2.5.3 system was freshly formatted and few data written from 2.5.3 client.
I would like to try llnl 2.5.3 server with 1.8.9 client again.

Apparently I'm missing something obvious.
I realize it is not supported or "tested" configuration, but we successfully running the similar configuration with last intel's GA release 2.5.3 server for more than half year with HPC clusters doing IO on both lustres and few nodes doing 'cp' between old and new lustres, checksumming and stats.

We still need to have double mount (1.8 and 2.5) for another month till we finish migration. We will need to run 1.8.9 clients for six months more. I'm trying to reassess if I can use 2.5.3 llnl lustre on reinstalled servers in this configuration.

lustre 1.8.9 (or 2.5.3) client with LLNL server 2.5.3 only - runs fine.
lustre 1.8.9 client mounting both 1.8 servers and intel's 2.5.3 servers - runs fine.
lustre 1.8.9 client mounting both 1.8 servers and llnl 2.5.3 servers - crash after mount or few operations.
I was able to make it to last longer by mounting in certain order and doing "ls" to few existing files, but it crashes some time later during IO.

The reported FIDs looks real, but also
[0xdead000000100100 :0x200200 :0xdead0000]
[0x5a5a5a5a5a5a5a5a :0x5a5a5a5a :0x5a5a5a5a]

which corresponds to
CONFIG_ILLEGAL_POINTER_VALUE
# define LI_POISON ((int)0x5a5a5a5a)    or like

I tried to compare branches 2.5.3-llnl and whamcloud branch 2_5 tag 2.5.3, and also tag 2.5.3.90 .
I did not find commit messages related to IGIF FID in commits which differ, though I guess there can be code change not related to commit message in the patch I missed.

I would appreciate any hints were to look to make it work and what is the difference causing this LBUG.

Thank in advance, Alex.


Jun  1 15:01:10 dsg0515 kernel: LustreError: 4541:0:(mdc_fid.c:334:fid_le_to_cpu()) ASSERTION(fid_is_igif(dst) || fid_ver(dst) == 0) failed: [0x600000005:0x7:0xffffffff]

Jun  1 15:01:10 dsg0515 kernel: LustreError: 4541:0:(mdc_fid.c:334:fid_le_to_cpu()) LBUG

Jun  1 15:01:10 dsg0515 kernel: Pid: 4541, comm: ls

Jun  1 15:01:10 dsg0515 kernel:

Jun  1 15:01:10 dsg0515 kernel: Call Trace:

Jun  1 15:01:10 dsg0515 kernel:  [<ffffffff810df1c4>] ? generic_permission+0x24/0xc0

Jun  1 15:01:10 dsg0515 kernel:  [<ffffffffa0eb3847>] libcfs_debug_dumpstack+0x57/0x80 [libcfs]

Jun  1 15:01:10 dsg0515 kernel:  [<ffffffffa0eb3de6>] lbug_with_loc+0x76/0xe0 [libcfs]

Jun  1 15:01:10 dsg0515 kernel:  [<ffffffffa1135ee5>] fid_le_to_cpu+0xa5/0xb0 [mdc]

Jun  1 15:01:10 dsg0515 kernel:  [<ffffffffa11a3c45>] ll_readdir+0x935/0xb00 [lustre]

Jun  1 15:01:10 dsg0515 kernel:  [<ffffffff810d5b47>] ? nameidata_to_filp+0x57/0x70

Jun  1 15:01:10 dsg0515 kernel:  [<ffffffff810af1d9>] ? __inc_zone_state+0x9/0x70

Jun  1 15:01:10 dsg0515 kernel:  [<ffffffff810a3099>] ? __lru_cache_add+0x9/0x70

Jun  1 15:01:10 dsg0515 kernel:  [<ffffffff810a3119>] ? lru_cache_add_lru+0x19/0x40

Jun  1 15:01:10 dsg0515 kernel:  [<ffffffff810e5710>] ? filldir+0x0/0xf0

Jun  1 15:01:10 dsg0515 kernel:  [<ffffffff810e5710>] ? filldir+0x0/0xf0

Jun  1 15:01:10 dsg0515 kernel:  [<ffffffff810e58ac>] vfs_readdir+0xac/0xd0

Jun  1 15:01:10 dsg0515 kernel:  [<ffffffff810e5b66>] sys_getdents+0x86/0xe0

Jun  1 15:01:10 dsg0515 kernel:  [<ffffffff81420def>] ? page_fault+0x1f/0x30

Jun  1 15:01:10 dsg0515 kernel:  [<ffffffff8100b2fb>] system_call_fastpath+0x16/0x1b

Jun  1 15:01:10 dsg0515 kernel:

Jun  1 15:01:10 dsg0515 kernel: LustreError: dumping log to /tmp/lustre-log.1433188870.4541


On Mar 1, 2016, at 3:44 PM, Drokin, Oleg <oleg.drokin at intel.com<mailto:oleg.drokin at intel.com>> wrote:


On Mar 1, 2016, at 4:14 PM, Christopher J. Morrone wrote:

On 03/01/2016 09:18 AM, Alexander I Kulyavtsev wrote:

is tag 2.5.3.90 considered stable?

No.  Generally speaking you do not want to use anything with number 50
or greater for the fourth number unless you are helping out with testing
during the development process.

I think you are mixing up things and it is the 3rd number at 50 or above
that is the development code.


2.5.3 was the last official release on branch b2_5 before it was
discontinued.


In this case 2.5.3.90 is "almost" 2.5.4 but not quite.
We wanted to have a tag in b2_5 before commits there ceased so that we can refer
to it by a version number vs the "tip of b2_5".
It contains various fixes on top of 2.5.3, but as far as I know, it did not
undergo the actual release testing that the point release would normally undergo.

Since the b2_5 at the time was a maintenance branch, we mostly tried to place
important fixes there that should not have broken anything.
But due to lack of proper release testing this could not be guaranteed by us,
so using it is still a bit of a leap of faith, but not quite as much as say
2.5.51.0 or 2.6.55.0 or the like.

Bye,
   Oleg

_______________________________________________
lustre-discuss mailing list
lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20160301/5e0acd8c/attachment-0001.htm>


More information about the lustre-discuss mailing list