[lustre-discuss] lustre 1.8.9 client with LLNL server 2.5.3 LBUG

Alexander I Kulyavtsev aik at fnal.gov
Wed Jan 27 13:29:24 PST 2016


Does anyone have experience running lustre 1.8.9 client with LLNL server 2.5.3 (zfs)?

I was almost instantly getting LBUG related to IGIF FID assertion after the mount:

dsg0515 kernel: LustreError: 30899:0:(mdc_fid.c:334:fid_le_to_cpu()) ASSERTION(fid_is_igif(dst) || fid_ver(dst) == 0) failed: [0x293e750000006ada:0x70f8b3:0xa721a500]
(full stack dump at the end of email).

This happened only when I tried to mount two lustre file systems (the old 1.8.9 servers and new 2.5.3 servers) on the same client 1.8.9 during the tests last summer. The new 2.5.3 system was freshly formatted and few data written from 2.5.3 client.
I would like to try llnl 2.5.3 server with 1.8.9 client again.

Apparently I'm missing something obvious.
I realize it is not supported or "tested" configuration, but we successfully running the similar configuration with last intel's GA release 2.5.3 server for more than half year with HPC clusters doing IO on both lustres and few nodes doing 'cp' between old and new lustres, checksumming and stats.

We still need to have double mount (1.8 and 2.5) for another month till we finish migration. We will need to run 1.8.9 clients for six months more. I'm trying to reassess if I can use 2.5.3 llnl lustre on reinstalled servers in this configuration.

lustre 1.8.9 (or 2.5.3) client with LLNL server 2.5.3 only - runs fine.
lustre 1.8.9 client mounting both 1.8 servers and intel's 2.5.3 servers - runs fine.
lustre 1.8.9 client mounting both 1.8 servers and llnl 2.5.3 servers - crash after mount or few operations.
I was able to make it to last longer by mounting in certain order and doing "ls" to few existing files, but it crashes some time later during IO.

The reported FIDs looks real, but also
[0xdead000000100100 :0x200200 :0xdead0000]
[0x5a5a5a5a5a5a5a5a :0x5a5a5a5a :0x5a5a5a5a]

which corresponds to
CONFIG_ILLEGAL_POINTER_VALUE
# define LI_POISON ((int)0x5a5a5a5a)    or like

I tried to compare branches 2.5.3-llnl and whamcloud branch 2_5 tag 2.5.3, and also tag 2.5.3.90 .
I did not find commit messages related to IGIF FID in commits which differ, though I guess there can be code change not related to commit message in the patch I missed.

I would appreciate any hints were to look to make it work and what is the difference causing this LBUG.

Thank in advance, Alex.


Jun  1 15:01:10 dsg0515 kernel: LustreError: 4541:0:(mdc_fid.c:334:fid_le_to_cpu()) ASSERTION(fid_is_igif(dst) || fid_ver(dst) == 0) failed: [0x600000005:0x7:0xffffffff]

Jun  1 15:01:10 dsg0515 kernel: LustreError: 4541:0:(mdc_fid.c:334:fid_le_to_cpu()) LBUG

Jun  1 15:01:10 dsg0515 kernel: Pid: 4541, comm: ls

Jun  1 15:01:10 dsg0515 kernel:

Jun  1 15:01:10 dsg0515 kernel: Call Trace:

Jun  1 15:01:10 dsg0515 kernel:  [<ffffffff810df1c4>] ? generic_permission+0x24/0xc0

Jun  1 15:01:10 dsg0515 kernel:  [<ffffffffa0eb3847>] libcfs_debug_dumpstack+0x57/0x80 [libcfs]

Jun  1 15:01:10 dsg0515 kernel:  [<ffffffffa0eb3de6>] lbug_with_loc+0x76/0xe0 [libcfs]

Jun  1 15:01:10 dsg0515 kernel:  [<ffffffffa1135ee5>] fid_le_to_cpu+0xa5/0xb0 [mdc]

Jun  1 15:01:10 dsg0515 kernel:  [<ffffffffa11a3c45>] ll_readdir+0x935/0xb00 [lustre]

Jun  1 15:01:10 dsg0515 kernel:  [<ffffffff810d5b47>] ? nameidata_to_filp+0x57/0x70

Jun  1 15:01:10 dsg0515 kernel:  [<ffffffff810af1d9>] ? __inc_zone_state+0x9/0x70

Jun  1 15:01:10 dsg0515 kernel:  [<ffffffff810a3099>] ? __lru_cache_add+0x9/0x70

Jun  1 15:01:10 dsg0515 kernel:  [<ffffffff810a3119>] ? lru_cache_add_lru+0x19/0x40

Jun  1 15:01:10 dsg0515 kernel:  [<ffffffff810e5710>] ? filldir+0x0/0xf0

Jun  1 15:01:10 dsg0515 kernel:  [<ffffffff810e5710>] ? filldir+0x0/0xf0

Jun  1 15:01:10 dsg0515 kernel:  [<ffffffff810e58ac>] vfs_readdir+0xac/0xd0

Jun  1 15:01:10 dsg0515 kernel:  [<ffffffff810e5b66>] sys_getdents+0x86/0xe0

Jun  1 15:01:10 dsg0515 kernel:  [<ffffffff81420def>] ? page_fault+0x1f/0x30

Jun  1 15:01:10 dsg0515 kernel:  [<ffffffff8100b2fb>] system_call_fastpath+0x16/0x1b

Jun  1 15:01:10 dsg0515 kernel:

Jun  1 15:01:10 dsg0515 kernel: LustreError: dumping log to /tmp/lustre-log.1433188870.4541




-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20160127/1171198c/attachment.htm>


More information about the lustre-discuss mailing list