[lustre-discuss] nodes crash during ior test
Brian Andrus
toomuchit at gmail.com
Mon Aug 7 23:34:33 PDT 2017
Had another where the client rebooted.
Here is the full dmesg from that:
/*[181902.731655] BUG: unable to handle kernel NULL pointer
dereference at (null)*//*
*//*[181902.731710] IP: [<ffffffff8168e99a>]
_raw_spin_unlock+0xa/0x30*//*
*//*[181902.731749] PGD 0*//*
*//*[181902.731766] Oops: 0002 [#1] SMP*//*
*//*[181902.731788] Modules linked in: osc(OE) mgc(OE) lustre(OE)
lmv(OE) fld(OE) mdc(OE) fid(OE) lov(OE) ksocklnd(OE) ptlrpc(OE)
obdclass(OE) lnet(OE) libcfs(OE) nfsv3 nfs fscache sfc mtd vfat fat
intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm
irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul
glue_helper ablk_helper cryptd ipmi_devintf ipmi_si iTCO_wdt
iTCO_vendor_support sb_edac pcspkr ipmi_msghandler sg edac_core
shpchp ioatdma wmi acpi_power_meter mei_me mei i2c_i801 lpc_ich nfsd
auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c
sd_mod sr_mod crc_t10dif cdrom crct10dif_generic crct10dif_pclmul
crct10dif_common crc32c_intel mgag200 drm_kms_helper syscopyarea
sysfillrect sysimgblt fb_sys_fops i2c_algo_bit ttm ahci libahci drm
ixgbe libata i2c_core mdio ptp pps_core*//*
*//*[181902.732221] dca fjes dm_mirror dm_region_hash dm_log dm_mod
[last unloaded: mtd]*//*
*//*[181902.732260] CPU: 14 PID: 18830 Comm: socknal_sd01_05
Tainted: G OE ------------ 3.10.0-514.26.2.el7.x86_64 #1*//*
*//*[181902.732309] Hardware name: NEC Express5800/R120f-1M
[N8100-2210F]/MS-S0901, BIOS 5.0.8022 06/22/2015*//*
*//*[181902.732351] task: ffff881031e72f10 ti: ffff88102ae40000
task.ti: ffff88102ae40000*//*
*//*[181902.732392] RIP: 0010:[<ffffffff8168e99a>]
[<ffffffff8168e99a>] _raw_spin_unlock+0xa/0x30*//*
*//*[181902.732434] RSP: 0018:ffff88102ae43c38 EFLAGS: 00010206*//*
*//*[181902.732459] RAX: ffff88103dd7ef70 RBX: ffff88103dd7eec0 RCX:
0000000000000000*//*
*//*[181902.732492] RDX: 000000000000d173 RSI: 00000000000068b8 RDI:
0000000000000000*//*
*//*[181902.732524] RBP: ffff88102ae43c50 R08: 00000000105d41fb R09:
0000000000005610*//*
*//*[181902.732557] R10: 0000000070000000 R11: 0000000000000000 R12:
00000000000345c0*//*
*//*[181902.732589] R13: ffff88102a70b200 R14: ffff88203d1eb674 R15:
ffff8818286b9810*//*
*//*[181902.732622] FS: 0000000000000000(0000)
GS:ffff88203f200000(0000) knlGS:0000000000000000*//*
*//*[181902.732658] CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033*//*
*//*[181902.732685] CR2: 0000000000000000 CR3: 00000000019be000 CR4:
00000000001407e0*//*
*//*[181902.732717] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000*//*
*//*[181902.732750] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400*//*
*//*[181902.732783] Stack:*//*
*//*[181902.732795] ffffffffa076d8b6 ffff88103261d200
ffff88203d1eb600 ffff88102ae43c90*//*
*//*[181902.732834] ffffffffa07f4a91 ffff8818286b9800
ffff88103261d200 0000000000000001*//*
*//*[181902.732873] ffff8820357155c0 0000000000000000
ffff88103261d210 ffff88102ae43cc0*//*
*//*[181902.732911] Call Trace:*//*
*//*[181902.732945] [<ffffffffa076d8b6>] ?
cfs_percpt_unlock+0x36/0xc0 [libcfs]*//*
*//*[181902.732995] [<ffffffffa07f4a91>]
lnet_return_tx_credits_locked+0x211/0x480 [lnet]*//*
*//*[181902.733037] [<ffffffffa07e7800>]
lnet_msg_decommit+0xd0/0x6c0 [lnet]*//*
*//*[181902.733073] [<ffffffffa07e8189>] lnet_finalize+0x1e9/0x690
[lnet]*//*
*//*[181902.733110] [<ffffffffa050ef45>]
ksocknal_tx_done+0x85/0x1c0 [ksocklnd]*//*
*//*[181902.733145] [<ffffffffa0517277>]
ksocknal_handle_zcack+0x137/0x1e0 [ksocklnd]*//*
*//*[181902.733181] [<ffffffffa0512cf1>]
ksocknal_process_receive+0x3a1/0xd90 [ksocklnd]*//*
*//*[181902.733219] [<ffffffffa0513a6e>]
ksocknal_scheduler+0xee/0x670 [ksocklnd]*//*
*//*[181902.733255] [<ffffffff810b1b20>] ?
wake_up_atomic_t+0x30/0x30*//*
*//*[181902.733286] [<ffffffffa0513980>] ?
ksocknal_recv+0x2a0/0x2a0 [ksocklnd]*//*
*//*[181902.733318] [<ffffffff810b0a4f>] kthread+0xcf/0xe0*//*
*//*[181902.733344] [<ffffffff810b0980>] ?
kthread_create_on_node+0x140/0x140*//*
*//*[181902.733377] [<ffffffff81697758>] ret_from_fork+0x58/0x90*//*
*//*[181902.734492] [<ffffffff810b0980>] ?
kthread_create_on_node+0x140/0x140*//*
*//*[181902.735590] Code: 90 8d 8a 00 00 02 00 89 d0 f0 0f b1 0f 39
d0 75 ea b8 01 00 00 00 5d c3 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00
00 0f 1f 44 00 00 <66> 83 07 02 c3 90 8b 37 f0 66 83 07 02 f6 47 02
01 74 f1 55 48*//*
*//*[181902.737942] RIP [<ffffffff8168e99a>]
_raw_spin_unlock+0xa/0x30*//*
*//*[181902.739079] RSP <ffff88102ae43c38>*//*
*//*[181902.740189] CR2: 0000000000000000*//*
*/
Brian Andrus
On 8/7/2017 7:17 AM, Brian Andrus wrote:
>
> There were actually several:
>
> On an OSS:
>
> [447314.138709] BUG: unable to handle kernel NULL pointer dereference
> at 0000000000000020
> [543262.189674] BUG: unable to handle kernel NULL pointer dereference
> at (null)
> [16397.115830] BUG: unable to handle kernel NULL pointer dereference
> at (null)
>
>
> On 2 separate clients:
>
> [65404.590906] BUG: unable to handle kernel NULL pointer dereference
> at (null)
> [72095.972732] BUG: unable to handle kernel paging request at
> 0000002029b0e000
>
> Brian Andrus
>
>
>
> On 8/4/2017 10:49 AM, Patrick Farrell wrote:
>>
>> Brian,
>>
>>
>> What is the actual crash? Null pointer, failed assertion/LBUG...?
>> Probably just a few more lines back in the log would show that.
>>
>>
>> Also, Lustre 2.10 has been released, you might benefit from switching
>> to that. There are almost certainly more bugs in this pre-2.10
>> development version you're running than in the release.
>>
>>
>> - Patrick
>>
>> ------------------------------------------------------------------------
>> *From:* lustre-discuss <lustre-discuss-bounces at lists.lustre.org> on
>> behalf of Brian Andrus <toomuchit at gmail.com>
>> *Sent:* Friday, August 4, 2017 12:12:59 PM
>> *To:* lustre-discuss at lists.lustre.org
>> *Subject:* [lustre-discuss] nodes crash during ior test
>> All,
>>
>> I am trying to run some ior benchmarking on a small system.
>>
>> It only has 2 OSSes.
>> I have been having some trouble where one of the clients will reboot and
>> do a crash dump somewhat arbitrarily. The runs will work most of the
>> time, but every 5 or so times, a client reboots and it is not always the
>> same client.
>>
>> The call trace seems to point to lnet:
>>
>>
>> 72095.973865] Call Trace:
>> [72095.973892] [<ffffffffa070e856>] ? cfs_percpt_unlock+0x36/0xc0
>> [libcfs]
>> [72095.973936] [<ffffffffa0779851>]
>> lnet_return_tx_credits_locked+0x211/0x480 [lnet]
>> [72095.973973] [<ffffffffa076c770>] lnet_msg_decommit+0xd0/0x6c0 [lnet]
>> [72095.974006] [<ffffffffa076d0f9>] lnet_finalize+0x1e9/0x690 [lnet]
>> [72095.974037] [<ffffffffa06baf45>] ksocknal_tx_done+0x85/0x1c0
>> [ksocklnd]
>> [72095.974068] [<ffffffffa06c3277>] ksocknal_handle_zcack+0x137/0x1e0
>> [ksocklnd]
>> [72095.974101] [<ffffffffa06becf1>]
>> ksocknal_process_receive+0x3a1/0xd90 [ksocklnd]
>> [72095.974134] [<ffffffffa06bfa6e>] ksocknal_scheduler+0xee/0x670
>> [ksocklnd]
>> [72095.974165] [<ffffffff810b1b20>] ? wake_up_atomic_t+0x30/0x30
>> [72095.974193] [<ffffffffa06bf980>] ? ksocknal_recv+0x2a0/0x2a0
>> [ksocklnd]
>> [72095.974222] [<ffffffff810b0a4f>] kthread+0xcf/0xe0
>> [72095.974244] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
>> [72095.974272] [<ffffffff81697758>] ret_from_fork+0x58/0x90
>> [72095.974296] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
>>
>> I am currently using lustre 2.9.59_15_g107b2cb built for kmod
>>
>> Is there something I can do to track this down and hopefully remedy it?
>>
>> Brian Andrus
>>
>> _______________________________________________
>> lustre-discuss mailing list
>> lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20170807/199010d8/attachment-0001.htm>
More information about the lustre-discuss
mailing list