[lustre-discuss] nodes crash during ior test

Mon Aug 7 23:34:33 PDT 2017

Had another where the client rebooted.

Here is the full dmesg from that:

    /*[181902.731655] BUG: unable to handle kernel NULL pointer
    dereference at           (null)*//*
    *//*[181902.731710] IP: [<ffffffff8168e99a>]
    _raw_spin_unlock+0xa/0x30*//*
    *//*[181902.731749] PGD 0*//*
    *//*[181902.731766] Oops: 0002 [#1] SMP*//*
    *//*[181902.731788] Modules linked in: osc(OE) mgc(OE) lustre(OE)
    lmv(OE) fld(OE) mdc(OE) fid(OE) lov(OE) ksocklnd(OE) ptlrpc(OE)
    obdclass(OE) lnet(OE) libcfs(OE) nfsv3 nfs fscache sfc mtd vfat fat
    intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm
    irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul
    glue_helper ablk_helper cryptd ipmi_devintf ipmi_si iTCO_wdt
    iTCO_vendor_support sb_edac pcspkr ipmi_msghandler sg edac_core
    shpchp ioatdma wmi acpi_power_meter mei_me mei i2c_i801 lpc_ich nfsd
    auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c
    sd_mod sr_mod crc_t10dif cdrom crct10dif_generic crct10dif_pclmul
    crct10dif_common crc32c_intel mgag200 drm_kms_helper syscopyarea
    sysfillrect sysimgblt fb_sys_fops i2c_algo_bit ttm ahci libahci drm
    ixgbe libata i2c_core mdio ptp pps_core*//*
    *//*[181902.732221]  dca fjes dm_mirror dm_region_hash dm_log dm_mod
    [last unloaded: mtd]*//*
    *//*[181902.732260] CPU: 14 PID: 18830 Comm: socknal_sd01_05
    Tainted: G           OE  ------------ 3.10.0-514.26.2.el7.x86_64 #1*//*
    *//*[181902.732309] Hardware name: NEC Express5800/R120f-1M
    [N8100-2210F]/MS-S0901, BIOS 5.0.8022 06/22/2015*//*
    *//*[181902.732351] task: ffff881031e72f10 ti: ffff88102ae40000
    task.ti: ffff88102ae40000*//*
    *//*[181902.732392] RIP: 0010:[<ffffffff8168e99a>] 
    [<ffffffff8168e99a>] _raw_spin_unlock+0xa/0x30*//*
    *//*[181902.732434] RSP: 0018:ffff88102ae43c38 EFLAGS: 00010206*//*
    *//*[181902.732459] RAX: ffff88103dd7ef70 RBX: ffff88103dd7eec0 RCX:
    0000000000000000*//*
    *//*[181902.732492] RDX: 000000000000d173 RSI: 00000000000068b8 RDI:
    0000000000000000*//*
    *//*[181902.732524] RBP: ffff88102ae43c50 R08: 00000000105d41fb R09:
    0000000000005610*//*
    *//*[181902.732557] R10: 0000000070000000 R11: 0000000000000000 R12:
    00000000000345c0*//*
    *//*[181902.732589] R13: ffff88102a70b200 R14: ffff88203d1eb674 R15:
    ffff8818286b9810*//*
    *//*[181902.732622] FS:  0000000000000000(0000)
    GS:ffff88203f200000(0000) knlGS:0000000000000000*//*
    *//*[181902.732658] CS:  0010 DS: 0000 ES: 0000 CR0:
    0000000080050033*//*
    *//*[181902.732685] CR2: 0000000000000000 CR3: 00000000019be000 CR4:
    00000000001407e0*//*
    *//*[181902.732717] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
    0000000000000000*//*
    *//*[181902.732750] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
    0000000000000400*//*
    *//*[181902.732783] Stack:*//*
    *//*[181902.732795]  ffffffffa076d8b6 ffff88103261d200
    ffff88203d1eb600 ffff88102ae43c90*//*
    *//*[181902.732834]  ffffffffa07f4a91 ffff8818286b9800
    ffff88103261d200 0000000000000001*//*
    *//*[181902.732873]  ffff8820357155c0 0000000000000000
    ffff88103261d210 ffff88102ae43cc0*//*
    *//*[181902.732911] Call Trace:*//*
    *//*[181902.732945]  [<ffffffffa076d8b6>] ?
    cfs_percpt_unlock+0x36/0xc0 [libcfs]*//*
    *//*[181902.732995]  [<ffffffffa07f4a91>]
    lnet_return_tx_credits_locked+0x211/0x480 [lnet]*//*
    *//*[181902.733037]  [<ffffffffa07e7800>]
    lnet_msg_decommit+0xd0/0x6c0 [lnet]*//*
    *//*[181902.733073]  [<ffffffffa07e8189>] lnet_finalize+0x1e9/0x690
    [lnet]*//*
    *//*[181902.733110]  [<ffffffffa050ef45>]
    ksocknal_tx_done+0x85/0x1c0 [ksocklnd]*//*
    *//*[181902.733145]  [<ffffffffa0517277>]
    ksocknal_handle_zcack+0x137/0x1e0 [ksocklnd]*//*
    *//*[181902.733181]  [<ffffffffa0512cf1>]
    ksocknal_process_receive+0x3a1/0xd90 [ksocklnd]*//*
    *//*[181902.733219]  [<ffffffffa0513a6e>]
    ksocknal_scheduler+0xee/0x670 [ksocklnd]*//*
    *//*[181902.733255]  [<ffffffff810b1b20>] ?
    wake_up_atomic_t+0x30/0x30*//*
    *//*[181902.733286]  [<ffffffffa0513980>] ?
    ksocknal_recv+0x2a0/0x2a0 [ksocklnd]*//*
    *//*[181902.733318]  [<ffffffff810b0a4f>] kthread+0xcf/0xe0*//*
    *//*[181902.733344]  [<ffffffff810b0980>] ?
    kthread_create_on_node+0x140/0x140*//*
    *//*[181902.733377]  [<ffffffff81697758>] ret_from_fork+0x58/0x90*//*
    *//*[181902.734492]  [<ffffffff810b0980>] ?
    kthread_create_on_node+0x140/0x140*//*
    *//*[181902.735590] Code: 90 8d 8a 00 00 02 00 89 d0 f0 0f b1 0f 39
    d0 75 ea b8 01 00 00 00 5d c3 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00
    00 0f 1f 44 00 00 <66> 83 07 02 c3 90 8b 37 f0 66 83 07 02 f6 47 02
    01 74 f1 55 48*//*
    *//*[181902.737942] RIP  [<ffffffff8168e99a>]
    _raw_spin_unlock+0xa/0x30*//*
    *//*[181902.739079]  RSP <ffff88102ae43c38>*//*
    *//*[181902.740189] CR2: 0000000000000000*//*
    */

Brian Andrus

On 8/7/2017 7:17 AM, Brian Andrus wrote:
>
> There were actually several:
>
> On an OSS:
>
> [447314.138709] BUG: unable to handle kernel NULL pointer dereference 
> at 0000000000000020
> [543262.189674] BUG: unable to handle kernel NULL pointer dereference 
> at           (null)
> [16397.115830] BUG: unable to handle kernel NULL pointer dereference 
> at           (null)
>
>
> On 2 separate clients:
>
> [65404.590906] BUG: unable to handle kernel NULL pointer dereference 
> at           (null)
> [72095.972732] BUG: unable to handle kernel paging request at 
> 0000002029b0e000
>
> Brian Andrus
>
>
>
> On 8/4/2017 10:49 AM, Patrick Farrell wrote:
>>
>> Brian,
>>
>>
>> What is the actual crash?  Null pointer, failed assertion/LBUG...?  
>> Probably just a few more lines back in the log would show that.
>>
>>
>> Also, Lustre 2.10 has been released, you might benefit from switching 
>> to that.  There are almost certainly more bugs in this pre-2.10 
>> development version you're running than in the release.
>>
>>
>> - Patrick
>>
>> ------------------------------------------------------------------------
>> *From:* lustre-discuss <lustre-discuss-bounces at lists.lustre.org> on 
>> behalf of Brian Andrus <toomuchit at gmail.com>
>> *Sent:* Friday, August 4, 2017 12:12:59 PM
>> *To:* lustre-discuss at lists.lustre.org
>> *Subject:* [lustre-discuss] nodes crash during ior test
>> All,
>>
>> I am trying to run some ior benchmarking on a small system.
>>
>> It only has 2 OSSes.
>> I have been having some trouble where one of the clients will reboot and
>> do a crash dump somewhat arbitrarily. The runs will work most of the
>> time, but every 5 or so times, a client reboots and it is not always the
>> same client.
>>
>> The call trace seems to point to lnet:
>>
>>
>> 72095.973865] Call Trace:
>> [72095.973892]  [<ffffffffa070e856>] ? cfs_percpt_unlock+0x36/0xc0 
>> [libcfs]
>> [72095.973936]  [<ffffffffa0779851>]
>> lnet_return_tx_credits_locked+0x211/0x480 [lnet]
>> [72095.973973]  [<ffffffffa076c770>] lnet_msg_decommit+0xd0/0x6c0 [lnet]
>> [72095.974006]  [<ffffffffa076d0f9>] lnet_finalize+0x1e9/0x690 [lnet]
>> [72095.974037]  [<ffffffffa06baf45>] ksocknal_tx_done+0x85/0x1c0 
>> [ksocklnd]
>> [72095.974068]  [<ffffffffa06c3277>] ksocknal_handle_zcack+0x137/0x1e0
>> [ksocklnd]
>> [72095.974101]  [<ffffffffa06becf1>]
>> ksocknal_process_receive+0x3a1/0xd90 [ksocklnd]
>> [72095.974134]  [<ffffffffa06bfa6e>] ksocknal_scheduler+0xee/0x670
>> [ksocklnd]
>> [72095.974165]  [<ffffffff810b1b20>] ? wake_up_atomic_t+0x30/0x30
>> [72095.974193]  [<ffffffffa06bf980>] ? ksocknal_recv+0x2a0/0x2a0 
>> [ksocklnd]
>> [72095.974222]  [<ffffffff810b0a4f>] kthread+0xcf/0xe0
>> [72095.974244]  [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
>> [72095.974272]  [<ffffffff81697758>] ret_from_fork+0x58/0x90
>> [72095.974296]  [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
>>
>> I am currently using lustre 2.9.59_15_g107b2cb built for kmod
>>
>> Is there something I can do to track this down and hopefully remedy it?
>>
>> Brian Andrus
>>
>> _______________________________________________
>> lustre-discuss mailing list
>> lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20170807/199010d8/attachment-0001.htm>