[lustre-discuss] nodes crash during ior test

Brian Andrus toomuchit at gmail.com
Mon Aug 7 07:17:44 PDT 2017


There were actually several:

On an OSS:

[447314.138709] BUG: unable to handle kernel NULL pointer dereference at 
0000000000000020
[543262.189674] BUG: unable to handle kernel NULL pointer dereference 
at           (null)
[16397.115830] BUG: unable to handle kernel NULL pointer dereference 
at           (null)


On 2 separate clients:

[65404.590906] BUG: unable to handle kernel NULL pointer dereference 
at           (null)
[72095.972732] BUG: unable to handle kernel paging request at 
0000002029b0e000

Brian Andrus



On 8/4/2017 10:49 AM, Patrick Farrell wrote:
>
> Brian,
>
>
> What is the actual crash?  Null pointer, failed assertion/LBUG...?  
> Probably just a few more lines back in the log would show that.
>
>
> Also, Lustre 2.10 has been released, you might benefit from switching 
> to that.  There are almost certainly more bugs in this pre-2.10 
> development version you're running than in the release.
>
>
> - Patrick
>
> ------------------------------------------------------------------------
> *From:* lustre-discuss <lustre-discuss-bounces at lists.lustre.org> on 
> behalf of Brian Andrus <toomuchit at gmail.com>
> *Sent:* Friday, August 4, 2017 12:12:59 PM
> *To:* lustre-discuss at lists.lustre.org
> *Subject:* [lustre-discuss] nodes crash during ior test
> All,
>
> I am trying to run some ior benchmarking on a small system.
>
> It only has 2 OSSes.
> I have been having some trouble where one of the clients will reboot and
> do a crash dump somewhat arbitrarily. The runs will work most of the
> time, but every 5 or so times, a client reboots and it is not always the
> same client.
>
> The call trace seems to point to lnet:
>
>
> 72095.973865] Call Trace:
> [72095.973892]  [<ffffffffa070e856>] ? cfs_percpt_unlock+0x36/0xc0 
> [libcfs]
> [72095.973936]  [<ffffffffa0779851>]
> lnet_return_tx_credits_locked+0x211/0x480 [lnet]
> [72095.973973]  [<ffffffffa076c770>] lnet_msg_decommit+0xd0/0x6c0 [lnet]
> [72095.974006]  [<ffffffffa076d0f9>] lnet_finalize+0x1e9/0x690 [lnet]
> [72095.974037]  [<ffffffffa06baf45>] ksocknal_tx_done+0x85/0x1c0 
> [ksocklnd]
> [72095.974068]  [<ffffffffa06c3277>] ksocknal_handle_zcack+0x137/0x1e0
> [ksocklnd]
> [72095.974101]  [<ffffffffa06becf1>]
> ksocknal_process_receive+0x3a1/0xd90 [ksocklnd]
> [72095.974134]  [<ffffffffa06bfa6e>] ksocknal_scheduler+0xee/0x670
> [ksocklnd]
> [72095.974165]  [<ffffffff810b1b20>] ? wake_up_atomic_t+0x30/0x30
> [72095.974193]  [<ffffffffa06bf980>] ? ksocknal_recv+0x2a0/0x2a0 
> [ksocklnd]
> [72095.974222]  [<ffffffff810b0a4f>] kthread+0xcf/0xe0
> [72095.974244]  [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
> [72095.974272]  [<ffffffff81697758>] ret_from_fork+0x58/0x90
> [72095.974296]  [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
>
> I am currently using lustre 2.9.59_15_g107b2cb built for kmod
>
> Is there something I can do to track this down and hopefully remedy it?
>
> Brian Andrus
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20170807/dcf1dfa0/attachment-0001.htm>


More information about the lustre-discuss mailing list