[lustre-discuss] nodes crash during ior test
Brian Andrus
toomuchit at gmail.com
Mon Aug 7 07:17:44 PDT 2017
There were actually several:
On an OSS:
[447314.138709] BUG: unable to handle kernel NULL pointer dereference at
0000000000000020
[543262.189674] BUG: unable to handle kernel NULL pointer dereference
at (null)
[16397.115830] BUG: unable to handle kernel NULL pointer dereference
at (null)
On 2 separate clients:
[65404.590906] BUG: unable to handle kernel NULL pointer dereference
at (null)
[72095.972732] BUG: unable to handle kernel paging request at
0000002029b0e000
Brian Andrus
On 8/4/2017 10:49 AM, Patrick Farrell wrote:
>
> Brian,
>
>
> What is the actual crash? Null pointer, failed assertion/LBUG...?
> Probably just a few more lines back in the log would show that.
>
>
> Also, Lustre 2.10 has been released, you might benefit from switching
> to that. There are almost certainly more bugs in this pre-2.10
> development version you're running than in the release.
>
>
> - Patrick
>
> ------------------------------------------------------------------------
> *From:* lustre-discuss <lustre-discuss-bounces at lists.lustre.org> on
> behalf of Brian Andrus <toomuchit at gmail.com>
> *Sent:* Friday, August 4, 2017 12:12:59 PM
> *To:* lustre-discuss at lists.lustre.org
> *Subject:* [lustre-discuss] nodes crash during ior test
> All,
>
> I am trying to run some ior benchmarking on a small system.
>
> It only has 2 OSSes.
> I have been having some trouble where one of the clients will reboot and
> do a crash dump somewhat arbitrarily. The runs will work most of the
> time, but every 5 or so times, a client reboots and it is not always the
> same client.
>
> The call trace seems to point to lnet:
>
>
> 72095.973865] Call Trace:
> [72095.973892] [<ffffffffa070e856>] ? cfs_percpt_unlock+0x36/0xc0
> [libcfs]
> [72095.973936] [<ffffffffa0779851>]
> lnet_return_tx_credits_locked+0x211/0x480 [lnet]
> [72095.973973] [<ffffffffa076c770>] lnet_msg_decommit+0xd0/0x6c0 [lnet]
> [72095.974006] [<ffffffffa076d0f9>] lnet_finalize+0x1e9/0x690 [lnet]
> [72095.974037] [<ffffffffa06baf45>] ksocknal_tx_done+0x85/0x1c0
> [ksocklnd]
> [72095.974068] [<ffffffffa06c3277>] ksocknal_handle_zcack+0x137/0x1e0
> [ksocklnd]
> [72095.974101] [<ffffffffa06becf1>]
> ksocknal_process_receive+0x3a1/0xd90 [ksocklnd]
> [72095.974134] [<ffffffffa06bfa6e>] ksocknal_scheduler+0xee/0x670
> [ksocklnd]
> [72095.974165] [<ffffffff810b1b20>] ? wake_up_atomic_t+0x30/0x30
> [72095.974193] [<ffffffffa06bf980>] ? ksocknal_recv+0x2a0/0x2a0
> [ksocklnd]
> [72095.974222] [<ffffffff810b0a4f>] kthread+0xcf/0xe0
> [72095.974244] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
> [72095.974272] [<ffffffff81697758>] ret_from_fork+0x58/0x90
> [72095.974296] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
>
> I am currently using lustre 2.9.59_15_g107b2cb built for kmod
>
> Is there something I can do to track this down and hopefully remedy it?
>
> Brian Andrus
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20170807/dcf1dfa0/attachment-0001.htm>
More information about the lustre-discuss
mailing list