[lustre-discuss] nodes crash during ior test

E.S. Rosenberg esr+lustre at mail.hebrew.edu
Mon Aug 7 05:56:39 PDT 2017


OT:
Can we create a wiki page or some other form of knowledge pooling on
benchmarking lustre?

Right now I'm using slides from 2009 as my source which may not be ideal...

http://wiki.lustre.org/images/4/40/Wednesday_shpc-2009-benchmarking.pdf

OT2:
Did I miss the release announcement or was 2.10 never announced on this
list?

Thanks!
Eli

On Fri, Aug 4, 2017 at 8:49 PM, Patrick Farrell <paf at cray.com> wrote:

> Brian,
>
> What is the actual crash?  Null pointer, failed assertion/LBUG...?
> Probably just a few more lines back in the log would show that.
>
>
> Also, Lustre 2.10 has been released, you might benefit from switching to
> that.  There are almost certainly more bugs in this pre-2.10 development
> version you're running than in the release.
>
>
> - Patrick
> ------------------------------
> *From:* lustre-discuss <lustre-discuss-bounces at lists.lustre.org> on
> behalf of Brian Andrus <toomuchit at gmail.com>
> *Sent:* Friday, August 4, 2017 12:12:59 PM
> *To:* lustre-discuss at lists.lustre.org
> *Subject:* [lustre-discuss] nodes crash during ior test
>
> All,
>
> I am trying to run some ior benchmarking on a small system.
>
> It only has 2 OSSes.
> I have been having some trouble where one of the clients will reboot and
> do a crash dump somewhat arbitrarily. The runs will work most of the
> time, but every 5 or so times, a client reboots and it is not always the
> same client.
>
> The call trace seems to point to lnet:
>
>
> 72095.973865] Call Trace:
> [72095.973892]  [<ffffffffa070e856>] ? cfs_percpt_unlock+0x36/0xc0 [libcfs]
> [72095.973936]  [<ffffffffa0779851>]
> lnet_return_tx_credits_locked+0x211/0x480 [lnet]
> [72095.973973]  [<ffffffffa076c770>] lnet_msg_decommit+0xd0/0x6c0 [lnet]
> [72095.974006]  [<ffffffffa076d0f9>] lnet_finalize+0x1e9/0x690 [lnet]
> [72095.974037]  [<ffffffffa06baf45>] ksocknal_tx_done+0x85/0x1c0 [ksocklnd]
> [72095.974068]  [<ffffffffa06c3277>] ksocknal_handle_zcack+0x137/0x1e0
> [ksocklnd]
> [72095.974101]  [<ffffffffa06becf1>]
> ksocknal_process_receive+0x3a1/0xd90 [ksocklnd]
> [72095.974134]  [<ffffffffa06bfa6e>] ksocknal_scheduler+0xee/0x670
> [ksocklnd]
> [72095.974165]  [<ffffffff810b1b20>] ? wake_up_atomic_t+0x30/0x30
> [72095.974193]  [<ffffffffa06bf980>] ? ksocknal_recv+0x2a0/0x2a0 [ksocklnd]
> [72095.974222]  [<ffffffff810b0a4f>] kthread+0xcf/0xe0
> [72095.974244]  [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
> [72095.974272]  [<ffffffff81697758>] ret_from_fork+0x58/0x90
> [72095.974296]  [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
>
> I am currently using lustre 2.9.59_15_g107b2cb built for kmod
>
> Is there something I can do to track this down and hopefully remedy it?
>
> Brian Andrus
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20170807/3c2e899f/attachment.htm>


More information about the lustre-discuss mailing list