[lustre-discuss] nodes crash during ior test

Patrick Farrell paf at cray.com
Fri Aug 4 10:49:01 PDT 2017


Brian,


What is the actual crash?  Null pointer, failed assertion/LBUG...?  Probably just a few more lines back in the log would show that.


Also, Lustre 2.10 has been released, you might benefit from switching to that.  There are almost certainly more bugs in this pre-2.10 development version you're running than in the release.


- Patrick

________________________________
From: lustre-discuss <lustre-discuss-bounces at lists.lustre.org> on behalf of Brian Andrus <toomuchit at gmail.com>
Sent: Friday, August 4, 2017 12:12:59 PM
To: lustre-discuss at lists.lustre.org
Subject: [lustre-discuss] nodes crash during ior test

All,

I am trying to run some ior benchmarking on a small system.

It only has 2 OSSes.
I have been having some trouble where one of the clients will reboot and
do a crash dump somewhat arbitrarily. The runs will work most of the
time, but every 5 or so times, a client reboots and it is not always the
same client.

The call trace seems to point to lnet:


72095.973865] Call Trace:
[72095.973892]  [<ffffffffa070e856>] ? cfs_percpt_unlock+0x36/0xc0 [libcfs]
[72095.973936]  [<ffffffffa0779851>]
lnet_return_tx_credits_locked+0x211/0x480 [lnet]
[72095.973973]  [<ffffffffa076c770>] lnet_msg_decommit+0xd0/0x6c0 [lnet]
[72095.974006]  [<ffffffffa076d0f9>] lnet_finalize+0x1e9/0x690 [lnet]
[72095.974037]  [<ffffffffa06baf45>] ksocknal_tx_done+0x85/0x1c0 [ksocklnd]
[72095.974068]  [<ffffffffa06c3277>] ksocknal_handle_zcack+0x137/0x1e0
[ksocklnd]
[72095.974101]  [<ffffffffa06becf1>]
ksocknal_process_receive+0x3a1/0xd90 [ksocklnd]
[72095.974134]  [<ffffffffa06bfa6e>] ksocknal_scheduler+0xee/0x670
[ksocklnd]
[72095.974165]  [<ffffffff810b1b20>] ? wake_up_atomic_t+0x30/0x30
[72095.974193]  [<ffffffffa06bf980>] ? ksocknal_recv+0x2a0/0x2a0 [ksocklnd]
[72095.974222]  [<ffffffff810b0a4f>] kthread+0xcf/0xe0
[72095.974244]  [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
[72095.974272]  [<ffffffff81697758>] ret_from_fork+0x58/0x90
[72095.974296]  [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140

I am currently using lustre 2.9.59_15_g107b2cb built for kmod

Is there something I can do to track this down and hopefully remedy it?

Brian Andrus

_______________________________________________
lustre-discuss mailing list
lustre-discuss at lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20170804/021d340f/attachment.htm>


More information about the lustre-discuss mailing list