[Lustre-discuss] obdfilter-survey crashing

robert spam.robert at risefx.com
Wed Jan 5 10:41:33 PST 2011


Thank you Wojciech, Alexey, Parinay and John,

it looks like the controller (areca 1280) is the problem. In the logs I
found an according error message (arcmsr6:...). Googling this shows that
a lot of users having problems with areca controllers under heavy load.
The OSS crash can be reproduced with tiobench running 32 threads each on
two clients (!). So the problem might not be directly related to
obdfilter-survey but just triggers a totally different problem.

Wojciech, I read about this. The odd thing is that the speed per process
is constant even with more processes on the same or another client. The
system is obviously capable of much more but shows a limit per reader
process.

Alexey, I was not able to capture the sysrq+t into a file in my test
installation and after discovering the arcmsr message went that way first.

Prinay, I tried the behaviour with strace, but neither do I get any
output apart from the "attached" message nor does obdfilter-survey
continue afterwards.

John, panic_on_lbug is not set and today i saw that the system freezes
after 1-2h even without interrupting obdfilter-survey.

I will do a test with a different controller in the next days and will
post log info if the problem persists.

Thanks again!

Robert

Am 05.01.2011 18:25, schrieb John Hammond:
> On 01/04/2011 02:14 PM, robert wrote:
>> Hi Everyone!
>>
>> I just setup a lustre system on centos 5.5 and lustre 1.8.5. there are
>> three identical oss with four osts each.
>>
>> After having fantastic write rates but low read rates, I ran the
>> obdfilter-survey script to get a hint of what may cause this.
>>
>> Unfortnately obdfilter-survey in case=disk mode freezes on two of my
>> three oss at the write task of the 4 objs, 16 threads line and leaves
>> the system in an unstable state requiring a reboot. The other oss runs
>> through the script without problems. To exclude a problem in the
>> system´s setup, I booted one of the bad oss with the working oss´ disk -
>> with the same faulty result. Creating a new filesystem on all osts of
>> one of the problem oss neither did the trick.
>>
>> Any ideas what may cause this behavior? Thanks!
> Do you have panic_on_lbug set?
>
> It's easy to LBUG Lustre by interrupting (Ctrl-C/SIGINT/Arrivederci Roma) a
> running obdfilter-survey.  Using 1.8.4 on RHEL 5.5:
>
> [root at oss21 obdfilter-survey]# nobjhi=2 thrhi=2 size=1024 case=disk sh
> obdfilter-survey
> Wed Jan  5 10:51:05 CST 2011 Obdfilter-survey for case=disk from
> oss21.ranger.tacc.utexas.edu
> ost  6 sz  6291456K rsz 1024K obj    6 thr    6 write
> ^C
>
> [root at oss21 ~]# dmesg
> [87251.960393] Lustre: 11759:(echo_client.c:1409:echo_client_cleanup())
> ASSERTION(eco->eco_refcount == 0) failed
> [87251.960451] Lustre: 11759:(echo_client.c:1409:echo_client_cleanup()) LBUG()
> [87251.960482] Pid: 11759, comm: lctl
> ...
>
> See https://bugzilla.lustre.org/show_bug.cgi?id=21745
>




More information about the lustre-discuss mailing list