[Lustre-discuss] OSS crashes
Thomas Roth
t.roth at gsi.de
Wed Jul 23 08:20:08 PDT 2008
Hi,
Brian J. Murrell wrote:
> On Wed, 2008-07-23 at 13:56 +0200, Thomas Roth wrote:
>> Hi all,
>
> Hi,
>
>> I've experienced reproducible OSS crashes with 1.6.5 but also
>> 1.6.4.3/1.6.4.2. The cluster is running Debian Etch64, kernel 2.6.22.
>> The OSS are file servers with two OSTs.
>> I'm now testing it by just using one OSS in the system (but encountered
>> the problem first with 9 OSS), mounting Lustre on 4 clients and writing
>> to it using the stress utility: "stress -d 2 --hdd-noclean --hdd-bytes 5M "
>>
>> Once the OSTs are filled up to > 60%, the machine will just stop working.
>
> Hrm. "stop working" and "crash" are two different things. Can we get
> clarification or more detail on exactly what does happen to the OSS at
> this point? Is the OSS still up and running? Can you log into it? Can
> you do an "ls -l /" and it returns successfully?
Well, in these cases the machine is simply dead: the jobs writing via
Lustre have stopped with write failed: Input/output error, I can't get
into the machine neither via ssh nor via console, the only thing I can
do is a hard reset. That's why I suspected the hardware first.
>> Jul 22 21:23:52 kernel: Lustre:
>> 25706:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0000:
>> slow i_mutex 30s
>> Jul 22 21:24:10 kernel: Lustre:
>> 25692:0:(lustre_fsfilt.h:246:fsfilt_brw_start_log()) gsilust-OST0000:
>> slow journal start 37s
>> Jul 22 21:24:10 kernel: Lustre:
>> 25692:0:(filter_io_26.c:713:filter_commitrw_write()) gsilust-OST0000:
>> slow brw_start 37s
>> Jul 22 21:24:10 kernel: Lustre:
>> 25697:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0000:
>> slow i_mutex 37s
>> Jul 22 21:46:55 kernel: Lustre:
>> 25680:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0001:
>> slow i_mutex 31s
>> Jul 22 21:46:55 kernel: Lustre:
>> 25680:0:(filter_io_26.c:700:filter_commitrw_write()) Skipped 2 previous
>> similar messages
>> Jul 22 21:47:06 kernel: Lustre:
>> 25733:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0001:
>> slow i_mutex 30s
>> Jul 22 21:47:10 kernel: Lustre:
>> 25744:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0001:
>> slow i_mutex 31s
>> Jul 22 21:47:15 kernel: Lustre:
>> 25729:0:(lustre_fsfilt.h:246:fsfilt_brw_start_log()) gsilust-OST0001:
>> slow journal start 30s
>> Jul 22 21:47:15 kernel: Lustre:
>> 25729:0:(filter_io_26.c:713:filter_commitrw_write()) gsilust-OST0001:
>> slow brw_start 30s
>> Jul 22 21:47:54 kernel: Lustre:
>> 25662:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0001:
>> slow i_mutex 36s
>> Jul 22 21:48:30 kernel: Lustre:
>> 25721:0:(lustre_fsfilt.h:246:fsfilt_brw_start_log()) gsilust-OST0001:
>> slow journal start 33s
>> Jul 22 21:48:30 kernel: Lustre:
>> 25721:0:(filter_io_26.c:713:filter_commitrw_write()) gsilust-OST0001:
>> slow brw_start 33s
>> Jul 22 21:48:30 kernel: Lustre:
>> 25736:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0001:
>> slow i_mutex 33s
>
> These are indicating that your OSTs are too slow. Maybe you have
> oversubscribed the number of OST threads your hardware can handle, or
> maybe the OST hardware has slowed down/degraded at the point this
> happens.
Interesting. These were 32 single processes "spinning on
write()/unlink()", as the man page of stress says.
I had a look on the network traffic only, and that was not as high as
in other test: the servers are connected via 1Gbit Ethernet links, but I
saw no more than 20-30 MB/s. Internally, the RAID-Controller and disks
can handle much more. The load on the servers was less than 8.
And as I mentioned, I did the same test on the machines locally,
although I do not remember how many parallel stress jobs I employed.
Where else could I look for overloaded hardware capacities? Any way to
find out about the number of OST threads our hardware can handle?
So far I have not tried any other pattern/utility for these tests: our
users are well known to be more demanding than any test program.
That's why we want to employ Lustre in the first place: to let the users
fire from hundreds of clients, concurrently, at a large data file space
(instead of killing NFS servers).
>> Some of my OSS I managed to crash with a trace in kern.log, a known bug
>> in ext3/ext4 code I think:
>> Jul 14 21:41:19 kernel: uh! busy PA
>
> This looks like bug 14322, Fixed in 1.6.5.
>
> b.
Yeah, I think I have not seen this log on the 1.6.5 system. Just wasn't
sure whether the servers might not have died before they were able to
write to the logs.
Thanks for your reply,
Thomas
More information about the lustre-discuss
mailing list