[Lustre-discuss] OSS crashes

Wed Jul 23 08:20:08 PDT 2008

Hi,

Brian J. Murrell wrote:
> On Wed, 2008-07-23 at 13:56 +0200, Thomas Roth wrote:
>> Hi all,
> 
> Hi,
> 
>> I've experienced reproducible OSS crashes with 1.6.5 but also 
>> 1.6.4.3/1.6.4.2. The cluster is running Debian Etch64, kernel 2.6.22. 
>> The OSS are file servers with two OSTs.
>> I'm now testing it by just using one OSS in the system (but encountered 
>> the problem first with 9 OSS), mounting Lustre on 4 clients and writing 
>> to it using the stress utility:  "stress -d 2 --hdd-noclean --hdd-bytes 5M "
>>
>> Once the OSTs are filled up to > 60%, the machine will just stop working.
> 
> Hrm.  "stop working" and "crash" are two different things.  Can we get
> clarification or more detail on exactly what does happen to the OSS at
> this point?  Is the OSS still up and running?  Can you log into it?  Can
> you do an "ls -l /" and it returns successfully?

Well, in these cases the machine is simply dead: the jobs writing via 
Lustre have stopped with write failed: Input/output error, I can't get 
into the machine neither via ssh nor via console, the only thing I can 
do is a hard reset. That's why I suspected the hardware first.

>> Jul 22 21:23:52  kernel: Lustre: 
>> 25706:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0000: 
>> slow i_mutex  30s
>> Jul 22 21:24:10  kernel: Lustre: 
>> 25692:0:(lustre_fsfilt.h:246:fsfilt_brw_start_log()) gsilust-OST0000: 
>> slow journal  start 37s
>> Jul 22 21:24:10  kernel: Lustre: 
>> 25692:0:(filter_io_26.c:713:filter_commitrw_write()) gsilust-OST0000: 
>> slow brw_start 37s
>> Jul 22 21:24:10  kernel: Lustre: 
>> 25697:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0000: 
>> slow i_mutex  37s
>> Jul 22 21:46:55  kernel: Lustre: 
>> 25680:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0001: 
>> slow i_mutex  31s
>> Jul 22 21:46:55 kernel: Lustre: 
>> 25680:0:(filter_io_26.c:700:filter_commitrw_write()) Skipped 2 previous 
>> similar messages
>> Jul 22 21:47:06 kernel: Lustre: 
>> 25733:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0001: 
>> slow i_mutex 30s
>> Jul 22 21:47:10 kernel: Lustre: 
>> 25744:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0001: 
>> slow i_mutex 31s
>> Jul 22 21:47:15 kernel: Lustre: 
>> 25729:0:(lustre_fsfilt.h:246:fsfilt_brw_start_log()) gsilust-OST0001: 
>> slow journal start 30s
>> Jul 22 21:47:15 kernel: Lustre: 
>> 25729:0:(filter_io_26.c:713:filter_commitrw_write()) gsilust-OST0001: 
>> slow brw_start 30s
>> Jul 22 21:47:54 kernel: Lustre: 
>> 25662:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0001: 
>> slow i_mutex 36s
>> Jul 22 21:48:30 kernel: Lustre: 
>> 25721:0:(lustre_fsfilt.h:246:fsfilt_brw_start_log()) gsilust-OST0001: 
>> slow journal start 33s
>> Jul 22 21:48:30 kernel: Lustre: 
>> 25721:0:(filter_io_26.c:713:filter_commitrw_write()) gsilust-OST0001: 
>> slow brw_start 33s
>> Jul 22 21:48:30 kernel: Lustre: 
>> 25736:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0001: 
>> slow i_mutex 33s
> 
> These are indicating that your OSTs are too slow.  Maybe you have
> oversubscribed the number of OST threads your hardware can handle, or
> maybe the OST hardware has slowed down/degraded at the point this
> happens.

Interesting. These were 32 single processes "spinning on 
write()/unlink()", as the man page of stress says.
I had a look on the  network traffic only, and that was not as high as 
in other test: the servers are connected via 1Gbit Ethernet links, but I 
saw no more than 20-30 MB/s. Internally, the RAID-Controller and disks 
can handle much more. The load on the servers was less than 8.
And as I mentioned, I did the same test on the machines locally, 
although I do not remember how many parallel stress jobs I employed.
Where else could I look for overloaded hardware capacities? Any way to 
find out about  the number of OST threads our hardware can handle?
So far I have not tried any other pattern/utility for these tests: our 
users are well known to be more demanding than any test program.
That's why we want to employ Lustre in the first place: to let the users 
fire from hundreds of clients, concurrently, at a large data file space 
(instead of killing NFS servers).

>> Some of my OSS I managed to crash with a trace in kern.log, a known bug 
>> in ext3/ext4 code I think:
>> Jul 14 21:41:19 kernel: uh! busy PA
> 
> This looks like bug 14322, Fixed in 1.6.5.
> 
> b.

Yeah, I think I have not seen this log on the 1.6.5 system. Just wasn't 
sure whether the servers might not have died before they were able to 
write to the logs.

Thanks for your reply,
Thomas