[Lustre-discuss] OSS crashes

Wed Jul 23 07:51:23 PDT 2008

On Wed, 2008-07-23 at 13:56 +0200, Thomas Roth wrote:
> Hi all,

Hi,

> I've experienced reproducible OSS crashes with 1.6.5 but also 
> 1.6.4.3/1.6.4.2. The cluster is running Debian Etch64, kernel 2.6.22. 
> The OSS are file servers with two OSTs.
> I'm now testing it by just using one OSS in the system (but encountered 
> the problem first with 9 OSS), mounting Lustre on 4 clients and writing 
> to it using the stress utility:  "stress -d 2 --hdd-noclean --hdd-bytes 5M "
> 
> Once the OSTs are filled up to > 60%, the machine will just stop working.

Hrm.  "stop working" and "crash" are two different things.  Can we get
clarification or more detail on exactly what does happen to the OSS at
this point?  Is the OSS still up and running?  Can you log into it?  Can
you do an "ls -l /" and it returns successfully?

> Jul 22 21:23:52  kernel: Lustre: 
> 25706:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0000: 
> slow i_mutex  30s
> Jul 22 21:24:10  kernel: Lustre: 
> 25692:0:(lustre_fsfilt.h:246:fsfilt_brw_start_log()) gsilust-OST0000: 
> slow journal  start 37s
> Jul 22 21:24:10  kernel: Lustre: 
> 25692:0:(filter_io_26.c:713:filter_commitrw_write()) gsilust-OST0000: 
> slow brw_start 37s
> Jul 22 21:24:10  kernel: Lustre: 
> 25697:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0000: 
> slow i_mutex  37s
> Jul 22 21:46:55  kernel: Lustre: 
> 25680:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0001: 
> slow i_mutex  31s
> Jul 22 21:46:55 kernel: Lustre: 
> 25680:0:(filter_io_26.c:700:filter_commitrw_write()) Skipped 2 previous 
> similar messages
> Jul 22 21:47:06 kernel: Lustre: 
> 25733:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0001: 
> slow i_mutex 30s
> Jul 22 21:47:10 kernel: Lustre: 
> 25744:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0001: 
> slow i_mutex 31s
> Jul 22 21:47:15 kernel: Lustre: 
> 25729:0:(lustre_fsfilt.h:246:fsfilt_brw_start_log()) gsilust-OST0001: 
> slow journal start 30s
> Jul 22 21:47:15 kernel: Lustre: 
> 25729:0:(filter_io_26.c:713:filter_commitrw_write()) gsilust-OST0001: 
> slow brw_start 30s
> Jul 22 21:47:54 kernel: Lustre: 
> 25662:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0001: 
> slow i_mutex 36s
> Jul 22 21:48:30 kernel: Lustre: 
> 25721:0:(lustre_fsfilt.h:246:fsfilt_brw_start_log()) gsilust-OST0001: 
> slow journal start 33s
> Jul 22 21:48:30 kernel: Lustre: 
> 25721:0:(filter_io_26.c:713:filter_commitrw_write()) gsilust-OST0001: 
> slow brw_start 33s
> Jul 22 21:48:30 kernel: Lustre: 
> 25736:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0001: 
> slow i_mutex 33s

These are indicating that your OSTs are too slow.  Maybe you have
oversubscribed the number of OST threads your hardware can handle, or
maybe the OST hardware has slowed down/degraded at the point this
happens.

> Some of my OSS I managed to crash with a trace in kern.log, a known bug 
> in ext3/ext4 code I think:
> Jul 14 21:41:19 kernel: uh! busy PA
> Jul 14 21:41:19 kernel:
> Jul 14 21:41:19 kernel: Call Trace:
> Jul 14 21:41:19 kernel: [<ffffffff8857c5ce>] 
> :ldiskfs:ldiskfs_mb_discard_group_preallocations+0x2ae/0x400
> Jul 14 21:41:19 kernel: [<ffffffff8857c75a>] 
> :ldiskfs:ldiskfs_mb_discard_preallocations+0x3a/0x70
> Jul 14 21:41:19 kernel: [<ffffffff8857cc0d>] 
> :ldiskfs:ldiskfs_mb_new_blocks+0x24d/0x270
> Jul 14 21:41:19 kernel: [<ffffffff802a1025>] __find_get_block_slow+0x2f/0xf1
> Jul 14 21:41:19 kernel: [<ffffffff885c90ef>] 
> :fsfilt_ldiskfs:ldiskfs_ext_new_extent_cb+0x4ef/0x640
> Jul 14 21:41:19 kernel: [<ffffffff802a1428>] __getblk+0x1d/0x20d
> Jul 14 21:41:19 kernel: [<ffffffff88577801>] 
> :ldiskfs:ldiskfs_ext_walk_space+0x131/0x250
> Jul 14 21:41:19 kernel: [<ffffffff885c8c00>] 
> :fsfilt_ldiskfs:ldiskfs_ext_new_extent_cb+0x0/0x640
> Jul 14 21:41:19 kernel: [<ffffffff885c48c4>] 
> :fsfilt_ldiskfs:fsfilt_map_nblocks+0xa4/0x150
> Jul 14 21:41:19 kernel: [<ffffffff8845095c>] 
> :ksocklnd:ksocknal_alloc_tx+0x2c/0x2a0
> Jul 14 21:41:19 kernel: [<ffffffff885f24db>] 
> :obdfilter:filter_direct_io+0x12b/0xd60
> Jul 14 21:41:19 kernel: [<ffffffff885f3e9d>] 
> :obdfilter:filter_commitrw_write+0x7bd/0x2640
> Jul 14 21:41:19 kernel: [<ffffffff883bdbbe>] 
> :ptlrpc:ldlm_resource_foreach+0x6e/0x3a0
> Jul 14 21:41:19 kernel: [<ffffffff8023a0a4>] lock_timer_base+0x26/0x4b
> Jul 14 21:41:19 kernel: [<ffffffff885aec4e>] 
> :ost:ost_brw_write+0x15be/0x2990
> Jul 14 21:41:19 kernel: [<ffffffff8022c506>] default_wake_function+0x0/0xe
> Jul 14 21:41:19 kernel: [<ffffffff885b32f5>] :ost:ost_handle+0x2745/0x5ed0
> Jul 14 21:41:19 kernel: [<ffffffff8027db5c>] cache_alloc_refill+0x94/0x1e8
> Jul 14 21:41:19 kernel: [<ffffffff8022a52c>] find_busiest_group+0x255/0x6cf
> Jul 14 21:41:19 kernel: [<ffffffff80246a60>] do_gettimeofday+0x2e/0x9e
> Jul 14 21:41:19 kernel: [<ffffffff8023a0a4>] lock_timer_base+0x26/0x4b
> Jul 14 21:41:19 kernel: [<ffffffff8834d088>] 
> :obdclass:class_handle2object+0x88/0x180
> Jul 14 21:41:19 kernel: [<ffffffff883e41d0>] 
> :ptlrpc:lustre_swab_ptlrpc_body+0x0/0x90
> Jul 14 21:41:19 kernel: [<ffffffff883e1c9e>] 
> :ptlrpc:lustre_swab_buf+0xbe/0xf0
> Jul 14 21:41:19 kernel: [<ffffffff8023a1ef>] __mod_timer+0xb6/0xc4
> Jul 14 21:41:19 kernel: [<ffffffff883ec38f>] 
> :ptlrpc:ptlrpc_main+0x130f/0x1ce0
> Jul 14 21:41:19 kernel: [<ffffffff8022c506>] default_wake_function+0x0/0xe
> Jul 14 21:41:19 kernel: [<ffffffff8020aba8>] child_rip+0xa/0x12
> Jul 14 21:41:19 kernel: [<ffffffff883eb080>] :ptlrpc:ptlrpc_main+0x0/0x1ce0
> Jul 14 21:41:19 kernel: [<ffffffff8020ab9e>] child_rip+0x0/0x12

This looks like bug 14322, Fixed in 1.6.5.

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20080723/8e13ee89/attachment.pgp>