[Lustre-discuss] OSS crashes

Wed Jul 23 04:56:56 PDT 2008

Hi all,

I've experienced reproducible OSS crashes with 1.6.5 but also 
1.6.4.3/1.6.4.2. The cluster is running Debian Etch64, kernel 2.6.22. 
The OSS are file servers with two OSTs.
I'm now testing it by just using one OSS in the system (but encountered 
the problem first with 9 OSS), mounting Lustre on 4 clients and writing 
to it using the stress utility:  "stress -d 2 --hdd-noclean --hdd-bytes 5M "

Once the OSTs are filled up to > 60%, the machine will just stop working.

There are no traces in any of the logs that would relate directly to the 
moment of failure.

I have repeated the procedure now with 9 of these machines, all of them 
SuperMicro X7DB8 16 slot file servers, 2 Intel Xeon E5320 Quadcores, 8GB 
RAM and one older SuperMicro X7DB8 with 2 Dual Core Xeons and 4 GB RAM, 
on a Lustre 1.6.4.2 test system. All of these machines have two 3ware 
9650 RAID controllers, 500 TB WD Disks in RAID 5.

Subsequently I reformatted the OST with ext3 and ran the stress test 
locally on the machine: no failure, the partition filled to 100% without 
problem.
All of this seems to indicate that it is not a (sole) hardware problem.

Prior to the recent crash the following is found in /var/log/kern.log:

Jul 22 21:23:52  kernel: Lustre: 
25706:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0000: 
slow i_mutex  30s
Jul 22 21:24:10  kernel: Lustre: 
25692:0:(lustre_fsfilt.h:246:fsfilt_brw_start_log()) gsilust-OST0000: 
slow journal  start 37s
Jul 22 21:24:10  kernel: Lustre: 
25692:0:(filter_io_26.c:713:filter_commitrw_write()) gsilust-OST0000: 
slow brw_start 37s
Jul 22 21:24:10  kernel: Lustre: 
25697:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0000: 
slow i_mutex  37s
Jul 22 21:46:55  kernel: Lustre: 
25680:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0001: 
slow i_mutex  31s
Jul 22 21:46:55 kernel: Lustre: 
25680:0:(filter_io_26.c:700:filter_commitrw_write()) Skipped 2 previous 
similar messages
Jul 22 21:47:06 kernel: Lustre: 
25733:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0001: 
slow i_mutex 30s
Jul 22 21:47:10 kernel: Lustre: 
25744:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0001: 
slow i_mutex 31s
Jul 22 21:47:15 kernel: Lustre: 
25729:0:(lustre_fsfilt.h:246:fsfilt_brw_start_log()) gsilust-OST0001: 
slow journal start 30s
Jul 22 21:47:15 kernel: Lustre: 
25729:0:(filter_io_26.c:713:filter_commitrw_write()) gsilust-OST0001: 
slow brw_start 30s
Jul 22 21:47:54 kernel: Lustre: 
25662:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0001: 
slow i_mutex 36s
Jul 22 21:48:30 kernel: Lustre: 
25721:0:(lustre_fsfilt.h:246:fsfilt_brw_start_log()) gsilust-OST0001: 
slow journal start 33s
Jul 22 21:48:30 kernel: Lustre: 
25721:0:(filter_io_26.c:713:filter_commitrw_write()) gsilust-OST0001: 
slow brw_start 33s
Jul 22 21:48:30 kernel: Lustre: 
25736:0:(filter_io_26.c:700:filter_commitrw_write()) gsilust-OST0001: 
slow i_mutex 33s

However, these messages might as well have appeared while the machine 
was still working.

Some of my OSS I managed to crash with a trace in kern.log, a known bug 
in ext3/ext4 code I think:
Jul 14 21:41:19 kernel: uh! busy PA
Jul 14 21:41:19 kernel:
Jul 14 21:41:19 kernel: Call Trace:
Jul 14 21:41:19 kernel: [<ffffffff8857c5ce>] 
:ldiskfs:ldiskfs_mb_discard_group_preallocations+0x2ae/0x400
Jul 14 21:41:19 kernel: [<ffffffff8857c75a>] 
:ldiskfs:ldiskfs_mb_discard_preallocations+0x3a/0x70
Jul 14 21:41:19 kernel: [<ffffffff8857cc0d>] 
:ldiskfs:ldiskfs_mb_new_blocks+0x24d/0x270
Jul 14 21:41:19 kernel: [<ffffffff802a1025>] __find_get_block_slow+0x2f/0xf1
Jul 14 21:41:19 kernel: [<ffffffff885c90ef>] 
:fsfilt_ldiskfs:ldiskfs_ext_new_extent_cb+0x4ef/0x640
Jul 14 21:41:19 kernel: [<ffffffff802a1428>] __getblk+0x1d/0x20d
Jul 14 21:41:19 kernel: [<ffffffff88577801>] 
:ldiskfs:ldiskfs_ext_walk_space+0x131/0x250
Jul 14 21:41:19 kernel: [<ffffffff885c8c00>] 
:fsfilt_ldiskfs:ldiskfs_ext_new_extent_cb+0x0/0x640
Jul 14 21:41:19 kernel: [<ffffffff885c48c4>] 
:fsfilt_ldiskfs:fsfilt_map_nblocks+0xa4/0x150
Jul 14 21:41:19 kernel: [<ffffffff8845095c>] 
:ksocklnd:ksocknal_alloc_tx+0x2c/0x2a0
Jul 14 21:41:19 kernel: [<ffffffff885f24db>] 
:obdfilter:filter_direct_io+0x12b/0xd60
Jul 14 21:41:19 kernel: [<ffffffff885f3e9d>] 
:obdfilter:filter_commitrw_write+0x7bd/0x2640
Jul 14 21:41:19 kernel: [<ffffffff883bdbbe>] 
:ptlrpc:ldlm_resource_foreach+0x6e/0x3a0
Jul 14 21:41:19 kernel: [<ffffffff8023a0a4>] lock_timer_base+0x26/0x4b
Jul 14 21:41:19 kernel: [<ffffffff885aec4e>] 
:ost:ost_brw_write+0x15be/0x2990
Jul 14 21:41:19 kernel: [<ffffffff8022c506>] default_wake_function+0x0/0xe
Jul 14 21:41:19 kernel: [<ffffffff885b32f5>] :ost:ost_handle+0x2745/0x5ed0
Jul 14 21:41:19 kernel: [<ffffffff8027db5c>] cache_alloc_refill+0x94/0x1e8
Jul 14 21:41:19 kernel: [<ffffffff8022a52c>] find_busiest_group+0x255/0x6cf
Jul 14 21:41:19 kernel: [<ffffffff80246a60>] do_gettimeofday+0x2e/0x9e
Jul 14 21:41:19 kernel: [<ffffffff8023a0a4>] lock_timer_base+0x26/0x4b
Jul 14 21:41:19 kernel: [<ffffffff8834d088>] 
:obdclass:class_handle2object+0x88/0x180
Jul 14 21:41:19 kernel: [<ffffffff883e41d0>] 
:ptlrpc:lustre_swab_ptlrpc_body+0x0/0x90
Jul 14 21:41:19 kernel: [<ffffffff883e1c9e>] 
:ptlrpc:lustre_swab_buf+0xbe/0xf0
Jul 14 21:41:19 kernel: [<ffffffff8023a1ef>] __mod_timer+0xb6/0xc4
Jul 14 21:41:19 kernel: [<ffffffff883ec38f>] 
:ptlrpc:ptlrpc_main+0x130f/0x1ce0
Jul 14 21:41:19 kernel: [<ffffffff8022c506>] default_wake_function+0x0/0xe
Jul 14 21:41:19 kernel: [<ffffffff8020aba8>] child_rip+0xa/0x12
Jul 14 21:41:19 kernel: [<ffffffff883eb080>] :ptlrpc:ptlrpc_main+0x0/0x1ce0
Jul 14 21:41:19 kernel: [<ffffffff8020ab9e>] child_rip+0x0/0x12

Neither the "slow"-something messages nor the " uh! busy PA" trace show 
up in all cases of crashed OSS, so I have no idea whether this is 
related at all.

In any case any hints would be appreciated,
Thomas