[Lustre-discuss] panic on jbd:journal_dirty_metadata

Mon Jul 19 16:00:42 PDT 2010

Hello,

I use OSSs with external journal partitions and since lustre-1.8.1 about one to two times a week I get frustrating panics on OSSs as follows:

	:libcfs:cfs_alloc ...
	:lvfs:lprocfs_counter_add ...
	...

	RIP [<ffffffff88031e64>] :jbd:journal_dirty_metadata+0x7f/0x1e3
	  RSP <ffff8101f99c3670>
	  <0>Kernel panic - not syncing: Fatal exception

	(full graphical screenshot attached, hoping it passes through)

Clients sometimes report:

	Message from syslogd@ at Mon Jul 19 04:11:46 2010 ...
	login4 kernel: journal commit I/O error

I have recently updated to 1.8.3, where I e2fsck'd and re-initialized the external journals, but still get those panics. I use 2 OSS with heartbeat failover, each one "owns" and normally serves 2 OSTs (4 OST total), coming from 4 LUNs on a RAID unit with dual controllers. All OSTs use ldiskfs (pre-ext4 proper, if I understand correctly) with the  journals located on partitions of 2 further LUNs. I use a variant of the script at bug 20807 to account for different device numbers of the external journals on the two OSSs.  Failover usually works, eventually, after a load peak of up to 100 on the OSS taking over, and messages about hung threads (see below).

Is there anything I could do besides giving up on external journals?  My data stores are RAID1, and the journal disks are a single pair of disks also in RAID1.

I had difficulties locating further information on googling "journal_dirty_metadata" pertaining to lustre/ldiskfs specifically. There are old discussions at:

	https://bugzilla.redhat.com/show_bug.cgi?id=183119	2007/2008 - kernel 2.4.7
	http://oss.oracle.com/pipermail/ocfs2-users/2010-January/004113.html	(ahem)

With best regards,
Michael

[root at mds01 ~]# cat /proc/fs/lustre/version
lustre: 1.8.3
kernel: patchless_client
build:  1.8.3-20100409182943-PRISTINE-2.6.18-164.11.1.el5_lustre.1.8.3

[root at mds01 ~]# uname -r
2.6.18-164.11.1.el5_lustre.1.8.3

[root at mds01 ~]# lctl dl
 0 UP mgs MGS MGS 725
 1 UP mgc MGC172.17.120.1 at o2ib 9642cdcd-4955-ca05-4e85-8a9f6d10c027 5
 2 UP mdt MDS MDS_uuid 3
 7 UP lov sandbox-mdtlov sandbox-mdtlov_UUID 4
 8 UP mds sandbox-MDT0000 sandbox-MDT0000_UUID 719
 9 UP osc sandbox-OST0000-osc sandbox-mdtlov_UUID 5
10 UP osc sandbox-OST0001-osc sandbox-mdtlov_UUID 5

[root at mds01 ~]# ssh mds02 lctl dl
 0 UP mgc MGC172.17.120.1 at o2ib 12c07f8c-f1e7-f739-9983-3c3aa2ec492a 5
 1 UP mdt MDS MDS_uuid 3
 2 UP lov carbonfs-mdtlov carbonfs-mdtlov_UUID 4
 3 UP mds carbonfs-MDT0000 carbonfs-MDT0000_UUID 719
 4 UP osc carbonfs-OST0001-osc carbonfs-mdtlov_UUID 5
 5 UP osc carbonfs-OST0000-osc carbonfs-mdtlov_UUID 5

[root at oss01 ~]# lctl dl
 0 UP mgc MGC172.17.120.1 at o2ib a89ba7f9-2a12-ff77-0321-151a1addf043 5
 1 UP ost OSS OSS_uuid 3
 2 UP obdfilter sandbox-OST0000 sandbox-OST0000_UUID 721
 3 UP obdfilter carbonfs-OST0000 carbonfs-OST0000_UUID 721
 4 UP obdfilter sandbox-OST0001 sandbox-OST0001_UUID 721
 5 UP obdfilter carbonfs-OST0001 carbonfs-OST0001_UUID 721

[client]# lctl dl
 0 UP mgc MGC172.17.120.1 at o2ib dadd88bf-fbad-d933-b02a-a539fd8abfea 5
 1 UP lov sandbox-clilov-ffff8101da93c400 30094b3a-b246-e667-ef8a-f6690e4d051c 4
 2 UP mdc sandbox-MDT0000-mdc-ffff8101da93c400 30094b3a-b246-e667-ef8a-f6690e4d051c 5
 3 UP osc sandbox-OST0000-osc-ffff8101da93c400 30094b3a-b246-e667-ef8a-f6690e4d051c 5
 4 UP osc sandbox-OST0001-osc-ffff8101da93c400 30094b3a-b246-e667-ef8a-f6690e4d051c 5
 5 UP lov carbonfs-clilov-ffff81041dc0f800 4aaa2ddd-eee8-564f-ea48-b9c36a428eb9 4
 6 UP mdc carbonfs-MDT0000-mdc-ffff81041dc0f800 4aaa2ddd-eee8-564f-ea48-b9c36a428eb9 5
 7 UP osc carbonfs-OST0001-osc-ffff81041dc0f800 4aaa2ddd-eee8-564f-ea48-b9c36a428eb9 5
 8 UP osc carbonfs-OST0000-osc-ffff81041dc0f800 4aaa2ddd-eee8-564f-ea48-b9c36a428eb9 5

--------------------------------------------------
"thread hung" messages:
--------------------------------------------------
Jul 19 04:01:05 oss01 kernel: Lustre: carbonfs-OST0001: Recovery period over after 1:05, of 359 clients 358 recovered and 0 were evicted.
Jul 19 04:01:05 oss01 kernel: Lustre: carbonfs-OST0001: sending delayed replies to recovered clientsJul 19 04:01:05 oss01 kernel: Lustre: carbonfs-OST0001: received MDS connection from 172.17.120.2 at o2ib
Jul 19 04:03:36 oss01 kernel: LustreError: 0:0:(ldlm_lockd.c:305:waiting_locks_callback()) ### lock callback timer expired after 151s: evicting client at 172.17.0.235 at o2ib  ns: filter-carbonfs-OST0001_UUID loc
k: ffff810178b8d600/0x80a4d28c4aff67ec lrc: 3/0,0 mode: PW/PW res: 152472/0 rrc: 5 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x10020 remote: 0x4701b34288b619f7 expref: 12 pid: 68
06 timeout 4356934523
Jul 19 04:04:16 oss01 kernel: Lustre: 6757:0:(ldlm_lib.c:804:target_handle_connect()) carbonfs-OST0001: exp ffff8101ecf94e00 already connectingJul 19 04:04:16 oss01 kernel: Lustre: 6757:0:(ldlm_lib.c:804:target_handle_connect()) Skipped 38 previous similar messages
Jul 19 04:04:25 oss01 kernel: Lustre: Service thread pid 6145 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Jul 19 04:04:25 oss01 kernel: Lustre: Skipped 2 previous similar messages
Jul 19 04:04:25 oss01 kernel: Lustre: 0:0:(linux-debug.c:264:libcfs_debug_dumpstack()) showing stack for process 6145
Jul 19 04:04:25 oss01 kernel: Lustre: 0:0:(linux-debug.c:264:libcfs_debug_dumpstack()) Skipped 2 previous similar messagesJul 19 04:04:25 oss01 kernel: ll_ost_io_38  D 0000000000000000     0  6145      1          6146  6144 (L-TLB)
Jul 19 04:04:25 oss01 kernel:  ffff810214dcb730 0000000000000046 ffff81021158b220 ffff81017fe44cc0
Jul 19 04:04:25 oss01 kernel:  ffff81017fe44cc0 000000000000000a ffff810214d4a7e0 ffff810214c0f7e0
Jul 19 04:04:25 oss01 kernel:  000038708ddb5ba3 0000000000001a95 ffff810214d4a9c8 000000042fe56000
Jul 19 04:04:25 oss01 kernel: Call Trace:Jul 19 04:04:25 oss01 kernel:  [<ffffffff88036769>] :jbd:log_wait_commit+0xa3/0xf5
Jul 19 04:04:25 oss01 kernel:  [<ffffffff800a00be>] autoremove_wake_function+0x0/0x2e
Jul 19 04:04:25 oss01 kernel:  [<ffffffff88cc900b>] :fsfilt_ldiskfs:fsfilt_ldiskfs_commit_wait+0xab/0xd0
Jul 19 04:04:25 oss01 kernel:  [<ffffffff88d0798f>] :obdfilter:filter_commitrw_write+0x19ef/0x2980
Jul 19 04:04:25 oss01 kernel:  [<ffffffff88ca4ed5>] :ost:ost_checksum_bulk+0x385/0x5b0Jul 19 04:04:26 oss01 kernel:  [<ffffffff88ca4c38>] :ost:ost_checksum_bulk+0xe8/0x5b0
Jul 19 04:04:26 oss01 kernel:  [<ffffffff88cab8da>] :ost:ost_brw_write+0x1c1a/0x23b0
Jul 19 04:04:26 oss01 kernel:  [<ffffffff889ec238>] :ptlrpc:ptlrpc_send_reply+0x5c8/0x5e0
Jul 19 04:04:26 oss01 kernel:  [<ffffffff889b7b10>] :ptlrpc:target_committed_to_req+0x40/0x120
Jul 19 04:04:26 oss01 kernel:  [<ffffffff88ca781d>] :ost:ost_brw_read+0x189d/0x1a70Jul 19 04:04:26 oss01 kernel:  [<ffffffff889f06c5>] :ptlrpc:lustre_msg_get_version+0x35/0xf0
Jul 19 04:04:26 oss01 kernel:  [<ffffffff8008c86b>] default_wake_function+0x0/0xe
Jul 19 04:04:26 oss01 kernel:  [<ffffffff889f0788>] :ptlrpc:lustre_msg_check_version_v2+0x8/0x20
Jul 19 04:04:26 oss01 kernel:  [<ffffffff88caec1e>] :ost:ost_handle+0x2bae/0x53d0
Jul 19 04:04:26 oss01 kernel:  [<ffffffff889cde89>] :ptlrpc:__ldlm_add_waiting_lock+0x189/0x1c0Jul 19 04:04:26 oss01 kernel:  [<ffffffff889cf10e>] :ptlrpc:ldlm_refresh_waiting_lock+0xbe/0x110
Jul 19 04:04:26 oss01 kernel:  [<ffffffff88ca2571>] :ost:ost_prolong_locks_iter+0x151/0x180
Jul 19 04:04:26 oss01 kernel:  [<ffffffff889fd9b5>] :ptlrpc:ptlrpc_server_handle_request+0xaa5/0x1140
Jul 19 04:04:26 oss01 kernel:  [<ffffffff80062ff8>] thread_return+0x62/0xfe
Jul 19 04:04:26 oss01 kernel:  [<ffffffff8003dafe>] lock_timer_base+0x1b/0x3cJul 19 04:04:26 oss01 kernel:  [<ffffffff8001ca74>] __mod_timer+0xb0/0xbe
Jul 19 04:04:26 oss01 kernel:  [<ffffffff88a01408>] :ptlrpc:ptlrpc_main+0x1258/0x1420
Jul 19 04:04:26 oss01 kernel:  [<ffffffff8008c86b>] default_wake_function+0x0/0xe
Jul 19 04:04:26 oss01 kernel:  [<ffffffff800b7076>] audit_syscall_exit+0x336/0x362
Jul 19 04:04:26 oss01 kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11Jul 19 04:04:26 oss01 kernel:  [<ffffffff88a001b0>] :ptlrpc:ptlrpc_main+0x0/0x1420
Jul 19 04:04:26 oss01 kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11
Jul 19 04:04:26 oss01 kernel: 
--------------------------------------------------

-------------- next part --------------
A non-text attachment was scrubbed...
Name: HPC_2010-06-30_oss02_panic-out.png
Type: image/png
Size: 29628 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20100719/bbfce767/attachment.png>