[Lustre-discuss] panic on jbd:journal_dirty_metadata

Thu Jul 22 07:16:27 PDT 2010

Hi Michael,

This looks like the problem we had some time ago after upgrading to 1.8.3

https://bugzilla.lustre.org/show_bug.cgi?id=22889

Best regards

Wojciech
On 20 July 2010 00:00, Michael Sternberg <sternberg at anl.gov> wrote:

> Hello,
>
> I use OSSs with external journal partitions and since lustre-1.8.1 about
> one to two times a week I get frustrating panics on OSSs as follows:
>
>        :libcfs:cfs_alloc ...
>        :lvfs:lprocfs_counter_add ...
>        ...
>
>        RIP [<ffffffff88031e64>] :jbd:journal_dirty_metadata+0x7f/0x1e3
>          RSP <ffff8101f99c3670>
>          <0>Kernel panic - not syncing: Fatal exception
>
>        (full graphical screenshot attached, hoping it passes through)
>
> Clients sometimes report:
>
>        Message from syslogd@ at Mon Jul 19 04:11:46 2010 ...
>        login4 kernel: journal commit I/O error
>
>
> I have recently updated to 1.8.3, where I e2fsck'd and re-initialized the
> external journals, but still get those panics. I use 2 OSS with heartbeat
> failover, each one "owns" and normally serves 2 OSTs (4 OST total), coming
> from 4 LUNs on a RAID unit with dual controllers. All OSTs use ldiskfs
> (pre-ext4 proper, if I understand correctly) with the  journals located on
> partitions of 2 further LUNs. I use a variant of the script at bug 20807 to
> account for different device numbers of the external journals on the two
> OSSs.  Failover usually works, eventually, after a load peak of up to 100 on
> the OSS taking over, and messages about hung threads (see below).
>
>
> Is there anything I could do besides giving up on external journals?  My
> data stores are RAID1, and the journal disks are a single pair of disks also
> in RAID1.
>
> I had difficulties locating further information on googling
> "journal_dirty_metadata" pertaining to lustre/ldiskfs specifically. There
> are old discussions at:
>
>        https://bugzilla.redhat.com/show_bug.cgi?id=183119      2007/2008 -
> kernel 2.4.7
>
> http://oss.oracle.com/pipermail/ocfs2-users/2010-January/004113.html
>  (ahem)
>
>
> With best regards,
> Michael
>
>
> [root at mds01 ~]# cat /proc/fs/lustre/version
> lustre: 1.8.3
> kernel: patchless_client
> build:  1.8.3-20100409182943-PRISTINE-2.6.18-164.11.1.el5_lustre.1.8.3
>
> [root at mds01 ~]# uname -r
> 2.6.18-164.11.1.el5_lustre.1.8.3
>
> [root at mds01 ~]# lctl dl
>  0 UP mgs MGS MGS 725
>  1 UP mgc MGC172.17.120.1 at o2ib 9642cdcd-4955-ca05-4e85-8a9f6d10c027 5
>  2 UP mdt MDS MDS_uuid 3
>  7 UP lov sandbox-mdtlov sandbox-mdtlov_UUID 4
>  8 UP mds sandbox-MDT0000 sandbox-MDT0000_UUID 719
>  9 UP osc sandbox-OST0000-osc sandbox-mdtlov_UUID 5
> 10 UP osc sandbox-OST0001-osc sandbox-mdtlov_UUID 5
>
> [root at mds01 ~]# ssh mds02 lctl dl
>  0 UP mgc MGC172.17.120.1 at o2ib 12c07f8c-f1e7-f739-9983-3c3aa2ec492a 5
>  1 UP mdt MDS MDS_uuid 3
>  2 UP lov carbonfs-mdtlov carbonfs-mdtlov_UUID 4
>  3 UP mds carbonfs-MDT0000 carbonfs-MDT0000_UUID 719
>  4 UP osc carbonfs-OST0001-osc carbonfs-mdtlov_UUID 5
>  5 UP osc carbonfs-OST0000-osc carbonfs-mdtlov_UUID 5
>
> [root at oss01 ~]# lctl dl
>  0 UP mgc MGC172.17.120.1 at o2ib a89ba7f9-2a12-ff77-0321-151a1addf043 5
>  1 UP ost OSS OSS_uuid 3
>  2 UP obdfilter sandbox-OST0000 sandbox-OST0000_UUID 721
>  3 UP obdfilter carbonfs-OST0000 carbonfs-OST0000_UUID 721
>  4 UP obdfilter sandbox-OST0001 sandbox-OST0001_UUID 721
>  5 UP obdfilter carbonfs-OST0001 carbonfs-OST0001_UUID 721
>
> [client]# lctl dl
>  0 UP mgc MGC172.17.120.1 at o2ib dadd88bf-fbad-d933-b02a-a539fd8abfea 5
>  1 UP lov sandbox-clilov-ffff8101da93c400
> 30094b3a-b246-e667-ef8a-f6690e4d051c 4
>  2 UP mdc sandbox-MDT0000-mdc-ffff8101da93c400
> 30094b3a-b246-e667-ef8a-f6690e4d051c 5
>  3 UP osc sandbox-OST0000-osc-ffff8101da93c400
> 30094b3a-b246-e667-ef8a-f6690e4d051c 5
>  4 UP osc sandbox-OST0001-osc-ffff8101da93c400
> 30094b3a-b246-e667-ef8a-f6690e4d051c 5
>  5 UP lov carbonfs-clilov-ffff81041dc0f800
> 4aaa2ddd-eee8-564f-ea48-b9c36a428eb9 4
>  6 UP mdc carbonfs-MDT0000-mdc-ffff81041dc0f800
> 4aaa2ddd-eee8-564f-ea48-b9c36a428eb9 5
>  7 UP osc carbonfs-OST0001-osc-ffff81041dc0f800
> 4aaa2ddd-eee8-564f-ea48-b9c36a428eb9 5
>  8 UP osc carbonfs-OST0000-osc-ffff81041dc0f800
> 4aaa2ddd-eee8-564f-ea48-b9c36a428eb9 5
>
>
> --------------------------------------------------
> "thread hung" messages:
> --------------------------------------------------
> Jul 19 04:01:05 oss01 kernel: Lustre: carbonfs-OST0001: Recovery period
> over after 1:05, of 359 clients 358 recovered and 0 were evicted.
> Jul 19 04:01:05 oss01 kernel: Lustre: carbonfs-OST0001: sending delayed
> replies to recovered clientsJul 19 04:01:05 oss01 kernel: Lustre:
> carbonfs-OST0001: received MDS connection from 172.17.120.2 at o2ib
> Jul 19 04:03:36 oss01 kernel: LustreError:
> 0:0:(ldlm_lockd.c:305:waiting_locks_callback()) ### lock callback timer
> expired after 151s: evicting client at 172.17.0.235 at o2ib  ns:
> filter-carbonfs-OST0001_UUID loc
> k: ffff810178b8d600/0x80a4d28c4aff67ec lrc: 3/0,0 mode: PW/PW res: 152472/0
> rrc: 5 type: EXT [0->18446744073709551615] (req 0->18446744073709551615)
> flags: 0x10020 remote: 0x4701b34288b619f7 expref: 12 pid: 68
> 06 timeout 4356934523
> Jul 19 04:04:16 oss01 kernel: Lustre:
> 6757:0:(ldlm_lib.c:804:target_handle_connect()) carbonfs-OST0001: exp
> ffff8101ecf94e00 already connectingJul 19 04:04:16 oss01 kernel: Lustre:
> 6757:0:(ldlm_lib.c:804:target_handle_connect()) Skipped 38 previous similar
> messages
> Jul 19 04:04:25 oss01 kernel: Lustre: Service thread pid 6145 was inactive
> for 200.00s. The thread might be hung, or it might only be slow and will
> resume later. Dumping the stack trace for debugging purposes:
> Jul 19 04:04:25 oss01 kernel: Lustre: Skipped 2 previous similar messages
> Jul 19 04:04:25 oss01 kernel: Lustre:
> 0:0:(linux-debug.c:264:libcfs_debug_dumpstack()) showing stack for process
> 6145
> Jul 19 04:04:25 oss01 kernel: Lustre:
> 0:0:(linux-debug.c:264:libcfs_debug_dumpstack()) Skipped 2 previous similar
> messagesJul 19 04:04:25 oss01 kernel: ll_ost_io_38  D 0000000000000000     0
>  6145      1          6146  6144 (L-TLB)
> Jul 19 04:04:25 oss01 kernel:  ffff810214dcb730 0000000000000046
> ffff81021158b220 ffff81017fe44cc0
> Jul 19 04:04:25 oss01 kernel:  ffff81017fe44cc0 000000000000000a
> ffff810214d4a7e0 ffff810214c0f7e0
> Jul 19 04:04:25 oss01 kernel:  000038708ddb5ba3 0000000000001a95
> ffff810214d4a9c8 000000042fe56000
> Jul 19 04:04:25 oss01 kernel: Call Trace:Jul 19 04:04:25 oss01 kernel:
>  [<ffffffff88036769>] :jbd:log_wait_commit+0xa3/0xf5
> Jul 19 04:04:25 oss01 kernel:  [<ffffffff800a00be>]
> autoremove_wake_function+0x0/0x2e
> Jul 19 04:04:25 oss01 kernel:  [<ffffffff88cc900b>]
> :fsfilt_ldiskfs:fsfilt_ldiskfs_commit_wait+0xab/0xd0
> Jul 19 04:04:25 oss01 kernel:  [<ffffffff88d0798f>]
> :obdfilter:filter_commitrw_write+0x19ef/0x2980
> Jul 19 04:04:25 oss01 kernel:  [<ffffffff88ca4ed5>]
> :ost:ost_checksum_bulk+0x385/0x5b0Jul 19 04:04:26 oss01 kernel:
>  [<ffffffff88ca4c38>] :ost:ost_checksum_bulk+0xe8/0x5b0
> Jul 19 04:04:26 oss01 kernel:  [<ffffffff88cab8da>]
> :ost:ost_brw_write+0x1c1a/0x23b0
> Jul 19 04:04:26 oss01 kernel:  [<ffffffff889ec238>]
> :ptlrpc:ptlrpc_send_reply+0x5c8/0x5e0
> Jul 19 04:04:26 oss01 kernel:  [<ffffffff889b7b10>]
> :ptlrpc:target_committed_to_req+0x40/0x120
> Jul 19 04:04:26 oss01 kernel:  [<ffffffff88ca781d>]
> :ost:ost_brw_read+0x189d/0x1a70Jul 19 04:04:26 oss01 kernel:
>  [<ffffffff889f06c5>] :ptlrpc:lustre_msg_get_version+0x35/0xf0
> Jul 19 04:04:26 oss01 kernel:  [<ffffffff8008c86b>]
> default_wake_function+0x0/0xe
> Jul 19 04:04:26 oss01 kernel:  [<ffffffff889f0788>]
> :ptlrpc:lustre_msg_check_version_v2+0x8/0x20
> Jul 19 04:04:26 oss01 kernel:  [<ffffffff88caec1e>]
> :ost:ost_handle+0x2bae/0x53d0
> Jul 19 04:04:26 oss01 kernel:  [<ffffffff889cde89>]
> :ptlrpc:__ldlm_add_waiting_lock+0x189/0x1c0Jul 19 04:04:26 oss01 kernel:
>  [<ffffffff889cf10e>] :ptlrpc:ldlm_refresh_waiting_lock+0xbe/0x110
> Jul 19 04:04:26 oss01 kernel:  [<ffffffff88ca2571>]
> :ost:ost_prolong_locks_iter+0x151/0x180
> Jul 19 04:04:26 oss01 kernel:  [<ffffffff889fd9b5>]
> :ptlrpc:ptlrpc_server_handle_request+0xaa5/0x1140
> Jul 19 04:04:26 oss01 kernel:  [<ffffffff80062ff8>] thread_return+0x62/0xfe
> Jul 19 04:04:26 oss01 kernel:  [<ffffffff8003dafe>]
> lock_timer_base+0x1b/0x3cJul 19 04:04:26 oss01 kernel:  [<ffffffff8001ca74>]
> __mod_timer+0xb0/0xbe
> Jul 19 04:04:26 oss01 kernel:  [<ffffffff88a01408>]
> :ptlrpc:ptlrpc_main+0x1258/0x1420
> Jul 19 04:04:26 oss01 kernel:  [<ffffffff8008c86b>]
> default_wake_function+0x0/0xe
> Jul 19 04:04:26 oss01 kernel:  [<ffffffff800b7076>]
> audit_syscall_exit+0x336/0x362
> Jul 19 04:04:26 oss01 kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11Jul
> 19 04:04:26 oss01 kernel:  [<ffffffff88a001b0>]
> :ptlrpc:ptlrpc_main+0x0/0x1420
> Jul 19 04:04:26 oss01 kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11
> Jul 19 04:04:26 oss01 kernel:
> --------------------------------------------------
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>

-- 
--
Wojciech Turek

Assistant System Manager

High Performance Computing Service
University of Cambridge
Email: wjt27 at cam.ac.uk
Tel: (+)44 1223 763517
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20100722/9fa839e9/attachment.htm>