[Lustre-discuss] Kernel BUG causing kernel panic kernel 2.6.9-55.0.9.EL_lustre.1.6.3smp

Wojciech Turek wjt27 at cam.ac.uk
Mon Nov 19 08:26:37 PST 2007


Dear All,

We are experiencing frequent OSS crashes.
We have 4 OSS's and each OSS serves 6 OST's to 600 clients. We  
observe random OSS crashes every 1-2 days. See below console output  
captured during crash.
Does is looks for some of you familiar? We have seen the same crashes  
with lustre 1.6.2



Nov 18 15:17:21 storage08 heartbeat: [25566]: info: Checking status  
of STONITH
Nov 18 15:17:21 storage08 heartbeat: [24250]: info: Exiting STONITH- 
stat process

Kernel BUG at mballoc:3352

invalid operand: 0000 [1] SMP

CPU 0

Modules linked in: obdfilter(U) fsfilt_ldiskfs(U) ost(U) mgc(U)  
ldiskfs(U) lustre(U) lov(U) mdc(U) lquota(U) ptlrpc(U) obdclass(U)  
lvfs(U) sg(U) ksocklnd(U) lnet(U) libcfs(U) cxgb3(U) ipmi_si(U)  
ipmi_devintf(U) ipmi_msghandler(U) md5(U) ipv6(U) autofs4(U)  
i2c_nforce2(U) i2c_amd756(U) i2c_isa(U) i2c_amd8111(U) i2c_i801(U)  
i2c_core(U) mptctl(U) dm_mirror(U) dm_round_robin(U) dm_multipath(U)  
dm_mod(U) sr_mod(U) usb_storage(U) joydev(U) button(U) battery(U) ac 
(U) uhci_hcd(U) ehci_hcd(U) hw_random(U) qla2400(U) qla2xxx(U)  
scsi_transport_fc(U) ata_piix(U) ext3(U) jbd(U) xfs(U) tg3(U) s2io(U)  
nfs(U) nfs_acl(U) lockd(U) sunrpc(U) mptsas(U) mptscsi(U) mptbase(U)  
megaraid_sas(U) e1000(U) bnx2(U) sd_mod(U)

Pid: 9070, comm: ll_ost_io_151 Tainted: GF     2.6.9-55.0.9.EL_lustre. 
1.6.3smp

RIP: 0010:[<ffffffffa05e2923>] <ffffffffa05e2923> 
{:ldiskfs:ldiskfs_mb_generate_from_pa+179}

RSP: 0018:00000100c9721268  EFLAGS: 00010297

RAX: 0000000000002177 RBX: 0000000000000000 RCX: 00000100c9721288

RDX: 0000000000000000 RSI: 0000000000002178 RDI: 0000010077ce42b0

RBP: 0000010077ce4290 R08: 00000100c9721280 R09: 01ff80000007c008

R10: 0000080000000000 R11: ffffffffffffffff R12: 0000010077ce42b0

R13: 000001007fb09000 R14: 0000000000000000 R15: 00000100ad763c28

FS:  0000002a95565b00(0000) GS:ffffffff804a6700(0000) knlGS: 
0000000000000000

CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b

CR2: 0000002a984c80e8 CR3: 0000000000101000 CR4: 00000000000006e0

Process ll_ost_io_151 (pid: 9070, threadinfo 00000100c9720000, task  
00000100c96f1800)

Stack: 0000000000001000 0000000000002177 00000100b5196400  
0000000000002178

        0000000000000000 0000000000002177 00000100a56d6ee0  
0000000000000000

        0000000000002177 0000000000000000

Call Trace:<ffffffffa05e310a>{:ldiskfs:ldiskfs_mb_init_cache+1898}

        <ffffffffa05e3340>{:ldiskfs:ldiskfs_mb_load_buddy+304}

        <ffffffffa05e96e2>{:ldiskfs:ldiskfs_mb_free_blocks+626}

        <ffffffffa0180920>{:jbd:journal_get_write_access+48}

        <ffffffff801589d9>{find_get_page+65} <ffffffff801798e7> 
{__find_get_block_slow+62}

        <ffffffff8017a097>{__find_get_block+162} <ffffffffa0180920> 
{:jbd:journal_get_write_access+48}

        <ffffffffa05c9933>{:ldiskfs:ldiskfs_free_blocks+163}

        <ffffffffa05e165a>{:ldiskfs:ldiskfs_remove_blocks+282}

        <ffffffffa05e0ff4>{:ldiskfs:ldiskfs_ext_remove_space+1508}

        <ffffffffa05ce27c>{:ldiskfs:ldiskfs_mark_inode_dirty+76}

        <ffffffffa05e1f80>{:ldiskfs:ldiskfs_ext_truncate+368}

        <ffffffffa05cfcb5>{:ldiskfs:ldiskfs_truncate+309}  
<ffffffff80167df9>{unmap_mapping_range+339}

        <ffffffffa05ce11a>{:ldiskfs:ldiskfs_mark_iloc_dirty+1034}

        <ffffffff80167ea4>{vmtruncate+162} <ffffffff80191c88> 
{inode_setattr+41}

        <ffffffffa05cf5bc>{:ldiskfs:ldiskfs_setattr+444}  
<ffffffffa062ae72>{:fsfilt_ldiskfs:fsfilt_ldiskfs_setattr+386}

        <ffffffffa064af7b>{:obdfilter:filter_destroy+3131}

        <ffffffffa0456da0>{:ptlrpc:ldlm_completion_ast+0}  
<ffffffff802f069d>{tcp_rcv_established+2099}

        <ffffffffa047bd83>{:ptlrpc:lustre_msg_add_version+83}

        <ffffffffa047d205>{:ptlrpc:lustre_msg_check_version+69}

        <ffffffffa061a25d>{:ost:ost_handle+6397} <ffffffff802dfc76> 
{ip_rcv+1046}

        <ffffffff802c6861>{netif_receive_skb+791} <ffffffffa031a9ba> 
{:cxgb3:lro_flush_session+154}

        <ffffffffa035fb58>{:lnet:lnet_match_blocked_msg+920}

        <ffffffffa0485b4c>{:ptlrpc:ptlrpc_server_handle_request+3036}

        <ffffffffa033cbae>{:libcfs:lcw_update_time+30}  
<ffffffff8013f448>{__mod_timer+293}

        <ffffffffa04881d8>{:ptlrpc:ptlrpc_main+2504}  
<ffffffff80133566>{default_wake_function+0}

        <ffffffffa0486860>{:ptlrpc:ptlrpc_retry_rqbds+0}  
<ffffffffa0486860>{:ptlrpc:ptlrpc_retry_rqbds+0}

        <ffffffff80110de3>{child_rip+8} <ffffffffa0487810> 
{:ptlrpc:ptlrpc_main+0}

        <ffffffff80110ddb>{child_rip+0}


Code: 0f 0b d2 bb 5e a0 ff ff ff ff 18 0d 90 8b 4c 24 20 8d 34 0b

RIP <ffffffffa05e2923>{:ldiskfs:ldiskfs_mb_generate_from_pa+179} RSP  
<00000100c9721268>

  <0>Kernel panic - not syncing: Oops


Best regards

Wojciech Turek

Mr Wojciech Turek
Assistant System Manager
University of Cambridge
High Performance Computing service
email: wjt27 at cam.ac.uk
tel. +441223763517



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20071119/59df571e/attachment.htm>


More information about the lustre-discuss mailing list