[Lustre-discuss] OSS crashed with PANIC

wanglu wanglu at ihep.ac.cn
Mon Oct 10 01:22:28 PDT 2011


Hi all, 

   One of our OSS with Lustre version:"2.6.18-194.17.1.el5_lustre.1.8.5 "  crashed today.  

 crash> sys
      KERNEL: /usr/lib/debug/lib/modules/2.6.18-194.17.1.el5_lustre.1.8.5//vmlinux
    DUMPFILE: /var/crash/2011-10-10-15:30/vmcore
        CPUS: 8
        DATE: Mon Oct 10 15:29:15 2011
      UPTIME: 05:30:52
LOAD AVERAGE: 3.74, 2.14, 1.57
       TASKS: 983
    NODENAME: 
     RELEASE: 2.6.18-194.17.1.el5_lustre.1.8.5
     VERSION: #1 SMP Mon Nov 15 15:48:43 MST 2010
     MACHINE: x86_64  (2399 Mhz)
      MEMORY: 23.6 GB
       PANIC: "Oops: 0000 [1] SMP " (check log for details)

Here is the end of crash dump log:

Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP: 
 [<ffffffff8895e7c6>] :obdfilter:filter_preprw+0x1746/0x1e00
PGD 2f8e86067 PUD 31c2c4067 PMD 0 
Oops: 0000 [1] SMP 
last sysfs file: /devices/pci0000:00/0000:00:00.0/irq
CPU 4 
Modules linked in: autofs4(U) hidp(U) obdfilter(U) fsfilt_ldiskfs(U) ost(U) mgc(
U) ldiskfs(U) jbd2(U) crc16(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd
(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) rfcomm(U) l2cap(U) bluetooth
(U) sunrpc(U) dm_multipath(U) scsi_dh(U) video(U) backlight(U) sbs(U) power_mete
r(U) hwmon(U) i2c_ec(U) i2c_core(U) dell_wmi(U) wmi(U) button(U) battery(U) asus
_acpi(U) acpi_memhotplug(U) ac(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) parport_pc
(U) lp(U) parport(U) joydev(U) ixgbe(U) 8021q(U) hpilo(U) sg(U) shpchp(U) dca(U)
 serio_raw(U) pcspkr(U) bnx2(U) dm_raid45(U) dm_message(U) dm_region_hash(U) dm_
mem_cache(U) dm_snapshot(U) dm_zero(U) dm_mirror(U) dm_log(U) dm_mod(U) usb_stor
age(U) lpfc(U) scsi_transport_fc(U) cciss(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U
) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U)
Pid: 4252, comm: ll_ost_io_12 Tainted: G      2.6.18-194.17.1.el5_lustre.1.8.5 #
1
RIP: 0010:[<ffffffff8895e7c6>]  [<ffffffff8895e7c6>] :obdfilter:filter_preprw+0x
1746/0x1e00
RSP: 0018:ffff81030bdcd8c0  EFLAGS: 00010206
RAX: 0000000000000021 RBX: 0000000000000000 RCX: ffff810011017300
RDX: ffff8101067b4c90 RSI: 000000000000000e RDI: 3533313130323331
RBP: ffff81030bdd1388 R08: ffff81061ff40b03 R09: 0000000000001000
R10: 0000000000000000 R11: 00000000000200d2 R12: 000000000000007e
R13: 000000000007e000 R14: 0000000000000100 R15: 0000000000000100
FS:  00002ba793935220(0000) GS:ffff81010af99240(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 00000002f92d1000 CR4: 00000000000006e0
Process ll_ost_io_12 (pid: 4252, threadinfo ffff81030bdcc000, task ffff81030bd17
080)
Stack:  0000000000000000 ffff81031fc503c0 ffff8102c44c7200 0000000000000000
 ffff81031fc503c0 00020000c0a83281 00020000ca7a214e ffffffff88539543
 0000000000000000 ffffffff885a0d80 ffff8102c44c7200 ffffffff8853ba03
Call Trace:
 [<ffffffff88539543>] :lnet:lnet_ni_send+0x93/0xd0
 [<ffffffff885a0d80>] :obdclass:class_handle2object+0xe0/0x170
 [<ffffffff8853ba03>] :lnet:lnet_send+0x9a3/0x9d0
 [<ffffffff8002b84a>] truncate_inode_pages_range+0x222/0x2ba
 [<ffffffff88908ffc>] :ost:ost_brw_write+0xf9c/0x2480
 [<ffffffff8864a658>] :ptlrpc:ptlrpc_send_reply+0x5c8/0x5e0
 [<ffffffff886158b0>] :ptlrpc:target_committed_to_req+0x40/0x120
 [<ffffffff8864eb05>] :ptlrpc:lustre_msg_get_version+0x35/0xf0
 [<ffffffff8864ea15>] :ptlrpc:lustre_msg_get_opc+0x35/0xf0
 [<ffffffff8008cf93>] default_wake_function+0x0/0xe
 [<ffffffff8864ebc8>] :ptlrpc:lustre_msg_check_version_v2+0x8/0x20
 [<ffffffff8890d08e>] :ost:ost_handle+0x2bae/0x55b0
 [<ffffffff80150d56>] __next_cpu+0x19/0x28
 [<ffffffff800767ae>] smp_send_reschedule+0x4e/0x53
 [<ffffffff8865e15a>] :ptlrpc:ptlrpc_server_handle_request+0x97a/0xdf0
 [<ffffffff8865e8a8>] :ptlrpc:ptlrpc_wait_event+0x2d8/0x310
 [<ffffffff8008b3bd>] __wake_up_common+0x3e/0x68
 [<ffffffff8865f817>] :ptlrpc:ptlrpc_main+0xf37/0x10f0
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff8865e8e0>] :ptlrpc:ptlrpc_main+0x0/0x10f0
 [<ffffffff8005dfa7>] child_rip+0x0/0x11
Code: 44 39 23 7e 0c 48 83 c5 28 e9 56 fd ff ff 45 31 ed 48 8d bc 
RIP  [<ffffffff8895e7c6>] :obdfilter:filter_preprw+0x1746/0x1e00
 RSP <ffff81030bdcd8c0>



and here is bt result:
crash> bt
PID: 4252   TASK: ffff81030bd17080  CPU: 4   COMMAND: "ll_ost_io_12"
 #0 [ffff81030bdcd620] crash_kexec at ffffffff800ad9c4
 #1 [ffff81030bdcd6e0] __die at ffffffff80065157
 #2 [ffff81030bdcd720] do_page_fault at ffffffff80066dd7
 #3 [ffff81030bdcd810] error_exit at ffffffff8005dde9
    [exception RIP: filter_preprw+5958]
    RIP: ffffffff8895e7c6  RSP: ffff81030bdcd8c0  RFLAGS: 00010206
    RAX: 0000000000000021  RBX: 0000000000000000  RCX: ffff810011017300
    RDX: ffff8101067b4c90  RSI: 000000000000000e  RDI: 3533313130323331
    RBP: ffff81030bdd1388   R8: ffff81061ff40b03   R9: 0000000000001000
    R10: 0000000000000000  R11: 00000000000200d2  R12: 000000000000007e
    R13: 000000000007e000  R14: 0000000000000100  R15: 0000000000000100
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #4 [ffff81030bdcd8f8] lnet_ni_send at ffffffff88539543
 #5 [ffff81030bdcd918] lnet_send at ffffffff8853ba03
 #6 [ffff81030bdcd9d8] truncate_inode_pages_range at ffffffff8002b84a
 #7 [ffff81030bdcdaf8] ptlrpc_send_reply at ffffffff8864a658
 #8 [ffff81030bdcdc18] lustre_msg_get_version at ffffffff8864eb05
 #9 [ffff81030bdcdc48] lustre_msg_check_version_v2 at ffffffff8864ebc8
#10 [ffff81030bdcdca8] ost_handle at ffffffff8890d08e
#11 [ffff81030bdcde38] ptlrpc_wait_event at ffffffff8865e8a8
#12 [ffff81030bdcdf48] kernel_thread at ffffffff8005dfb1



When the crash happened, the machine seemed working at good condition, low load, low memory usage, low iowait...
Do you have any suggetions for avoiding this kind of crashes? 
Thank you very much !
 
---------------------------------------------------------
Lu Wang
Computing Center, Institute of High Energy Physics
Chinese Academy of Sciences
Beijing, 100049    P. R. China
Tel: (86)010-88236010 ext 105
E-mail: Lu.Wang at ihep.ac.cn
---------------------------------------------------------
         
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20111010/dd7d59cd/attachment.htm>


More information about the lustre-discuss mailing list