[Lustre-discuss] Help
Pankaj Dorlikar
pankajd at cdac.in
Mon Nov 29 12:10:53 PST 2010
Hi,
we have lustre client intalled is 1.6.6 on node. It got crashed generating
vmcore file.OS is RHEL 5.2. Some part of vmcore logs.
Error is :
ib_cm: req timeout_ms 34816 > 32768, decreasing
LustreError: 13790:0:(ldlm_request.c:996:ldlm_cli_cancel_req()) Got rc
-108 from cancel RPC: canceling anyway
LustreError: 13790:0:(ldlm_request.c:996:ldlm_cli_cancel_req()) Skipped 18
previous similar messages
LustreError: 13790:0:(ldlm_request.c:1605:ldlm_cli_cancel_list())
ldlm_cli_cancel_list: -108
LustreError: 13790:0:(ldlm_request.c:1605:ldlm_cli_cancel_list()) Skipped
18 previous similar messages
LustreError: 13790:0:(ldlm_request.c:996:ldlm_cli_cancel_req()) Got rc
-108 from cancel RPC: canceling anyway
LustreError: 13790:0:(ldlm_request.c:1605:ldlm_cli_cancel_list())
ldlm_cli_cancel_list: -108
Lustre: client ffff81200f886400 umount complete
Lustre: Request x21503806 sent from MGC172.31.65.49 at o2ib to NID
172.31.65.49 at o2ib 100s ago has timed out (limit 100s).
LustreError: 166-1: MGC172.31.65.49 at o2ib: Connection to service MGS via
nid 172.31.65.49 at o2ib was lost; in progress operations using this service
will fail.
Lustre: Request x21504004 sent from MGC172.31.65.49 at o2ib to NID
172.31.65.49 at o2ib 100s ago has timed out (limit 100s).
Lustre: Request x21504206 sent from MGC172.31.65.49 at o2ib to NID
172.31.65.49 at o2ib 100s ago has timed out (limit 100s).
Lustre: Request x21504405 sent from MGC172.31.65.49 at o2ib to NID
172.31.65.49 at o2ib 100s ago has timed out (limit 100s).
Lustre: Request x21504583 sent from MGC172.31.65.49 at o2ib to NID
172.31.65.49 at o2ib 100s ago has timed out (limit 100s).
Lustre: Request x21504796 sent from MGC172.31.65.49 at o2ib to NID
172.31.65.49 at o2ib 100s ago has timed out (limit 100s).
Lustre: Request x21505000 sent from MGC172.31.65.49 at o2ib to NID
172.31.65.49 at o2ib 100s ago has timed out (limit 100s).
Lustre: 19021:0:(import.c:736:ptlrpc_connect_interpret())
MGS at MGC172.31.65.49@o2ib_0 changed server handle from 0x818c15f164eefdf6
to 0x818c15f1f24cadd8
but is still in recovery
Lustre: MGC172.31.65.49 at o2ib: Reactivating import
Lustre: MGC172.31.65.49 at o2ib: Connection restored to service MGS using nid
172.31.65.49 at o2ib.
general protection fault: 0000 [1] SMP
last sysfs file:
/devices/pci0000:00/0000:00:01.0/0000:03:00.0/0000:04:01.0/0000:07:00.0/0000:08:00.0/irq
CPU 3
Modules linked in: iptable_raw iptable_filter iptable_mangle iptable_nat
ip_nat ip_conntrack nfnetlink ip_tables nfsd exportfs auth_rpcgss
xt_tcpudp nfs lock
d fscache nfs_acl x_tables ipmi_devintf ipmi_si ipmi_msghandler mgc(U)
lustre(U) lov(U) mdc(U) lquota(U) osc(U) ko2iblnd(U) ptlrpc(U) obdclass(U)
lnet(U) lvf
s(U) libcfs(U) hpilo(U) sunrpc rdma_ucm(U) rds(U) ib_ucm(U) ib_sdp(U)
rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ipoib_helper(U) ib_cm(U)
ib_sa(U) ipv6 xfrm_nalgo crypto_api ib_uverbs(U) ib_umad(U) mlx4_ib(U)
ib_mthca(U) ib_mad(U) ib_core(U) dm_mirror dm_multipath dm_mod video sbs
backlight i2c_ec i2c_core button battery asus_acpi acpi_memhotplug ac
parport_pc lp parport joydev mlx4_core(U) bnx2(U) ide_cd shpchp e1000e
serio_raw cdrom pcspkr ata_piix libata cciss(U) sd_mod scsi_mod ext3 jbd
uhci_hcd ohci_hcd ehci_hcd
Pid: 14147, comm: ldlm_cb_02 Tainted: G 2.6.18-92.el5 #1
RIP: 0010:[<ffffffff88625121>] [<ffffffff88625121>]
:ptlrpc:lock_res_and_lock+0x41/0xe0
RSP: 0018:ffff811ff7593ce0 EFLAGS: 00010206
RAX: ffff811d4538f800 RBX: 5a5a5a5a5a5a5a5a RCX: 0000000000000001
RDX: ffffc2001053f790 RSI: 0000000000000000 RDI: ffff811d4538f800
RBP: ffff811db60e80c0 R08: 0000000000000000 R09: ffff812003815400
R10: 0000000000000000 R11: 0000000000000001 R12: ffff811d4538f800
R13: ffff810170bd3a00 R14: 000000004ce47162 R15: ffff811ff8a64c50
FS: 00002b2883022220(0000) GS:ffff81202ff1c640(0000)
knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 000000000c655f28 CR3: 000000201a1fb000 CR4: 00000000000006e0
Process ldlm_cb_02 (pid: 14147, threadinfo ffff811ff7592000, task
ffff81202fa85100)
Stack: ffff811d4538f800 ffff811ff8a64c50 ffff811ff84ecc50
ffffffff88646ba8
ffff810009040a80 ffff811ff8ea8820 ffff811ff7593d30 ffff811ff84ecbb8
0000000000000010 ffffffff8866e0da ffff811ff7e04dc0 ffffffff8866530e
Call Trace:
[<ffffffff88646ba8>] :ptlrpc:ldlm_callback_handler+0x10a8/0x1ae0
[<ffffffff8866e0da>] :ptlrpc:ptlrpc_check_req+0x1a/0x110
[<ffffffff8866530e>] :ptlrpc:lustre_msg_get_handle+0x2e/0xe0
[<ffffffff886702c2>] :ptlrpc:ptlrpc_server_handle_request+0x992/0x1040
[<ffffffff80062efb>] thread_return+0x0/0xdf
[<ffffffff8006d7bf>] do_gettimeofday+0x50/0x92
[<ffffffff88520466>] :libcfs:lcw_update_time+0x16/0x100
[<ffffffff80089241>] __wake_up_common+0x3e/0x68
[<ffffffff886732dc>] :ptlrpc:ptlrpc_main+0xe0c/0xf90
[<ffffffff8008ac03>] default_wake_function+0x0/0xe
[<ffffffff800b4326>] audit_syscall_exit+0x31b/0x336
[<ffffffff8005dfb1>] child_rip+0xa/0x11
[<ffffffff886724d0>] :ptlrpc:ptlrpc_main+0x0/0xf90
[<ffffffff8005dfa7>] child_rip+0x0/0x11
Code: f7 43 08 fc ff ff ff 74 26 b9 7f 01 00 00 48 c7 c2 40 ad 68
RIP [<ffffffff88625121>] :ptlrpc:lock_res_and_lock+0x41/0xe0
RSP <ffff811ff7593ce0>
crash> sys
KERNEL: ../vmlinux
DUMPFILE: ./vmcore
CPUS: 16
DATE: Thu Nov 18 05:50:50 2010
UPTIME: 21 days, 12:07:43
LOAD AVERAGE: 0.04, 0.03, 0.00
TASKS: 379
Warm Regards,
Pankaj
--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.
More information about the lustre-discuss
mailing list