[Lustre-discuss] Help

Pankaj Dorlikar pankajd at cdac.in
Mon Nov 29 12:10:53 PST 2010


Hi,
we have lustre client intalled is 1.6.6 on node. It got crashed generating 
vmcore file.OS is RHEL 5.2. Some part of vmcore logs.

Error is :

ib_cm: req timeout_ms 34816 > 32768, decreasing
LustreError: 13790:0:(ldlm_request.c:996:ldlm_cli_cancel_req()) Got rc 
-108 from cancel RPC: canceling anyway
LustreError: 13790:0:(ldlm_request.c:996:ldlm_cli_cancel_req()) Skipped 18 
previous similar messages
LustreError: 13790:0:(ldlm_request.c:1605:ldlm_cli_cancel_list()) 
ldlm_cli_cancel_list: -108
LustreError: 13790:0:(ldlm_request.c:1605:ldlm_cli_cancel_list()) Skipped 
18 previous similar messages
LustreError: 13790:0:(ldlm_request.c:996:ldlm_cli_cancel_req()) Got rc 
-108 from cancel RPC: canceling anyway
LustreError: 13790:0:(ldlm_request.c:1605:ldlm_cli_cancel_list()) 
ldlm_cli_cancel_list: -108
Lustre: client ffff81200f886400 umount complete
Lustre: Request x21503806 sent from MGC172.31.65.49 at o2ib to NID 
172.31.65.49 at o2ib 100s ago has timed out (limit 100s).
LustreError: 166-1: MGC172.31.65.49 at o2ib: Connection to service MGS via 
nid 172.31.65.49 at o2ib was lost; in progress operations using this service 
will fail.
Lustre: Request x21504004 sent from MGC172.31.65.49 at o2ib to NID 
172.31.65.49 at o2ib 100s ago has timed out (limit 100s).
Lustre: Request x21504206 sent from MGC172.31.65.49 at o2ib to NID 
172.31.65.49 at o2ib 100s ago has timed out (limit 100s).
Lustre: Request x21504405 sent from MGC172.31.65.49 at o2ib to NID 
172.31.65.49 at o2ib 100s ago has timed out (limit 100s).
Lustre: Request x21504583 sent from MGC172.31.65.49 at o2ib to NID 
172.31.65.49 at o2ib 100s ago has timed out (limit 100s).
Lustre: Request x21504796 sent from MGC172.31.65.49 at o2ib to NID 
172.31.65.49 at o2ib 100s ago has timed out (limit 100s).
Lustre: Request x21505000 sent from MGC172.31.65.49 at o2ib to NID 
172.31.65.49 at o2ib 100s ago has timed out (limit 100s).
Lustre: 19021:0:(import.c:736:ptlrpc_connect_interpret()) 
MGS at MGC172.31.65.49@o2ib_0 changed server handle from 0x818c15f164eefdf6 
to 0x818c15f1f24cadd8
but is still in recovery
Lustre: MGC172.31.65.49 at o2ib: Reactivating import
Lustre: MGC172.31.65.49 at o2ib: Connection restored to service MGS using nid 
172.31.65.49 at o2ib.
general protection fault: 0000 [1] SMP
last sysfs file: 
/devices/pci0000:00/0000:00:01.0/0000:03:00.0/0000:04:01.0/0000:07:00.0/0000:08:00.0/irq
CPU 3
Modules linked in: iptable_raw iptable_filter iptable_mangle iptable_nat 
ip_nat ip_conntrack nfnetlink ip_tables nfsd exportfs auth_rpcgss 
xt_tcpudp nfs lock
d fscache nfs_acl x_tables ipmi_devintf ipmi_si ipmi_msghandler mgc(U) 
lustre(U) lov(U) mdc(U) lquota(U) osc(U) ko2iblnd(U) ptlrpc(U) obdclass(U) 
lnet(U) lvf
s(U) libcfs(U) hpilo(U) sunrpc rdma_ucm(U) rds(U) ib_ucm(U) ib_sdp(U) 
rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ipoib_helper(U) ib_cm(U) 
ib_sa(U) ipv6 xfrm_nalgo crypto_api ib_uverbs(U) ib_umad(U) mlx4_ib(U) 
ib_mthca(U) ib_mad(U) ib_core(U) dm_mirror dm_multipath dm_mod video sbs 
backlight i2c_ec i2c_core button battery asus_acpi acpi_memhotplug ac 
parport_pc lp parport joydev mlx4_core(U) bnx2(U) ide_cd shpchp e1000e 
serio_raw cdrom pcspkr ata_piix libata cciss(U) sd_mod scsi_mod ext3 jbd 
uhci_hcd ohci_hcd ehci_hcd
Pid: 14147, comm: ldlm_cb_02 Tainted: G      2.6.18-92.el5 #1
RIP: 0010:[<ffffffff88625121>]  [<ffffffff88625121>] 
:ptlrpc:lock_res_and_lock+0x41/0xe0
RSP: 0018:ffff811ff7593ce0  EFLAGS: 00010206
RAX: ffff811d4538f800 RBX: 5a5a5a5a5a5a5a5a RCX: 0000000000000001
RDX: ffffc2001053f790 RSI: 0000000000000000 RDI: ffff811d4538f800
RBP: ffff811db60e80c0 R08: 0000000000000000 R09: ffff812003815400
R10: 0000000000000000 R11: 0000000000000001 R12: ffff811d4538f800
R13: ffff810170bd3a00 R14: 000000004ce47162 R15: ffff811ff8a64c50
FS:  00002b2883022220(0000) GS:ffff81202ff1c640(0000) 
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 000000000c655f28 CR3: 000000201a1fb000 CR4: 00000000000006e0
Process ldlm_cb_02 (pid: 14147, threadinfo ffff811ff7592000, task 
ffff81202fa85100)
Stack:  ffff811d4538f800 ffff811ff8a64c50 ffff811ff84ecc50 
ffffffff88646ba8
  ffff810009040a80 ffff811ff8ea8820 ffff811ff7593d30 ffff811ff84ecbb8
  0000000000000010 ffffffff8866e0da ffff811ff7e04dc0 ffffffff8866530e
Call Trace:
  [<ffffffff88646ba8>] :ptlrpc:ldlm_callback_handler+0x10a8/0x1ae0
  [<ffffffff8866e0da>] :ptlrpc:ptlrpc_check_req+0x1a/0x110
  [<ffffffff8866530e>] :ptlrpc:lustre_msg_get_handle+0x2e/0xe0
  [<ffffffff886702c2>] :ptlrpc:ptlrpc_server_handle_request+0x992/0x1040
  [<ffffffff80062efb>] thread_return+0x0/0xdf
  [<ffffffff8006d7bf>] do_gettimeofday+0x50/0x92
  [<ffffffff88520466>] :libcfs:lcw_update_time+0x16/0x100
  [<ffffffff80089241>] __wake_up_common+0x3e/0x68
  [<ffffffff886732dc>] :ptlrpc:ptlrpc_main+0xe0c/0xf90
  [<ffffffff8008ac03>] default_wake_function+0x0/0xe
  [<ffffffff800b4326>] audit_syscall_exit+0x31b/0x336
  [<ffffffff8005dfb1>] child_rip+0xa/0x11
  [<ffffffff886724d0>] :ptlrpc:ptlrpc_main+0x0/0xf90
  [<ffffffff8005dfa7>] child_rip+0x0/0x11


Code: f7 43 08 fc ff ff ff 74 26 b9 7f 01 00 00 48 c7 c2 40 ad 68
RIP  [<ffffffff88625121>] :ptlrpc:lock_res_and_lock+0x41/0xe0
  RSP <ffff811ff7593ce0>
crash> sys
       KERNEL: ../vmlinux
     DUMPFILE: ./vmcore
         CPUS: 16
         DATE: Thu Nov 18 05:50:50 2010
       UPTIME: 21 days, 12:07:43
LOAD AVERAGE: 0.04, 0.03, 0.00
        TASKS: 379


Warm Regards,
Pankaj

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.




More information about the lustre-discuss mailing list