[Lustre-discuss] ll_ost_creat_* goes bersek (100% cpu used - OST disabled)

Alexey Lyashkov alexey.lyashkov at clusterstor.com
Fri Aug 13 11:54:02 PDT 2010


 Adrian,

in general soft-lookup isn't error, that just notice about some operation is need too many time (more then 10s i think).
attached soft-lookup say - OST is busy with creating objects after MDS<>OST reconnect,  i think you have too busy disks or overloaded node.
to avoid that softlookup  you can reduce maximal number of precreated object via procfs (if my memory correct).

why clients has reconnects is different question, but that can be related to soft-lookup.
if you have slow disks - client can be disconnected before they request is processing, and that request blocked to reconnect from that client.

On Aug 13, 2010, at 21:29, Adrian Ulrich wrote:
> Oh, sorry for mixing this up.
> 
> 'lctl df' doesn't show much new stuff:
> 
> 00010000:00000400:0:1281721514.362102:0:13008:0:(ldlm_lib.c:541:target_handle_reconnect()) lustre1-OST0005:
> 9f880a5e-2331-07a8-8611-d6e3102f466e reconnecting
> 00010000:00000400:0:1281721514.362107:0:13008:0:(ldlm_lib.c:835:target_handle_connect()) lustre1-OST0005:
> refuse reconnection from 9f880a5e-2331-07a8-8611-d6e3102f466e at 10.201.48.12@o2ib to 0xffff8101c93b6000; still busy with 1 active RPCs
> 00010000:00020000:
> 00010000:00000400:7:1281721525.880767:0:11822:0:(ldlm_lib.c:541:target_handle_reconnect()) lustre1-OST0005:
> lustre1-mdtlov_UUID reconnecting
> 


> 
>> please post soft-lookup report. one of possibility, MDS ask too many objects to create on that OST or OST have too many reconnects.
> 
> LustreError: 12972:0:(ldlm_lib.c:1863:target_send_reply_msg()) Skipped 71 previous similar messages
> BUG: soft lockup - CPU#4 stuck for 59s! [ll_ost_creat_00:11833]
> CPU 4:
> Modules linked in: obdfilter(U) fsfilt_ldiskfs(U) ost(U) mgc(U) ldiskfs(U) crc16(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U) ko2iblnd(U)
> ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) raid456(U) xor(U) raid1(U) netconsole(U) lockd(U) sunrpc(U) rdma_ucm(U) qlgc_vnic(U) ib_sdp(U) rdma_cm
> (U) iw_cm(U) ib_addr(U) ib_ipoib(U) ipoib_helper(U) ib_cm(U) ib_sa(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) ib_uverbs(U) ib_umad(U) iw_nes(U) iw_cxgb3(U) 
> cxgb3(U) ib_ipath(U) ib_mthca(U) mptctl(U) dm_mirror(U) dm_multipath(U) scsi_dh(U) video(U) hwmon(U) backlight(U) sbs(U) i2c_ec(U) button(U) battery(U) 
> asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) mlx4_ib(U) ib_mad(U) ib_core(U) joydev(U) sr_mod(U) cdrom(U) sg(U) tpm_infineon(U) 
> tpm(U) tpm_bios(U) mlx4_core(U) i5000_edac(U) edac_mc(U) i2c_i801(U) pcspkr(U) e1000e(U) i2c_core(U) serio_raw(U) dm_raid45(U) dm_message(U) 
> dm_region_hash(U) dm_log(U) dm_mod(U) dm_mem_cache(U) usb_storage(U) ahci(U) ata_piix(U) libata(U) mptsas(U) scsi_transport_sas(U) mptfc(U) 
> scsi_transport_fc(U) mptspi(U) mptscsih(U) mptbase(U) scsi_transport_spi(U) shpchp(U) aacraid(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) uhci_hcd(U) 
> ohci_hcd(U) ehci_hcd(U)
> Pid: 11833, comm: ll_ost_creat_00 Tainted: G      2.6.18-128.7.1.el5_lustre.1.8.1.1 #1
> RIP: 0010:[<ffffffff88b49af4>]  [<ffffffff88b49af4>] :ldiskfs:ldiskfs_find_entry+0x1d4/0x5c0
> RSP: 0018:ffff8101e715d500  EFLAGS: 00000202
> RAX: 0000000000000000 RBX: 0000000000000008 RCX: 000000000c91a9f1
> RDX: ffff8101e8893800 RSI: ffff8101e715d4e8 RDI: ffff81010773e838
> RBP: 0000000000000002 R08: ffff81017bd9cff8 R09: ffff81017bd9c000
> R10: ffff810216dfb000 R11: 000000004c6578dc R12: ffff81017d41e6d0
> R13: ffffffff80063b4c R14: ffff8101e715d5b8 R15: ffffffff80014fae
> FS:  00002ab1d8b97220(0000) GS:ffff81021fc74bc0(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 0000000009c9c178 CR3: 0000000000201000 CR4: 00000000000006e0
> 
> Call Trace:
> [<ffffffff8001a1b8>] vsnprintf+0x559/0x59e
> [<ffffffff8005ba4d>] cache_alloc_refill+0x106/0x186
> [<ffffffff88b4bf63>] :ldiskfs:ldiskfs_lookup+0x53/0x290
> [<ffffffff800366e8>] __lookup_hash+0x10b/0x130
> [<ffffffff800e2c9b>] lookup_one_len+0x53/0x61
> [<ffffffff88bd71ed>] :obdfilter:filter_fid2dentry+0x42d/0x730
> [<ffffffff88bd3383>] :obdfilter:filter_statfs+0x273/0x350
> [<ffffffff8026f08b>] __down_trylock+0x44/0x4e
> [<ffffffff88bd2f00>] :obdfilter:filter_parent_lock+0x20/0x220
> [<ffffffff88bd7d43>] :obdfilter:filter_precreate+0x843/0x19e0
> [<ffffffff887e0923>] :lnet:lnet_ni_send+0x93/0xd0
> [<ffffffff8000d0bd>] dput+0x23/0x10a
> [<ffffffff88be1e19>] :obdfilter:filter_create+0x10b9/0x15e0
> [<ffffffff887e6ad2>] :lnet:LNetPut+0x702/0x800
> [<ffffffff888e9a13>] :ptlrpc:ptl_send_buf+0x3f3/0x5b0
> [<ffffffff888ef394>] :ptlrpc:lustre_msg_add_version+0x34/0x110
> [<ffffffff888ea198>] :ptlrpc:ptlrpc_send_reply+0x5c8/0x5e0
> [<ffffffff888f1f69>] :ptlrpc:lustre_pack_reply+0x29/0xb0
> [<ffffffff88ba161d>] :ost:ost_handle+0x131d/0x5a70
> [<ffffffff8001a1b8>] vsnprintf+0x559/0x59e
> [<ffffffff88797978>] :libcfs:libcfs_debug_vmsg2+0x6e8/0x990
> [<ffffffff80148e8c>] __next_cpu+0x19/0x28
> [<ffffffff80088f36>] find_busiest_group+0x20d/0x621
> [<ffffffff888f3f05>] :ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0
> [<ffffffff80089d8d>] enqueue_task+0x41/0x56
> [<ffffffff888f8c1d>] :ptlrpc:ptlrpc_check_req+0x1d/0x110
> [<ffffffff888fb357>] :ptlrpc:ptlrpc_server_handle_request+0xa97/0x1170
> [<ffffffff8003d382>] lock_timer_base+0x1b/0x3c
> [<ffffffff8008881d>] __wake_up_common+0x3e/0x68
> [<ffffffff888fee08>] :ptlrpc:ptlrpc_main+0x1218/0x13e0
> [<ffffffff8008a3f3>] default_wake_function+0x0/0xe
> [<ffffffff8005dfb1>] child_rip+0xa/0x11
> [<ffffffff888fdbf0>] :ptlrpc:ptlrpc_main+0x0/0x13e0
> [<ffffffff8005dfa7>] child_rip+0x0/0x11
> 
> 
> Btw: We are running 1.8.1.1 (with rhel kernel 2.6.18-128.7.1.el5_lustre.1.8.1.1)
> 
> Regards,
> Adrian




More information about the lustre-discuss mailing list