[Lustre-discuss] mds server crashing

Mag Gam magawake at gmail.com
Fri Mar 13 18:47:22 PDT 2009


We are having a problem with a MDS server (which also has 1 OST) on the box.

When the server boots up, we notice there is an ll_mdt process running
at 100% and we keep on waiting close  to 10-15 mins. We only have 8
clients. (I assume this normal recovery process). However if I
manually mount up the mdt without any recovery everything is fine

mount -t lustre /dev/foo -o abort_recov /mnt/lustre

BUT the server crashes again after 18-24 hours.

I am trying to get to the bottom of this crash. I am not sure whats
causing the problem and hopefully I am doing something foolish.

There are 2 OSTs connecting to this MDS.

MDS Server Version:
Redhat 5.1-1.2 Running, 2.6.18-92.1.17.el5_lustre.1.6.7smp

cat /proc/fs/lustre/version
lustre: 1.6.7
kernel: patchless_client
build:  1.6.7-19691231170000-PRISTINE-.cache.build.BUILD.lustre-kernel-2.6.18.lustre.linux-2.6.18-92.1.17.el5_lustre.1.6.7smp

client# lfs check mds
lfs002-MDT0000-mdc-ffff81102ac40000 active.
lfs002-MDT0000-mdc-ffff810fd264bc00 active.

client# lfs check osts
lfs002-OST0000-osc-ffff81102ac40000 active.
lfs002-OST0001-osc-ffff81102ac40000 active.
lfs002-OST0000-osc-ffff810fd264bc00 active.
lfs002-OST0001-osc-ffff810fd264bc00 active.
lfs002-OST0002-osc-ffff810fd264bc00 active.
lfs002-OST0003-osc-ffff810fd264bc00 active.
lfs002-OST0004-osc-ffff810fd264bc00 active.
lfs002-OST0005-osc-ffff810fd264bc00 active.

mds#  lctl dl
  0 UP mgs MGS MGS 25
  1 UP mgc MGC141.128.90.153 at tcp b6d875c0-6b30-5a2d-92d3-600ef3324c50 5
  2 UP mdt MDS MDS_uuid 3
  3 UP lov lfs002-mdtlov lfs002-mdtlov_UUID 4
  4 UP mds lfs002-MDT0000 lfs002-MDT0000_UUID 21
  5 UP osc lfs002-OST0000-osc lfs002-mdtlov_UUID 5
  6 UP osc lfs002-OST0001-osc lfs002-mdtlov_UUID 5
  7 UP ost OSS OSS_uuid 3
  8 UP obdfilter lfs002-OST0001 lfs002-OST0001_UUID 23

The clients are running:
Redhat 5.2
2.6.18-92.1.10.el5

cat /proc/fs/lustre/version
lustre: 1.6.6
kernel: patchless
build:  1.6.6-19691231190000-PRISTINE-.usr.src.linux-2.6.18-92.1.10.el5


Mar 12 10:11:02 protected_host_01 kernel: Pid: 10375, comm: ll_mdt_10
Tainted: G      2.6.18-92.1.17.el5_lustre.1.6.7smp #1
Mar 12 10:11:02 protected_host_01 kernel: RIP:
0010:[<ffffffff888ed8df>]  [<ffffffff888ed8df>]
:ldiskfs:do_split+0x3ef/0x560
Mar 12 10:11:02 protected_host_01 kernel: RSP: 0018:ffff8103d2a5f460
EFLAGS: 00000216
Mar 12 10:11:02 protected_host_01 kernel: RAX: 0000000000000000 RBX:
0000000000000080 RCX: 0000000000000000
Mar 12 10:11:02 protected_host_01 kernel: RDX: 0000000000000080 RSI:
ffff8103cd52177c RDI: ffff8103cd52176c
Mar 12 10:11:02 protected_host_01 kernel: RBP: ffffffff8000b071 R08:
ffff8103cd5216ec R09: 00000000010a0014
Mar 12 10:11:02 protected_host_01 kernel: R10: 00007a6700000008 R11:
00007a672e767363 R12: 000000000064dc69
Mar 12 10:11:02 protected_host_01 kernel: R13: ffffffff80019496 R14:
ffff81040ed0f4c0 R15: 0000000000000000
Mar 12 10:11:02 protected_host_01 kernel: FS:  00002b7545c3b220(0000)
GS:ffff81042fea79c0(0000) knlGS:0000000000000000
Mar 12 10:11:02 protected_host_01 kernel: CS:  0010 DS: 0000 ES: 0000
CR0: 000000008005003b
Mar 12 10:11:02 protected_host_01 kernel: CR2: 0000003d222c5cb0 CR3:
0000000000201000 CR4: 00000000000006e0
Mar 12 10:11:02 protected_host_01 kernel:
Mar 12 10:11:02 protected_host_01 kernel: Call Trace:
Mar 12 10:11:02 protected_host_01 kernel:  [<ffffffff888ee3b5>]
:ldiskfs:ldiskfs_add_entry+0x4f5/0x980
Mar 12 10:11:02 protected_host_01 kernel:  [<ffffffff88034f74>]
:jbd:journal_dirty_metadata+0x1b5/0x1e3
Mar 12 10:11:02 protected_host_01 kernel:  [<ffffffff889a6840>]
:mds:mds_get_parent_child_locked+0x750/0x8e0
Mar 12 10:11:02 protected_host_01 kernel:  [<ffffffff888eee56>]
:ldiskfs:ldiskfs_add_nondir+0x26/0x90
Mar 12 10:11:02 protected_host_01 kernel:  [<ffffffff888ef776>]
:ldiskfs:ldiskfs_create+0xf6/0x140
Mar 12 10:11:02 protected_host_01 kernel:  [<ffffffff8896f412>]
:fsfilt_ldiskfs:fsfilt_ldiskfs_start+0x562/0x630
Mar 12 10:11:02 protected_host_01 kernel:  [<ffffffff8003a075>]
vfs_create+0xe6/0x158
Mar 12 10:11:02 protected_host_01 kernel:  [<ffffffff889c7140>]
:mds:mds_open+0x14b0/0x317e
Mar 12 10:11:02 protected_host_01 kernel:  [<ffffffff8002e15a>]
__wake_up+0x38/0x4f
Mar 12 10:11:02 protected_host_01 kernel:  [<ffffffff8876c241>]
:ksocklnd:ksocknal_queue_tx_locked+0x4f1/0x550
Mar 12 10:11:02 protected_host_01 kernel:  [<ffffffff8876d47f>]
:ksocklnd:ksocknal_launch_packet+0x2df/0x3d0
Mar 12 10:11:02 protected_host_01 kernel:  [<ffffffff889a1f49>]
:mds:mds_reint_rec+0x1d9/0x2b0
Mar 12 10:11:02 protected_host_01 kernel:  [<ffffffff889cad82>]
:mds:mds_open_unpack+0x312/0x430
Mar 12 10:11:02 protected_host_01 kernel:  [<ffffffff88994d4a>]
:mds:mds_reint+0x35a/0x420
Mar 12 10:11:02 protected_host_01 kernel:  [<ffffffff889934db>]
:mds:fixup_handle_for_resent_req+0x25b/0x2c0
Mar 12 10:11:02 protected_host_01 kernel:  [<ffffffff88998dfc>]
:mds:mds_intent_policy+0x48c/0xc30
Mar 12 10:11:02 protected_host_01 kernel:  [<ffffffff886ab526>]
:ptlrpc:ldlm_resource_putref+0x1b6/0x3a0
Mar 12 10:11:02 protected_host_01 kernel:  [<ffffffff886a8d18>]
:ptlrpc:ldlm_lock_enqueue+0x188/0x990
Mar 12 10:11:02 protected_host_01 kernel:  [<ffffffff886c36ff>]
:ptlrpc:ldlm_export_lock_get+0x6f/0xe0
Mar 12 10:11:02 protected_host_01 kernel:  [<ffffffff8862c688>]
:obdclass:lustre_hash_add+0x208/0x2d0
Mar 12 10:11:02 protected_host_01 kernel:  [<ffffffff886cc2a0>]
:ptlrpc:ldlm_server_blocking_ast+0x0/0x833
Mar 12 10:11:02 protected_host_01 kernel:  [<ffffffff886ca3f9>]
:ptlrpc:ldlm_handle_enqueue+0xc09/0x1200
Mar 12 10:11:02 protected_host_01 kernel:  [<ffffffff8899d615>]
:mds:mds_handle+0x4075/0x4d30
Mar 12 10:11:02 protected_host_01 kernel:  [<ffffffff800d40d5>]
cache_flusharray+0x2f/0xa3
Mar 12 10:11:02 protected_host_01 kernel:  [<ffffffff80143809>]
__next_cpu+0x19/0x28
Mar 12 10:11:02 protected_host_01 kernel:  [<ffffffff80143809>]
__next_cpu+0x19/0x28
Mar 12 10:11:02 protected_host_01 kernel:  [<ffffffff800898e3>]
find_busiest_group+0x20d/0x621
Mar 12 10:11:02 protected_host_01 kernel:  [<ffffffff886e65a5>]
:ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0
Mar 12 10:11:02 protected_host_01 kernel:  [<ffffffff886eecfa>]
:ptlrpc:ptlrpc_server_request_get+0x6a/0x150
Mar 12 10:11:02 protected_host_01 kernel:  [<ffffffff886f0b7d>]
:ptlrpc:ptlrpc_check_req+0x1d/0x110
Mar 12 10:11:02 protected_host_01 kernel:  [<ffffffff886f3103>]
:ptlrpc:ptlrpc_server_handle_request+0xa93/0x1150
Mar 12 10:11:02 protected_host_01 kernel:  [<ffffffff80062f4b>]
thread_return+0x0/0xdf
Mar 12 10:11:02 protected_host_01 kernel:  [<ffffffff8006d8a2>]
do_gettimeofday+0x40/0x8f
Mar 12 10:11:02 protected_host_01 kernel:  [<ffffffff885967c6>]
:libcfs:lcw_update_time+0x16/0x100
Mar 12 10:11:02 protected_host_01 kernel:  [<ffffffff800891f6>]
__wake_up_common+0x3e/0x68
Mar 12 10:11:02 protected_host_01 kernel:  [<ffffffff886f65f8>]
:ptlrpc:ptlrpc_main+0x1218/0x13e0
Mar 12 10:11:02 protected_host_01 kernel:  [<ffffffff8008abb9>]
default_wake_function+0x0/0xe
Mar 12 10:11:02 protected_host_01 kernel:  [<ffffffff800b4382>]
audit_syscall_exit+0x31b/0x336
Mar 12 10:11:02 protected_host_01 kernel:  [<ffffffff8005dfb1>]
child_rip+0xa/0x11
Mar 12 10:11:02 protected_host_01 kernel:  [<ffffffff886f53e0>]
:ptlrpc:ptlrpc_main+0x0/0x13e0
Mar 12 10:11:02 protected_host_01 kernel:  [<ffffffff8005dfa7>]
child_rip+0x0/0x11
Mar 12 10:11:02 protected_host_01 kernel:


Mar 12 10:17:06 protected_host_01 kernel: BUG: soft lockup - CPU#6
stuck for 10s! [ll_mdt_10:10375]
Mar 12 10:17:06 protected_host_01 kernel: CPU 6:
Mar 12 10:17:06 protected_host_01 kernel: Modules linked in:
obdfilter(U) ost(U) mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U)
crc16(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U)
ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) mptctl(U)
ipmi_devintf(U) ipmi_si(U) ipmi_msghandler(U) nfsd(U) exportfs(U)
auth_rpcgss(U) nfs(U) lockd(U) fscache(U) nfs_acl(U) autofs4(U)
sunrpc(U) bonding(U) dm_round_robin(U) dm_multipath(U) video(U) sbs(U)
backlight(U) i2c_ec(U) i2c_core(U) button(U) battery(U) asus_acpi(U)
acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) sg(U)
pata_acpi(U) lpfc(U) ide_cd(U) bnx2(U) e1000e(U) cdrom(U) shpchp(U)
scsi_transport_fc(U) hpwdt(U) i5000_edac(U) edac_mc(U) pcspkr(U)
serio_raw(U) dm_snapshot(U) dm_zero(U) dm_mirror(U) dm_mod(U)
usb_storage(U) ata_piix(U) sata_nv(U) libata(U) mptsas(U)
scsi_transport_sas(U) mptspi(U) mptscsih(U) scsi_transport_spi(U)
mptbase(U) cciss(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) ehci_hcd(U)
ohci_hcd(U) uhci_hcd(U)
Mar 12 10:17:06 protected_host_01 kernel: Pid: 10375, comm: ll_mdt_10
Tainted: G      2.6.18-92.1.17.el5_lustre.1.6.7smp #1
Mar 12 10:17:06 protected_host_01 kernel: RIP:
0010:[<ffffffff888ed8f0>]  [<ffffffff888ed8f0>]
:ldiskfs:do_split+0x400/0x560
Mar 12 10:17:06 protected_host_01 kernel: RSP: 0018:ffff8103d2a5f460
EFLAGS: 00000246
Mar 12 10:17:06 protected_host_01 kernel: RAX: 0000000000000000 RBX:
0000000000000080 RCX: 0000000000000000
Mar 12 10:17:06 protected_host_01 kernel: RDX: 0000000000000080 RSI:
ffff8103cd52177c RDI: ffff8103cd52176c
Mar 12 10:17:06 protected_host_01 kernel: RBP: ffffffff8000b071 R08:
ffff8103cd5216ec R09: 00000000010a0014
Mar 12 10:17:06 protected_host_01 kernel: R10: 00007a6700000008 R11:
00007a672e767363 R12: 000000000064dc69
Mar 12 10:17:06 protected_host_01 kernel: R13: ffffffff80019496 R14:
ffff81040ed0f4c0 R15: 0000000000000000
Mar 12 10:17:06 protected_host_01 kernel: FS:  00002b7545c3b220(0000)
GS:ffff81042fea79c0(0000) knlGS:0000000000000000
Mar 12 10:17:06 protected_host_01 kernel: CS:  0010 DS: 0000 ES: 0000
CR0: 000000008005003b
Mar 12 10:17:06 protected_host_01 kernel: CR2: 0000003d222c5cb0 CR3:
0000000000201000 CR4: 00000000000006e0
Mar 12 10:17:06 protected_host_01 kernel:
Mar 12 10:17:06 protected_host_01 kernel: Call Trace:
Mar 12 10:17:06 protected_host_01 kernel:  [<ffffffff888ee3b5>]
:ldiskfs:ldiskfs_add_entry+0x4f5/0x980
Mar 12 10:17:06 protected_host_01 kernel:  [<ffffffff88034f74>]
:jbd:journal_dirty_metadata+0x1b5/0x1e3
Mar 12 10:17:06 protected_host_01 kernel:  [<ffffffff889a6840>]
:mds:mds_get_parent_child_locked+0x750/0x8e0
Mar 12 10:17:06 protected_host_01 kernel:  [<ffffffff888eee56>]
:ldiskfs:ldiskfs_add_nondir+0x26/0x90
Mar 12 10:17:06 protected_host_01 kernel:  [<ffffffff888ef776>]
:ldiskfs:ldiskfs_create+0xf6/0x140
Mar 12 10:17:06 protected_host_01 kernel:  [<ffffffff8896f412>]
:fsfilt_ldiskfs:fsfilt_ldiskfs_start+0x562/0x630
Mar 12 10:17:06 protected_host_01 kernel:  [<ffffffff8003a075>]
vfs_create+0xe6/0x158
Mar 12 10:17:06 protected_host_01 kernel:  [<ffffffff889c7140>]
:mds:mds_open+0x14b0/0x317e
Mar 12 10:17:06 protected_host_01 kernel:  [<ffffffff8002e15a>]
__wake_up+0x38/0x4f
Mar 12 10:17:06 protected_host_01 kernel:  [<ffffffff8876c241>]
:ksocklnd:ksocknal_queue_tx_locked+0x4f1/0x550
Mar 12 10:17:06 protected_host_01 kernel:  [<ffffffff8876d47f>]
:ksocklnd:ksocknal_launch_packet+0x2df/0x3d0
Mar 12 10:17:06 protected_host_01 kernel:  [<ffffffff889a1f49>]
:mds:mds_reint_rec+0x1d9/0x2b0
Mar 12 10:17:06 protected_host_01 kernel:  [<ffffffff889cad82>]
:mds:mds_open_unpack+0x312/0x430
Mar 12 10:17:06 protected_host_01 kernel:  [<ffffffff88994d4a>]
:mds:mds_reint+0x35a/0x420
Mar 12 10:17:06 protected_host_01 kernel:  [<ffffffff889934db>]
:mds:fixup_handle_for_resent_req+0x25b/0x2c0
Mar 12 10:17:06 protected_host_01 kernel:  [<ffffffff88998dfc>]
:mds:mds_intent_policy+0x48c/0xc30
Mar 12 10:17:06 protected_host_01 kernel:  [<ffffffff886ab526>]
:ptlrpc:ldlm_resource_putref+0x1b6/0x3a0
Mar 12 10:17:06 protected_host_01 kernel:  [<ffffffff886a8d18>]
:ptlrpc:ldlm_lock_enqueue+0x188/0x990
Mar 12 10:17:06 protected_host_01 kernel:  [<ffffffff886c36ff>]
:ptlrpc:ldlm_export_lock_get+0x6f/0xe0
Mar 12 10:17:06 protected_host_01 kernel:  [<ffffffff8862c688>]
:obdclass:lustre_hash_add+0x208/0x2d0
Mar 12 10:17:06 protected_host_01 kernel:  [<ffffffff886cc2a0>]
:ptlrpc:ldlm_server_blocking_ast+0x0/0x833
Mar 12 10:17:06 protected_host_01 kernel:  [<ffffffff886ca3f9>]
:ptlrpc:ldlm_handle_enqueue+0xc09/0x1200
Mar 12 10:17:06 protected_host_01 kernel:  [<ffffffff8899d615>]
:mds:mds_handle+0x4075/0x4d30
Mar 12 10:17:06 protected_host_01 kernel:  [<ffffffff800d40d5>]
cache_flusharray+0x2f/0xa3
Mar 12 10:17:06 protected_host_01 kernel:  [<ffffffff80143809>]
__next_cpu+0x19/0x28
Mar 12 10:17:06 protected_host_01 kernel:  [<ffffffff80143809>]
__next_cpu+0x19/0x28
Mar 12 10:17:06 protected_host_01 kernel:  [<ffffffff800898e3>]
find_busiest_group+0x20d/0x621
Mar 12 10:17:06 protected_host_01 kernel:  [<ffffffff886e65a5>]
:ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0
Mar 12 10:17:06 protected_host_01 kernel:  [<ffffffff886eecfa>]
:ptlrpc:ptlrpc_server_request_get+0x6a/0x150
Mar 12 10:17:06 protected_host_01 kernel:  [<ffffffff886f0b7d>]
:ptlrpc:ptlrpc_check_req+0x1d/0x110
Mar 12 10:17:06 protected_host_01 kernel:  [<ffffffff886f3103>]
:ptlrpc:ptlrpc_server_handle_request+0xa93/0x1150
Mar 12 10:17:06 protected_host_01 kernel:  [<ffffffff80062f4b>]
thread_return+0x0/0xdf
Mar 12 10:17:06 protected_host_01 kernel:  [<ffffffff8006d8a2>]
do_gettimeofday+0x40/0x8f
Mar 12 10:17:06 protected_host_01 kernel:  [<ffffffff885967c6>]
:libcfs:lcw_update_time+0x16/0x100
Mar 12 10:17:06 protected_host_01 kernel:  [<ffffffff800891f6>]
__wake_up_common+0x3e/0x68
Mar 12 10:17:06 protected_host_01 kernel:  [<ffffffff886f65f8>]
:ptlrpc:ptlrpc_main+0x1218/0x13e0
Mar 12 10:17:06 protected_host_01 kernel:  [<ffffffff8008abb9>]
default_wake_function+0x0/0xe
Mar 12 10:17:06 protected_host_01 kernel:  [<ffffffff800b4382>]
audit_syscall_exit+0x31b/0x336
Mar 12 10:17:06 protected_host_01 kernel:  [<ffffffff8005dfb1>]
child_rip+0xa/0x11
Mar 12 10:17:06 protected_host_01 kernel:  [<ffffffff886f53e0>]
:ptlrpc:ptlrpc_main+0x0/0x13e0
Mar 12 10:17:06 protected_host_01 kernel:  [<ffffffff8005dfa7>]
child_rip+0x0/0x11


Any thoughts?

TIA



More information about the lustre-discuss mailing list