[Lustre-discuss] MDS crashing after upgrade to 1.6.7

Thomas Wakefield twake at cola.iges.org
Fri Mar 20 07:15:50 PDT 2009


My mds has crashed a couple times after about 48 hours of running on  
1.6.7,  I can't decide if it is a bug or a disk issue.  Prior to the  
upgrade, my uptime was 250+ days, so i think the hardware is stable.

Any help would be very appreciated.

Server is a dell 2950, dual OS drives, dual raid 1 drives for the MDS,  
running LVM for snapshots.


Log from right before the crash:

Mar 20 09:51:59 disk0 kernel: device-mapper: snapshots: Invalidating  
snapshot: Unable to allocate exception.
Mar 20 09:51:59 disk0 kernel: Buffer I/O error on device dm-1, logical  
block 22676124
Mar 20 09:51:59 disk0 kernel: lost page write due to I/O error on dm-1
Mar 20 09:51:59 disk0 kernel: Buffer I/O error on device dm-1, logical  
block 22676125
Mar 20 09:51:59 disk0 kernel: lost page write due to I/O error on dm-1
Mar 20 09:51:59 disk0 kernel: Buffer I/O error on device dm-1, logical  
block 22676127
Mar 20 09:51:59 disk0 kernel: lost page write due to I/O error on dm-1
Mar 20 09:51:59 disk0 kernel: Buffer I/O error on device dm-1, logical  
block 22676374
Mar 20 09:51:59 disk0 kernel: lost page write due to I/O error on dm-1
Mar 20 09:51:59 disk0 kernel: Buffer I/O error on device dm-1, logical  
block 22676375
Mar 20 09:51:59 disk0 kernel: lost page write due to I/O error on dm-1
Mar 20 09:51:59 disk0 kernel: Buffer I/O error on device dm-1, logical  
block 22676376
Mar 20 09:51:59 disk0 kernel: lost page write due to I/O error on dm-1
Mar 20 09:51:59 disk0 kernel: Buffer I/O error on device dm-1, logical  
block 22676377
Mar 20 09:51:59 disk0 kernel: Buffer I/O error on device dm-1, logical  
block 11042873
Mar 20 09:51:59 disk0 kernel: lost page write due to I/O error on dm-1
Mar 20 09:51:59 disk0 kernel: Buffer I/O error on device dm-1, logical  
block 11042919
Mar 20 09:51:59 disk0 kernel: lost page write due to I/O error on dm-1
Mar 20 09:51:59 disk0 kernel: Buffer I/O error on device dm-1, logical  
block 11042920
Mar 20 09:51:59 disk0 kernel: lost page write due to I/O error on dm-1
Mar 20 09:51:59 disk0 kernel: lost page write due to I/O error on dm-1
Mar 20 09:51:59 disk0 kernel: LDISKFS-fs error (device dm-1):  
ldiskfs_get_inode_loc: unable to read inode block - inode=21347250,  
block=21333880
Mar 20 09:58:24 disk0 syslogd 1.4.1: restart.



Then after a reboot here is a DMESG bug that concerned me:

LustreError: 10661:0:(ldlm_lib.c:1643:target_send_reply_msg()) Skipped  
390 previous similar messages
LustreError: 10689:0:(handler.c:1601:mds_handle()) Skipped 390  
previous similar messages
BUG: soft lockup - CPU#3 stuck for 10s! [ll_mdt_19:10680]
CPU 3:
Modules linked in: mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U)  
crc16(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U)  
ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U)  
ipmi_devintf(U) mptctl(U) mptbase(U) ipmi_si(U) ipmi_msghandler(U)  
dell_rbu(U) autofs4(U) hidp(U) rfcomm(U) l2cap(U) bluetooth(U)  
sunrpc(U) rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U)  
ib_ipoib(U) ipoib_helper(U) ib_cm(U) ib_sa(U) ipv6(U) xfrm_nalgo(U)  
crypto_api(U) ib_uverbs(U) ib_umad(U) iw_cxgb3(U) cxgb3(U) ib_ipath(U)  
mlx4_ib(U) mlx4_core(U) dm_multipath(U) video(U) sbs(U) backlight(U)  
i2c_ec(U) i2c_core(U) button(U) battery(U) asus_acpi(U)  
acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) pata_acpi(U)  
joydev(U) ata_piix(U) libata(U) sr_mod(U) ib_mthca(U) ib_mad(U)  
shpchp(U) ib_core(U) sg(U) pcspkr(U) ide_cd(U) bnx2(U) cdrom(U)  
serio_raw(U) i5000_edac(U) edac_mc(U) dm_snapshot(U) dm_zero(U)  
dm_mirror(U) dm_mod(U) usb_storage(U) megaraid_sas(U) sd_mod(U)  
scsi_mod(U) ext3(U) jbd(U) ehci_hcd(U) ohci_hcd(U) uhci_hcd(U)
Pid: 10680, comm: ll_mdt_19 Tainted: G      2.6.18-92.1.17.el5_lustre. 
1.6.7smp #1
RIP: 0010:[<ffffffff8870e3e0>]   
[<ffffffff8870e3e0>] :obdclass:lustre_hash_for_each_empty+0x1f0/0x290
RSP: 0018:ffff81040dc339e0  EFLAGS: 00000206
RAX: ffff81040ab94a00 RBX: 0000000000001f0e RCX: 0000000000007b38
RDX: 000000000000020a RSI: ffffffff802f0d80 RDI: ffffc20010d5e0ec
RBP: ffffffff88788912 R08: ffff81000101f3e0 R09: 0000000000000000
R10: ffff81040ab920c0 R11: 0000000000000150 R12: ffff81040ab94a00
R13: 0000000000000286 R14: 0000000000000286 R15: ffff81040dc33980
FS:  00002aca2e21d220(0000) GS:ffff81010eb9a4c0(0000) knlGS: 
0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000000efdee288 CR3: 0000000414759000 CR4: 00000000000006e0

Call Trace:
  [<ffffffff88714bf8>] :obdclass:class_disconnect+0x378/0x400
  [<ffffffff8878a5b0>] :ptlrpc:ldlm_cancel_locks_for_export_cb+0x0/0xc0
  [<ffffffff88aa605b>] :mds:mds_disconnect+0x11b/0x310
  [<ffffffff88674598>] :libcfs:libcfs_ip_addr2str+0x38/0x40
  [<ffffffff88712f44>] :obdclass:class_disconnect_export_list 
+0x5e4/0x780
  [<ffffffff88714f67>] :obdclass:class_disconnect_stale_exports 
+0x137/0x1a0
  [<ffffffff8879262c>] :ptlrpc:target_abort_recovery+0x19c/0x390
  [<ffffffff887c8e15>] :ptlrpc:lustre_msg_get_opc+0x35/0xf0
  [<ffffffff88aa0d97>] :mds:mds_handle+0x7f7/0x4d30
  [<ffffffff80143809>] __next_cpu+0x19/0x28
  [<ffffffff80143809>] __next_cpu+0x19/0x28
  [<ffffffff800898e3>] find_busiest_group+0x20d/0x621
  [<ffffffff887c85a5>] :ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0
  [<ffffffff887d0cfa>] :ptlrpc:ptlrpc_server_request_get+0x6a/0x150
  [<ffffffff887d2b7d>] :ptlrpc:ptlrpc_check_req+0x1d/0x110
  [<ffffffff887d5103>] :ptlrpc:ptlrpc_server_handle_request+0xa93/0x1150
  [<ffffffff80062f4b>] thread_return+0x0/0xdf
  [<ffffffff8006d8a2>] do_gettimeofday+0x40/0x8f
  [<ffffffff886787c6>] :libcfs:lcw_update_time+0x16/0x100
  [<ffffffff8003d514>] lock_timer_base+0x1b/0x3c
  [<ffffffff8001c5b0>] __mod_timer+0xb0/0xbe
  [<ffffffff887d85f8>] :ptlrpc:ptlrpc_main+0x1218/0x13e0
  [<ffffffff8008abb9>] default_wake_function+0x0/0xe
  [<ffffffff800b4382>] audit_syscall_exit+0x31b/0x336
  [<ffffffff8005dfb1>] child_rip+0xa/0x11
  [<ffffffff887d73e0>] :ptlrpc:ptlrpc_main+0x0/0x13e0
  [<ffffffff8005dfa7>] child_rip+0x0/0x11

Lustre: 10663:0:(ldlm_lib.c:541:target_handle_reconnect()) data- 
MDT0000: 3b041098-4f33-6c6a-6e5c-3af7b34062e8 reconnecting
LustreError: 10663:0:(handler.c:1601:mds_handle()) operation 101 on  
unconnected MDS from 12345-192.168.64.14 at o2ib
LustreError: 10681:0:(ldlm_lib.c:1643:target_send_reply_msg()) @@@  
processing error (-107)  req at ffff81040962ec00 x52078437/t0 o101-><?>@<? 
 >:0/0 lens 232/0 e 0 to 0 dl 1237557961 ref 1 fl Interpret:/0/0 rc  
-107/0
LustreError: 10681:0:(ldlm_lib.c:1643:target_send_reply_msg()) Skipped  
777 previous similar messages
LustreError: 10663:0:(handler.c:1601:mds_handle()) Skipped 779  
previous similar messages
Lustre: 10676:0:(ldlm_lib.c:541:target_handle_reconnect()) data- 
MDT0000: 2a1f051a-1043-af7c-75c5-65af1bced83b reconnecting
Lustre: 10676:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 1  
previous similar message
Lustre: 10663:0:(ldlm_lib.c:541:target_handle_reconnect()) data- 
MDT0000: 99fbab1c-b702-fab2-a584-6c31ac18a5f4 reconnecting
Lustre: 10663:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 2  
previous similar messages
BUG: soft lockup - CPU#3 stuck for 10s! [ll_mdt_19:10680]
CPU 3:
Modules linked in: mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U)  
crc16(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U)  
ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U)  
ipmi_devintf(U) mptctl(U) mptbase(U) ipmi_si(U) ipmi_msghandler(U)  
dell_rbu(U) autofs4(U) hidp(U) rfcomm(U) l2cap(U) bluetooth(U)  
sunrpc(U) rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U)  
ib_ipoib(U) ipoib_helper(U) ib_cm(U) ib_sa(U) ipv6(U) xfrm_nalgo(U)  
crypto_api(U) ib_uverbs(U) ib_umad(U) iw_cxgb3(U) cxgb3(U) ib_ipath(U)  
mlx4_ib(U) mlx4_core(U) dm_multipath(U) video(U) sbs(U) backlight(U)  
i2c_ec(U) i2c_core(U) button(U) battery(U) asus_acpi(U)  
acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) pata_acpi(U)  
joydev(U) ata_piix(U) libata(U) sr_mod(U) ib_mthca(U) ib_mad(U)  
shpchp(U) ib_core(U) sg(U) pcspkr(U) ide_cd(U) bnx2(U) cdrom(U)  
serio_raw(U) i5000_edac(U) edac_mc(U) dm_snapshot(U) dm_zero(U)  
dm_mirror(U) dm_mod(U) usb_storage(U) megaraid_sas(U) sd_mod(U)  
scsi_mod(U) ext3(U) jbd(U) ehci_hcd(U) ohci_hcd(U) uhci_hcd(U)
Pid: 10680, comm: ll_mdt_19 Tainted: G      2.6.18-92.1.17.el5_lustre. 
1.6.7smp #1
RIP: 0010:[<ffffffff800649e6>]  [<ffffffff800649e6>] _write_lock+0x7/0xf
RSP: 0018:ffff81040dc339d8  EFLAGS: 00000246
RAX: ffff81040826ea00 RBX: 0000000000006227 RCX: 000000000000aee7
RDX: 00000000000002e4 RSI: ffffffff802f0d80 RDI: ffffc20010da127c
RBP: 0000000000000286 R08: ffff81000101f3e0 R09: 0000000000000000
R10: ffff81040826fe00 R11: 0000000000000150 R12: 0000000000000286
R13: ffff81040dc33980 R14: ffff81040826fe58 R15: ffff81040826fe00
FS:  00002aca2e21d220(0000) GS:ffff81010eb9a4c0(0000) knlGS: 
0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000000efdee288 CR3: 0000000414759000 CR4: 00000000000006e0

Call Trace:
  [<ffffffff8870e3fe>] :obdclass:lustre_hash_for_each_empty+0x20e/0x290
  [<ffffffff88714bf8>] :obdclass:class_disconnect+0x378/0x400
  [<ffffffff8878a5b0>] :ptlrpc:ldlm_cancel_locks_for_export_cb+0x0/0xc0
  [<ffffffff88aa605b>] :mds:mds_disconnect+0x11b/0x310
  [<ffffffff88674598>] :libcfs:libcfs_ip_addr2str+0x38/0x40
  [<ffffffff88712f44>] :obdclass:class_disconnect_export_list 
+0x5e4/0x780
  [<ffffffff88714f67>] :obdclass:class_disconnect_stale_exports 
+0x137/0x1a0
  [<ffffffff8879262c>] :ptlrpc:target_abort_recovery+0x19c/0x390
  [<ffffffff887c8e15>] :ptlrpc:lustre_msg_get_opc+0x35/0xf0
  [<ffffffff88aa0d97>] :mds:mds_handle+0x7f7/0x4d30
  [<ffffffff80143809>] __next_cpu+0x19/0x28
  [<ffffffff80143809>] __next_cpu+0x19/0x28
  [<ffffffff800898e3>] find_busiest_group+0x20d/0x621
  [<ffffffff887c85a5>] :ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0
  [<ffffffff887d0cfa>] :ptlrpc:ptlrpc_server_request_get+0x6a/0x150
  [<ffffffff887d2b7d>] :ptlrpc:ptlrpc_check_req+0x1d/0x110
  [<ffffffff887d5103>] :ptlrpc:ptlrpc_server_handle_request+0xa93/0x1150
  [<ffffffff80062f4b>] thread_return+0x0/0xdf
  [<ffffffff8006d8a2>] do_gettimeofday+0x40/0x8f
  [<ffffffff886787c6>] :libcfs:lcw_update_time+0x16/0x100
  [<ffffffff8003d514>] lock_timer_base+0x1b/0x3c
  [<ffffffff8001c5b0>] __mod_timer+0xb0/0xbe
  [<ffffffff887d85f8>] :ptlrpc:ptlrpc_main+0x1218/0x13e0
  [<ffffffff8008abb9>] default_wake_function+0x0/0xe
  [<ffffffff800b4382>] audit_syscall_exit+0x31b/0x336
  [<ffffffff8005dfb1>] child_rip+0xa/0x11
  [<ffffffff887d73e0>] :ptlrpc:ptlrpc_main+0x0/0x13e0
  [<ffffffff8005dfa7>] child_rip+0x0/0x11

Lustre: data-MDT0000: sending delayed replies to recovered clients
Lustre: data-MDT0000: recovery complete: rc 0
Lustre: MDS data-MDT0000: data-OST0000_UUID now active, resetting  
orphans
Lustre: MDS data-MDT0000: data-OST0001_UUID now active, resetting  
orphans
LustreError: 10616:0:(llog_server.c:471:llog_origin_handle_cancel())  
Cancel 62 of 124 llog-records failed: -2
LustreError: 10617:0:(llog_server.c:471:llog_origin_handle_cancel())  
Cancel 62 of 124 llog-records failed: -2
LustreError: 10668:0:(handler.c:1601:mds_handle()) operation 101 on  
unconnected MDS from 12345-192.168.64.14 at o2ib
LustreError: 10681:0:(ldlm_lib.c:1643:target_send_reply_msg()) @@@  
processing error (-107)  req at ffff81040c50c400 x52079502/t0 o101-><?>@<? 
 >:0/0 lens 232/0 e 0 to 0 dl 1237557977 ref 1 fl Interpret:/0/0 rc  
-107/0
LustreError: 10681:0:(ldlm_lib.c:1643:target_send_reply_msg()) Skipped  
1058 previous similar messages
LustreError: 10668:0:(handler.c:1601:mds_handle()) Skipped 1060  
previous similar messages
LustreError: 10687:0:(handler.c:1601:mds_handle()) operation 101 on  
unconnected MDS from 12345-192.168.64.14 at o2ib
LustreError: 10679:0:(ldlm_lib.c:1643:target_send_reply_msg()) @@@  
processing error (-107)  req at ffff81042dcecc00 x52082030/t0 o101-><?>@<? 
 >:0/0 lens 232/0 e 0 to 0 dl 1237558009 ref 1 fl Interpret:/0/0 rc  
-107/0
LustreError: 10679:0:(ldlm_lib.c:1643:target_send_reply_msg()) Skipped  
2518 previous similar messages
LustreError: 10687:0:(handler.c:1601:mds_handle()) Skipped 2518  
previous similar messages





Thomas Wakefield Jr.
Systems Administrator COLA/IGES
301-902-1268
twake at cola.iges.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20090320/060deb76/attachment.htm>


More information about the lustre-discuss mailing list