[Lustre-discuss] IO-Node issue

Young, Damiri Damiri.Young at unt.edu
Mon Jul 18 08:28:32 PDT 2011


Hi All,
On Friday 1 of the 4 I/O nodes in our lustre network began logging the following slow start_page_write errors for some OSTs:
--------------------------------------------------- messages -----------------------------------------------------------------------------------
Jul 15 12:55:32 IO-10 heartbeat: [21053]: info: Checking status of STONITH device [IPMI STONITH device ]
Jul 15 12:55:32 IO-10 heartbeat: [16746]: info: Managed STONITH-stat process 21053 exited with return code 0.
Jul 15 13:02:25 IO-10 kernel: Lustre: 18616:0:(filter_io.c:652:filter_preprw_write()) es1-OST000b: slow start_page_write 44s
Jul 15 13:02:25 IO-10 kernel: Lustre: 18688:0:(filter_io_26.c:779:filter_commitrw_write()) es1-OST000b: slow direct_io 44s
Jul 15 13:02:57 IO-10 kernel: Lustre: 18531:0:(filter_io.c:652:filter_preprw_write()) es1-OST000b: slow start_page_write 38s
Jul 15 13:03:29 IO-10 kernel: Lustre: 18566:0:(filter_io_26.c:779:filter_commitrw_write()) es1-OST000b: slow direct_io 32s
Jul 15 13:03:29 IO-10 kernel:  host2: SRP abort called
Jul 15 13:03:29 IO-10 last message repeated 4 times
Jul 15 13:03:29 IO-10 kernel:  host2: SRP reset_device called
Jul 15 13:03:29 IO-10 kernel:  host2: ib_srp: SRP reset_host called state 0 qp_err 0
Jul 15 13:03:39 IO-10 kernel: sd 2:0:0:12: SCSI error: return code = 0x06000000
Jul 15 13:03:39 IO-10 kernel: end_request: I/O error, dev sdag, sector 15012385240
Jul 15 13:03:39 IO-10 kernel: device-mapper: multipath: Failing path 66:0.
Jul 15 13:03:39 IO-10 kernel: sd 2:0:0:12: SCSI error: return code = 0x06000000
Jul 15 13:03:39 IO-10 kernel: end_request: I/O error, dev sdag, sector 15012386176
Jul 15 13:03:39 IO-10 kernel: sd 2:0:0:12: SCSI error: return code = 0x06000000
Jul 15 13:03:39 IO-10 kernel: end_request: I/O error, dev sdag, sector 15012386696
Jul 15 13:03:39 IO-10 kernel: sd 2:0:0:12: SCSI error: return code = 0x06000000
Jul 15 13:03:39 IO-10 kernel: end_request: I/O error, dev sdag, sector 15012385256
Jul 15 13:03:39 IO-10 kernel: sd 2:0:0:12: SCSI error: return code = 0x06000000
Jul 15 13:03:39 IO-10 kernel: end_request: I/O error, dev sdag, sector 15012385568
Jul 15 13:03:39 IO-10 multipathd: 66:0: mark as failed
Jul 15 13:03:39 IO-10 multipathd: lun_12: remaining active paths: 1
Jul 15 13:04:35 IO-10 kernel: LustreError: 18538:0:(filter_io_26.c:779:filter_commitrw_write()) es1-OST000b: slow direct_io 50s
Jul 15 13:04:35 IO-10 kernel: Lustre: 18447:0:(filter_io_26.c:779:filter_commitrw_write()) es1-OST000b: slow direct_io 31s
Jul 15 13:04:35 IO-10 kernel: Lustre: 18525:0:(filter_io.c:652:filter_preprw_write()) es1-OST000b: slow start_page_write 46s
Jul 15 13:04:35 IO-10 kernel: LustreError: 18598:0:(lustre_fsfilt.h:318:fsfilt_commit_wait()) es1-OST000b: slow journal start 56s
Jul 15 13:04:35 IO-10 kernel: LustreError: 18393:0:(filter_io_26.c:792:filter_commitrw_write()) es1-OST000b: slow commitrw commit 56s
Jul 15 13:04:35 IO-10 kernel: LustreError: 18393:0:(filter_io_26.c:792:filter_commitrw_write()) Skipped 1 previous similar message
Jul 15 13:04:35 IO-10 kernel: Lustre: 18739:0:(lustre_fsfilt.h:318:fsfilt_commit_wait()) es1-OST000b: slow journal start 46s
Jul 15 13:04:35 IO-10 kernel: Lustre: 18739:0:(filter_io_26.c:792:filter_commitrw_write()) es1-OST000b: slow commitrw commit 46s
Jul 15 13:04:35 IO-10 kernel: Lustre: 18422:0:(lustre_fsfilt.h:318:fsfilt_commit_wait()) es1-OST000b: slow journal start 31s
Jul 15 13:04:35 IO-10 kernel: Lustre: 18422:0:(lustre_fsfilt.h:318:fsfilt_commit_wait()) Skipped 5 previous similar messages
Jul 15 13:04:35 IO-10 kernel: Lustre: 18422:0:(filter_io_26.c:792:filter_commitrw_write()) es1-OST000b: slow commitrw commit 31s
Jul 15 13:04:35 IO-10 kernel: Lustre: 18422:0:(filter_io_26.c:792:filter_commitrw_write()) Skipped 5 previous similar messages
Jul 15 13:04:35 IO-10 kernel: LustreError: 18644:0:(lustre_fsfilt.h:318:fsfilt_commit_wait()) es1-OST000b: slow journal start 56s
Jul 15 13:04:35 IO-10 kernel: LustreError: 18644:0:(lustre_fsfilt.h:318:fsfilt_commit_wait()) Skipped 5 previous similar messages
Jul 15 13:04:35 IO-10 kernel: LustreError: 18644:0:(filter_io_26.c:792:filter_commitrw_write()) es1-OST000b: slow commitrw commit 56s
Jul 15 13:04:35 IO-10 kernel: LustreError: 18644:0:(filter_io_26.c:792:filter_commitrw_write()) Skipped 4 previous similar messages
Jul 15 13:04:35 IO-10 kernel: LustreError: 18598:0:(lustre_fsfilt.h:318:fsfilt_commit_wait()) Skipped 4 previous similar messages
Jul 15 13:04:35 IO-10 kernel: LustreError: 18766:0:(filter_io_26.c:779:filter_commitrw_write()) es1-OST000b: slow direct_io 55s
Jul 15 13:04:35 IO-10 kernel: LustreError: 18766:0:(filter_io_26.c:779:filter_commitrw_write()) Skipped 1 previous similar message
Jul 15 13:05:07 IO-10 kernel: Lustre: 18650:0:(filter_io_26.c:779:filter_commitrw_write()) es1-OST000b: slow direct_io 31s
Jul 15 13:05:07 IO-10 kernel: Lustre: 18650:0:(filter_io_26.c:779:filter_commitrw_write()) Skipped 2 previous similar messages
Jul 15 13:05:08 IO-10 kernel:  host1: SRP abort called
Jul 15 13:05:08 IO-10 kernel:  host1: SRP reset_device called
Jul 15 13:05:08 IO-10 kernel:  host1: ib_srp: SRP reset_host called state 0 qp_err 0
Jul 15 13:05:18 IO-10 kernel: sd 1:0:0:12: SCSI error: return code = 0x00000002
Jul 15 13:05:18 IO-10 kernel: end_request: I/O error, dev sdm, sector 861112
Jul 15 13:05:18 IO-10 kernel: device-mapper: multipath: Failing path 8:192.
Jul 15 13:05:18 IO-10 kernel: Buffer I/O error on device dm-11, logical block 107639
Jul 15 13:05:18 IO-10 kernel: lost page write due to I/O error on dm-11
Jul 15 13:05:18 IO-10 kernel: LDISKFS-fs error (device dm-11): kmmpd: Error writing to MMP block
Jul 15 13:05:18 IO-10 kernel: Remounting filesystem read-only
Jul 15 13:05:18 IO-10 kernel: LustreError: 18499:0:(fsfilt-ldiskfs.c:438:fsfilt_ldiskfs_brw_start()) can't get handle for 99 credits: rc = -3
0
Jul 15 13:05:18 IO-10 kernel: Lustre: 18657:0:(filter_io_26.c:779:filter_commitrw_write()) es1-OST000b: slow direct_io 36s
Jul 15 13:05:18 IO-10 multipathd: 8:192: mark as failed
Jul 15 13:05:18 IO-10 kernel: Lustre: 18657:0:(filter_io_26.c:779:filter_commitrw_write()) Skipped 1 previous similar message
Jul 15 13:05:18 IO-10 kernel: Buffer I/O error on device dm-11, logical block 79917
Jul 15 13:05:18 IO-10 multipathd: lun_12: remaining active paths: 0
Jul 15 13:05:18 IO-10 kernel: lost page write due to I/O error on dm-11
Jul 15 13:05:18 IO-10 kernel: Aborting journal on device dm-11.
Jul 15 13:05:18 IO-10 kernel: LustreError: 19103:0:(obd.h:1188:obd_transno_commit_cb()) es1-OST000b: transno 608801934 commit error: 2

------------------------------------------- end messages -------------------------------------------------------------------------


I then tried to fsck -n one of the mentioned devices for inconsistencies at which point I noticed on a kernel  panic. Rebooted the machine which appears to not be able to establish a connection or LNET:
-----------------------------------------------------messages ---------------------------------------------------------------------
Jul 18 09:24:44 IO-10 OpenSM[28708]: /var/log/opensm.log log file opened
Jul 18 09:24:44 IO-10 OpenSM[28708]: OpenSM 3.1.11
Jul 18 09:24:44 IO-10 OpenSM[28708]: Entering STANDBY state
Jul 18 09:28:23 IO-10 run_srp_daemon[28757]: failed srp_daemon: [HCA=mlx4_0] [port=2] [exit status=110]. Will try to restart srp_daemon perio
dically. No more warnings will be issued in the next 7200 seconds if the same problem repeats
Jul 18 09:28:23 IO-10 run_srp_daemon[28763]: failed srp_daemon: [HCA=mlx4_0] [port=1] [exit status=110]. Will try to restart srp_daemon perio
dically. No more warnings will be issued in the next 7200 seconds if the same problem repeats
Jul 18 09:28:24 IO-10 run_srp_daemon[28769]: starting srp_daemon: [HCA=mlx4_0] [port=2]
Jul 18 09:28:25 IO-10 run_srp_daemon[28782]: starting srp_daemon: [HCA=mlx4_0] [port=1]
Jul 18 09:33:10 IO-10 kernel: megasas: 00.00.03.15-RH1 Wed Nov. 21 10:29:45 PST 2007
Jul 18 09:33:10 IO-10 kernel: Lustre: OBD class driver, http://www.lustre.org/
Jul 18 09:33:10 IO-10 kernel:         Lustre Version: 1.6.6
Jul 18 09:33:10 IO-10 kernel:         Build Version: 1.6.6-1.6.6-ddn3.1-20090527173746
Jul 18 09:33:10 IO-10 kernel: Lustre: 30493:0:(o2iblnd_modparams.c:324:kiblnd_tunables_init()) Concurrent sends 7 is lower than message queue
 size: 8, performance may drop slightly.
Jul 18 09:33:10 IO-10 kernel: Lustre: Added LNI 10.1.0.229 at o2ib [8/64]
Jul 18 09:33:11 IO-10 kernel: Lustre: Lustre Client File System; http://www.lustre.org/
Jul 18 09:33:34 IO-10 kernel: kjournald starting.  Commit interval 5 seconds
Jul 18 09:33:34 IO-10 kernel: LDISKFS-fs warning: checktime reached, running e2fsck is recommended
Jul 18 09:33:34 IO-10 kernel: LDISKFS FS on dm-10, internal journal
Jul 18 09:33:34 IO-10 kernel: LDISKFS-fs: recovery complete.
Jul 18 09:33:34 IO-10 kernel: LDISKFS-fs: mounted filesystem with ordered data mode.
Jul 18 09:33:34 IO-10 multipathd: dm-10: umount map (uevent)
Jul 18 09:33:45 IO-10 kernel: kjournald starting.  Commit interval 5 seconds
Jul 18 09:33:45 IO-10 kernel: LDISKFS-fs warning: checktime reached, running e2fsck is recommended
Jul 18 09:33:45 IO-10 kernel: LDISKFS FS on dm-10, internal journal
Jul 18 09:33:45 IO-10 kernel: LDISKFS-fs: mounted filesystem with ordered data mode.
Jul 18 09:33:45 IO-10 kernel: LDISKFS-fs: file extents enabled
Jul 18 09:33:45 IO-10 kernel: LDISKFS-fs: mballoc enabled
Jul 18 09:33:45 IO-10 kernel: LustreError: 137-5: UUID 'es1-OST000a_UUID' is not available  for connect (no target)
Jul 18 09:33:45 IO-10 kernel: LustreError: 30789:0:(ldlm_lib.c:1619:target_send_reply_msg()) @@@ processing error (-19)  req at ffff8101a6992c00
 x5558026/t0 o8-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl 1310999725 ref 1 fl Interpret:/0/0 rc -19/0
Jul 18 09:33:45 IO-10 kernel: LustreError: 30789:0:(ldlm_lib.c:1619:target_send_reply_msg()) Skipped 2 previous similar messages
Jul 18 09:33:45 IO-10 kernel: fsfilt_ldiskfs: no version for "ldiskfs_free_blocks" found: kernel tainted.
Jul 18 09:33:45 IO-10 kernel: Lustre: Filtering OBD driver; http://www.lustre.org/
Jul 18 09:33:45 IO-10 kernel: Lustre: 31834:0:(filter.c:868:filter_init_server_data()) RECOVERY: service es1-OST000a, 249 recoverable clients
, last_rcvd 469616249
Jul 18 09:33:45 IO-10 kernel: Lustre: OST es1-OST000a now serving dev (es1-OST000a/15fae56a-7dae-ba24-4423-347c0a118367), but will be in reco
very for at least 5:00, or until 249 clients reconnect. During this time new clients will not be allowed to connect. Recovery progress can be
 monitored by watching /proc/fs/lustre/obdfilter/es1-OST000a/recovery_status.
Jul 18 09:33:45 IO-10 kernel: Lustre: es1-OST000a.ost: set parameter quota_type=ug
Jul 18 09:33:45 IO-10 kernel: Lustre: Server es1-OST000a on device /dev/mpath/lun_11 has started
Jul 18 09:33:45 IO-10 kernel: Lustre: 30790:0:(ldlm_lib.c:1226:check_and_start_recovery_timer()) es1-OST000a: starting req at ffff8101a6992c00
 x5558026/t0 o8-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl 1310999725 ref 1 fl Interpret:/0/0 rc -19/0
Jul 18 09:33:45 IO-10 kernel: LustreError: 30789:0:(ldlm_lib.c:1619:target_send_reply_msg()) Skipped 2 previous similar messages
Jul 18 09:33:45 IO-10 kernel: fsfilt_ldiskfs: no version for "ldiskfs_free_blocks" found: kernel tainted.
Jul 18 09:33:45 IO-10 kernel: Lustre: Filtering OBD driver; http://www.lustre.org/
Jul 18 09:33:45 IO-10 kernel: Lustre: 31834:0:(filter.c:868:filter_init_server_data()) RECOVERY: service es1-OST000a, 249 recoverable clients
, last_rcvd 469616249
Jul 18 09:33:45 IO-10 kernel: Lustre: OST es1-OST000a now serving dev (es1-OST000a/15fae56a-7dae-ba24-4423-347c0a118367), but will be in reco
very for at least 5:00, or until 249 clients reconnect. During this time new clients will not be allowed to connect. Recovery progress can be
 monitored by watching /proc/fs/lustre/obdfilter/es1-OST000a/recovery_status.
Jul 18 09:33:45 IO-10 kernel: Lustre: es1-OST000a.ost: set parameter quota_type=ug
Jul 18 09:33:45 IO-10 kernel: Lustre: Server es1-OST000a on device /dev/mpath/lun_11 has started
Jul 18 09:33:45 IO-10 kernel: Lustre: 30790:0:(ldlm_lib.c:1226:check_and_start_recovery_timer()) es1-OST000a: starting recovery timer
Jul 18 09:33:45 IO-10 kernel: LustreError: 137-5: UUID 'es1-OST000b_UUID' is not available  for connect (no target)
Jul 18 09:33:45 IO-10 kernel: LustreError: Skipped 4 previous similar messages
Jul 18 09:33:45 IO-10 kernel: LustreError: 30792:0:(ldlm_lib.c:1619:target_send_reply_msg()) @@@ processing error (-19)  req at ffff8103185c9000
 x215076358/t0 o8-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl 1310999725 ref 1 fl Interpret:/0/0 rc -19/0
Jul 18 09:33:45 IO-10 kernel: Lustre: 30794:0:(ldlm_lib.c:1567:target_queue_last_replay_reply()) es1-OST000a: 248 recoverable clients remain
Jul 18 09:33:46 IO-10 kernel: Lustre: 30806:0:(ldlm_lib.c:1567:target_queue_last_replay_reply()) es1-OST000a: 247 recoverable clients remain
Jul 18 09:33:46 IO-10 kernel: LustreError: 137-5: UUID 'es1-OST000e_UUID' is not available  for connect (no target)
Jul 18 09:33:46 IO-10 kernel: LustreError: 30815:0:(ldlm_lib.c:1619:target_send_reply_msg()) @@@ processing error (-19)  req at ffff8102f093d000
 x220133374/t0 o8-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl 1310999726 ref 1 fl Interpret:/0/0 rc -19/0
Jul 18 09:33:46 IO-10 kernel: LustreError: 30815:0:(ldlm_lib.c:1619:target_send_reply_msg()) Skipped 11 previous similar messages
Jul 18 09:33:46 IO-10 kernel: LustreError: Skipped 14 previous similar messages
Jul 18 09:33:46 IO-10 kernel: Lustre: 30889:0:(ldlm_lib.c:1567:target_queue_last_replay_reply()) es1-OST000a: 245 recoverable clients remain
Jul 18 09:33:46 IO-10 kernel: Lustre: 30889:0:(ldlm_lib.c:1567:target_queue_last_replay_reply()) Skipped 1 previous similar message
Jul 18 09:33:47 IO-10 kernel: LustreError: 137-5: UUID 'es1-OST000b_UUID' is not available  for connect (no target)
Jul 18 09:33:47 IO-10 kernel: LustreError: 31004:0:(ldlm_lib.c:1619:target_send_reply_msg()) @@@ processing error (-19)  req at ffff8102fea35450
 x33450807/t0 o8-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl 1310999727 ref 1 fl Interpret:/0/0 rc -19/0
Jul 18 09:33:47 IO-10 kernel: LustreError: 31004:0:(ldlm_lib.c:1619:target_send_reply_msg()) Skipped 47 previous similar messages
Jul 18 09:33:47 IO-10 kernel: LustreError: Skipped 47 previous similar messages
----------------------------------------------------------------------end messages ------------------------------------------------------------------------------------------

I'm not seeing that Lustre is trying to recover at all. Any thoughts on where to go from here, Lustre (mounted at /scratch on clients) is starting to have problems cluster-wide.



Best Regards,
--
DaMiri Young
System Engineer
High Performance Computing Team
CITC Academic Computing and User Services | UNT
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20110718/72099f3e/attachment.htm>


More information about the lustre-discuss mailing list