<div dir="ltr">Hi Kurt,<div><br></div><div>What's e2fsck -fn against the target look like? Does it find issues?</div><div><br></div><div>Also, there are a few known fixes for similar issues such as what you describe above, unfortunately I don't have the bug number handy, maybe someone from Intel remembers which bug it is.</div><div><br></div><div>-cf</div><div><br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Thu, May 7, 2015 at 11:15 AM, Kurt Strosahl <span dir="ltr"><<a href="mailto:strosahl@jlab.org" target="_blank">strosahl@jlab.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Nothing presently wrong with sdc2, it is a partition on a raid6 disk array so smartctl doesn't see anything (nor does the raid controller report any problems).  The raid array did have a failed drive, but the drive was replaced, and the rebuild started, over an hour before the first time it went to read-only.<br>

<br>

Looking back in the logs I see the below error (which I thought I'd put in my original email).<br>

LDISKFS-fs error (device sdc2): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 56312corrupted: 4499 blocks free in bitmap, 4585 - in gd<br>

<div class="HOEnZb"><div class="h5"><br>

----- Original Message -----<br>

From: "Colin Faber" <<a href="mailto:cfaber@gmail.com">cfaber@gmail.com</a>><br>

To: "Kurt Strosahl" <<a href="mailto:strosahl@jlab.org">strosahl@jlab.org</a>><br>

Cc: <a href="mailto:lustre-discuss@lists.lustre.org">lustre-discuss@lists.lustre.org</a><br>

Sent: Thursday, May 7, 2015 11:59:35 AM<br>

Subject: Re: [lustre-discuss] lustre issue with OST setting to read-only mode as soon as writes are attempted. using Lustre 1.8.8<br>

<br>

Whoops, meant to respond here...<br>

<br>

Anyways, it seems something is wrong with sdc2. What's smart tell you? any<br>

notices about it in dmesg?<br>

<br>

On Thu, May 7, 2015 at 8:54 AM, Kurt Strosahl <<a href="mailto:strosahl@jlab.org">strosahl@jlab.org</a>> wrote:<br>

<br>

> Good Morning,<br>

><br>

>      We recently had an ost encounter an issue with what appears to be its<br>

> journal...  The ost is sitting as a partition atop a raid6 array, which was<br>

> rebuilding due to a failed disk.  The ost has a journal on an external<br>

> mirrored disk.  We unmounted the ost, and ran  the following: e2fsck -y -C<br>

> 0 /dev/sdc2 -j /dev/sdd5<br>

><br>

>      After that we remounted the ost, and as soon as the first client<br>

> tried to write to it after recover it went back to read-only.  We unmounted<br>

> it again, ran e2fsck again, and again it flipped to read-only the second<br>

> writes tried to go to it (I had set it to read only in the mds, and let it<br>

> sit for a few minutes before setting it back to read/write to make sure<br>

> that it was only on a write that the problem happened).<br>

><br>

> May  7 10:28:48  kernel:<br>

> May  7 10:28:48  kernel: Aborting journal on device sdd5.<br>

> May  7 10:28:48  kernel: LDISKFS-fs (sdc2): Remounting filesystem read-only<br>

> May  7 10:28:48  kernel: LDISKFS-fs error (device sdc2) in<br>

> ldiskfs_mb_free_blocks: IO failure<br>

> May  7 10:28:48  kernel: LDISKFS-fs error (device sdc2) in<br>

> ldiskfs_reserve_inode_write: Journal has aborted<br>

> May  7 10:28:48  kernel: LDISKFS-fs error (device sdc2) in<br>

> ldiskfs_reserve_inode_write: Journal has aborted<br>

> May  7 10:28:48  kernel: LDISKFS-fs error (device sdc2) in<br>

> ldiskfs_ext_remove_space: Journal has aborted<br>

> May  7 10:28:48  kernel: LDISKFS-fs error (device sdc2) in<br>

> ldiskfs_reserve_inode_write: Journal has aborted<br>

> May  7 10:28:48  kernel: LDISKFS-fs error (device sdc2) in<br>

> ldiskfs_orphan_del: Journal has aborted<br>

> May  7 10:28:48  kernel: LDISKFS-fs error (device sdc2) in<br>

> ldiskfs_reserve_inode_write: Journal has aborted<br>

> May  7 10:28:48  kernel: LDISKFS-fs error (device sdc2) in<br>

> ldiskfs_ext_truncate: Journal has aborted<br>

> May  7 10:28:48  kernel: LustreError:<br>

> 2436:0:(filter_log.c:174:filter_recov_log_unlink_cb()) error destroying<br>

> object 2760722: -30<br>

> May  7 10:28:48  kernel: LustreError:<br>

> 2434:0:(llog_cat.c:441:llog_cat_process_thread()) llog_cat_process() failed<br>

> -30<br>

> May  7 10:28:58  kernel: LustreError:<br>

> 8791:0:(fsfilt-ldiskfs.c:501:fsfilt_ldiskfs_brw_start()) can't get handle<br>

> for 47 credits: rc = -30<br>

> May  7 10:28:58  kernel: LustreError:<br>

> 8791:0:(fsfilt-ldiskfs.c:501:fsfilt_ldiskfs_brw_start()) Skipped 54<br>

> previous similar messages<br>

> May  7 10:28:58  kernel: LustreError:<br>

> 8791:0:(filter_io_26.c:705:filter_commitrw_write()) error starting<br>

> transaction: rc = -30<br>

> May  7 10:28:59  kernel: LustreError:<br>

> 5245:0:(fsfilt-ldiskfs.c:367:fsfilt_ldiskfs_start()) error starting handle<br>

> for op 4 (108 credits): rc -30<br>

> May  7 10:28:59  kernel: LustreError:<br>

> 5245:0:(fsfilt-ldiskfs.c:367:fsfilt_ldiskfs_start()) Skipped 18 previous<br>

> similar messages<br>

> May  7 10:29:03  kernel: LustreError:<br>

> 8793:0:(filter_io_26.c:705:filter_commitrw_write()) error starting<br>

> transaction: rc = -30<br>

> May  7 10:29:07  kernel: LustreError:<br>

> 8711:0:(filter_io_26.c:705:filter_commitrw_write()) error starting<br>

> transaction: rc = -30<br>

><br>

> Kurt J. Strosahl<br>

> System Administrator<br>

> Scientific Computing Group, Thomas Jefferson National Accelerator Facility<br>

> _______________________________________________<br>

> lustre-discuss mailing list<br>

> <a href="mailto:lustre-discuss@lists.lustre.org">lustre-discuss@lists.lustre.org</a><br>

> <a href="http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org" target="_blank">http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org</a><br>

><br>

</div></div></blockquote></div><br></div>