[lustre-discuss] lustre issue with OST setting to read-only mode as soon as writes are attempted. using Lustre 1.8.8

Colin Faber cfaber at gmail.com
Thu May 7 14:05:06 PDT 2015


Hi Kurt,

What's e2fsck -fn against the target look like? Does it find issues?

Also, there are a few known fixes for similar issues such as what you
describe above, unfortunately I don't have the bug number handy, maybe
someone from Intel remembers which bug it is.

-cf


On Thu, May 7, 2015 at 11:15 AM, Kurt Strosahl <strosahl at jlab.org> wrote:

> Nothing presently wrong with sdc2, it is a partition on a raid6 disk array
> so smartctl doesn't see anything (nor does the raid controller report any
> problems).  The raid array did have a failed drive, but the drive was
> replaced, and the rebuild started, over an hour before the first time it
> went to read-only.
>
> Looking back in the logs I see the below error (which I thought I'd put in
> my original email).
> LDISKFS-fs error (device sdc2): ldiskfs_mb_check_ondisk_bitmap: on-disk
> bitmap for group 56312corrupted: 4499 blocks free in bitmap, 4585 - in gd
>
> ----- Original Message -----
> From: "Colin Faber" <cfaber at gmail.com>
> To: "Kurt Strosahl" <strosahl at jlab.org>
> Cc: lustre-discuss at lists.lustre.org
> Sent: Thursday, May 7, 2015 11:59:35 AM
> Subject: Re: [lustre-discuss] lustre issue with OST setting to read-only
> mode as soon as writes are attempted. using Lustre 1.8.8
>
> Whoops, meant to respond here...
>
> Anyways, it seems something is wrong with sdc2. What's smart tell you? any
> notices about it in dmesg?
>
> On Thu, May 7, 2015 at 8:54 AM, Kurt Strosahl <strosahl at jlab.org> wrote:
>
> > Good Morning,
> >
> >      We recently had an ost encounter an issue with what appears to be
> its
> > journal...  The ost is sitting as a partition atop a raid6 array, which
> was
> > rebuilding due to a failed disk.  The ost has a journal on an external
> > mirrored disk.  We unmounted the ost, and ran  the following: e2fsck -y
> -C
> > 0 /dev/sdc2 -j /dev/sdd5
> >
> >      After that we remounted the ost, and as soon as the first client
> > tried to write to it after recover it went back to read-only.  We
> unmounted
> > it again, ran e2fsck again, and again it flipped to read-only the second
> > writes tried to go to it (I had set it to read only in the mds, and let
> it
> > sit for a few minutes before setting it back to read/write to make sure
> > that it was only on a write that the problem happened).
> >
> > May  7 10:28:48  kernel:
> > May  7 10:28:48  kernel: Aborting journal on device sdd5.
> > May  7 10:28:48  kernel: LDISKFS-fs (sdc2): Remounting filesystem
> read-only
> > May  7 10:28:48  kernel: LDISKFS-fs error (device sdc2) in
> > ldiskfs_mb_free_blocks: IO failure
> > May  7 10:28:48  kernel: LDISKFS-fs error (device sdc2) in
> > ldiskfs_reserve_inode_write: Journal has aborted
> > May  7 10:28:48  kernel: LDISKFS-fs error (device sdc2) in
> > ldiskfs_reserve_inode_write: Journal has aborted
> > May  7 10:28:48  kernel: LDISKFS-fs error (device sdc2) in
> > ldiskfs_ext_remove_space: Journal has aborted
> > May  7 10:28:48  kernel: LDISKFS-fs error (device sdc2) in
> > ldiskfs_reserve_inode_write: Journal has aborted
> > May  7 10:28:48  kernel: LDISKFS-fs error (device sdc2) in
> > ldiskfs_orphan_del: Journal has aborted
> > May  7 10:28:48  kernel: LDISKFS-fs error (device sdc2) in
> > ldiskfs_reserve_inode_write: Journal has aborted
> > May  7 10:28:48  kernel: LDISKFS-fs error (device sdc2) in
> > ldiskfs_ext_truncate: Journal has aborted
> > May  7 10:28:48  kernel: LustreError:
> > 2436:0:(filter_log.c:174:filter_recov_log_unlink_cb()) error destroying
> > object 2760722: -30
> > May  7 10:28:48  kernel: LustreError:
> > 2434:0:(llog_cat.c:441:llog_cat_process_thread()) llog_cat_process()
> failed
> > -30
> > May  7 10:28:58  kernel: LustreError:
> > 8791:0:(fsfilt-ldiskfs.c:501:fsfilt_ldiskfs_brw_start()) can't get handle
> > for 47 credits: rc = -30
> > May  7 10:28:58  kernel: LustreError:
> > 8791:0:(fsfilt-ldiskfs.c:501:fsfilt_ldiskfs_brw_start()) Skipped 54
> > previous similar messages
> > May  7 10:28:58  kernel: LustreError:
> > 8791:0:(filter_io_26.c:705:filter_commitrw_write()) error starting
> > transaction: rc = -30
> > May  7 10:28:59  kernel: LustreError:
> > 5245:0:(fsfilt-ldiskfs.c:367:fsfilt_ldiskfs_start()) error starting
> handle
> > for op 4 (108 credits): rc -30
> > May  7 10:28:59  kernel: LustreError:
> > 5245:0:(fsfilt-ldiskfs.c:367:fsfilt_ldiskfs_start()) Skipped 18 previous
> > similar messages
> > May  7 10:29:03  kernel: LustreError:
> > 8793:0:(filter_io_26.c:705:filter_commitrw_write()) error starting
> > transaction: rc = -30
> > May  7 10:29:07  kernel: LustreError:
> > 8711:0:(filter_io_26.c:705:filter_commitrw_write()) error starting
> > transaction: rc = -30
> >
> > Kurt J. Strosahl
> > System Administrator
> > Scientific Computing Group, Thomas Jefferson National Accelerator
> Facility
> > _______________________________________________
> > lustre-discuss mailing list
> > lustre-discuss at lists.lustre.org
> > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20150507/3787579f/attachment.htm>


More information about the lustre-discuss mailing list