[lustre-discuss] lustre issue with OST setting to read-only mode as soon as writes are attempted. using Lustre 1.8.8

Kurt Strosahl strosahl at jlab.org
Mon May 11 06:17:44 PDT 2015


e2fsck 1.42.3.wc3 (15-Aug-2012)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Block bitmap differences:  -1845243648 -1845243668 -(1845243713--1845243714) -1845243738 -1845243742 -(1845243751--1845243753) -(1845243756--1845243761) -1845243763 -1845243765 -1845243767 -1845243769 -(1845243776--1845243778) -(1845243781--1845243786) -1845243790 -1845243793 -(1845243816--1845243817) -1845243819 -1845243822 -(1845243824--1845243826) -(1845243829--1845243831) -(1845243890--1845243894) -(1845243899--1845243902) -(1845244225--1845244227) -1845244247 -1845244275 -1845244290 -1845244294 -1845244296 -1845244301 -1845244304 -1845244311 -1845244319 -(1845244322--1845244324) -1845244330 -(1845244348--1845244349) -1845244352 -1845244354 -1845244360 -1845244367 -1845244371 -1845244374 -1845244381 -(1845244385--1845244386) -(1845244395--1845244399) -(1845244409--1845244413)
Fix? no

Free blocks count wrong for group #56312 (4585, counted=4499).
Fix? no

Free blocks count wrong (597852500, counted=597852414).
Fix? no


lustre-OST0060: ********** WARNING: Filesystem still has errors **********

lustre-OST0060: 451137/22888704 files (39.9% non-contiguous), 2331868992/2929721492 blocks

After some discussion here we are going to run the check again and let e2fsck fix the problems it finds.

w/r,
Kurt


----- Original Message -----
From: "Colin Faber" <cfaber at gmail.com>
To: "Kurt Strosahl" <strosahl at jlab.org>
Cc: lustre-discuss at lists.lustre.org
Sent: Thursday, May 7, 2015 5:05:06 PM
Subject: Re: [lustre-discuss] lustre issue with OST setting to read-only mode as soon as writes are attempted. using Lustre 1.8.8

Hi Kurt,

What's e2fsck -fn against the target look like? Does it find issues?

Also, there are a few known fixes for similar issues such as what you
describe above, unfortunately I don't have the bug number handy, maybe
someone from Intel remembers which bug it is.

-cf


On Thu, May 7, 2015 at 11:15 AM, Kurt Strosahl <strosahl at jlab.org> wrote:

> Nothing presently wrong with sdc2, it is a partition on a raid6 disk array
> so smartctl doesn't see anything (nor does the raid controller report any
> problems).  The raid array did have a failed drive, but the drive was
> replaced, and the rebuild started, over an hour before the first time it
> went to read-only.
>
> Looking back in the logs I see the below error (which I thought I'd put in
> my original email).
> LDISKFS-fs error (device sdc2): ldiskfs_mb_check_ondisk_bitmap: on-disk
> bitmap for group 56312corrupted: 4499 blocks free in bitmap, 4585 - in gd
>
> ----- Original Message -----
> From: "Colin Faber" <cfaber at gmail.com>
> To: "Kurt Strosahl" <strosahl at jlab.org>
> Cc: lustre-discuss at lists.lustre.org
> Sent: Thursday, May 7, 2015 11:59:35 AM
> Subject: Re: [lustre-discuss] lustre issue with OST setting to read-only
> mode as soon as writes are attempted. using Lustre 1.8.8
>
> Whoops, meant to respond here...
>
> Anyways, it seems something is wrong with sdc2. What's smart tell you? any
> notices about it in dmesg?
>
> On Thu, May 7, 2015 at 8:54 AM, Kurt Strosahl <strosahl at jlab.org> wrote:
>
> > Good Morning,
> >
> >      We recently had an ost encounter an issue with what appears to be
> its
> > journal...  The ost is sitting as a partition atop a raid6 array, which
> was
> > rebuilding due to a failed disk.  The ost has a journal on an external
> > mirrored disk.  We unmounted the ost, and ran  the following: e2fsck -y
> -C
> > 0 /dev/sdc2 -j /dev/sdd5
> >
> >      After that we remounted the ost, and as soon as the first client
> > tried to write to it after recover it went back to read-only.  We
> unmounted
> > it again, ran e2fsck again, and again it flipped to read-only the second
> > writes tried to go to it (I had set it to read only in the mds, and let
> it
> > sit for a few minutes before setting it back to read/write to make sure
> > that it was only on a write that the problem happened).
> >
> > May  7 10:28:48  kernel:
> > May  7 10:28:48  kernel: Aborting journal on device sdd5.
> > May  7 10:28:48  kernel: LDISKFS-fs (sdc2): Remounting filesystem
> read-only
> > May  7 10:28:48  kernel: LDISKFS-fs error (device sdc2) in
> > ldiskfs_mb_free_blocks: IO failure
> > May  7 10:28:48  kernel: LDISKFS-fs error (device sdc2) in
> > ldiskfs_reserve_inode_write: Journal has aborted
> > May  7 10:28:48  kernel: LDISKFS-fs error (device sdc2) in
> > ldiskfs_reserve_inode_write: Journal has aborted
> > May  7 10:28:48  kernel: LDISKFS-fs error (device sdc2) in
> > ldiskfs_ext_remove_space: Journal has aborted
> > May  7 10:28:48  kernel: LDISKFS-fs error (device sdc2) in
> > ldiskfs_reserve_inode_write: Journal has aborted
> > May  7 10:28:48  kernel: LDISKFS-fs error (device sdc2) in
> > ldiskfs_orphan_del: Journal has aborted
> > May  7 10:28:48  kernel: LDISKFS-fs error (device sdc2) in
> > ldiskfs_reserve_inode_write: Journal has aborted
> > May  7 10:28:48  kernel: LDISKFS-fs error (device sdc2) in
> > ldiskfs_ext_truncate: Journal has aborted
> > May  7 10:28:48  kernel: LustreError:
> > 2436:0:(filter_log.c:174:filter_recov_log_unlink_cb()) error destroying
> > object 2760722: -30
> > May  7 10:28:48  kernel: LustreError:
> > 2434:0:(llog_cat.c:441:llog_cat_process_thread()) llog_cat_process()
> failed
> > -30
> > May  7 10:28:58  kernel: LustreError:
> > 8791:0:(fsfilt-ldiskfs.c:501:fsfilt_ldiskfs_brw_start()) can't get handle
> > for 47 credits: rc = -30
> > May  7 10:28:58  kernel: LustreError:
> > 8791:0:(fsfilt-ldiskfs.c:501:fsfilt_ldiskfs_brw_start()) Skipped 54
> > previous similar messages
> > May  7 10:28:58  kernel: LustreError:
> > 8791:0:(filter_io_26.c:705:filter_commitrw_write()) error starting
> > transaction: rc = -30
> > May  7 10:28:59  kernel: LustreError:
> > 5245:0:(fsfilt-ldiskfs.c:367:fsfilt_ldiskfs_start()) error starting
> handle
> > for op 4 (108 credits): rc -30
> > May  7 10:28:59  kernel: LustreError:
> > 5245:0:(fsfilt-ldiskfs.c:367:fsfilt_ldiskfs_start()) Skipped 18 previous
> > similar messages
> > May  7 10:29:03  kernel: LustreError:
> > 8793:0:(filter_io_26.c:705:filter_commitrw_write()) error starting
> > transaction: rc = -30
> > May  7 10:29:07  kernel: LustreError:
> > 8711:0:(filter_io_26.c:705:filter_commitrw_write()) error starting
> > transaction: rc = -30
> >
> > Kurt J. Strosahl
> > System Administrator
> > Scientific Computing Group, Thomas Jefferson National Accelerator
> Facility
> > _______________________________________________
> > lustre-discuss mailing list
> > lustre-discuss at lists.lustre.org
> > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> >
>


More information about the lustre-discuss mailing list