[lustre-discuss] lustre issue with OST setting to read-only mode as soon as writes are attempted. using Lustre 1.8.8

Kurt Strosahl strosahl at jlab.org
Mon May 11 07:21:47 PDT 2015


It took a while but now it has finished.


:~]e2fsck -fy -C 0 /dev/sdc2 -j /dev/sdd5
e2fsck 1.42.3.wc3 (15-Aug-2012)                      
Pass 1: Checking inodes, blocks, and sizes           
Pass 2: Checking directory structure                                           
Pass 3: Checking directory connectivity                                        
Pass 4: Checking reference counts                                              
Pass 5: Checking group summary information                                     
Block bitmap differences:  -1845243648 -1845243668 -(1845243713--1845243714) -1845243738 -1845243742 -(1845243751--1845243753) -(1845243756--1845243761) -1845243763 -1845243765 -1845243767 -1845243769 -(1845243776--1845243778) -(1845243781--1845243786) -1845243790 -1845243793 -(1845243816--1845243817) -1845243819 -1845243822 -(1845243824--1845243826) -(1845243829--1845243831) -(1845243890--1845243894) -(1845243899--1845243902) -(1845244225--1845244227) -1845244247 -1845244275 -1845244290 -1845244294 -1845244296 -1845244301 -1845244304 -1845244311 -1845244319 -(1845244322--1845244324) -1845244330 -(1845244348--1845244349) -1845244352 -1845244354 -1845244360 -1845244367 -1845244371 -1845244374 -1845244381 -(1845244385--1845244386) -(1845244395--1845244399) -(1845244409--1845244413)                                                                                            
Fix? yes                                                                                                                                           

                                                                               
lustre-OST0060: ***** FILE SYSTEM WAS MODIFIED *****                           
lustre-OST0060: 451137/22888704 files (39.9% non-contiguous), 2331868992/2929721492 blocks

I mounted the ost but haven't set it to read-write yet, because of the below error...
  Lustre: lustre-OST0060: sending delayed replies to recovered clients
LustreError: 12922:0:(filter_log.c:135:filter_cancel_cookies_cb()) error cancelling log cookies: rc = -19
LustreError: 12922:0:(filter_log.c:135:filter_cancel_cookies_cb()) Skipped 2 previous similar messages

An error message it was getting before.

As an aside, over the weekend we had a large number of client nodes reboot.  When they came back up they were unable to reach the ost (it showed as inactive).  It wasn't displaying this behaviour before, and clients that hadn't rebooted were still able to see it.

w/r,
Kurt
----- Original Message -----
From: "Kurt Strosahl" <strosahl at jlab.org>
To: "Colin Faber" <cfaber at gmail.com>
Cc: lustre-discuss at lists.lustre.org
Sent: Monday, May 11, 2015 9:17:44 AM
Subject: Re: [lustre-discuss] lustre issue with OST setting to read-only mode as soon as writes are attempted. using Lustre 1.8.8

e2fsck 1.42.3.wc3 (15-Aug-2012)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Block bitmap differences:  -1845243648 -1845243668 -(1845243713--1845243714) -1845243738 -1845243742 -(1845243751--1845243753) -(1845243756--1845243761) -1845243763 -1845243765 -1845243767 -1845243769 -(1845243776--1845243778) -(1845243781--1845243786) -1845243790 -1845243793 -(1845243816--1845243817) -1845243819 -1845243822 -(1845243824--1845243826) -(1845243829--1845243831) -(1845243890--1845243894) -(1845243899--1845243902) -(1845244225--1845244227) -1845244247 -1845244275 -1845244290 -1845244294 -1845244296 -1845244301 -1845244304 -1845244311 -1845244319 -(1845244322--1845244324) -1845244330 -(1845244348--1845244349) -1845244352 -1845244354 -1845244360 -1845244367 -1845244371 -1845244374 -1845244381 -(1845244385--1845244386) -(1845244395--1845244399) -(1845244409--1845244413)
Fix? no

Free blocks count wrong for group #56312 (4585, counted=4499).
Fix? no

Free blocks count wrong (597852500, counted=597852414).
Fix? no


lustre-OST0060: ********** WARNING: Filesystem still has errors **********

lustre-OST0060: 451137/22888704 files (39.9% non-contiguous), 2331868992/2929721492 blocks

After some discussion here we are going to run the check again and let e2fsck fix the problems it finds.

w/r,
Kurt


----- Original Message -----
From: "Colin Faber" <cfaber at gmail.com>
To: "Kurt Strosahl" <strosahl at jlab.org>
Cc: lustre-discuss at lists.lustre.org
Sent: Thursday, May 7, 2015 5:05:06 PM
Subject: Re: [lustre-discuss] lustre issue with OST setting to read-only mode as soon as writes are attempted. using Lustre 1.8.8

Hi Kurt,

What's e2fsck -fn against the target look like? Does it find issues?

Also, there are a few known fixes for similar issues such as what you
describe above, unfortunately I don't have the bug number handy, maybe
someone from Intel remembers which bug it is.

-cf


On Thu, May 7, 2015 at 11:15 AM, Kurt Strosahl <strosahl at jlab.org> wrote:

> Nothing presently wrong with sdc2, it is a partition on a raid6 disk array
> so smartctl doesn't see anything (nor does the raid controller report any
> problems).  The raid array did have a failed drive, but the drive was
> replaced, and the rebuild started, over an hour before the first time it
> went to read-only.
>
> Looking back in the logs I see the below error (which I thought I'd put in
> my original email).
> LDISKFS-fs error (device sdc2): ldiskfs_mb_check_ondisk_bitmap: on-disk
> bitmap for group 56312corrupted: 4499 blocks free in bitmap, 4585 - in gd
>
> ----- Original Message -----
> From: "Colin Faber" <cfaber at gmail.com>
> To: "Kurt Strosahl" <strosahl at jlab.org>
> Cc: lustre-discuss at lists.lustre.org
> Sent: Thursday, May 7, 2015 11:59:35 AM
> Subject: Re: [lustre-discuss] lustre issue with OST setting to read-only
> mode as soon as writes are attempted. using Lustre 1.8.8
>
> Whoops, meant to respond here...
>
> Anyways, it seems something is wrong with sdc2. What's smart tell you? any
> notices about it in dmesg?
>
> On Thu, May 7, 2015 at 8:54 AM, Kurt Strosahl <strosahl at jlab.org> wrote:
>
> > Good Morning,
> >
> >      We recently had an ost encounter an issue with what appears to be
> its
> > journal...  The ost is sitting as a partition atop a raid6 array, which
> was
> > rebuilding due to a failed disk.  The ost has a journal on an external
> > mirrored disk.  We unmounted the ost, and ran  the following: e2fsck -y
> -C
> > 0 /dev/sdc2 -j /dev/sdd5
> >
> >      After that we remounted the ost, and as soon as the first client
> > tried to write to it after recover it went back to read-only.  We
> unmounted
> > it again, ran e2fsck again, and again it flipped to read-only the second
> > writes tried to go to it (I had set it to read only in the mds, and let
> it
> > sit for a few minutes before setting it back to read/write to make sure
> > that it was only on a write that the problem happened).
> >
> > May  7 10:28:48  kernel:
> > May  7 10:28:48  kernel: Aborting journal on device sdd5.
> > May  7 10:28:48  kernel: LDISKFS-fs (sdc2): Remounting filesystem
> read-only
> > May  7 10:28:48  kernel: LDISKFS-fs error (device sdc2) in
> > ldiskfs_mb_free_blocks: IO failure
> > May  7 10:28:48  kernel: LDISKFS-fs error (device sdc2) in
> > ldiskfs_reserve_inode_write: Journal has aborted
> > May  7 10:28:48  kernel: LDISKFS-fs error (device sdc2) in
> > ldiskfs_reserve_inode_write: Journal has aborted
> > May  7 10:28:48  kernel: LDISKFS-fs error (device sdc2) in
> > ldiskfs_ext_remove_space: Journal has aborted
> > May  7 10:28:48  kernel: LDISKFS-fs error (device sdc2) in
> > ldiskfs_reserve_inode_write: Journal has aborted
> > May  7 10:28:48  kernel: LDISKFS-fs error (device sdc2) in
> > ldiskfs_orphan_del: Journal has aborted
> > May  7 10:28:48  kernel: LDISKFS-fs error (device sdc2) in
> > ldiskfs_reserve_inode_write: Journal has aborted
> > May  7 10:28:48  kernel: LDISKFS-fs error (device sdc2) in
> > ldiskfs_ext_truncate: Journal has aborted
> > May  7 10:28:48  kernel: LustreError:
> > 2436:0:(filter_log.c:174:filter_recov_log_unlink_cb()) error destroying
> > object 2760722: -30
> > May  7 10:28:48  kernel: LustreError:
> > 2434:0:(llog_cat.c:441:llog_cat_process_thread()) llog_cat_process()
> failed
> > -30
> > May  7 10:28:58  kernel: LustreError:
> > 8791:0:(fsfilt-ldiskfs.c:501:fsfilt_ldiskfs_brw_start()) can't get handle
> > for 47 credits: rc = -30
> > May  7 10:28:58  kernel: LustreError:
> > 8791:0:(fsfilt-ldiskfs.c:501:fsfilt_ldiskfs_brw_start()) Skipped 54
> > previous similar messages
> > May  7 10:28:58  kernel: LustreError:
> > 8791:0:(filter_io_26.c:705:filter_commitrw_write()) error starting
> > transaction: rc = -30
> > May  7 10:28:59  kernel: LustreError:
> > 5245:0:(fsfilt-ldiskfs.c:367:fsfilt_ldiskfs_start()) error starting
> handle
> > for op 4 (108 credits): rc -30
> > May  7 10:28:59  kernel: LustreError:
> > 5245:0:(fsfilt-ldiskfs.c:367:fsfilt_ldiskfs_start()) Skipped 18 previous
> > similar messages
> > May  7 10:29:03  kernel: LustreError:
> > 8793:0:(filter_io_26.c:705:filter_commitrw_write()) error starting
> > transaction: rc = -30
> > May  7 10:29:07  kernel: LustreError:
> > 8711:0:(filter_io_26.c:705:filter_commitrw_write()) error starting
> > transaction: rc = -30
> >
> > Kurt J. Strosahl
> > System Administrator
> > Scientific Computing Group, Thomas Jefferson National Accelerator
> Facility
> > _______________________________________________
> > lustre-discuss mailing list
> > lustre-discuss at lists.lustre.org
> > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> >
>


More information about the lustre-discuss mailing list