[Lustre-discuss] OSS not healty

Frank Mietke frank.mietke at informatik.tu-chemnitz.de
Thu Mar 13 04:34:00 PDT 2008


Hi,

On Thu, Mar 13, 2008 at 03:29:29AM -0600, Andreas Dilger wrote:
> On Mar 13, 2008  10:15 +0100, Frank Mietke wrote:
> > we're using Lustre-1.6.4.2 and now one of our OSS (comprising two OSTs) shows
> > the status "not healthy". 
> > 
> > dmesg tells the following:
> > ...
> > [3082673.456429] LustreError:
> > 16561:0:(filter_io_26.c:705:filter_commitrw_write()) error starting transaction:
> > rc = -30
> > 
> > I've found that it seems to be the error EROFS. The documentation states that I
> > have to restart Lustre services. Is it enough to umount / mount both OSTs on
> > this OSS or do I have to umount everything (MDS/OSS)? Anything else to care
> > about?
> 
> You should investigate in your /var/log/messages why this happened.  It
> is usually a sign of filesystem corruption or disk errors, so you would
> likely also need to run e2fsck before remounting the filesystem.
okay I've found the following in /var/log/messages before the bulk of above
messages come. It seems that something with the RAID went wrong. Any hints?

Mar 13 05:50:37 chic2e24 kernel: [3067020.190468] LustreError: 4574:0:(ldlm_resource.c:719:ldlm_resource_add()) lvbo_init failed for resource 116733: rc -2
Mar 13 05:50:37 chic2e24 kernel: [3067020.190907] LustreError: 4574:0:(ldlm_resource.c:719:ldlm_resource_add()) Skipped 1 previous similar message
Mar 13 05:50:57 chic2e24 kernel: [3067040.964208] LustreError: 4598:0:(ldlm_resource.c:719:ldlm_resource_add()) lvbo_init failed for resource 10518: rc -2
Mar 13 05:50:57 chic2e24 kernel: [3067040.964652] LustreError: 4598:0:(ldlm_resource.c:719:ldlm_resource_add()) Skipped 2 previous similar messages
Mar 13 06:17:31 chic2e24 kernel: [3068633.701448] attempt to access beyond end of device
Mar 13 06:17:31 chic2e24 kernel: [3068633.701454] sda: rw=1, want=11287722456, limit=7796867072
Mar 13 06:17:31 chic2e24 kernel: [3068633.701555] attempt to access beyond end of device
Mar 13 06:17:31 chic2e24 kernel: [3068633.701558] sda: rw=1, want=25366292592, limit=7796867072
Mar 13 06:17:31 chic2e24 kernel: [3068633.701562] Buffer I/O error on device sda, logical block 3170786573
Mar 13 06:17:31 chic2e24 kernel: [3068633.701785] lost page write due to I/O error on sda
Mar 13 06:17:31 chic2e24 kernel: [3068633.702004] Aborting journal on device sda.
Mar 13 06:17:31 chic2e24 kernel: [3068633.702226] LustreError: 4493:0:(obd.h:1038:obd_transno_commit_cb()) chicfs-OST0010: transno
6510615555435490347 commit error: 2 
Mar 13 06:17:31 chic2e24 kernel: [3068633.702933] LDISKFS-fs error (device sda) in ldiskfs_reserve_inode_write: Journal has aborted
Mar 13 06:17:31 chic2e24 kernel: [3068633.703587] Remounting filesystem read-only
Mar 13 06:17:31 chic2e24 kernel: [3068633.704001] journal commit I/O error
Mar 13 06:17:31 chic2e24 kernel: [3068633.704981] LDISKFS-fs error (device sda) in ldiskfs_dirty_inode: Journal has aborted
Mar 13 06:17:31 chic2e24 kernel: [3068633.705034] LustreError: 5887:0:(filter_io_26.c:767:filter_commitrw_write()) Failure to commit OST transaction (-5)?
Mar 13 06:17:31 chic2e24 kernel: [3068633.706134] LustreError: 4662:0:(fsfilt-ldiskfs.c:1318:fsfilt_ldiskfs_write_record()) can't start transaction for 37 blocks (128 bytes)
Mar 13 06:17:31 chic2e24 kernel: [3068633.706718] LustreError: 4662:0:(filter.c:139:filter_finish_transno()) wrote trans 6510615555435490348 for client 67e1aea3-f93a-affd-b39d-eefa306ae345 at #212: err = -30
Mar 13 06:17:31 chic2e24 kernel: [3068633.707570] LustreError: 4662:0:(filter_io_26.c:566:filter_direct_io()) can't close transaction: -30
Mar 13 06:17:31 chic2e24 kernel: [3068633.708153] LustreError: 4662:0:(fsfilt-ldiskfs.c:483:fsfilt_ldiskfs_commit_async()) error while stopping transaction: -30
Mar 13 06:17:31 chic2e24 kernel: [3068633.708735] LustreError: 4662:0:(filter_io_26.c:767:filter_commitrw_write()) Failure to commit OST transaction (-5)?
Mar 13 06:17:31 chic2e24 kernel: [3068633.708875] LustreError: 16324:0:(fsfilt-ldiskfs.c:417:fsfilt_ldiskfs_brw_start()) can't get handle for 530 credits: rc = -30
Mar 13 06:17:31 chic2e24 kernel: [3068633.708881] LustreError: 16324:0:(filter_io_26.c:705:filter_commitrw_write()) error starting transaction: rc = -30
Mar 13 06:17:31 chic2e24 kernel: [3068633.708976] LustreError: 4776:0:(filter_io_26.c:705:filter_commitrw_write()) error starting transaction: rc = -30
Mar 13 06:17:31 chic2e24 kernel: [3068633.709006] LustreError: 4742:0:(filter_io_26.c:705:filter_commitrw_write()) error starting transaction: rc = -30
Mar 13 06:17:31 chic2e24 kernel: [3068633.711072] LustreError: 4493:0:(obd.h:1038:obd_transno_commit_cb()) chicfs-OST0010: transno 6510615555435490348 commit error: 2
Mar 13 06:17:31 chic2e24 kernel: [3068633.711100] LustreError: 16385:0:(fsfilt-ldiskfs.c:417:fsfilt_ldiskfs_brw_start()) can't get handle for 530 credits: rc = -30
Mar 13 06:17:31 chic2e24 kernel: [3068633.711105] LustreError: 16385:0:(fsfilt-ldiskfs.c:417:fsfilt_ldiskfs_brw_start()) Skipped 2 previous similar messages
Mar 13 06:17:31 chic2e24 kernel: [3068633.711110] LustreError: 16385:0:(filter_io_26.c:705:filter_commitrw_write()) error starting transaction: rc = -30

Best Regards,
Frank


> 
> Doing the unmount/mount of just the OSTs should be enough
> 
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
> 
> 

-- 
Dipl.-Inf. Frank Mietke     |     Fakultätsrechen- und Informationszentrum
Tel.: 0371 - 531 - 35538    |     Fak. für Informatik
Fax:  0371 - 531 8 35538    |     TU-Chemnitz
Key-ID: 60F59599            |     frank.mietke at informatik.tu-chemnitz.de



More information about the lustre-discuss mailing list