[Lustre-discuss] OST error

Bob Ball ball at umich.edu
Thu Dec 2 13:35:02 PST 2010


It is a Dell PERC6 RAID array.  OMSA monitoring is enabled and is not 
throwing errors.  Hmmmm, mptctl is old though, so maybe that is a 
contributing factor.  I guess I need to update that.  Shoot, 
megaraid_sas is also not up to date.  dkms....

OK, guess I need some driver updates.

Later.
bob

On 12/2/2010 4:05 PM, Colin Faber wrote:
> Hi Bob,
>
> If you're seeing the same errors on the same disk after e2fsck run, 
> and it's not catching them, it's possible that you're hitting an edge 
> case which isn't handled within e2fsck properly, however if you're 
> experiencing different errors and e2fsck did catch them before, 
> chances are you're looking at some hardware failure some place.
>
> If this is a single disk, and you have SMART monitoring enabled, check 
> your error counters, if it's a raid device, verify the error counters 
> on that.
>
> -cf
>
>
> On 12/02/2010 02:00 PM, Bob Ball wrote:
>> We were getting errors thrown by an OST.  /var/log/messages contained a
>> lot of these:
>> 2010-11-28T17:05:34-05:00 umfs06.aglt2.org kernel: [2102640.735927]
>> LDISKFS-fs error (device sdk): ldiskfs_mb_check_ondisk_bitmap: on-disk
>> bitmap for group 639corrupted: 440 blocks free in bitmap, 439 - in gd
>>
>> So, I turned off (most) access to the disk via lctl (we have a LOT of
>> client machines, some were missed) and got problems.  Had to use the
>> alternate superblock to e2fsck the disk.  When back online, I still saw
>> similar messages.  Updated to e2fsprogs 1.41.12 as suggested elsewhere.
>> Repeated e2fsck.
>>
>> Still seeing these.  Users report some files corrupted, coming up with
>> bad md5sum....  Any other thoughts on what to do about this problem?
>>
>> [2440763.879143] LDISKFS-fs error (device sdk):
>> ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 35406corrupted:
>> 1318 blocks free in bitmap, 1317 - in gd
>> [2440763.879796]
>> [2440763.882724] LustreError:
>> 1651027:0:(fsfilt-ldiskfs.c:1333:fsfilt_ldiskfs_write_record()) can't
>> read/create block: -28
>> [2440763.882736] LustreError:
>> 1651027:0:(llog_lvfs.c:116:llog_lvfs_write_blob()) error writing log
>> record: rc -28
>> [2440763.882789] LustreError:
>> 1651002:0:(mgc_request.c:1089:mgc_copy_llog()) Failed to copy remote log
>> umt3-OST0019 (-28)
>>
>> Rebooted to make system clean as a whole, and found the same kind of
>> thing repeating.
>> [  285.834864] LDISKFS-fs (sdk): warning: mounting fs with errors,
>> running e2fsck is recommended
>> [  285.852559] LDISKFS-fs (sdk): mounted filesystem with ordered data 
>> mode
>> [  286.079065] LDISKFS-fs (sdk): warning: mounting fs with errors,
>> running e2fsck is recommended
>> [  286.096316] LDISKFS-fs (sdk): mounted filesystem with ordered data 
>> mode
>> [  286.940872] LDISKFS-fs error (device sdk):
>> ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 35406corrupted:
>> 1318 blocks free in bitmap, 1317 - in gd
>> [  286.941693]
>> [  286.945224] LustreError:
>> 5790:0:(fsfilt-ldiskfs.c:1333:fsfilt_ldiskfs_write_record()) can't
>> read/create block: -28
>> [  286.945233] LustreError:
>> 5790:0:(llog_lvfs.c:116:llog_lvfs_write_blob()) error writing log
>> record: rc -28
>> [  286.945448] LustreError: 5763:0:(mgc_request.c:1089:mgc_copy_llog())
>> Failed to copy remote log umt3-OST0019 (-28)
>>
>> All help appreciated.
>>
>> bob
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>



More information about the lustre-discuss mailing list