[Lustre-discuss] OST error

Colin Faber cfaber at gmail.com
Fri Dec 3 13:48:59 PST 2010


Hi Bob,

Good to hear you've identified and resolved the issue. Sorry to hear 
you'll have to restore from backup though.

-cf


On 12/03/2010 02:41 PM, Bob Ball wrote:
> Just to cleanly end this thread, the mptctl was out of date.  We also
> updated megaraid_sas and perc6 firmware.  e2fsck found some Block bitmap
> differences (fixed) at this point, but the OST mounted cleanly and the
> errors stopped.
>
> Unfortunately, there are now corrupted files in the system, that remain
> corrupted, and we'll probably never be able to come up with a complete
> list of them.
>
> bob
>
>
> On 12/2/2010 4:35 PM, Bob Ball wrote:
>> It is a Dell PERC6 RAID array.  OMSA monitoring is enabled and is not
>> throwing errors.  Hmmmm, mptctl is old though, so maybe that is a
>> contributing factor.  I guess I need to update that.  Shoot,
>> megaraid_sas is also not up to date.  dkms....
>>
>> OK, guess I need some driver updates.
>>
>> Later.
>> bob
>>
>> On 12/2/2010 4:05 PM, Colin Faber wrote:
>>> Hi Bob,
>>>
>>> If you're seeing the same errors on the same disk after e2fsck run,
>>> and it's not catching them, it's possible that you're hitting an edge
>>> case which isn't handled within e2fsck properly, however if you're
>>> experiencing different errors and e2fsck did catch them before,
>>> chances are you're looking at some hardware failure some place.
>>>
>>> If this is a single disk, and you have SMART monitoring enabled, check
>>> your error counters, if it's a raid device, verify the error counters
>>> on that.
>>>
>>> -cf
>>>
>>>
>>> On 12/02/2010 02:00 PM, Bob Ball wrote:
>>>> We were getting errors thrown by an OST.  /var/log/messages contained a
>>>> lot of these:
>>>> 2010-11-28T17:05:34-05:00 umfs06.aglt2.org kernel: [2102640.735927]
>>>> LDISKFS-fs error (device sdk): ldiskfs_mb_check_ondisk_bitmap: on-disk
>>>> bitmap for group 639corrupted: 440 blocks free in bitmap, 439 - in gd
>>>>
>>>> So, I turned off (most) access to the disk via lctl (we have a LOT of
>>>> client machines, some were missed) and got problems.  Had to use the
>>>> alternate superblock to e2fsck the disk.  When back online, I still saw
>>>> similar messages.  Updated to e2fsprogs 1.41.12 as suggested elsewhere.
>>>> Repeated e2fsck.
>>>>
>>>> Still seeing these.  Users report some files corrupted, coming up with
>>>> bad md5sum....  Any other thoughts on what to do about this problem?
>>>>
>>>> [2440763.879143] LDISKFS-fs error (device sdk):
>>>> ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 35406corrupted:
>>>> 1318 blocks free in bitmap, 1317 - in gd
>>>> [2440763.879796]
>>>> [2440763.882724] LustreError:
>>>> 1651027:0:(fsfilt-ldiskfs.c:1333:fsfilt_ldiskfs_write_record()) can't
>>>> read/create block: -28
>>>> [2440763.882736] LustreError:
>>>> 1651027:0:(llog_lvfs.c:116:llog_lvfs_write_blob()) error writing log
>>>> record: rc -28
>>>> [2440763.882789] LustreError:
>>>> 1651002:0:(mgc_request.c:1089:mgc_copy_llog()) Failed to copy remote log
>>>> umt3-OST0019 (-28)
>>>>
>>>> Rebooted to make system clean as a whole, and found the same kind of
>>>> thing repeating.
>>>> [  285.834864] LDISKFS-fs (sdk): warning: mounting fs with errors,
>>>> running e2fsck is recommended
>>>> [  285.852559] LDISKFS-fs (sdk): mounted filesystem with ordered data
>>>> mode
>>>> [  286.079065] LDISKFS-fs (sdk): warning: mounting fs with errors,
>>>> running e2fsck is recommended
>>>> [  286.096316] LDISKFS-fs (sdk): mounted filesystem with ordered data
>>>> mode
>>>> [  286.940872] LDISKFS-fs error (device sdk):
>>>> ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 35406corrupted:
>>>> 1318 blocks free in bitmap, 1317 - in gd
>>>> [  286.941693]
>>>> [  286.945224] LustreError:
>>>> 5790:0:(fsfilt-ldiskfs.c:1333:fsfilt_ldiskfs_write_record()) can't
>>>> read/create block: -28
>>>> [  286.945233] LustreError:
>>>> 5790:0:(llog_lvfs.c:116:llog_lvfs_write_blob()) error writing log
>>>> record: rc -28
>>>> [  286.945448] LustreError: 5763:0:(mgc_request.c:1089:mgc_copy_llog())
>>>> Failed to copy remote log umt3-OST0019 (-28)
>>>>
>>>> All help appreciated.
>>>>
>>>> bob
>>>> _______________________________________________
>>>> Lustre-discuss mailing list
>>>> Lustre-discuss at lists.lustre.org
>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss



More information about the lustre-discuss mailing list