[Lustre-discuss] OSS not healty

Thu Mar 13 11:11:19 PDT 2008

On Mar 13, 2008  13:44 +0100, Brian J. Murrell wrote:
> On Thu, 2008-03-13 at 12:34 +0100, Frank Mietke wrote:
> > Mar 13 06:17:31 chic2e24 kernel: [3068633.701448] attempt to access beyond end of device
> > Mar 13 06:17:31 chic2e24 kernel: [3068633.701454] sda: rw=1, want=11287722456, limit=7796867072
> 
> This is pretty self-explanatory.  Something tried to read beyond the end
> of the disk.  Something has a misunderstanding of how big the disk is.
> Is it possible that the disk format process was misled about the disk
> size during initialization?

Unlikely.

> Andreas, does mkfs do any bounds checking to verify the sanity of the
> mkfs request?  I.e. does it make sure that if/when you specify a number
> of blocks for a filesystem that that many block are available?

Yes, mke2fs will zero out the last ~128kB of the device to overwrite any
MD RAID signatures, and also verify that the device is as big as requested.

These kind of errors are usually a result of corruption internal to the
filesystem, and some garbage is interpreted as a block number beyond the
end of the device.

> > Mar 13 06:17:31 chic2e24 kernel: [3068633.701555] attempt to access beyond end of device
> > Mar 13 06:17:31 chic2e24 kernel: [3068633.701558] sda: rw=1, want=25366292592, limit=7796867072
> > Mar 13 06:17:31 chic2e24 kernel: [3068633.701562] Buffer I/O error on device sda, logical block 3170786573
> > Mar 13 06:17:31 chic2e24 kernel: [3068633.701785] lost page write due to I/O error on sda
> > Mar 13 06:17:31 chic2e24 kernel: [3068633.702004] Aborting journal on device sda.
> 
> This is all just fallout error messages from the attempted read beyond
> EOF.

Time to unmount the filesystem and run a full e2fsck "e2fsck -fp /dev/sdaNNN"

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.