[Lustre-discuss] fsck.ext4 for device ... exited with signal 11.

Andreas Dilger andreas.dilger at oracle.com
Thu Dec 2 09:52:27 PST 2010


On 2010-12-02, at 09:24, Craig Prescott wrote:
> But the fsck using seems to be going extremely slowly - it ran all 
> night, and is still running.  This is very abnormal, as fsck's on the 
> OSTs in this filesystem usually take on order of 30 minutes.  I'd like 
> to understand better what fsck is doing at this time.
> 
> fsck seems to be spending a lot of time in Pass1D, cloning 
> multiply-claimed blocks.  But there is no output from fsck in many hours 
> now,

Pass 1b-1d have O(n^2) complexity, and require a second pass through all of the metadata, so if there are a large number of duplicate blocks it can take a long time.

> 1) fsck.ext4 is using 100% of a 2.2GHz core.  The progress of the fsck 
> seems to be CPU bound for a long time (many hours).  We're not used to 
> seeing this.

If there are a limited number of files, you can restart e2fsck with the option "-E shared=delete", which will cause the inodes with shared blocks to be deleted.  It will of course cause that data to be lost, but it will allow e2fsck to complete much more quickly.

> 4) Using pstack, I can see fsck.ext4 is in ext2fs_block_iterate2() - it 
> looks like there is a lot of time being spent in ext2fs_new_block().

This is a major contributor to the slowdown - the code in libext2fs for allocating blocks is quite slow, and does not necessarily make very good allocations.

> I'd like to understand what fsck is doing that takes so much CPU.  The 
> OST was pretty full (~90%)... Is it computationally expensive to clone 
> multiply-claimed blocks on a filesystem this full?
> 
> I'm also wondering if I should let this continue or not.
> 
> I appended a bit of the strace output.  From the offset arg to the 
> lseek() calls, it looks like data is being copied from one side of the 
> spindles to the other(?).
> 
> Thanks,
> Craig Prescott
> UF HPC Center
> 
> 
> Sample strace output:
> 
> ...
> read(3, 
> "\313R\354\222\205%\16\227\221,\226\35\317\22\331,0\312\262\330\252\314wI\2\345^\305\222d\273$"..., 
> 4096) = 4096
> lseek(3, 36574076928, SEEK_SET)         = 36574076928
> write(3, "\35z\354 
> \252\370\24\317\323\236VL]NF;\335\303\16w&\n\312\236F\0\3664RK\366\304"..., 
> 4096) = 4096
> lseek(3, 7424726908928, SEEK_SET)       = 7424726908928
> ...
> 
> 
> 
> 
> 
> Colin Faber wrote:
>> Hi,
>> 
>> Try upgrading to the latest e2fsprogs package. 1.41.12.2
>> 
>> -cf
>> 
>> 
>> On 12/01/2010 03:20 PM, Craig Prescott wrote:
>>> I forgot to add - our affected OSS is running Lustre 1.8.4, and
>>> e2fsprogs-1.41.10.sun2-0redhat.  `uname -r` gives
>>> 
>>> 2.6.18-194.3.1.0.1.el5_lustre.1.8.4
>>> 
>>> Thanks,
>>> Craig Prescott
>>> UF HPC Center
>>> 
>>> 
>>> Craig Prescott wrote:
>>>> Hi,
>>>> 
>>>> We are trying to fsck an OST that was not unmounted
>>>> cleanly.  But fsck is dying with this error after making some 
>>>> corrections:
>>>> 
>>>> [root at XXXXXX tmp]# fsck -f -y /dev/arc1-lv2/OST0003
>>>> ...
>>>> High 16 bits of extent/index block set
>>>> CLEARED.
>>>> Inode 306602015 has an invalid extent node (blk 512, lblk 641536)
>>>> Clear? yes
>>>> 
>>>> Warning... fsck.ext4 for device /dev/arc1-lv2/OST0003 exited with 
>>>> signal 11.
>>>> 
>>>> It is repeatable.
>>>> 
>>>> So we are stuck.  We need to fsck our OST, but fsck is dying.  Can
>>>> anyone give us some advice on how to proceed?
>>>> 
>>>> Thanks,
>>>> Craig Prescott
>>>> UF HPC Center
>>>> _______________________________________________
>>>> Lustre-discuss mailing list
>>>> Lustre-discuss at lists.lustre.org
>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss


Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.




More information about the lustre-discuss mailing list