[Lustre-discuss] ll_ost thread soft lockup

Tae Young Hong catchrye at gmail.com
Tue Mar 20 07:42:49 PDT 2012


Thank you for your  information,
Today I tested our OSS after reading bugzilla 24264, say, after patching the kernel (http://review.whamcloud.com/#change,1672), I rebuilt the md in question with new one disk added (because we just had 9 disks for RAID6 8+2), and then reran e2fsck -fn,  and I finally tried to mount it but I still saw ll_ost soft lockup. the call trace messages is the same as before. so I think ours is not the case that you said. 

Anyway yesterday I tried the simplest method as below, to see if ldiskfs is working properly alone. 

mount -t ldiskfs -o  ro,extents,mballoc /dev/md17 /mnt/kkk
find /mnt/kkk  -type f | while read f; do echo $f >&2 ; cat $f > /dev/null ; done

and I got the following syslog messages while running this "find/cat" command, however the command finished without any other kernel error or soft lockup. 

Mar 19 17:57:51 oss19 kernel: LDISKFS-fs error (device md17): htree_dirblock_to_tree: bad entry in directory #341259: rec_len is smaller than minimal - offset=806912, inode=0, rec_len=0, name_len=0
Mar 19 18:31:09 oss19 kernel: LDISKFS-fs error (device md17): htree_dirblock_to_tree: bad entry in directory #341382: rec_len is smaller than minimal - offset=978944, inode=0, rec_len=0, name_len=0
Mar 19 18:31:11 oss19 kernel: LDISKFS-fs error (device md17): htree_dirblock_to_tree: bad entry in directory #341382: rec_len is smaller than minimal - offset=282624, inode=0, rec_len=0, name_len=0
Mar 19 18:31:11 oss19 kernel: LDISKFS-fs error (device md17): htree_dirblock_to_tree: bad entry in directory #341382: rec_len is smaller than minimal - offset=290816, inode=0, rec_len=0, name_len=0
Mar 19 19:01:15 oss19 kernel: LDISKFS-fs error (device md17): htree_dirblock_to_tree: bad entry in directory #341379: rec_len is smaller than minimal - offset=528384, inode=0, rec_len=0, name_len=0
Mar 19 19:18:14 oss19 kernel: LDISKFS-fs error (device md17): htree_dirblock_to_tree: bad entry in directory #341258: rec_len is smaller than minimal - offset=1196032, inode=0, rec_len=0, name_len=0
Mar 19 19:18:14 oss19 kernel: LDISKFS-fs error (device md17): htree_dirblock_to_tree: bad entry in directory #341258: rec_len is smaller than minimal - offset=1187840, inode=0, rec_len=0, name_len=0
...

 cat ldiskfs_syslog_error.20120320 | awk '{print $15}' | sort | uniq -c 
      4 #341257:
     12 #341258:
      4 #341259:
      4 #341379:
     12 #341380:
      4 #341381:
     12 #341382:
     12 #341383:
      4 #341384:
      4 #341507:
      4 #341510:

11 directory data blocks seems corrupted, I don't know what I can do further, 

regards,
Taeyoung Hong


2012. 3. 19., 오후 11:27, Robin Humble 작성:

> On Mon, Mar 19, 2012 at 07:28:22AM -0600, Kevin Van Maren wrote:
>> You are running 1.8.5, which does not have the fix for the known MD raid5/6 rebuild corruption bug.  That fix was released in the Oracle Lustre 1.8.7 kernel patches.  Unless you already applied that patch, you might want to run a check of your raid arrays and consider an upgrade (at least patch your kernel with that fix).
>> 
>> md-avoid-corrupted-ldiskfs-after-rebuild.patch in the 2.6-rhel5.series (note that this bug is NOT specific to rhel5).  This fix does NOT appear to have been picked up by whamcloud.
> 
> as you say, the md rebuild bug is in all kernels < 2.6.32
>  http://marc.info/?l=linux-raid&m=130192650924540&w=2
> 
> the Whamcloud fix is LU-824 which landed in git a tad after 1.8.7-wc1.
> 
> I also asked RedHat nicely, and they added the same patch to RHEL5.8
> kernels, which IMHO is the correct place for a fundamental md fix.
> 
> so once Lustre supports RHEL5.8 servers, then the patch in Lustre
> isn't needed any more.
> 
> cheers,
> robin
> --
> Dr Robin Humble, HPC Systems Analyst, NCI National Facility




More information about the lustre-discuss mailing list