[Lustre-discuss] ll_ost thread soft lockup
Tae Young Hong
catchrye at gmail.com
Tue Mar 20 07:42:49 PDT 2012
Thank you for your information,
Today I tested our OSS after reading bugzilla 24264, say, after patching the kernel (http://review.whamcloud.com/#change,1672), I rebuilt the md in question with new one disk added (because we just had 9 disks for RAID6 8+2), and then reran e2fsck -fn, and I finally tried to mount it but I still saw ll_ost soft lockup. the call trace messages is the same as before. so I think ours is not the case that you said.
Anyway yesterday I tried the simplest method as below, to see if ldiskfs is working properly alone.
mount -t ldiskfs -o ro,extents,mballoc /dev/md17 /mnt/kkk
find /mnt/kkk -type f | while read f; do echo $f >&2 ; cat $f > /dev/null ; done
and I got the following syslog messages while running this "find/cat" command, however the command finished without any other kernel error or soft lockup.
Mar 19 17:57:51 oss19 kernel: LDISKFS-fs error (device md17): htree_dirblock_to_tree: bad entry in directory #341259: rec_len is smaller than minimal - offset=806912, inode=0, rec_len=0, name_len=0
Mar 19 18:31:09 oss19 kernel: LDISKFS-fs error (device md17): htree_dirblock_to_tree: bad entry in directory #341382: rec_len is smaller than minimal - offset=978944, inode=0, rec_len=0, name_len=0
Mar 19 18:31:11 oss19 kernel: LDISKFS-fs error (device md17): htree_dirblock_to_tree: bad entry in directory #341382: rec_len is smaller than minimal - offset=282624, inode=0, rec_len=0, name_len=0
Mar 19 18:31:11 oss19 kernel: LDISKFS-fs error (device md17): htree_dirblock_to_tree: bad entry in directory #341382: rec_len is smaller than minimal - offset=290816, inode=0, rec_len=0, name_len=0
Mar 19 19:01:15 oss19 kernel: LDISKFS-fs error (device md17): htree_dirblock_to_tree: bad entry in directory #341379: rec_len is smaller than minimal - offset=528384, inode=0, rec_len=0, name_len=0
Mar 19 19:18:14 oss19 kernel: LDISKFS-fs error (device md17): htree_dirblock_to_tree: bad entry in directory #341258: rec_len is smaller than minimal - offset=1196032, inode=0, rec_len=0, name_len=0
Mar 19 19:18:14 oss19 kernel: LDISKFS-fs error (device md17): htree_dirblock_to_tree: bad entry in directory #341258: rec_len is smaller than minimal - offset=1187840, inode=0, rec_len=0, name_len=0
...
cat ldiskfs_syslog_error.20120320 | awk '{print $15}' | sort | uniq -c
4 #341257:
12 #341258:
4 #341259:
4 #341379:
12 #341380:
4 #341381:
12 #341382:
12 #341383:
4 #341384:
4 #341507:
4 #341510:
11 directory data blocks seems corrupted, I don't know what I can do further,
regards,
Taeyoung Hong
2012. 3. 19., 오후 11:27, Robin Humble 작성:
> On Mon, Mar 19, 2012 at 07:28:22AM -0600, Kevin Van Maren wrote:
>> You are running 1.8.5, which does not have the fix for the known MD raid5/6 rebuild corruption bug. That fix was released in the Oracle Lustre 1.8.7 kernel patches. Unless you already applied that patch, you might want to run a check of your raid arrays and consider an upgrade (at least patch your kernel with that fix).
>>
>> md-avoid-corrupted-ldiskfs-after-rebuild.patch in the 2.6-rhel5.series (note that this bug is NOT specific to rhel5). This fix does NOT appear to have been picked up by whamcloud.
>
> as you say, the md rebuild bug is in all kernels < 2.6.32
> http://marc.info/?l=linux-raid&m=130192650924540&w=2
>
> the Whamcloud fix is LU-824 which landed in git a tad after 1.8.7-wc1.
>
> I also asked RedHat nicely, and they added the same patch to RHEL5.8
> kernels, which IMHO is the correct place for a fundamental md fix.
>
> so once Lustre supports RHEL5.8 servers, then the patch in Lustre
> isn't needed any more.
>
> cheers,
> robin
> --
> Dr Robin Humble, HPC Systems Analyst, NCI National Facility
More information about the lustre-discuss
mailing list