[Lustre-discuss] Hung software raid in 2.6.18-92.1.26 + lustre 1.6.7.2
Tim Burgess
ozburgess+lustre at gmail.com
Mon Jun 22 22:23:00 PDT 2009
Hi All,
Was wondering if anyone might be able to shed any light on some more
problems we've been seeing since our 1.6.7.2 upgrade over the
weekend...
We've upgraded all the OSSes and the MDS to the SDLC
2.6.18-92.1.26.el5_lustre.1.6.7.2smp, and now it appears that
something is causing the software raid layer on the OSSes to freeze
completely.
Even:
[root at oss006 md]# dd if=/dev/md2 of=/dev/null bs=1024k count=1
hangs forever.
/dev/md2 is the OST volume (one OST per OSS), but we see the same
effect on /dev/md0 (and presumably /dev/md1).
This of course causes all the lustre io threads to go into D state one
by one and never return:
[root at oss006 ~]# ps -elf | grep D
F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD
1 D root 250 27 0 75 0 - 0 get_ac Jun21 ?
00:00:02 [pdflush]
1 D root 251 27 0 70 -5 - 0 get_ac Jun21 ?
00:00:01 [kswapd0]
1 D root 3232 1 0 75 0 - 0 log_wa Jun21 ?
00:00:00 [obd_zombid]
1 D root 3288 27 0 70 -5 - 0 sync_b Jun21 ?
00:00:37 [kjournald]
1 D root 3303 1 0 75 0 - 0 get_ac Jun21 ?
00:00:00 [ldlm_cn_00]
1 D root 3305 1 0 75 0 - 0 - Jun21 ?
00:00:00 [ldlm_cn_01]
1 D root 3306 1 0 75 0 - 0 - Jun21 ?
00:00:00 [ldlm_cn_02]
1 D root 3307 1 0 75 0 - 0 get_ac Jun21 ?
00:00:00 [ldlm_cn_03]
1 D root 3308 1 0 75 0 - 0 - Jun21 ?
00:00:00 [ldlm_cn_04]
1 D root 3309 1 0 75 0 - 0 - Jun21 ?
00:00:00 [ldlm_cn_05]
1 D root 3310 1 0 75 0 - 0 - Jun21 ?
00:00:00 [ldlm_cn_06]
....
1 D root 3455 1 0 75 0 - 0 - Jun21 ?
00:00:00 [ll_evictor]
1 D root 5996 1 0 75 0 - 0 get_ac 12:16 ?
00:00:00 [ldlm_cn_08]
1 D root 5997 1 0 75 0 - 0 - 12:18 ?
00:00:00 [ldlm_cn_09]
1 D root 6020 1 0 75 0 - 0 get_ac 12:41 ?
00:00:00 [ldlm_cn_10]
4 D root 6107 1 0 77 0 - 16819 get_ac 12:53 ?
00:00:00 dd if /dev/md2 of /dev/null bs 4096k count 10 skip 10000
4 D root 6138 1 0 78 0 - 16818 get_ac 12:55 ?
00:00:01 dd if /dev/md0 of /dev/null bs 4096k count 10 skip 10000
If it's relevant - we haven't _yet_ seen this on our newer OSSes,
which are 7+1 RAID5s. We are only seeing it on the older 5+1s for
now.
Any help would be greatly appreciated!
Thanks again,
Tim
More information about the lustre-discuss
mailing list