[Lustre-discuss] High Load and high system CPU for mds
huangql
huangql at ihep.ac.cn
Sun Feb 28 18:31:01 PST 2010
Hi,
We got a problem that the MDS has high load value and the system CPU is up to 60% when running chown command on client. It's strange that the load value and system CPU didn't decrease to the normal level as long as it getted high. Even we can't do anything on clients and OSS. You can see the information with top command as follows:
[root at mainmds ~]# top
top - 10:19:02 up 1:03, 3 users, load average: 28.73, 27.10, 23.88
Tasks: 515 total, 44 running, 471 sleeping, 0 stopped, 0 zombie
Cpu0 : 0.0%us, 84.1%sy, 0.0%ni, 15.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu1 : 0.0%us,100.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu2 : 0.0%us, 72.5%sy, 0.0%ni, 27.5%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu3 : 0.0%us, 83.5%sy, 0.0%ni, 16.5%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu4 : 0.0%us, 78.4%sy, 0.0%ni, 21.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu5 : 0.0%us, 82.9%sy, 0.0%ni, 17.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu6 : 0.0%us, 69.2%sy, 0.0%ni, 30.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu7 : 0.0%us, 79.6%sy, 0.0%ni, 20.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu8 : 0.0%us, 77.2%sy, 0.0%ni, 22.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu9 : 0.0%us, 58.9%sy, 0.0%ni, 41.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu10 : 0.0%us, 84.4%sy, 0.0%ni, 15.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu11 : 0.0%us, 97.6%sy, 0.0%ni, 2.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu12 : 0.0%us, 81.4%sy, 0.0%ni, 18.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu13 : 0.0%us, 85.0%sy, 0.0%ni, 15.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu14 : 0.0%us, 88.0%sy, 0.0%ni, 12.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu15 : 0.0%us, 36.3%sy, 0.0%ni, 63.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 24682716k total, 2985412k used, 21697304k free, 268360k buffers
Swap: 24579440k total, 0k used, 24579440k free, 368904k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5449 root 16 0 0 0 0 R 100.2 0.0 52:46.12 ptlrpcd
5434 root 16 0 0 0 0 R 89.0 0.0 34:15.77 socknal_sd07
5432 root 16 0 0 0 0 R 88.3 0.0 32:43.12 socknal_sd05
5430 root 16 0 0 0 0 R 79.1 0.0 30:37.78 socknal_sd03
5436 root 16 0 0 0 0 R 61.2 0.0 29:08.47 socknal_sd09
5440 root 16 0 0 0 0 S 59.5 0.0 33:31.32 socknal_sd13
5433 root 16 0 0 0 0 R 49.0 0.0 23:20.61 socknal_sd06
5431 root 15 0 0 0 0 R 45.0 0.0 26:04.43 socknal_sd04
5427 root 15 0 0 0 0 S 44.7 0.0 23:31.11 socknal_sd00
5435 root 15 0 0 0 0 S 44.3 0.0 24:50.30 socknal_sd08
5439 root 15 0 0 0 0 R 43.7 0.0 24:23.79 socknal_sd12
5437 root 15 0 0 0 0 R 39.7 0.0 27:11.58 socknal_sd10
5438 root 16 0 0 0 0 S 37.4 0.0 40:50.69 socknal_sd11
5441 root 15 0 0 0 0 S 35.4 0.0 26:35.59 socknal_sd14
According to the top information, we can see the proc ptlrpcd with 100% CPU, it is not normal for the system, it likes the ptlrpcd become locked. So we have to reboot the MDS to solve the proble now. We don't know about the phenomena. Do someone get the problem or have some idea for it? I will be appreciate for your any help.
Addition, we use the lustre 1.8.1.1 on MDS and OSS, lustre1.6.5 on clients.
Thanks advance for you.
Cheers
Qiulan Huang
--------------------------------------------------------------
Computing Center IHEP Office: Computing Center,123
19B Yuquan Road Tel: (+86) 10 88236012-607
P.O. Box 918-7 Fax: (+86) 10 8823 6839
Beijing 100049,China Email: huangql at ihep.ac.cn
--------------------------------------------------------------
2010-03-01
huangql
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20100301/19b40b1d/attachment.htm>
More information about the lustre-discuss
mailing list