[Lustre-discuss] High Load and high system CPU for mds

Sun Feb 28 18:31:01 PST 2010

Hi,

We got a problem that the MDS has high load value and the system CPU is up to 60% when running chown command on client. It's strange that the load value and system CPU didn't decrease to the normal level as long as it getted high. Even we can't do anything on clients and OSS. You can see the information with top command as follows:
[root at mainmds ~]# top
top - 10:19:02 up  1:03,  3 users,  load average: 28.73, 27.10, 23.88
Tasks: 515 total,  44 running, 471 sleeping,   0 stopped,   0 zombie
Cpu0  :  0.0%us, 84.1%sy,  0.0%ni, 15.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  :  0.0%us,100.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  :  0.0%us, 72.5%sy,  0.0%ni, 27.5%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  :  0.0%us, 83.5%sy,  0.0%ni, 16.5%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu4  :  0.0%us, 78.4%sy,  0.0%ni, 21.6%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu5  :  0.0%us, 82.9%sy,  0.0%ni, 17.1%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu6  :  0.0%us, 69.2%sy,  0.0%ni, 30.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu7  :  0.0%us, 79.6%sy,  0.0%ni, 20.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu8  :  0.0%us, 77.2%sy,  0.0%ni, 22.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu9  :  0.0%us, 58.9%sy,  0.0%ni, 41.1%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu10 :  0.0%us, 84.4%sy,  0.0%ni, 15.6%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu11 :  0.0%us, 97.6%sy,  0.0%ni,  2.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu12 :  0.0%us, 81.4%sy,  0.0%ni, 18.6%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu13 :  0.0%us, 85.0%sy,  0.0%ni, 15.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu14 :  0.0%us, 88.0%sy,  0.0%ni, 12.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu15 :  0.0%us, 36.3%sy,  0.0%ni, 63.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  24682716k total,  2985412k used, 21697304k free,   268360k buffers
Swap: 24579440k total,        0k used, 24579440k free,   368904k cached
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                   
 5449 root      16   0     0    0    0 R 100.2  0.0  52:46.12 ptlrpcd                                                  
 5434 root      16   0     0    0    0 R 89.0  0.0  34:15.77 socknal_sd07                                              
 5432 root      16   0     0    0    0 R 88.3  0.0  32:43.12 socknal_sd05                                              
 5430 root      16   0     0    0    0 R 79.1  0.0  30:37.78 socknal_sd03                                              
 5436 root      16   0     0    0    0 R 61.2  0.0  29:08.47 socknal_sd09                                              
 5440 root      16   0     0    0    0 S 59.5  0.0  33:31.32 socknal_sd13                                              
 5433 root      16   0     0    0    0 R 49.0  0.0  23:20.61 socknal_sd06                                              
 5431 root      15   0     0    0    0 R 45.0  0.0  26:04.43 socknal_sd04                                              
 5427 root      15   0     0    0    0 S 44.7  0.0  23:31.11 socknal_sd00                                              
 5435 root      15   0     0    0    0 S 44.3  0.0  24:50.30 socknal_sd08                                              
 5439 root      15   0     0    0    0 R 43.7  0.0  24:23.79 socknal_sd12                                              
 5437 root      15   0     0    0    0 R 39.7  0.0  27:11.58 socknal_sd10                                              
 5438 root      16   0     0    0    0 S 37.4  0.0  40:50.69 socknal_sd11                                              
 5441 root      15   0     0    0    0 S 35.4  0.0  26:35.59 socknal_sd14      

According to the top information, we can see the proc ptlrpcd with 100% CPU, it is not normal for the system, it likes the ptlrpcd become locked. So we have to reboot the MDS to solve the proble now. We don't know about the phenomena. Do someone get the problem or have some idea for it? I will be appreciate for your any help.
Addition, we use the lustre 1.8.1.1 on MDS and OSS, lustre1.6.5 on clients. 

Thanks advance for you.

Cheers
Qiulan Huang
--------------------------------------------------------------   
Computing Center IHEP         Office: Computing Center,123 
19B Yuquan Road                 Tel: (+86) 10 88236012-607
P.O. Box 918-7                    Fax: (+86) 10 8823 6839
Beijing 100049,China             Email: huangql at ihep.ac.cn 
--------------------------------------------------------------    
2010-03-01 

huangql 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20100301/19b40b1d/attachment.htm>