[Lustre-discuss] Frequent OSS Crashes with heavy load

Mon Nov 10 08:18:30 PST 2008

I am also unclear about the top result:
top - 00:16:19 up 1 day,  3:58,  1 user,  load average: 22.71, 23.27, 23.74
Tasks: 851 total,   2 running, 849 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0% us,  7.0% sy,  0.0% ni, 86.7% id,  0.2% wa,  0.2% hi,  5.9% si
Mem:   8307364k total,   894940k used,  7412424k free,   240912k buffers
Swap: 16386292k total,        0k used, 16386292k free,    78108k cached

The CPU and memory are both free, while the load average is quite high. It is
possibile for Lustre to cache more data?

Brian J. Murrell 写:

> On Mon, 2008-11-10 at 15:58 +0000, Wang lu wrote:
>> Thanks, but I am still unclear about: 
>> 
>> 1.How to limit the OST thread number after I find a optimum number?
> 
> It's a module option to the oss module.  It should be documented in the
> manual.
> 
>> 2.The meaning of /proc/sys/lnet/peers and /proc/sys/lnet/nis?
> 
> The meaning of many of the variables in /proc are also documented in the
> manual.  If you find any that are not, you can file a ticket in our bz
> requesting they be added.
> 
>> For example
>> [root at boss01 ~]# cat /proc/sys/lnet/peers
>> nid                      refs state   max   rtr   min    tx   min queue
>> 192.168.52.39 at tcp           6  ~rtr     8     8     8     3   -19 1458536
>> 
>> [root at boss01 ~]# cat /proc/sys/lnet/nis
>> nid                      refs peer   max    tx   min
>> 0 at lo                        2    0     0     0     0
>> 192.168.50.33 at tcp         137    8   256   256  -424
> 
> I don't know the details of either of these off-hand.  Probably one of
> our LNET experts might be able to provide more information.
> 
> b.
>