[Lustre-discuss] Errors in output from sgpdd-survey (sgp_dd.c Cannot allocate memory)

Tue Dec 14 12:27:09 PST 2010

Jim Shankland wrote:
>> ... 
>> total_size 100663296K rsz 1024 crg   384 thr   768 write  388.20 MB/s   384
>> x   1.01 =  388.18 MB/s read  387.16 MB/s   384 x   1.01 =  388.18 MB/s
>> total_size 100663296K rsz 1024 crg   384 thr  1536 write 1 failed read
>> 385.72 MB/s   384 x   1.01 =  388.18 MB/s
>> total_size 100663296K rsz 1024 crg   384 thr  3072 write 140 failed read 121
>> failed 
>> total_size 100663296K rsz 1024 crg   384 thr  6144 ENOMEM
>>     
>
> You just don't have enough RAM to do these particular runs.
> If you look at the line ending in ENOMEM above: sgpdd-survey
> is proposing to launch 384 separate sgp_dd processes for each
> of 12 different devices, with each process launching 16
> threads (6144 / 384), and each thread allocating at least 1 1
> MiB write buffer.  That adds up to 72 GiB of RAM for write
> buffers.  The ENOMEM line means that the sgpdd-survey script
> looked at the amount of physical RAM you have, and estimated it
> wasn't enough to do this run.
>   

It's not just the ENOMEM at 6144 total threads that is the problem, it 
is the "write X failed", etc, at the _lower_ thread counts.

 From memory, the "crg" and "thr" numbers are already multiplied by 12 
(the number of devices being tested), so "thr" should reflect the total 
number of buffers required.  For this test, it looks like crg=32 and 
SG_MAX_QUEUE is the default 16.  So the memory consumption _should not_ 
be an issue, but sgp_dd is still having problems allocating buffers.

Again, I've seen this even when I clearly had free memory on the node, 
so I think there is something else at work here.

Kevin