[Lustre-discuss] Errors in output from sgpdd-survey (sgp_dd.c Cannot allocate memory)

Mon Feb 21 15:52:04 PST 2011

On 14/12/10 20:27, Kevin Van Maren wrote:
> Jim Shankland wrote:
>>> ...
>>> total_size 100663296K rsz 1024 crg   384 thr   768 write  388.20 MB/s   384
>>> x   1.01 =  388.18 MB/s read  387.16 MB/s   384 x   1.01 =  388.18 MB/s
>>> total_size 100663296K rsz 1024 crg   384 thr  1536 write 1 failed read
>>> 385.72 MB/s   384 x   1.01 =  388.18 MB/s
>>> total_size 100663296K rsz 1024 crg   384 thr  3072 write 140 failed read 121
>>> failed
>>> total_size 100663296K rsz 1024 crg   384 thr  6144 ENOMEM
>>>
>>
>> You just don't have enough RAM to do these particular runs.
>> If you look at the line ending in ENOMEM above: sgpdd-survey
>> is proposing to launch 384 separate sgp_dd processes for each
>> of 12 different devices, with each process launching 16
>> threads (6144 / 384), and each thread allocating at least 1 1
>> MiB write buffer.  That adds up to 72 GiB of RAM for write
>> buffers.  The ENOMEM line means that the sgpdd-survey script
>> looked at the amount of physical RAM you have, and estimated it
>> wasn't enough to do this run.
>>
>
> It's not just the ENOMEM at 6144 total threads that is the problem, it
> is the "write X failed", etc, at the _lower_ thread counts.
>
>   From memory, the "crg" and "thr" numbers are already multiplied by 12
> (the number of devices being tested), so "thr" should reflect the total
> number of buffers required.  For this test, it looks like crg=32 and
> SG_MAX_QUEUE is the default 16.  So the memory consumption _should not_
> be an issue, but sgp_dd is still having problems allocating buffers.
>
> Again, I've seen this even when I clearly had free memory on the node,
> so I think there is something else at work here.
>

I've run into this problem (on a scientific linux 5.5 machine).

If I use /dev/sg1, I get the following:

[root at sn86 lustre]# sgp_dd if=/dev/zero of=/dev/sg1 seek=1024 thr=1 
count=1677721 bs=512 bpt=2048 time=1
sg starting out command at "sgp_dd.c":872: Cannot allocate memory

whereas if I use  /dev/sdb, I get:

[root at sn86 lustre]# sgp_dd if=/dev/zero of=/dev/sdb seek=1024 thr=1 
count=1677721 bs=512 bpt=2048 time=1
time to transfer data was 0.485030 secs, 1771.01 MB/sec

They correspond to the same disk:

[root at sn86 lustre]# sg_map | grep sdb
/dev/sg1  /dev/sdb

Have I just defeated the point of using sgp_dd? Is the fact that this 
really a sata disk (behind a Dell H700 controller) the problem?

Chris