[Lustre-discuss] Errors in output from sgpdd-survey (sgp_dd.c Cannot allocate memory)
Christopher J. Walker
C.J.Walker at qmul.ac.uk
Mon Feb 21 15:52:04 PST 2011
On 14/12/10 20:27, Kevin Van Maren wrote:
> Jim Shankland wrote:
>>> ...
>>> total_size 100663296K rsz 1024 crg 384 thr 768 write 388.20 MB/s 384
>>> x 1.01 = 388.18 MB/s read 387.16 MB/s 384 x 1.01 = 388.18 MB/s
>>> total_size 100663296K rsz 1024 crg 384 thr 1536 write 1 failed read
>>> 385.72 MB/s 384 x 1.01 = 388.18 MB/s
>>> total_size 100663296K rsz 1024 crg 384 thr 3072 write 140 failed read 121
>>> failed
>>> total_size 100663296K rsz 1024 crg 384 thr 6144 ENOMEM
>>>
>>
>> You just don't have enough RAM to do these particular runs.
>> If you look at the line ending in ENOMEM above: sgpdd-survey
>> is proposing to launch 384 separate sgp_dd processes for each
>> of 12 different devices, with each process launching 16
>> threads (6144 / 384), and each thread allocating at least 1 1
>> MiB write buffer. That adds up to 72 GiB of RAM for write
>> buffers. The ENOMEM line means that the sgpdd-survey script
>> looked at the amount of physical RAM you have, and estimated it
>> wasn't enough to do this run.
>>
>
> It's not just the ENOMEM at 6144 total threads that is the problem, it
> is the "write X failed", etc, at the _lower_ thread counts.
>
> From memory, the "crg" and "thr" numbers are already multiplied by 12
> (the number of devices being tested), so "thr" should reflect the total
> number of buffers required. For this test, it looks like crg=32 and
> SG_MAX_QUEUE is the default 16. So the memory consumption _should not_
> be an issue, but sgp_dd is still having problems allocating buffers.
>
> Again, I've seen this even when I clearly had free memory on the node,
> so I think there is something else at work here.
>
I've run into this problem (on a scientific linux 5.5 machine).
If I use /dev/sg1, I get the following:
[root at sn86 lustre]# sgp_dd if=/dev/zero of=/dev/sg1 seek=1024 thr=1
count=1677721 bs=512 bpt=2048 time=1
sg starting out command at "sgp_dd.c":872: Cannot allocate memory
whereas if I use /dev/sdb, I get:
[root at sn86 lustre]# sgp_dd if=/dev/zero of=/dev/sdb seek=1024 thr=1
count=1677721 bs=512 bpt=2048 time=1
time to transfer data was 0.485030 secs, 1771.01 MB/sec
They correspond to the same disk:
[root at sn86 lustre]# sg_map | grep sdb
/dev/sg1 /dev/sdb
Have I just defeated the point of using sgp_dd? Is the fact that this
really a sata disk (behind a Dell H700 controller) the problem?
Chris
More information about the lustre-discuss
mailing list