[Lustre-discuss] Errors in output from sgpdd-survey (sgp_dd.c Cannot allocate memory)

Tue Dec 14 10:20:25 PST 2010

Heald, Nathan T. wrote:
> Hi everyone,
> I have been running sgpdd-survey on some DDN 9550's and am getting some
> errors. I'm using what I believe to be the latest version of the I/O Kit
> (lustre-iokit-1.2-200709210921). I've got 4 OSSes attached and run
> sgpdd-survey against all the disk from each host one at a time. Each host is
> getting these errors, but not identically. I've found several threads on the
> mailing list with people reporting this same error but there are no
> resolutions posted. One post suggested a modification to the flags for
> "sg_readcap" in the script could resolve these errors, but making the
> changes did not seem to fix the issue. It looks like sgp_dd is having
> intermittent problems:
> 
> 16384+0 records out
> sg starting in command at "sgp_dd.c":827: Cannot allocate memory

[snip]

> 
> Output from sgpdd-survey:
> 
> Wed Dec  1 10:55:55 EST 2010 sgpdd-survey on /dev/sdp /dev/sdo /dev/sdn
> /dev/sdw /dev/sdv /dev/sdu /dev/sdt /dev/sds /dev/sdy /dev/sdr /dev/sdx
> /dev/sdq  from oss1
> ... 
> total_size 100663296K rsz 1024 crg   384 thr   768 write  388.20 MB/s   384
> x   1.01 =  388.18 MB/s read  387.16 MB/s   384 x   1.01 =  388.18 MB/s
> total_size 100663296K rsz 1024 crg   384 thr  1536 write 1 failed read
> 385.72 MB/s   384 x   1.01 =  388.18 MB/s
> total_size 100663296K rsz 1024 crg   384 thr  3072 write 140 failed read 121
> failed 
> total_size 100663296K rsz 1024 crg   384 thr  6144 ENOMEM

You just don't have enough RAM to do these particular runs.
If you look at the line ending in ENOMEM above: sgpdd-survey
is proposing to launch 384 separate sgp_dd processes for each
of 12 different devices, with each process launching 16
threads (6144 / 384), and each thread allocating at least 1 1
MiB write buffer.  That adds up to 72 GiB of RAM for write
buffers.  The ENOMEM line means that the sgpdd-survey script
looked at the amount of physical RAM you have, and estimated it
wasn't enough to do this run.

You could try running sgpdd-survey against each block device
one at a time, which will reduce the needed RAM by a factor of
12 (in your case), but of course isn't quite equivalent.

sg_readcap is used to determine the physical sector size and
capacity (sector count) of each block device.  I wouldn't
think changing the flags on it would help anything.

Jim Shankland
Whamcloud, Inc.