[Lustre-discuss] Errors in output from sgpdd-survey (sgp_dd.c Cannot allocate memory)
Jim Shankland
jas at whamcloud.com
Tue Dec 14 10:20:25 PST 2010
Heald, Nathan T. wrote:
> Hi everyone,
> I have been running sgpdd-survey on some DDN 9550's and am getting some
> errors. I'm using what I believe to be the latest version of the I/O Kit
> (lustre-iokit-1.2-200709210921). I've got 4 OSSes attached and run
> sgpdd-survey against all the disk from each host one at a time. Each host is
> getting these errors, but not identically. I've found several threads on the
> mailing list with people reporting this same error but there are no
> resolutions posted. One post suggested a modification to the flags for
> "sg_readcap" in the script could resolve these errors, but making the
> changes did not seem to fix the issue. It looks like sgp_dd is having
> intermittent problems:
>
> 16384+0 records out
> sg starting in command at "sgp_dd.c":827: Cannot allocate memory
[snip]
>
> Output from sgpdd-survey:
>
> Wed Dec 1 10:55:55 EST 2010 sgpdd-survey on /dev/sdp /dev/sdo /dev/sdn
> /dev/sdw /dev/sdv /dev/sdu /dev/sdt /dev/sds /dev/sdy /dev/sdr /dev/sdx
> /dev/sdq from oss1
> ...
> total_size 100663296K rsz 1024 crg 384 thr 768 write 388.20 MB/s 384
> x 1.01 = 388.18 MB/s read 387.16 MB/s 384 x 1.01 = 388.18 MB/s
> total_size 100663296K rsz 1024 crg 384 thr 1536 write 1 failed read
> 385.72 MB/s 384 x 1.01 = 388.18 MB/s
> total_size 100663296K rsz 1024 crg 384 thr 3072 write 140 failed read 121
> failed
> total_size 100663296K rsz 1024 crg 384 thr 6144 ENOMEM
You just don't have enough RAM to do these particular runs.
If you look at the line ending in ENOMEM above: sgpdd-survey
is proposing to launch 384 separate sgp_dd processes for each
of 12 different devices, with each process launching 16
threads (6144 / 384), and each thread allocating at least 1 1
MiB write buffer. That adds up to 72 GiB of RAM for write
buffers. The ENOMEM line means that the sgpdd-survey script
looked at the amount of physical RAM you have, and estimated it
wasn't enough to do this run.
You could try running sgpdd-survey against each block device
one at a time, which will reduce the needed RAM by a factor of
12 (in your case), but of course isn't quite equivalent.
sg_readcap is used to determine the physical sector size and
capacity (sector count) of each block device. I wouldn't
think changing the flags on it would help anything.
Jim Shankland
Whamcloud, Inc.
More information about the lustre-discuss
mailing list