[Lustre-devel] Hangs with cgroup memory controller
Andreas Dilger
adilger at whamcloud.com
Wed Jul 27 12:16:28 PDT 2011
On 2011-07-27, at 12:57 PM, Mark Hills wrote:
> On Wed, 27 Jul 2011, Andreas Dilger wrote:
> [...]
>> Possibly you can correlate reproducer cases with Lustre errors on the
>> console?
>
> I've managed to catch the bad state, on a clean client too -- there's no
> errors reported from Lustre in dmesg.
>
> Here's the information reported by the cgroup. It seems that there's a
> discrepancy of 2x pages (the 'cache' field, pgpgin, pgpgout).
To dump Lustre pagecache pages use "lctl get_param llite.*.dump_page_cache",
which will print the inode, page index, read/write access, and page flags.
It wouldn't hurt to dump the kernel debug log, but it is unlikely to hold
anything useful.
> The process which was in the group terminated a long time ago.
>
> I can leave the machine in this state until tomorrow, so any suggestions
> for data to capture that could help trace this bug would be welcomed.
> Thanks.
>
> # cd /cgroup/p25321
>
> # echo 1 > memory.force_empty
> <hangs: the bug>
>
> # cat tasks
> <none>
>
> # cat memory.max_usage_in_bytes
> 1281351680
>
> # cat memory.usage_in_bytes
> 8192
>
> # cat memory.stat
> cache 8192 <--- two pages
> rss 0
> mapped_file 0
> pgpgin 396369 <--- two pages higher than pgpgout
> pgpgout 396367
> swap 0
> inactive_anon 0
> active_anon 0
> inactive_file 0
> active_file 0
> unevictable 0
> hierarchical_memory_limit 8388608000
> hierarchical_memsw_limit 10485760000
> total_cache 8192
> total_rss 0
> total_mapped_file 0
> total_pgpgin 396369
> total_pgpgout 396367
> total_swap 0
> total_inactive_anon 0
> total_active_anon 0
> total_inactive_file 0
> total_active_file 0
> total_unevictable 0
>
> # echo 1 > /proc/sys/vm/drop_caches
> <success>
>
> # echo 2 > /proc/sys/vm/drop_caches
> <success>
>
> # cat memory.stat
> <same as above>
>
> --
> Mark
Cheers, Andreas
--
Andreas Dilger
Principal Engineer
Whamcloud, Inc.
More information about the lustre-devel
mailing list