[Lustre-devel] Hangs with cgroup memory controller
Mark Hills
Mark.Hills at framestore.com
Wed Jul 27 10:33:10 PDT 2011
On Wed, 27 Jul 2011, Andreas Dilger wrote:
> Two ideas come to mind. On is that the reason you are having difficulty
> to reproduce the problem is that it only happens after some fault
> condition. Possibly you need the client to do recovery to an OST and
> resend a bulk RPC, or resend due to a checksum error?
Is there an easy way to trigger some error cases like this?
> It might also be due to application IO types (e.g. mmap, direct IO,
> pwrite, splice, etc).
Yes, of course. Although I didn't gather any statistics, there wasn't a
clear standout application which was more affected than others.
> Possibly you can correlate reproducer cases with Lustre errors on the
> console?
Back when I tried this last year on the production system, I wasn't able
to see corresponding errors. But I don't have any of this data around any
more.
I'd need to do some tests on the production system to capture one case.
> Lustre also has memory debugging that can be enabled, but without a
> reasonably concise reproducer it would be difficult to log/analyze so
> much data for hours of runtime.
If I am able to capture a case, is there a way to, for example, dump a
list of Lustre pages still held by the client? And correlate these with
the files in question?
What I am thinking is that I could stop the running processes and attempt
to drain all the pages, and this could hopefully leave a small number of
'bad' ones -- with the files in question I could at least help to identify
the I/O type.
Thanks for your reply
--
Mark
More information about the lustre-devel
mailing list