[Lustre-devel] Hangs with cgroup memory controller

Wed Jul 27 10:33:10 PDT 2011

On Wed, 27 Jul 2011, Andreas Dilger wrote:

> Two ideas come to mind. On is that the reason you are having difficulty 
> to reproduce the problem is that it only happens after some fault 
> condition. Possibly you need the client to do recovery to an OST and 
> resend a bulk RPC, or resend due to a checksum error?

Is there an easy way to trigger some error cases like this?

> It might also be due to application IO types (e.g. mmap, direct IO, 
> pwrite, splice, etc).

Yes, of course. Although I didn't gather any statistics, there wasn't a 
clear standout application which was more affected than others.

> Possibly you can correlate reproducer cases with Lustre errors on the 
> console?

Back when I tried this last year on the production system, I wasn't able 
to see corresponding errors. But I don't have any of this data around any 
more.

I'd need to do some tests on the production system to capture one case.

> Lustre also has memory debugging that can be enabled, but without a 
> reasonably concise reproducer it would be difficult to log/analyze so 
> much data for hours of runtime.

If I am able to capture a case, is there a way to, for example, dump a 
list of Lustre pages still held by the client? And correlate these with 
the files in question?

What I am thinking is that I could stop the running processes and attempt 
to drain all the pages, and this could hopefully leave a small number of 
'bad' ones -- with the files in question I could at least help to identify 
the I/O type.

Thanks for your reply

-- 
Mark