[Lustre-discuss] Hung Lustre filesystem until a remount

Fri Jan 23 12:17:15 PST 2009

On Jan 22, 2009  14:05 -0600, Jeremy Mann wrote:
> We have been running Lustre for a few years now and today was the first
> time I came upon something I haven't seen before. The lustre partition was
> mounted and I could access files within it, however the minute I started
> opening the large files, it became unstable and hung. The system load shot
> up to 33 (on the headnode client) and Lustre was using approximately 6 GB
> of memory.  I stopped all of our services that write into the Lustre
> partition and unmounted /lustre. Tailing the logs during this process, I
> saw:
> 
> LustreError: 8620:0:(ldlm_request.c:986:ldlm_cli_cancel_req()) Got rc -108
> from cancel RPC: canceling anyway
> LustreError: 8620:0:(ldlm_request.c:986:ldlm_cli_cancel_req()) Skipped
> 308135 previous similar messages
> LustreError: 8620:0:(ldlm_request.c:1575:ldlm_cli_cancel_list())
> ldlm_cli_cancel_list: -108
> LustreError: 8620:0:(ldlm_request.c:1575:ldlm_cli_cancel_list()) Skipped
> 308135 previous similar messages
> LustreError: 8620:0:(ldlm_request.c:986:ldlm_cli_cancel_req()) Got rc -108
> from cancel RPC: canceling anyway
> LustreError: 8620:0:(ldlm_request.c:986:ldlm_cli_cancel_req()) Skipped
> 710099 previous similar messages
> LustreError: 8620:0:(ldlm_request.c:1575:ldlm_cli_cancel_list())
> ldlm_cli_cancel_list: -108
> LustreError: 8620:0:(ldlm_request.c:1575:ldlm_cli_cancel_list()) Skipped
> 710099 previous similar messages

With so many skipped messages, it appears this node is in a tight loop for
some reason.  Is this client mounted on the same node as the MDS perhaps?
That isn't an excuse for hitting such a problem, but might explain why
it was in such a tight loop that it was DOS-ing your filesystem.

> Over and over again. A few minutes later, Lustre unmounted and freed up
> the 6GB of memory it was using. I didn't see anything wrong with our OSTs
> and remounted the Lustre partition on the headnode and now everything is
> back to normal. I'm wondering what could have caused this in the first
> place?
> 
> Rocks 5 (RHEL5), Lustre 1.6.5.1, Kernel 2.6.18-53.1.14.el5_lustre.1.6.5.1smp

If it is 1.6.5.1 it might be the statahead bug.  Please check archives for
many discussions for workarouds.  There was also a recent patch (not in any
release yet) to fix the lock dynamic LRU sizing code to use less CPU, which
may have contributed to this problem.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.