[Lustre-discuss] LustreError: acquire timeout exceeded
Thomas Roth
t.roth at gsi.de
Tue Jul 29 09:51:14 PDT 2008
Hi all,
I've encountered a LustreError that might have triggered an unwanted
failover of a MGS/MGD -HA-pair of servers. I'm not sure about the
latter, but at least I have not found a trace of that error via Google,
so it might be worth considering.
And it occurred in this form only the two times the heartbeat monitoring
failed shortly afterwards:
kern.log.1:Jul 20 06:47:19 kernel: LustreError:
27696:0:(upcall_cache.c:326:upcall_cache_get_entry()) acquire timeout
exceeded for key 0
kern.log.1:Jul 20 06:47:41 kernel: LustreError:
27713:0:(upcall_cache.c:326:upcall_cache_get_entry()) acquire timeout
exceeded for key 0
There was no Lustre log activity the day before that, the last entry
before being an eviction of a client at Jul 25 19:31:09
The system is running Lustre 1.6.3, kernel 2.6.22, Debian Etch.
There are some more 'acquire timeout ' messages dating from Jul 24+25,
however not for 'key 0' but for key 4209, 4409, ..., whatever this may
mean. No "fatal" consequences then.
On Jul 27, the same happened again,
kern.log:Jul 27 06:47:17 kernel: LustreError:
24327:0:(upcall_cache.c:326:upcall_cache_get_entry()) acquire timeout
exceeded for key 0
kern.log:Jul 27 06:47:37 kernel: LustreError:
22381:0:(upcall_cache.c:326:upcall_cache_get_entry()) acquire timeout
exceeded for key 0
This time it took heartbeat only second to loose its IP:
lrmd[10373]: 2008/07/27_06:47:31 WARN: IPaddr:monitor process (PID
26903) timed out (try 1). Killing with signal SIGT
ERM (15).
On another system running Lustre 1.6.5, without any heartbeat errors, it
was:
kern.log:Jul 27 06:47:20 kernel: LustreError:
4627:0:(upcall_cache.c:325:upcall_cache_get_entry()) acquire timeout
exceeded for key 0
kern.log:Jul 27 06:47:37 kernel: LustreError:
3581:0:(upcall_cache.c:325:upcall_cache_get_entry()) acquire timeout
exceeded for key 0
Of course these temporal coincidences look verrrrry suspicious. So far,
I have no idea what kind of weird script might be running at these times
causing all the trouble, still I'm already looking forward to next
Sunday ;-)
But it would be nice if somebody could explain these Lustre errors, and
perhaps assure me that these Lustre errors are entirely harmless or
cannot possibly have any influence on the stability of the system.
Thanks,
Thomas
More information about the lustre-discuss
mailing list