[Lustre-discuss] LustreError: acquire timeout exceeded

Thomas Roth t.roth at gsi.de
Tue Jul 29 09:51:14 PDT 2008

Previous message: [Lustre-discuss] lfs quota limit problem
Next message: [Lustre-discuss] LustreError: acquire timeout exceeded
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi all,

I've encountered a LustreError that might have triggered an unwanted 
failover of a MGS/MGD -HA-pair of servers. I'm not sure about the 
latter, but at least I have not found a trace of that error via Google, 
so it might be worth considering.
And it occurred in this form only the two times the heartbeat monitoring 
failed shortly afterwards:

kern.log.1:Jul 20 06:47:19 kernel: LustreError: 
27696:0:(upcall_cache.c:326:upcall_cache_get_entry()) acquire timeout 
exceeded for key 0
kern.log.1:Jul 20 06:47:41 kernel: LustreError: 
27713:0:(upcall_cache.c:326:upcall_cache_get_entry()) acquire timeout 
exceeded for key 0

There was no Lustre log activity the day before that, the last entry 
before being an eviction of a client at Jul 25 19:31:09

The system is running Lustre 1.6.3, kernel 2.6.22, Debian Etch.

There are some more 'acquire timeout ' messages dating from Jul 24+25, 
however not for 'key 0' but for key 4209, 4409, ..., whatever this may 
mean. No "fatal" consequences then.

On Jul 27, the same happened again,

kern.log:Jul 27 06:47:17 kernel: LustreError: 
24327:0:(upcall_cache.c:326:upcall_cache_get_entry()) acquire timeout 
exceeded for key 0
kern.log:Jul 27 06:47:37 kernel: LustreError: 
22381:0:(upcall_cache.c:326:upcall_cache_get_entry()) acquire timeout 
exceeded for key 0

This time it took heartbeat only second to loose its IP:

lrmd[10373]: 2008/07/27_06:47:31 WARN: IPaddr:monitor process (PID 
26903) timed out (try 1).  Killing with signal SIGT
ERM (15).

On another system running Lustre 1.6.5, without any heartbeat errors, it 
was:
kern.log:Jul 27 06:47:20  kernel: LustreError: 
4627:0:(upcall_cache.c:325:upcall_cache_get_entry()) acquire timeout 
exceeded for key 0
kern.log:Jul 27 06:47:37  kernel: LustreError: 
3581:0:(upcall_cache.c:325:upcall_cache_get_entry()) acquire timeout 
exceeded for key 0

Of course these temporal coincidences look verrrrry suspicious. So far, 
I have no idea what kind of weird script might be running at these times 
causing all the trouble, still I'm already looking forward to next 
Sunday ;-)

But it would be nice if somebody could explain these  Lustre errors, and 
perhaps assure me that these Lustre errors are entirely harmless or 
cannot possibly have any influence on the stability of the system.

Thanks,
Thomas

Previous message: [Lustre-discuss] lfs quota limit problem
Next message: [Lustre-discuss] LustreError: acquire timeout exceeded
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the lustre-discuss mailing list