[Lustre-discuss] LustreError: lock callback timer expired after

Oleg Drokin Oleg.Drokin at Sun.COM
Mon Mar 30 08:44:26 PDT 2009


Hello!

On Mar 30, 2009, at 7:06 AM, Simon Latapie wrote:

> I currently have a lustre system with 1 MDS, 2 OSS with 2 OSTs each,  
> and
> 37 lustre clients (1 login and 36 compute nodes), all using infiniband
> as lustre network (o2ib). All nodes are on 1.6.5.1 patched kernel.
> For the past two months, several times a month, the login node seems  
> to
> be permanently evicted from the OSTs. The OSTs show a "lock callback
> timer expired after ..." error, then the login tries to reconnect, and
> fail. As lustre mount is the home directory of the cluster, users  
> can't
> have access to it, and can't log in anymore. The only way I found to
> stop this is to reboot the login node (umount -f stucks). After the
> reboot, the login simply reconnects to the OST, and everything is okay
> until the next "lock callback timer" issue.
> Compute nodes doesn't seem to be affected by this problem. Only login
> node does.
> There is no memory problem (no swap, no memory leaks), neither on OSTs
> or login node.
> There is network error (no packet loss detected), in IB or IPoIB.
> Expiration time can be very random: from about 300s to 9000s.

There are several possible bugs that could lead to this.
One of the possible ones that comes to mind is bug 15716.
Recommended way is to upgrade to latest lustre release, of course.

Bye,
     Oleg



More information about the lustre-discuss mailing list