[Lustre-discuss] Lustre MDS Errors 1-7 and operation 101
Cliff.White at Sun.COM
Wed Jan 14 20:05:29 PST 2009
Thomas Roth wrote:
> Hi all,
> on our production cluster we have for a surprisingly long time (> 1 day)
> only the following two error messages (and no visible problems),
> although the system is under heavy load right now:
> Jan 14 10:44:33 server1 kernel: LustreError:
> 5118:0:(ldlm_lib.c:1536:target_send_reply_msg()) @@@ processing error
> (-107) req at ffff8107fd6c4c50 x2077599/t0 o101-><?>@<?>:0/0 lens 232/0 e
> 0 to 0 dl 1231927273 ref 1 fl Interpret:/0/0 rc -107/0
> Jan 14 10:46:42 server1 kernel: LustreError:
> 6766:0:(mgs_handler.c:557:mgs_handle()) lustre_mgs: operation 101 on
> unconnected MGS
> error (-107) is /* Transport endpoint is not connected */ - I have
> seen this before on clients which had lost the connection to the
> cluster. But this is on the MGS/MDS - one server with one partition for
> the MGS and one for the MDT.
Remember, this is a distributed client/server system. When any node
needs to connect to a service, there will be a client process.
So, an OSS (which needs to talk to the MDS) will have a metadata client
(mdc) running on it.
> The second error suggests of course that the MGS is actually not
> connected - but how can a Lustre system run when its MGS isn't there?
> Makes no sense, does it?
Ah, that's the beauty of Lustre. The MGS is needed for two things:
- New clients get the mount from the MGS
- Configuration changes are propagated from the MGS.
So, if you are not actively mounting clients, and not changing the
configuration, in fact Lustre can run just fine without the MGS.
Filesystem users will not even notice it's gone, unless they are
attempting a mount.
Likewise, the MDS is used for metadata transactions. If a client is not
actively touching metadata, (for example a client already has an open
file and is doing IO only) you can fail the MDS without the clients
Those two errors are quite harmless in this case - 'operation x on
unconnected MGS' means a client was evicted, the client is attempting to
replay an RPC, however the server has destroyed the import (due to the
eviction) and it has not been re-established.
> O.k., the cluster is running Debian Etch 64bit, Kernel 2.6.22, Lustre
> 126.96.36.199. The "operation 101" thing is supposed to have been solved in
> the 1.6.4 -> 1.6.5 upgrade, according to the change logs. Either it
> hasn't, or I have a real problem were this error message really applies.
> It is also remarkable that it seems nobody seems to know about the
> meaning of "operation X on unconnected MGS" - via Google one will find
> many questions but no answers - at least that's my impression (and I
> didn't search Bugzilla).
> Many thanks,
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
More information about the lustre-discuss