[Lustre-discuss] MGS/MDS error: operation 101 on unconnected MGS

Mon Sep 29 08:34:42 PDT 2008

Greetings,

It's Monday (sigh).  I lost one dual-core Opteron 275 of two on my OSS
box over the weekend.  The /var/log/messages contained many "bus error
on processor" messages.  So Monday I rebooted the OSS with only one
dual core CPU.  The box came up  just fine and I mounted the three
lustre OST disks I have on that box.  (CentOS 5
2.6.18-53.1.13.el5_lustre.1.6.4.3smp #1 SMP Sun Feb 17 08:38:44 EST
2008 x86_64 x86_64 x86_64 GNU/Linux)

The problem is now my MGS/MDS box cannot access/use the disks on that
box.   The single MDT volume mounts without error but I see the
following messages in the MGS/MDS /var/log/messages file:

Sep 29 10:02:40 mds1 kernel: Lustre: MDT crew3-MDT0000 now serving dev
(crew3-MDT0000/be7a58cd-e259-823f-486b-e974551d7ad6) with recovery
enabled
Sep 29 10:02:40 mds1 kernel: Lustre: Server crew3-MDT0000 on device
/dev/md0 has started
Sep 29 10:02:40 mds1 kernel: Lustre: MDS crew3-MDT0000: crew3d1_UUID
now active, resetting orphans
Sep 29 10:02:40 mds1 kernel: Lustre: Skipped 2 previous similar messages
Sep 29 10:03:29 mds1 kernel: LustreError:
26914:0:(mgs_handler.c:515:mgs_handle()) lustre_mgs: operation 101 on
unconnected MGS
Sep 29 10:03:29 mds1 kernel: LustreError:
26914:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error
(-107)  req at ffff81005986d450 x17040407/t0 o101-><?>@<?>:-1 lens 232/0
ref 0 fl Interpret:/0/0 rc -107/0
Sep 29 10:03:29 mds1 kernel: LustreError:
26915:0:(mgs_handler.c:515:mgs_handle()) lustre_mgs: operation 501 on
unconnected MGS
Sep 29 10:03:29 mds1 kernel: LustreError:
26915:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error
(-107)  req at ffff810055ffac50 x17040408/t0 o501-><?>@<?>:-1 lens 200/0
ref 0 fl Interpret:/0/0 rc -107/0

The messages do not repeat.

On MGS/MDS I also have [root at mds1 ~]# cat
/proc/fs/lustre/mds/crew3-MDT0000/recovery_status
status: INACTIVE

The lctl can successfully ping the OSS and the OST appear correctly in
lctl dl.  The peers on the MGS/MDS (which is still successfully
serving other disks) appears normal.
[root at mds1 ~]# cat /proc/sys/lnet/peers
nid                      refs state   max   rtr   min    tx   min queue
0 at lo                        1  ~rtr     0     0     0     0     0 0
172.16.0.15 at o2ib            1  ~rtr     8     8     8     8     7 0
172.18.1.1 at o2ib             1  ~rtr     8     8     8     8  -527 0
172.18.1.2 at o2ib             1  ~rtr     8     8     8     8  -261 0
172.18.0.9 at o2ib             1  ~rtr     8     8     8     8     7 0
172.18.0.10 at o2ib            1  ~rtr     8     8     8     8     6 0
172.18.0.11 at o2ib            1  ~rtr     8     8     8     8  -239 0
172.18.0.12 at o2ib            1  ~rtr     8     8     8     8    -2 0
172.18.0.13 at o2ib            1  ~rtr     8     8     8     8    -4 0
172.18.0.14 at o2ib            1  ~rtr     8     8     8     8    -4 0
172.18.0.15 at o2ib            1  ~rtr     8     8     8     8   -42 0
172.18.0.16 at o2ib            1  ~rtr     8     8     8     8     7 0

With this information I Google searched on the error and I found
http://lustre.sev.net.ua/changeset/119/trunk/lustre.
The page was timestamped 3/12/08 by Author shadow with the info below:

trunk/lustre/ChangeLog
r100   r119
Severity   : major
 	16	Frequency  : frequent on X2 node
 	17	Bugzilla   : 15010
 	18	Description: mdc_set_open_replay_data LBUG
 	19	Details    : Set replay data for requests that are eligible for replay.
 	20	
 	21	Severity   : normal
 	22	Bugzilla   : 14321
 	23	Description: lustre_mgs: operation 101 on unconnected MGS
 	24	Details    : When MGC is disconnected from MGS long enough, MGS
will evict the
 	25	             MGC, and late on MGC cannot successfully connect to
MGS and a lot
 	26	             of the error messages complaining that MGS is not connected.
 	27	
 	28	Severity   : major
16	29	Frequency  : on start mds
17	30	Bugzilla   : 14884

Okay.  I am still running 2.6.18-53.1.13.el5_lustre.1.6.4.3smp,  is
there a way in which to get the MGS/MDS to once again access the OSTs
associated with the MDT?
The OSS box looks perfectly fine (minus one CPU).  All the errors
appear on the MGS/MDS box.   The lustre disk will not mount on any of
my clients.   The message "mount.lustre: mount ic-mds1 at o2ib:/crew3 at
/crew3 failed: Transport endpoint is not connected"   is all that
occurs.

Suggestions and advice greatly appreciated.  Do I just have to wait a
long time to let the disk "find itself"?   Using lctl device xx and
activate did not help.

megan