[Lustre-discuss] MGS/MDS error: operation 101 on unconnected MGS
Wojciech Turek
wjt27 at cam.ac.uk
Tue Sep 30 10:07:38 PDT 2008
Hi,
Can you please run from the client server (crew01) following command
line and paste the output here.
lctl ping ic-mds1 at o2ib
Cheers,
Wojciech
megan wrote:
> More information:
>
> I tried rebooting my MGD/MDS to see if that would solve my inability
> to client-mount this disk problem.
> No, it did not.
>
> However, I did try the following that might offer some insight to a
> wise Lustre person. On the MGS/MDS I did the following:
> [root at mds1 tmp.BKUP]# mount -vv -t lustre ic-mds1 at o2ib:/crew3 /
> tmp.BKUP/crew3
> arg[0] = /sbin/mount.lustre
> arg[1] = -v
> arg[2] = -o
> arg[3] = rw
> arg[4] = ic-mds1 at o2ib:/crew3
> arg[5] = /tmp.BKUP/crew3
> source = ic-mds1 at o2ib:/crew3 (172.18.0.10 at o2ib:/crew3), target = /
> tmp.BKUP/crew3
> options = rw
> mounting device 172.18.0.10 at o2ib:/crew3 at /tmp.BKUP/crew3, flags=0
> options=device=172.18.0.10 at o2ib:/crew3
> warning: 172.18.0.10 at o2ib:/crew3: cannot resolve: No such file or
> directory/sbin/mount.lustre: unable to set tunables for
> 172.18.0.10 at o2ib:/crew3 (may cause reduced IO performance)[root at mds1
> tmp.BKUP]# df -h
> Filesystem Size Used Avail Use% Mounted on
> /dev/sda2 29G 20G 7.5G 73% /
> /dev/sda1 145M 17M 121M 13% /boot
> tmpfs 1006M 0 1006M 0% /dev/shm
> /dev/sda5 42G 6.2G 34G 16% /srv/lustre_admin
> /dev/METADATA1/LV1 204G 5.3G 190G 3% /srv/lustre/mds/crew2-
> MDT0000
> /dev/sdf 58G 2.5G 52G 5% /srv/lustre/mds/crew8-
> MDT0000
> /dev/md0 204G 4.4G 188G 3% /srv/lustre/mds/crew3-
> MDT0000
> ic-mds1 at o2ib:/crew3 19T 17T 1.8T 91% /tmp.BKUP/crew3
> [root at mds1 tmp.BKUP]# ls /crew3
> ls: /crew3: No such file or directory
> [root at mds1 tmp.BKUP]# cd /tmp.BKUP/crew3
> [root at mds1 crew3]# ls
> data users
> [root at mds1 crew3]# ls users
> asahoo hongbo kristi larkoc qiaox roshan tugrul yluo
> [root at mds1 crew3]# cat /proc/fs/lustre/mds/crew3-MDT0000/
> recovery_status
> status: INACTIVE
>
> Whoa! The files are there; the disk is there. The recovery_status
> is still "INACTIVE". Can this be correct? The disk seems usable on
> the (no user access *at all* MGS/MDS).
>
> So on a client I did very similar:
> [root at crew01 ~]# mount -vv -t lustre ic-mds1 at o2ib:/crew3 /crew3
> arg[0] = /sbin/mount.lustre
> arg[1] = -v
> arg[2] = -o
> arg[3] = rw
> arg[4] = ic-mds1 at o2ib:/crew3
> arg[5] = /crew3
> source = ic-mds1 at o2ib:/crew3 (172.18.0.10 at o2ib:/crew3), target = /
> crew3
> options = rw
> mounting device 172.18.0.10 at o2ib:/crew3 at /crew3, flags=0
> options=device=172.18.0.10 at o2ib:/crew3
> warning: 172.18.0.10 at o2ib:/crew3: cannot resolve: No such file or
> directory/sbin/mount.lustre: unable to set tunables for
> 172.18.0.10 at o2ib:/crew3 (may cause reduced IO
> performance)mount.lustre: mount ic-mds1 at o2ib:/crew3 at /crew3 failed:
> Transport endpoint is not connected
>
> Still the same error message "Transport endpoint is not connected".
>
> How can other disks on the same MGS/MDS and same IB switch work on the
> client and this one particular disk mount only on the MGS/MDS and not
> on any client?
>
> What do I need to do to fix it?
>
> Thank you,
> megan
>
>
> On Sep 29, 11:34 am, "Ms. Megan Larko" <dobsonu... at gmail.com> wrote:
>
>> Greetings,
>>
>> It's Monday (sigh). I lost one dual-core Opteron 275 of two on my OSS
>> box over the weekend. The /var/log/messages contained many "bus error
>> on processor" messages. So Monday I rebooted the OSS with only one
>> dual core CPU. The box came up just fine and I mounted the three
>> lustre OST disks I have on that box. (CentOS 5
>> 2.6.18-53.1.13.el5_lustre.1.6.4.3smp #1 SMP Sun Feb 17 08:38:44 EST
>> 2008 x86_64 x86_64 x86_64 GNU/Linux)
>>
>> The problem is now my MGS/MDS box cannot access/use the disks on that
>> box. The single MDT volume mounts without error but I see the
>> following messages in the MGS/MDS /var/log/messages file:
>>
>> Sep 29 10:02:40 mds1 kernel: Lustre: MDT crew3-MDT0000 now serving dev
>> (crew3-MDT0000/be7a58cd-e259-823f-486b-e974551d7ad6) with recovery
>> enabled
>> Sep 29 10:02:40 mds1 kernel: Lustre: Server crew3-MDT0000 on device
>> /dev/md0 has started
>> Sep 29 10:02:40 mds1 kernel: Lustre: MDS crew3-MDT0000: crew3d1_UUID
>> now active, resetting orphans
>> Sep 29 10:02:40 mds1 kernel: Lustre: Skipped 2 previous similar messages
>> Sep 29 10:03:29 mds1 kernel: LustreError:
>> 26914:0:(mgs_handler.c:515:mgs_handle()) lustre_mgs: operation 101 on
>> unconnected MGS
>> Sep 29 10:03:29 mds1 kernel: LustreError:
>> 26914:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error
>> (-107) req at ffff81005986d450 x17040407/t0 o101-><?>@<?>:-1 lens 232/0
>> ref 0 fl Interpret:/0/0 rc -107/0
>> Sep 29 10:03:29 mds1 kernel: LustreError:
>> 26915:0:(mgs_handler.c:515:mgs_handle()) lustre_mgs: operation 501 on
>> unconnected MGS
>> Sep 29 10:03:29 mds1 kernel: LustreError:
>> 26915:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error
>> (-107) req at ffff810055ffac50 x17040408/t0 o501-><?>@<?>:-1 lens 200/0
>> ref 0 fl Interpret:/0/0 rc -107/0
>>
>> The messages do not repeat.
>>
>> On MGS/MDS I also have [root at mds1 ~]# cat
>> /proc/fs/lustre/mds/crew3-MDT0000/recovery_status
>> status: INACTIVE
>>
>> The lctl can successfully ping the OSS and the OST appear correctly in
>> lctl dl. The peers on the MGS/MDS (which is still successfully
>> serving other disks) appears normal.
>> [root at mds1 ~]# cat /proc/sys/lnet/peers
>> nid refs state max rtr min tx min queue
>> 0 at lo 1 ~rtr 0 0 0 0 0 0
>> 172.16.0.15 at o2ib 1 ~rtr 8 8 8 8 7 0
>> 172.18.1.1 at o2ib 1 ~rtr 8 8 8 8 -527 0
>> 172.18.1.2 at o2ib 1 ~rtr 8 8 8 8 -261 0
>> 172.18.0.9 at o2ib 1 ~rtr 8 8 8 8 7 0
>> 172.18.0.10 at o2ib 1 ~rtr 8 8 8 8 6 0
>> 172.18.0.11 at o2ib 1 ~rtr 8 8 8 8 -239 0
>> 172.18.0.12 at o2ib 1 ~rtr 8 8 8 8 -2 0
>> 172.18.0.13 at o2ib 1 ~rtr 8 8 8 8 -4 0
>> 172.18.0.14 at o2ib 1 ~rtr 8 8 8 8 -4 0
>> 172.18.0.15 at o2ib 1 ~rtr 8 8 8 8 -42 0
>> 172.18.0.16 at o2ib 1 ~rtr 8 8 8 8 7 0
>>
>> With this information I Google searched on the error and I foundhttp://lustre.sev.net.ua/changeset/119/trunk/lustre.
>> The page was timestamped 3/12/08 by Author shadow with the info below:
>>
>> trunk/lustre/ChangeLog
>> r100 r119
>> Severity : major
>> 16 Frequency : frequent on X2 node
>> 17 Bugzilla : 15010
>> 18 Description: mdc_set_open_replay_data LBUG
>> 19 Details : Set replay data for requests that are eligible for replay.
>> 20
>> 21 Severity : normal
>> 22 Bugzilla : 14321
>> 23 Description: lustre_mgs: operation 101 on unconnected MGS
>> 24 Details : When MGC is disconnected from MGS long enough, MGS
>> will evict the
>> 25 MGC, and late on MGC cannot successfully connect to
>> MGS and a lot
>> 26 of the error messages complaining that MGS is not connected.
>> 27
>> 28 Severity : major
>> 16 29 Frequency : on start mds
>> 17 30 Bugzilla : 14884
>>
>> Okay. I am still running 2.6.18-53.1.13.el5_lustre.1.6.4.3smp, is
>> there a way in which to get the MGS/MDS to once again access the OSTs
>> associated with the MDT?
>> The OSS box looks perfectly fine (minus one CPU). All the errors
>> appear on the MGS/MDS box. The lustre disk will not mount on any of
>> my clients. The message "mount.lustre: mount ic-mds1 at o2ib:/crew3 at
>> /crew3 failed: Transport endpoint is not connected" is all that
>> occurs.
>>
>> Suggestions and advice greatly appreciated. Do I just have to wait a
>> long time to let the disk "find itself"? Using lctl device xx and
>> activate did not help.
>>
>> megan
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-disc... at lists.lustre.orghttp://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
--
Wojciech Turek
Assistant System Manager
High Performance Computing Service
University of Cambridge
Email: wjt27 at cam.ac.uk
Tel: (+)44 1223 763517
More information about the lustre-discuss
mailing list