[Lustre-discuss] lustre_mgs: operation ... on unconnected MGS

Reto Gantenbein reto.gantenbein at id.unibe.ch
Tue Apr 29 12:06:09 PDT 2008


Dear lustre users

I did setup a lustre file system with 7 osts (fibre-channel raids) and
an mgs/mdt which are exported via two nodes. One node has the mgs/mdt
and 3 osts, the other has 4 osts mounted. The nodes are running the
lustre patched 2.6.18 vanilla kernel. The clients are patchless and are
running the 2.6.22 gentoo kernel. The lustre-1.6.4.3 is compiled from
sources under gentoo linux.

The two nodes are called lustre01 and lustre02.

I did format the mgs/mdt on lustre01 with:
mkfs.lustre --mgs --mdt --fsname=homefs --failnode=lustre02 at tcp
--reformat /dev/sdb

Then I mounted it and formatted the osts also on lustre01 with:
mkfs.lustre --ost --mgsnode=lustre01 at tcp --mgsnode=lustre02 at tcp
--fsname=homefs --failnode=lustre02 at tcp --index=1 /dev/sdc

and so on...

Is there already a general mistake in this installation setup?

The osts are distributed over both servers to enlarge bandwidth and also
for failover reasons. All osts and mgs are connected to both servers but
only mounted on a single one.

Now to my problem:
I mounted the file system from a client with ip 10.1.1.65 and these are
the messages that appear in the system log:

 lustre01 LustreError: 13533:0:(handler.c:148:mds_sendpage()) @@@ bulk
failed: timeout 0(4096), evicting
87fb775c-8f64-5d85-2a95-8fb595e62892 at NET_0x200000a010141_UUID
 lustre01 req at ffff81011dc72e00 x2483/t0
o37->87fb775c-8f64-5d85-2a95-8fb595e62892 at NET_0x200000a010141_UUID:-1
lens 296/296 ref 0 fl Interpret:/0/0 rc 0/0

 lustre01 LustreError: 13469:0:(ldlm_lib.c:1442:target_send_reply_msg())
@@@ processing error (-107)  req at ffff81011d704a00 x2479/t0
o400-><?>@<?>:-1 lens 128/0 ref 0 fl Interpret:/0/0 rc -107/0

 lustre01 LustreError: 13469:0:(handler.c:1499:mds_handle()) operation
400 on unconnected MDS from 12345-10.1.1.65 at tcp

 lustre01 LustreError: 13535:0:(mgs_handler.c:515:mgs_handle())
lustre_mgs: operation 101 on unconnected MGS

 lustre01 LustreError: 13535:0:(mgs_handler.c:515:mgs_handle())
lustre_mgs: operation 501 on unconnected MGS

I already tried to find some answers in the net but without much
success. I cannot find what they mean or where they come from. 

Maybe it also helps to show you my device list:

lustre01:
lctl > device_list
  0 UP mgs MGS MGS 11
  1 UP mgc MGC10.1.140.2 at tcp 89b4c0f0-c602-0857-c22e-ed232d8ad7aa 5
  2 UP mdt MDS MDS_uuid 3
  3 UP lov homefs-mdtlov homefs-mdtlov_UUID 4
  4 UP mds homefs-MDT0000 homefs-MDT0000_UUID 5
  5 UP osc homefs-OST0001-osc homefs-mdtlov_UUID 5
  6 UP osc homefs-OST0004-osc homefs-mdtlov_UUID 5
  7 UP osc homefs-OST0005-osc homefs-mdtlov_UUID 5
  8 UP osc homefs-OST0002-osc homefs-mdtlov_UUID 5
  9 UP osc homefs-OST0003-osc homefs-mdtlov_UUID 5
 10 UP osc homefs-OST0006-osc homefs-mdtlov_UUID 5
 11 UP osc homefs-OST0007-osc homefs-mdtlov_UUID 5
 12 UP mgc MGC10.1.140.1 at tcp c8ad2ab0-9eef-b334-37af-85734b53ac94 5
 13 UP ost OSS OSS_uuid 3
 14 UP obdfilter homefs-OST0001 homefs-OST0001_UUID 7
 15 UP obdfilter homefs-OST0004 homefs-OST0004_UUID 7
 16 UP obdfilter homefs-OST0005 homefs-OST0005_UUID 7

lustre02:
lctl > device_list
  0 UP mgc MGC10.1.140.1 at tcp 6154baf3-e830-81d9-ff6c-451d107650c1 5
  1 UP ost OSS OSS_uuid 3
  2 UP obdfilter homefs-OST0002 homefs-OST0002_UUID 7
  3 UP obdfilter homefs-OST0003 homefs-OST0003_UUID 7
  4 UP obdfilter homefs-OST0006 homefs-OST0006_UUID 7
  5 UP obdfilter homefs-OST0007 homefs-OST0007_UUID 7


Can someone give me some hints? What is going wrong here?

Kind regards,
Reto Gantenbein





More information about the lustre-discuss mailing list