[Lustre-discuss] Problem with write_conf

Tue Aug 3 14:00:31 PDT 2010

Nathan,

I started out with IP addresses of 10.2.9.1 (MDS), 10.2.9.2 (standby
MDS), 10.2.9.3 (OSS), and 10.2.9.4 (peer OSS).  I created a single MDT
and a single OST, using the following commands:

MDS#  mkfs.lustre --reformat --fsname hss2 --device-size=10000 --mgs
--mdt --mkfsoptions=' -O extents,dir_index,uninit_groups'
--mgsnode=10.2.9.1 at o2ib0 /dev/mapper/map0

OSS#  mkfs.lustre --reformat --ost --index=0 --mkfsoptions=' -O
extents,dir_index,uninit_groups ' --fsname hss2 --device-size=100000
--mgsnode=10.2.9.1 at o2ib0 /dev/mapper/map0

I mounted, mounted a client, created a few files, then unmounted the
client, unmounted the servers, rebooted the clients and servers.

Once the servers were back up, I ran the following on the MDS and OSS,
respectively:

MDS#  tunefs.lustre --erase-param --mgsnode=10.2.9.201 at o2ib0
--failnode=10.2.9.202 at o2ib0 /dev/mapper/map0 

OSS#  tunefs.lustre --erase-param --failnode=10.2.9.204 at o2ib0
--mgsnode=10.2.9.201 at o2ib0 --mgsnode=10.2.9.202 at o2ib0 /dev/mapper/map0

Then, I removed last_rcvd from the MDT and OST.

The, I changed the IP address to 10.2.9.201 (MDS), 10.2.9.202 (standby
MDS), 10.2.9.203 (OSS), 10.2.9.204 (peer OSS).

I mounted the MDT and OST.  After a short while, I got the following
errors on the MDS:

Lustre: 4567:0:(client.c:1464:ptlrpc_expire_one_request()) @@@ Request
x1343087831941136 sent from hss2-OST0000-osc to NID 10.2.9.204 at o2ib 0s
ago has failed due to network 

error (5s prior to deadline).

  req at ffff810213b5e400 x1343087831941136/t0
o8->hss2-OST0000_UUID at 10.2.9.204@o2ib:28/4 lens 368/584 e 0 to 1 dl
1280868405 ref 1 fl Rpc:N/0/0 rc 0/0

Lustre: 4568:0:(import.c:517:import_select_connection())
hss2-OST0000-osc: tried all connections, increasing latency to 1s

Lustre: 4567:0:(client.c:1464:ptlrpc_expire_one_request()) @@@ Request
x1343087831941137 sent from hss2-OST0000-osc to NID 10.2.9.3 at o2ib 6s ago
has timed out (6s prior to d

eadline).

  req at ffff810213b5e400 x1343087831941137/t0
o8->hss2-OST0000_UUID at 10.2.9.3@o2ib:28/4 lens 368/584 e 0 to 1 dl
1280868412 ref 2 fl Rpc:N/0/0 rc 0/0

LustreError: 4567:0:(lib-move.c:2441:LNetPut()) Error sending PUT to
12345-10.2.9.204 at o2ib: -113

Note that the old IP address of the old OST (10.2.9.203) is still
listed.  How can I change that?

The client is also seeing old IP addresses, this time the MDS's
10.2.9.1:

Lustre: Request x55 sent from hss2-MDT0000-mdc-ffff81007981d800 to NID
10.2.9.1 at o2ib 5s ago has timed out (limit 5s).

Lustre: Skipped 9 previous similar messages

Lustre: 6433:0:(import.c:507:import_select_connection())
hss2-MDT0000-mdc-ffff81007981d800: tried all connections, increasing
latency to 50s

Lustre: 6433:0:(import.c:507:import_select_connection()) Skipped 4
previous similar messages

Any help is appreciated.

Thanks.

-Roger

________________________________

From: Roger Spellman 
Sent: Tuesday, August 03, 2010 4:22 PM
To: 'Nathan Rutman'
Cc: lustre-discuss at lists.lustre.org
Subject: RE: [Lustre-discuss] Problem with write_conf

Nathan,

Thanks.  That works great.

Are there any tricks involved in also making a non-redundant system
redundant at the same time?  E.g. Can I just do:

MDS#  tunefs.lustre --erase-param --mgsnode=10.2.9.201 at o2ib0
--failnode=10.2.9.202 at o2ib0 /dev/mapper/map0 

OSS#  tunefs.lustre --erase-param --failnode=10.2.9.204 at o2ib0
--mgsnode=10.2.9.201 at o2ib0 --mgsnode=10.2.9.202 at o2ib0 /dev/mapper/map0

Is the OSS's NID stored anywhere on the OST?

-Roger

________________________________

From: Nathan Rutman [mailto:nathan.rutman at oracle.com] 
Sent: Tuesday, August 03, 2010 4:05 PM
To: Roger Spellman
Cc: lustre-discuss at lists.lustre.org
Subject: Re: [Lustre-discuss] Problem with write_conf

On Aug 3, 2010, at 12:49 PM, Roger Spellman wrote:

If I change the NIDs, and if I don't remove /mnt/mdt/CONFIGS/*-client,
then I get the following when I try mounting a client (note that
10.2.9.1 is the OLD address):

mount.lustre: mount 10.2.9.1 at o2ib:/hss2 at /mnt/lustre-hss2 failed:
Cannot send after transport endpoint shutdown

Don't mount with the old address :)

This is not contained in the config log; this is the MGS address the
client needs to talk to to GET the config log.  It needs to point to the
current IP of the MGS.  Maybe you've stuck this in /etc/fstab or perhaps
your DNS name resolution of the MGS's common name hasn't been updated. 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20100803/cb6ffe6c/attachment.htm>