[Lustre-discuss] Critical Situation -- trying to remove a badly configured OST

Roger Spellman roger at terascala.com
Thu Jun 4 14:29:56 PDT 2009


I wonder if anyone can help me.  I'm in a bit of a rush, because my file
system is down, and I can't seem to get it back up.  I have a lot of
users who need their data ASAP.

 

As of yesterday, I had a 5 node system.  Node 1 was and MGS and MDT.
Nodes 2-5 were OSTs.  The network is both IB and tcp.

 

Today, I added another OST.  Unfortunately, I messed up the --mgsnode
option, and I only set it up for IB.  We noticed this when the IB
clients were OK, but the TCP clients were not.  Then, rather try to
change that (as I should have), I just reformatted that OST.  That left
me in a bad spot:  OST-0004 was missing.  The device list from the MDT
looked as follows:

 

>   3 UP lov lstr-ter-mdtlov lstr-ter-mdtlov_UUID 4

>   4 UP mds lstr-ter-MDT0000 lstr-ter-MDT0000_UUID 437

>   5 UP osc lstr-ter-OST0000-osc lstr-ter-mdtlov_UUID 5

>   6 UP osc lstr-ter-OST0001-osc lstr-ter-mdtlov_UUID 5

>   7 UP osc lstr-ter-OST0002-osc lstr-ter-mdtlov_UUID 5

>   8 UP osc lstr-ter-OST0003-osc lstr-ter-mdtlov_UUID 5

>   9 UP osc lstr-ter-OST0004-osc lstr-ter-mdtlov_UUID 5

>  10 UP osc lstr-ter-OST0005-osc lstr-ter-OST0004-osc-mdtlov_UUID 4

 

 

OST-0004 was the "bad" one, and OST0004 was its replacement.  Why is the
UUID so messed up?

In any case, I just deactived OST-0004 using:

lctl conf_param lstr-ter-0ST0004.osc.active=0

 

That did not solve anything.

 

I've even deactivating OST-0005, trying to get back to where I was
yesterday.  

 

I've rebooted my whole system, but can't even mount the MDT.  When I
try, I get the following errors:

 

Jun  4 14:28:54 ts-nrel-01 sshd(pam_unix)[6717]: session opened for user
root by (uid=0) Jun  4 14:28:55 ts-nrel-01 kernel: LustreError: 137-5:
UUID 'lstr-ter-MDT0000_UUID' is not available  for connect (stopp

ing)

Jun  4 14:28:56 ts-nrel-01 kernel: Lustre: Request x19 sent from
lstr-ter-OST0000-osc to NID 172.16.103.22 at tcp 5s ago ha s timed out
(limit 5s).

Jun  4 14:28:56 ts-nrel-01 kernel: Lustre: Request x20 sent from
lstr-ter-OST0001-osc to NID 172.16.103.23 at tcp 5s ago ha s timed out
(limit 5s).

Jun  4 14:28:56 ts-nrel-01 kernel: Lustre: Failing over
lstr-ter-OST0000-osc Jun  4 14:29:03 ts-nrel-01 kernel: LustreError:
137-5: UUID 'lstr-ter-MDT0000_UUID' is not available  for connect (stopp

ing)

Jun  4 14:29:13 ts-nrel-01 last message repeated 3 timesJun  4 14:29:13

ts-nrel-01 kernel: LustreError: Skipped 1 previous similar message Jun
4 14:29:16 ts-nrel-01 kernel: Lustre: Request x23 sent from
lstr-ter-OST0004-osc to NID 172.17.103.27 at o2ib 25s ago has timed out
(limit 25s).

Jun  4 14:29:16 ts-nrel-01 kernel: Lustre: Skipped 2 previous similar
messagesJun  4 14:29:16 ts-nrel-01 kernel: Lustre: Failing over
lstr-ter-OST0004-oscJun  4 14:29:16 ts-nrel-01 kernel: Lustre: Skipped 3
previous similar messages Jun  4 14:29:16 ts-nrel-01 kernel: Lustre:
lstr-ter-MDT0000: shutting down for failover; client state will be
preserved.Jun  4 14:29:16 ts-nrel-01

kernel: Lustre: MDT lstr-ter-MDT0000 has stopped.

 

 

If it helps to see the output of tunefs.lustre, here it is on the MDT
and the first OST:

 

checking for existing Lustre data: found CONFIGS/mountdata Reading
CONFIGS/mountdata

 

   Read previous values:

Target:     MGS

Index:      unassigned

Lustre FS:  lstr-ter

Mount type: ldiskfs

Flags:      0x174

              (MGS needs_index first_time update writeconf ) Persistent
mount opts: errors=remount-ro,iopen_nopriv,user_xattr

Parameters:

 

 

   Permanent disk data:

Target:     MGS

Index:      unassigned

Lustre FS:  lstr-ter

Mount type: ldiskfs

Flags:      0x174

              (MGS needs_index first_time update writeconf ) Persistent
mount opts: errors=remount-ro,iopen_nopriv,user_xattr

Parameters:

 

Writing CONFIGS/mountdata

 

 

checking for existing Lustre data: found CONFIGS/mountdata Reading
CONFIGS/mountdata

 

   Read previous values:

Target:     lstr-ter-MDT0000

Index:      0

Lustre FS:  lstr-ter

Mount type: ldiskfs

Flags:      0x1

              (MDT )

Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr

Parameters: mgsnode=172.16.103.21 at tcp,172.17.103.21 at o2ib

 

 

   Permanent disk data:

Target:     lstr-ter-MDT0000

Index:      0

Lustre FS:  lstr-ter

Mount type: ldiskfs

Flags:      0x1

              (MDT )

Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr

Parameters: mgsnode=172.16.103.21 at tcp,172.17.103.21 at o2ib

 

 

checking for existing Lustre data: found CONFIGS/mountdata Reading
CONFIGS/mountdata

 

   Read previous values:

Target:     lstr-ter-OST0000

Index:      0

Lustre FS:  lstr-ter

Mount type: ldiskfs

Flags:      0x2

              (OST )

Persistent mount opts: errors=remount-ro,extents,mballoc

Parameters: mgsnode=172.16.103.21 at tcp,172.17.103.21 at o2ib

 

 

   Permanent disk data:

Target:     lstr-ter-OST0000

Index:      0

Lustre FS:  lstr-ter

Mount type: ldiskfs

Flags:      0x2

              (OST )

Persistent mount opts: errors=remount-ro,extents,mballoc

Parameters: mgsnode=172.16.103.21 at tcp,172.17.103.21 at o2ib

 

Writing CONFIGS/mountdata

 

 

My goal for right now is to get something (even without the new OST) up
and running ASAP, as my users are putting great pressure on me.  If you
can help, I'd greatly appreciate it.

 

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20090604/e5caef60/attachment.htm>


More information about the lustre-discuss mailing list