[Lustre-discuss] [EXTERNAL] how do I deactivate a very wonky OST

Andrus, Brian Contractor bdandrus at nps.edu
Fri Jan 30 14:10:09 PST 2015


Joe,

So I gave that a try. I disabled ib0 so the node would not be on the lnet network.
It tried to mount and threw some errors about unable to connect to the MGS and then kernel panicked. Here is what came out of dmesg just before it did that:

LDISKFS-fs (dm-27): mounted filesystem with ordered data mode. quota=on. Opts:
LNetError: 3243:0:(o2iblnd_cb.c:1267:kiblnd_resolve_addr()) Failed to bind to a free privileged port
LNetError: 3243:0:(o2iblnd_cb.c:1321:kiblnd_connect_peer()) Can't resolve addr for 10.100.1.11 at o2ib: -99
Lustre: 3243:0:(client.c:1926:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1422655534/real 1422655534]  req at ffff880fc93dcc00 x1491748356358156/t0(0) o250->MGC10.100.1.11 at o2ib@10.100.1.11 at o2ib:26/25 lens 400/544 e 0 to 1 dl 1422655539 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
LustreError: 20160:0:(client.c:1083:ptlrpc_import_delay_req()) @@@ send limit expired   req at ffff880fc9dfbc00 x1491748356358160/t0(0) o253->MGC10.100.1.11 at o2ib@10.100.1.11 at o2ib:26/25 lens 4768/4768 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
LustreError: 20160:0:(obd_mount_server.c:1165:server_register_target()) WORK-OST0005: error registering with the MGS: rc = -5 (not fatal)
LustreError: 20160:0:(client.c:1083:ptlrpc_import_delay_req()) @@@ send limit expired   req at ffff880fc9dfbc00 x1491748356358164/t0(0) o101->MGC10.100.1.11 at o2ib@10.100.1.11 at o2ib:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
LustreError: 20160:0:(client.c:1083:ptlrpc_import_delay_req()) @@@ send limit expired   req at ffff880fc9dfbc00 x1491748356358168/t0(0) o101->MGC10.100.1.11 at o2ib@10.100.1.11 at o2ib:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
LNetError: 3243:0:(o2iblnd_cb.c:1267:kiblnd_resolve_addr()) Failed to bind to a free privileged port
LNetError: 3243:0:(o2iblnd_cb.c:1321:kiblnd_connect_peer()) Can't resolve addr for 10.100.1.10 at o2ib: -99
Lustre: 3243:0:(client.c:1926:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1422655559/real 1422655559]  req at ffff8804100edc00 x1491748356358176/t0(0) o250->MGC10.100.1.11 at o2ib@10.100.1.10 at o2ib:26/25 lens 400/544 e 0 to 1 dl 1422655564 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
LustreError: 20160:0:(client.c:1083:ptlrpc_import_delay_req()) @@@ send limit expired   req at ffff880fc9dfbc00 x1491748356358172/t0(0) o101->MGC10.100.1.11 at o2ib@10.100.1.10 at o2ib:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
LNetError: 3243:0:(o2iblnd_cb.c:1267:kiblnd_resolve_addr()) Failed to bind to a free privileged port
LNetError: 3243:0:(o2iblnd_cb.c:1321:kiblnd_connect_peer()) Can't resolve addr for 10.100.1.11 at o2ib: -99
Lustre: 3243:0:(client.c:1926:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1422655568/real 1422655568]  req at ffff8808156a2c00 x1491748356358184/t0(0) o38->WORK-MDT0000-lwp-OST0005 at 10.100.1.11@o2ib:12/10 lens 400/544 e 0 to 1 dl 1422655573 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
LustreError: 20160:0:(client.c:1083:ptlrpc_import_delay_req()) @@@ send limit expired   req at ffff880fc9dfb000 x1491748356358188/t0(0) o101->MGC10.100.1.11 at o2ib@10.100.1.10 at o2ib:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
LustreError: 20160:0:(client.c:1083:ptlrpc_import_delay_req()) Skipped 1 previous similar message
[root at nas-spare ~]#
Message from syslogd at nas-spare at Jan 30 14:06:28 ...
kernel:Kernel panic - not syncing: Fatal exception


So it is still causing a kernel panic anytime I mount it....


Brian Andrus
ITACS/Research Computing
Naval Postgraduate School
Monterey, California
voice: 831-656-6238



From: Mervini, Joseph A [mailto:jamervi at sandia.gov]
Sent: Thursday, January 29, 2015 2:12 PM
To: Andrus, Brian Contractor
Subject: Re: [EXTERNAL] [Lustre-discuss] how do I deactivate a very wonky OST

Brian,

Have you tried mount the OST outside of the the file system? By that I mean mounting it like it is not connect to the LNET network? It should mount, just complain about not being able to communicate with the MGS/MDS. Also, if you have quotas turned of on the device and it mounts I would think that you should be able to re-register it with the overall file system.

Just a thought...

====

Joe Mervini
Sandia National Laboratories
High Performance Computing
505.844.6770
jamervi at sandia.gov<mailto:jamervi at sandia.gov>



On Jan 23, 2015, at 11:40 AM, Andrus, Brian Contractor <bdandrus at nps.edu<mailto:bdandrus at nps.edu>> wrote:


Joe,

We are still having the issue. I did do the commands you suggested, that was what got me to a point I could get MOST of the system back up.

Current state, the 'bad' OST (OST5) is unregistered. When I tried to start an lfsck, it kernel panicked all of the OSSes and when I would try to bring up OST5, that would kernel panic the OSS it was on. So I did a writeconf on everything and then brought everything back up except it to keep it from registering.
Doing the lfs find does not work directly, it throws an error for every file that has anything on the bad OST, so I was able to capture the STDERR to get the info for those files.

I haven't given up hope on the OST since it passes e2fsk with no issues. I can mount it as ldiskfs and see everything too. But, as long as it is unavailable, I cannot even delete or unlink files that are on it.
Right now, we have the filesystem up, it is working except any file that has data on the bad OST is inaccessible and cannot be removed. It would be nice to figure out what is wrong with the OST that makes the OSS panic if it gets mounted as part of the filesystem.


Brian Andrus
ITACS/Research Computing
Naval Postgraduate School
Monterey, California
voice: 831-656-6238




From: Mervini, Joseph A [mailto:jamervi at sandia.gov]
Sent: Tuesday, January 20, 2015 11:46 AM
To: Andrus, Brian Contractor
Subject: Re: [EXTERNAL] [Lustre-discuss] how do I deactivate a very wonky OST

Hey Brian,

Are you still fighting the issue?

I had something similar come up with ldiskfs and quotas on a MDT on 2.5.3 that was also kernel panicking the system. I was able to get past it by shutting off quotas and the MDT and then re-enabling them. (I didn't know that lustre was using Linux quota now...)

Anyway, the commands that I used to turn off quota on the device was:

dumpe2fs -h <device> (To see what was in the superblock.)


tune2fs -O ^quota <device>

I was then able to mount the file system with no problem.

I then re-enabled it with:

tune2fs -O quota <device>

And everything works fine now.

Hope this helps.

Joe

====

Joe Mervini
Sandia National Laboratories
High Performance Computing
505.844.6770
jamervi at sandia.gov<mailto:jamervi at sandia.gov>



On Jan 16, 2015, at 10:06 AM, Andrus, Brian Contractor <bdandrus at nps.edu<mailto:bdandrus at nps.edu>> wrote:



Joe,

We are using OSTs that are on an LVM connected via Infiniband. I have 17 OSTs and all the others are fine. I have also tried mounting this 'bad' one on other hardware with the same result.


Brian Andrus
ITACS/Research Computing
Naval Postgraduate School
Monterey, California
voice: 831-656-6238



From: Mervini, Joseph A [mailto:jamervi at sandia.gov]
Sent: Thursday, January 15, 2015 10:09 AM
To: Andrus, Brian Contractor
Subject: Re: [EXTERNAL] Re: [Lustre-discuss] how do I deactivate a very wonky OST

Brian,

Oh, sorry. I didn't realize you were using 2.6 - don't know anything about that version yet.

It doesn't make a lot of sense that if you can fsck the file system it would panic the system on mount. What is the hardware that you are using (i.e., is the storage direct attached or external, is it a JBOD or raid device, etc..)

I happy to make suggestions that may help.

Regards,
Joe
====

Joe Mervini
Sandia National Laboratories
High Performance Computing
505.844.6770
jamervi at sandia.gov<mailto:jamervi at sandia.gov>



On Jan 15, 2015, at 10:06 AM, Andrus, Brian Contractor <bdandrus at nps.edu<mailto:bdandrus at nps.edu>> wrote:




Joe,

Is that something that can be done on an offline, unavailable OST under lustre 2.6? I thought quotas were now built-in.
It is not a persistent mount parameter on the OST when I user tunefs.

I have done tune2fs and removed/rebuilt the quota info on the backing filesystem, that removes the quota errors from e2fsck, but it still kernel panics the OSS when it mounts.

I cannot even seem to unlink any files that use the missing OST either. It seems that if you completely lose an OST that is not recoverable in lustre 2.6, you cannot do much with the files that were on it....

I haven't given up hope on the OST yet, though...

Brian Andrus
ITACS/Research Computing
Naval Postgraduate School
Monterey, California
voice: 831-656-6238



From: Mervini, Joseph A [mailto:jamervi at sandia.gov]
Sent: Wednesday, January 14, 2015 3:34 PM
To: Andrus, Brian Contractor
Subject: Re: [EXTERNAL] Re: [Lustre-discuss] how do I deactivate a very wonky OST

Brian,

I have one thought: You did say that you could run fsck against the file system but it was complaining about the quota file?  You might try removing the quota parameter from the OST configuration and see if you can mount it then. If it does, then you can reinsert the parameter. The only downside to that will be that you'll have to do a quota check again but that would be considerably better that trying to hack the file system back into a working order.

Hope this helps/works.
====

Joe Mervini
Sandia National Laboratories
High Performance Computing
505.844.6770
jamervi at sandia.gov<mailto:jamervi at sandia.gov>



On Jan 14, 2015, at 4:11 PM, Andrus, Brian Contractor <bdandrus at nps.edu<mailto:bdandrus at nps.edu>> wrote:





Thanks Sean,

Right now neither help me as I had to bring the entire system up from scratch and NOT mount the bad OST.
So, now OST5 is not listed anywhere. The only knows it is missing.
Doing 'lctl dl' only show the OSTs that have been brought up.
If I try to bring it up, it registers, the MDS becomes aware, the OSS kernel panics and the MDS starts making everyone wait for it to come back.

Part of 'lfs df':
OST0005             : Resource temporarily unavailable

Hassle is I cannot really do an 'lfs find' for the files on the bad OST because the OST is not registered... stuck in a loop here...

If I could find a way to tag it as offline even though the MDS doesn't see it yet, that may help.


Brian Andrus
ITACS/Research Computing
Naval Postgraduate School
Monterey, California
voice: 831-656-6238




From: Sean Brisbane [mailto:s.brisbane1 at physics.ox.ac.uk]
Sent: Wednesday, January 14, 2015 3:04 PM
To: Andrus, Brian Contractor; lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>
Subject: RE: how do I deactivate a very wonky OST

This caught me out in a recent upgrade:

cat /proc/fs/lustre/lov/{yourmdt}/target_obd

rather than

"lctl dl"

Shows the state of the OST.

Cheers,
Sean
________________________________
From: lustre-discuss-bounces at lists.lustre.org<mailto:lustre-discuss-bounces at lists.lustre.org> [lustre-discuss-bounces at lists.lustre.org<mailto:lustre-discuss-bounces at lists.lustre.org>] on behalf of Andrus, Brian Contractor [bdandrus at nps.edu<mailto:bdandrus at nps.edu>]
Sent: 13 January 2015 17:28
To: lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>
Subject: [Lustre-discuss] how do I deactivate a very wonky OST
All,

We are still trying to move forward getting our filesystem at least partially up with a failed OST.

Currently the OST will kernel panic any device that mounts it. That seems to be a constant.

So, the plan is to bring the system up without that OST and find what data will be lost.
Now, I am trying to deactivate the OST on the MGS, but it seems to have no effect.
Running lctl --device 14 deactivate does not change anything. The OST still shows 'UP'

Is there a way to force lustre to deactivate an OST altogether when it is showing 'UP' and the OST is not going to be happily mounted?

I can mount the filesystem, but many actions hang (ls, df, etc).

Brian Andrus
ITACS/Research Computing
Naval Postgraduate School
Monterey, California
voice: 831-656-6238

_______________________________________________
Lustre-discuss mailing list
Lustre-discuss at lists.lustre.org<mailto:Lustre-discuss at lists.lustre.org>
http://lists.lustre.org/mailman/listinfo/lustre-discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20150130/848976c1/attachment.htm>


More information about the lustre-discuss mailing list