[Lustre-discuss] Another server question.

Wed Feb 4 08:43:44 PST 2009

I mounted the OST rw. How long do you wait? It has been more
than 24 hours, and this drive is 400GB. The kernel only
reports...

LDISKFS FS on sda4, internal journal
LDISKFS-fs: mounted filesystem with ordered data mode.
LDISKFS-fs: file extents enabled
LDISKFS-fs: mballoc enabled
Lustre: OST datafs-OST0001 now serving dev (datafs-OST0001/a4ca1e1f-5fb6-98bb-4001-179eef95f576) with recovery enabled
Lustre: Server datafs-OST0001 on device /dev/sda4 has started

Step by step for the problem...

Goto a server at random. 

shutdown -h now

wait 10 mins

restart server

remount lustre (mount -t lustre -o rw /dev/sda4 /mnt/data/ost5)

check it...

cd /proc/fs/lustre; find . -name "*recov*" -exec cat {} \;

status: INACTIVE

a cat /proc/fs/lustre/devices shows:

  0 UP mgc MGC10.1.15.6 at tcp 6d8c5b4e-d22d-e17c-030b-0bf2a01defca 5
  1 UP ost OSS OSS_uuid 3
  2 UP obdfilter datafs-OST0001 datafs-OST0001_UUID 3

That seems correct, 10.1.15.6 is l1storage1, the MGS/MDT server. 
Check to see if MGS sees it...

[root at l1storage1 ~]# cat /proc/fs/lustre/lov/datafs-mdtlov/target_obd 
0: datafs-OST0000_UUID ACTIVE
1: datafs-OST0001_UUID ACTIVE
4: datafs-OST0004_UUID ACTIVE
5: datafs-OST0005_UUID ACTIVE
6: datafs-OST0006_UUID ACTIVE

again...

[root at l1storage1 ~]# cat /proc/fs/lustre/devices 
  0 UP mgs MGS MGS 13
  1 UP mgc MGC10.1.15.6 at tcp efa6505e-238d-7107-7a7c-c64208640f9f 5
  2 UP mdt MDS MDS_uuid 3
  3 UP lov datafs-mdtlov datafs-mdtlov_UUID 4
  4 UP mds datafs-MDT0000 datafs-MDT0000_UUID 5
  5 UP ost OSS OSS_uuid 3
  6 UP obdfilter datafs-OST0000 datafs-OST0000_UUID 5
  7 UP osc datafs-OST0000-osc datafs-mdtlov_UUID 5
  8 UP osc datafs-OST0006-osc datafs-mdtlov_UUID 5
  9 UP osc datafs-OST0005-osc datafs-mdtlov_UUID 5
 10 UP osc datafs-OST0004-osc datafs-mdtlov_UUID 5
 11 UP osc datafs-OST0001-osc datafs-mdtlov_UUID 5

Now, just to make sure it should be ok, I goto a client,
restart the client, make sure it sees the mount, then test:
mount commands shows it is mounted:

l1storage1 at tcp0:/datafs on /datafs type lustre (rw)

ls -l of /datafs shows my test data...

drwxr-xr-x 2 root root       4096 Jan 30 08:47 t
drwxr-xr-x 2 root root     221184 Feb  2 12:36 test
drwxr-xr-x 2 root root     221184 Feb  2 12:35 test2

As I previously noted, I can create/delete files, but
ls -lR hangs, df hangs, etc, etc.

"You are trying way too hard." -- I do not think that
is possibe... 

----- "Brian J. Murrell" <Brian.Murrell at Sun.COM> wrote:

> On Wed, 2009-02-04 at 09:39 -0600, Robert Minvielle wrote:
> > I still can not seem to get this OST to come online. The clients 
> > are still exhibiting the same behaviour as before. Is there any
> > way to get the OST to go into active by force? I ran a ext3 check
> > on it using the SUN modded e2fsprogs and it returns 
> > 
> > e2fsck 1.40.11.sun1 (17-June-2008)
> > datafs-OST0001: recovering journal
> > datafs-OST0001: clean, 472/25608192 files, 1862944/102410358 blocks
> > 
> > Yet, I still get:
> > 
> > cd /proc/fs/lustre; find . -name "*recov*" -exec cat {} \;
> > status: INACTIVE
> > 
> > On the MGS, it seems to show as active...
> > 
> > [root at l1storage1 ~]# cat
> /proc/fs/lustre/lov/datafs-mdtlov/target_obd 
> > 0: datafs-OST0000_UUID ACTIVE
> > 1: datafs-OST0001_UUID ACTIVE
> > 4: datafs-OST0004_UUID ACTIVE
> > 5: datafs-OST0005_UUID ACTIVE
> > 6: datafs-OST0006_UUID ACTIVE
> > 
> > I can not seem to how to up the OST in the FAQ/manual, other than
> > the 4.2.1 4.2.2 section, which does not seem to work on this OST
> > (when I do a  lctl --device <devno> conf_param
> datafs-OST0001.osc.active=1
> > it fails, although no matter what I put in for <devno> it gives me
> an
> > error). 
> > 
> > Any help would be much appreciated. 
> 
> You are trying way too hard.  The process is simply to mount the OST
> and
> wait for recovery to complete.  If that is not working, then that
> needs
> to be debugged.  All of these other things you are attempting are
> likely
> just confusing things more than helping.
> 
> So after you mount the OST, you should get a bunch of messages in
> your
> "kernel log".  What are they?
> 
> Also, can you explain, exactly, step by step what you are doing to
> invoke this failure and recovery.
> 
> b.
> 
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss