[Lustre-discuss] File corrupted after fail-overing

Andreas Dilger adilger at sun.com
Thu Jun 18 01:45:06 PDT 2009


On Jun 18, 2009  15:03 +0700, Đàm Thanh Tùng wrote:
> I'm newbie in Lustre and i'm so sorry if my question is too stupid or it
> existed elsewhere.
> I'm have a problem with Lustre OST fail over
> I have 2 OSSs , configured to fail-over together, each OSS have their own
> OST ( i didn't use shared disk for my 2 OSS ) and they used the same OST
> index

You are misunderstanding how Lustre failover works.  You MUST have shared
disks between the two OSS nodes.

> This is all the things i've done:
> 
> - With my MDS: mkfs.lustre --verbose --mdt --mgs /dev/sdb
>                          mount -t lustre /dev/sdb/ /mnt/lustre
> - And my OSSs:
> 
> OSS1: mkfs.lustre --ost
> --mgsnode=192.168.1.200 at tcp0--failover=192.168.1.202 at tcp0--index=lustre-OST0000
> /dev/sdb
> 
> mount -t lustre /dev/sdb /mnt/lustre
> 
> OSS2: mkfs.lustre --ost
> --mgsnode=192.168.1.200 at tcp0--failover=192.168.1.201 at tcp0--index=lustre-OST0000
> /deb/sdb
> 
> mount -t lustre /dev/sdb /mnt/lustre
> 
> Everything worked well.
> 
> I made my own test:
> - Copy a large file to lustre mounted partition in my client, when it's
> still writing in there, i umount one of my OSS ( which is receiving data - i
> verified it by looking at df -h output on each OSS and lfs getstripe in
> client ).
> - The fail-overing worked well, at least by everything display in their log
> and my MDS log. The copy stopped at the moment, after recovering and
> changing connection from MDS to acitve OSS, it continued and finished
> without any error.
> 
> But, the problem is: When i used md5sum command to verify the file i've just
> copied, it's not the same with the original file. I tested many time after
> that and found almost the same result.

That is because the data is only being written to one of the OSTs.  That
is just how Lustre works today - it is doing RAID-0 striping of files
over OSTs.  There is not yet any RAID-1 layer for it.

> Is there any way to overcome this problem ?

Implement RAID-1 support :-)

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.




More information about the lustre-discuss mailing list