[Lustre-discuss] File Content change without Error log

Lu Wang wanglu at ihep.ac.cn
Tue Mar 31 09:41:40 PDT 2009


Yes, you are right. 
The problem is caused by misconfiguration of one disk array.Two  Patritions of this array are mapped to a same lun. 
That is to say: When I created OST1 on /dev/sda OST2 on /dev/sdb,  the two OSTs are acturally written to a same disk patrition on the disk array. 

(It is quite strange, there was not errors when I created two OST on a "same"    patrition.  )And when the content grows to nearly 40% of each OST,some of the files are demaged( because double use of disk space) .
I found this problem because all the files damaged are on a same OST.   After e2fsck, I lost one OST, the other OST becomes double sizes.  There are a lot of "red" files in directory "O" when I mount the OST as ldiskfs. 
I used
lfs getstripe --obd ****_UUID /dir generated the demaged file list. 
Is it possible to get back the lost OST using the "red" files?    I am not sure I have explained clearly ...
------------------				 
Lu Wang
2009-04-01

-------------------------------------------------------------
发件人:Brian J. Murrell
发送日期:2009-03-31 23:23:04
收件人:lustre-discuss
抄送:
主题:Re: [Lustre-discuss] File Content change without Error log

On Tue, 2009-03-31 at 12:15 +0800, Lu Wang wrote:
> Dear  all,
>      There are more than 100 files demaged recently without any error logs on OSS. The demaged files has same size with their original copys in our backup system. However, the chksum changed. For example,
> #ll run_0008126_All_file015_SFO-1.raw.353645 
> -rw-r--r--  1 chyd u07 2108082156 Mar 31 10:07 run_0008126_All_file015_SFO-1.raw.353645 
> # ll demaged 
> -rw-r--r--  1 root root 2108082156 Mar 31 11:19 demaged

I'm assuming run_0008126_All_file015_SFO-1.raw.353645 is from your
backup and demaged is the "corrupt" file, is that correct?  I will base
my statements on that...

> # cmp run_0008126_All_file015_SFO-1.raw.353645 demaged 
> run_0008126_All_file015_SFO-1.raw.353645 demaged differ: byte 16777217, line 118663
> 
> # adler32 run_0008126_All_file015_SFO-1.raw.353645 
> adler32(run_0008126_All_file015_SFO-1.raw.353645) = 3653083401, 0xd9bda109
> #adler32 demaged 
> adler32(demaged) = 195426776, 0xba5f9d8
> PS:
> 1.The modifiy time of these demaged files are same as the time they copied to Lustre. 

Why is the modification time of run_0008126_All_file015_SFO-1.raw.353645
and demaged different?  Could that difference, and the relatively
newness of run_0008126_All_file015_SFO-1.raw.353645 explain what
happened (i.e. it was written to, legitimately).

> 2.There is no abnormal signals in OSS logs. 

There wouldn't be in normal situations such as the file was written to
after the backup was made.  The modification times give no assurance
that that was not the case as "demaged" is written after
run_0008126_All_file015_SFO-1.raw.353645.

Also, silent disk corruption (i.e. in the hardware) could be a cause as
could any kind of silent failure below the Lustre stack.

Also, with regard to the backup file that you are comparing to, is it
truly the actual file on the backup medium that you are using in the
comparison or is it a copy (i.e. restored to a disk from the backup
medium)?

If it's a copy of the backup file, how do you know that the copy from
the backup is not actually corrupt and that the copy on disk is in fact
the true copy?  Or how do you know that the copy that's on the backup
medium is not corrupted (i.e. faulty backup medium)?  What's your point
of reference that assures that (the copy from) the backup is the true
image and not damaged?

Just some things to consider.

b.


_______________________________________________
Lustre-discuss mailing list
Lustre-discuss at lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss



More information about the lustre-discuss mailing list