[Lustre-discuss] File Content change without Error log

Brian J. Murrell Brian.Murrell at Sun.COM
Tue Mar 31 08:25:37 PDT 2009


On Tue, 2009-03-31 at 12:15 +0800, Lu Wang wrote:
> Dear  all,
>      There are more than 100 files demaged recently without any error logs on OSS. The demaged files has same size with their original copys in our backup system. However, the chksum changed. For example,
> #ll run_0008126_All_file015_SFO-1.raw.353645 
> -rw-r--r--  1 chyd u07 2108082156 Mar 31 10:07 run_0008126_All_file015_SFO-1.raw.353645 
> # ll demaged 
> -rw-r--r--  1 root root 2108082156 Mar 31 11:19 demaged

I'm assuming run_0008126_All_file015_SFO-1.raw.353645 is from your
backup and demaged is the "corrupt" file, is that correct?  I will base
my statements on that...

> # cmp run_0008126_All_file015_SFO-1.raw.353645 demaged 
> run_0008126_All_file015_SFO-1.raw.353645 demaged differ: byte 16777217, line 118663
> 
> # adler32 run_0008126_All_file015_SFO-1.raw.353645 
> adler32(run_0008126_All_file015_SFO-1.raw.353645) = 3653083401, 0xd9bda109
> #adler32 demaged 
> adler32(demaged) = 195426776, 0xba5f9d8
> PS:
> 1.The modifiy time of these demaged files are same as the time they copied to Lustre. 

Why is the modification time of run_0008126_All_file015_SFO-1.raw.353645
and demaged different?  Could that difference, and the relatively
newness of run_0008126_All_file015_SFO-1.raw.353645 explain what
happened (i.e. it was written to, legitimately).

> 2.There is no abnormal signals in OSS logs. 

There wouldn't be in normal situations such as the file was written to
after the backup was made.  The modification times give no assurance
that that was not the case as "demaged" is written after
run_0008126_All_file015_SFO-1.raw.353645.

Also, silent disk corruption (i.e. in the hardware) could be a cause as
could any kind of silent failure below the Lustre stack.

Also, with regard to the backup file that you are comparing to, is it
truly the actual file on the backup medium that you are using in the
comparison or is it a copy (i.e. restored to a disk from the backup
medium)?

If it's a copy of the backup file, how do you know that the copy from
the backup is not actually corrupt and that the copy on disk is in fact
the true copy?  Or how do you know that the copy that's on the backup
medium is not corrupted (i.e. faulty backup medium)?  What's your point
of reference that assures that (the copy from) the backup is the true
image and not damaged?

Just some things to consider.

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20090331/7b8390a8/attachment.pgp>


More information about the lustre-discuss mailing list