[Lustre-discuss] WARNING: data corruption issue found in 1.8.x releases

Mervini, Joseph A jamervi at sandia.gov
Wed Sep 9 11:23:09 PDT 2009


I'm not really sure why writethrough_cache_enable is being disabled but the method we have used to disable the read_cache_enable is "echo 0 > /proc/fs/lustre/obdfilter/<ost name>/read_cache_enable" without any issues.

-----Original Message-----
From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of Charles A. Taylor
Sent: Wednesday, September 09, 2009 12:07 PM
To: Johann Lombardi
Cc: lustre-discuss at lists.lustre.org discuss
Subject: Re: [Lustre-discuss] WARNING: data corruption issue found in 1.8.x releases

Just for the record, we've been running 1.8.1 for a several weeks now
with no problems.  Well, truthfully, "no problems" is an exaggeration
but it is mostly working.   We see lots of log messages we are not used
to regarding client and server csum differences.  

Anyway, your  email concerned us so we issued the recommended commands
on our OSSs to disable the caching.   That promptly crashed two of our
OSSs.   We got the servers back up and after fsck'ing (fsck.ext4) all
the OSTs and remounting lustre, one of the two OSSs promptly crashed
again.  

We're still working through it but we weren't having any problems - or
at least none we were aware of - until we disabled the caching.   Maybe
we were already doomed - I don't know. 

Right now I'm kind of wishing we had moved to 1.6.7.2 rather than
1.8.0.1/1.8.1.  I think we got overconfident after running 1.6.4.2 for
so long with so few problems.

Charlie Taylor
UF HPC Center

On Wed, 2009-09-09 at 17:00 +0200, Johann Lombardi wrote:
> A bug has been identified in the 1.8 releases (1.8.0, 1.8.0.1 & 1.8.1  
> are
> impacted) that can cause data corruption on the OSTs. This problem is
> related to the OSS read cache feature that has been introduced in 1.8.0.
> This can happen when a bulk read or write request is aborted due to the
> client being evicted or because the data transfer over the network has
> timed out. More details are available in bug 20560:
> https://bugzilla.lustre.org/show_bug.cgi?id=20560
> 
> A patch is under testing and will be included in 1.8.1.1.
> Until 1.8.1.1 is available, we recommend to disable the OSS read cache
> feature. This feature can be disabled by running the two following
> commands on the OSSs:
> # lctl set_param obdfilter.*.writethrough_cache_enable=0
> # lctl set_param obdfilter.*.read_cache_enable=0
> 
> This has to be done each time an OST is restarted.
> 
> Best regards,
> Johann, for the Lustre team
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

_______________________________________________
Lustre-discuss mailing list
Lustre-discuss at lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss





More information about the lustre-discuss mailing list