[Lustre-discuss] HLRN lustre breakdown

Troy Benjegerdes hozer at hozed.org
Thu Aug 21 07:22:33 PDT 2008


This is a big nasty issue, particularly for HPC applications where
performance is a big issue.

How does one even begin to benchmark the performance overhead of a
parallel filesystem with checksumming? I am having nightmares over the
ways vendors will try to play games with performance numbers.

My suspicion is that whenever a parallel filesystem with checksumming is
available and works, that all the end-users will just turn it off anyway
because the applications will run twice as fast without it, regardless
of what the benchmarks say.. leaving us back at the same problem.

On Wed, Aug 20, 2008 at 07:12:10PM +0200, Bernd Schubert wrote:
> Oh damn, I'm always afraid of silent data corruptions due to bad harddisks. We 
> also already had this issue, fortunately we found this disk before taking the 
> system into production.
> 
> Will lustre-2.0 use the ZFS checksum feature?
> 
> 
> Thanks,
> Bernd
> 
> On Wednesday 20 August 2008 19:08:34 Peter Jones wrote:
> > Hi there
> >
> > I got the following background information from Juergen Kreuels at SGI
> >
> > "It turned out that a bad disk ( which did NOT report itself as being
> > bad ) killed the lustre leading to data corruption due to inode areas on
> > that disk.
> > It was finally decided to remake the whole FS and only during that
> > action we finally ( after nearly 48 h ) found that bad drive.
> >
> > It had nothing to do with the lustre FS itself. Lustre had been the
> > victim of a HW failure on a Raid6 lun."
> >
> > I hope that this helps
> >
> > PJones
> >
> > Heiko Schroeter wrote:
> > > Hello list,
> > >
> > > does anyone has more background infos of what happened there ?
> > >
> > > Regards
> > > Heiko
> > >
> > >
> > >
> > >
> > > HLRN News
> > > ---------
> > >
> > >
> > > Since Mon Aug 18, 2008 12:00 HLRN-II complex Berlin is open for users,
> > > again.
> > >
> > > During the maintenance it turned out that the Lustre file system holding
> > > the users $WORK and $TMPDIR was damaged completely.
> > > The file system had to be reconstructed from scratch. All user data in
> > > $WORK are lost.
> > >
> > > We hope that this event remains an exception. SGI apologizes for this
> > > event.
> > >
> > > /Bka
> > >
> > > ========================================================================
> > > This is an announcement for all HLRN Users
> > > _______________________________________________
> > > Lustre-discuss mailing list
> > > Lustre-discuss at lists.lustre.org
> > > http://lists.lustre.org/mailman/listinfo/lustre-discuss
> >
> > _______________________________________________
> > Lustre-discuss mailing list
> > Lustre-discuss at lists.lustre.org
> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
> 
> 
> 
> -- 
> Bernd Schubert
> Q-Leap Networks GmbH
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

-- 
--------------------------------------------------------------------------
Troy Benjegerdes                'da hozer'                hozer at hozed.org  

Somone asked me why I work on this free (http://www.gnu.org/philosophy/)
software stuff and not get a real job. Charles Shultz had the best answer:

"Why do musicians compose symphonies and poets write poems? They do it
because life wouldn't have any meaning for them if they didn't. That's why
I draw cartoons. It's my life." -- Charles Shultz



More information about the lustre-discuss mailing list