[Lustre-discuss] HLRN lustre breakdown

Brock Palen brockp at umich.edu
Thu Aug 21 07:55:11 PDT 2008


On Aug 21, 2008, at 10:22 AM, Troy Benjegerdes wrote:
> This is a big nasty issue, particularly for HPC applications where
> performance is a big issue.
>
> How does one even begin to benchmark the performance overhead of a
> parallel filesystem with checksumming? I am having nightmares over the
> ways vendors will try to play games with performance numbers.

True

>
> My suspicion is that whenever a parallel filesystem with  
> checksumming is
> available and works, that all the end-users will just turn it off  
> anyway
> because the applications will run twice as fast without it, regardless
> of what the benchmarks say.. leaving us back at the same problem.

I don't think this will be a problem. On current systems it may be  
the case of the checksummed filesystem becoming cpu bound.  I think  
the OST's will be bailed out by cpu speeds going up faster than disk  
speeds. You just need to limit the number of OST's/OSS.

Where I could see it being a problem is on the client side. That  
assumes that writes and reads are competing with the application for  
cycles.  So far on our clusters I see applications do ether compute  
or IO on a thread/rank.  Not both, freeing up allocated cpus for IO.   
Then again maybe I should ask our users why they don't do any async IO.

Prob depends.
My 2 cents.

>
> On Wed, Aug 20, 2008 at 07:12:10PM +0200, Bernd Schubert wrote:
>> Oh damn, I'm always afraid of silent data corruptions due to bad  
>> harddisks. We
>> also already had this issue, fortunately we found this disk before  
>> taking the
>> system into production.
>>
>> Will lustre-2.0 use the ZFS checksum feature?
>>
>>
>> Thanks,
>> Bernd
>>
>> On Wednesday 20 August 2008 19:08:34 Peter Jones wrote:
>>> Hi there
>>>
>>> I got the following background information from Juergen Kreuels  
>>> at SGI
>>>
>>> "It turned out that a bad disk ( which did NOT report itself as  
>>> being
>>> bad ) killed the lustre leading to data corruption due to inode  
>>> areas on
>>> that disk.
>>> It was finally decided to remake the whole FS and only during that
>>> action we finally ( after nearly 48 h ) found that bad drive.
>>>
>>> It had nothing to do with the lustre FS itself. Lustre had been the
>>> victim of a HW failure on a Raid6 lun."
>>>
>>> I hope that this helps
>>>
>>> PJones
>>>
>>> Heiko Schroeter wrote:
>>>> Hello list,
>>>>
>>>> does anyone has more background infos of what happened there ?
>>>>
>>>> Regards
>>>> Heiko
>>>>
>>>>
>>>>
>>>>
>>>> HLRN News
>>>> ---------
>>>>
>>>>
>>>> Since Mon Aug 18, 2008 12:00 HLRN-II complex Berlin is open for  
>>>> users,
>>>> again.
>>>>
>>>> During the maintenance it turned out that the Lustre file system  
>>>> holding
>>>> the users $WORK and $TMPDIR was damaged completely.
>>>> The file system had to be reconstructed from scratch. All user  
>>>> data in
>>>> $WORK are lost.
>>>>
>>>> We hope that this event remains an exception. SGI apologizes for  
>>>> this
>>>> event.
>>>>
>>>> /Bka
>>>>
>>>> =================================================================== 
>>>> =====
>>>> This is an announcement for all HLRN Users
>>>> _______________________________________________
>>>> Lustre-discuss mailing list
>>>> Lustre-discuss at lists.lustre.org
>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>>
>>
>> -- 
>> Bernd Schubert
>> Q-Leap Networks GmbH
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
> -- 
> ---------------------------------------------------------------------- 
> ----
> Troy Benjegerdes                'da hozer'                 
> hozer at hozed.org
>
> Somone asked me why I work on this free (http://www.gnu.org/ 
> philosophy/)
> software stuff and not get a real job. Charles Shultz had the best  
> answer:
>
> "Why do musicians compose symphonies and poets write poems? They do it
> because life wouldn't have any meaning for them if they didn't.  
> That's why
> I draw cartoons. It's my life." -- Charles Shultz
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>




More information about the lustre-discuss mailing list