[lustre-discuss] zfs -- mds/mdt -- ssd model / type recommendation

Wed May 6 06:50:43 PDT 2015

Marc,

Have you backed up your ZFS MDT? Our SSD RAID10 with 4 disks and ~200GB 
of metadata can take days to backup a snapshot.

Andrew Wagner
Research Systems Administrator
Space Science and Engineering
University of Wisconsin
andrew.wagner at ssec.wisc.edu | 608-261-1360

On 05/05/2015 03:07 PM, Stearman, Marc wrote:
> Most of our production MDS nodes have a 2.7TB zpool.  They vary in amount full, but one file system has 1 billion files and is 63% full.  I plan on adding a bit more storage to try and get % full down to about 50%.  This is another nice attribute of ZFS.  I can increase the size of the pool online without having to reformat the file system, thereby adding more inodes.
>
> Also, remember that by default, ZFS has redundant_metadata=all defined.  ZFS is storing an extra copy of all the metadata for the pool.  And, ZFS stores a checksum of all blocks on the file system, so yes there is more overhead, but you do not need to do offline fsck, and it's checking not just metadata, but actual data as well.
>
> Disks today are relatively inexpensive, and I feel the benefits of ZFS (online fsck, data integrity, etc) are well worth the cost (slightly slower perf., need a few extra drives)
>
> -Marc
>
> ----
> D. Marc Stearman
> Lustre Operations Lead
> stearman2 at llnl.gov
> 925.423.9670
>
>
>
>
> On May 5, 2015, at 10:16 AM, Alexander I Kulyavtsev <aik at fnal.gov>
>   wrote:
>
>> How much space is used per i-node on MDT in production installation.
>> What is recommended size of MDT?
>>
>> I'm presently at about 10 KB/inode which seems too high compared with ldiskfs.
>>
>> I ran out of inodes on zfs mdt in my tests and zfs got "locked". MDT zpool got all space used.
>>
>> We have zpool created as stripe of mirrors ( mirror s0 s1 mirror s3 s3). Total size ~940 GB, get stuck at about 97 mil files.
>> zfs v 0.6.4.1 . default 128 KB record. Fragmentation went to 83% when things get locked at 98 % capacity; now I'm at 62% fragmentation after I removed some files (down to 97% space capacity.)
>>
>> Shall we use smaller ZFS record size on MDT, say 8KB or 16KB? If inode is ~10KB and zfs record 128KB, we are dropping caches and read data we do not need.
>>
>> Alex.
>>
>> On May 5, 2015, at 10:43 AM, Stearman, Marc <stearman2@†llnl.gov> wrote:
>>
>>> We are using the HGST S842 line of 2.5" SSDs.  We have them configures as a raid10 setup in ZFS.  We started with SAS drives and found them to be too slow, and were bottlenecked on the drives, so we upgraded to SSDs.  The nice thing with ZFS is that it's not just a two device mirror.  You can do an n-way mirror, so we added the SSDs to each of the vdevs with the SAS drives, let them resilver online, and then removed the SAS drives.  Users did not have to experience any downtime.
>>>
>>> We have about 100PB of Lustre spread over 10 file systems.  All of them are using SSDs.  We have a couple using OCZ SSDs, but I'm not a fan of their RMA policies.  That has changed since they were bought by Toshiba, but I still prefer the HGST drives.
>>>
>>> We configure them as 10 mirror pairs (20 drives total), spread across two JBODs so we can lose an entire JBOD and still have the pool up.
>>>
>>> -Marc
>>>
>>> ----
>>> D. Marc Stearman
>>> Lustre Operations Lead
>>> stearman2 at llnl.gov
>>> 925.423.9670
>>>
>>>
>>>
>>>
>>> On May 4, 2015, at 11:18 AM, Kevin Abbey <kevin.abbey at rutgers.edu> wrote:
>>>
>>>> Hi,
>>>>
>>>> For a single node OSS I'm planning to use a combined MGS/MDS.  Can anyone recommend an enterprise ssd designed for this workload?  I'd like to create a raid10  with 4x ssd using zfs as the backing fs.
>>>>
>>>> Are there any published/documented systems using zfs in raid 10 using ssd?
>>>>
>>>> Thanks,
>>>> Kevin
>>>>
>>>>
>>>> -- 
>>>> Kevin Abbey
>>>> Systems Administrator
>>>> Rutgers University
>>>>
>>>> _______________________________________________
>>>> lustre-discuss mailing list
>>>> lustre-discuss at lists.lustre.org
>>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>> _______________________________________________
>>> lustre-discuss mailing list
>>> lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org