[lustre-discuss] zfs -- mds/mdt -- ssd model / type recommendation

Wed May 6 12:13:09 PDT 2015

Regarding ZFS MDT backups -

Dealing with the metadata performance issues we performed many zfs mdt 
backup/recoveries and used snapshots to test things. It did work, but 
slow. It was *agonizing* to wait a day for each testing iteration.

So now in production we've just got a full and incrementals on top, 
which helps with total time to snapshot. But I'd still rather run 
frequent full snapshots too.

We use lustre for not just scratch, so backup of the mdt matters for us. 
And really, if it wasn't hard to do, wouldn't you do it for scratch 
filesystems too?

Of course critical stuff gets backed up in home or whatever is really 
unique, but there's still plenty of data that will be painful if the 
filesystem dies. This has been a bigger decision point for us than 
performance for MDT backing filesystem choices. For our next 2 lustre 
file systems we are leaning more towards ldiskfs on the MDT because of 
it. It's going to be tough if we give up the other features because of 
this, but it's important.

One interesting thing we tested with ZFS is using it to mirror MDT's 
between 2 servers via infiniband SRP. Notes are here: 
http://wiki.lustre.org/MDT_Mirroring_with_ZFS_and_SRP - This would give 
you a true mirror of your data (separate servers, separate disks) at 
what looked like little or no performance penalty from my testing. Not a 
backup, but nice. We only were able to test it for 1 week, so could not 
put it into production.

Scott

On 5/6/2015 10:40 AM, Stearman, Marc wrote:
> No, all of our Lustre file systems are scratch space.  They are not backed up.  We have an HPSS archive to store files forever, and we use NetApp filers for NFS home space, which is backed up.
>
> I do recall we tried to do a migration years ago under ldiskfs to reformat with more inodes and the backup for the MDT was taking forever (more than a week), so we decided for future migrations to just build a new file system along side and ask the users to move the most important files that they needed.
>
> -Marc
>
> ----
> D. Marc Stearman
> Lustre Operations Lead
> stearman2 at llnl.gov
> 925.423.9670
>
>
>
>
> On May 6, 2015, at 6:50 AM, Andrew Wagner <andrew.wagner at ssec.wisc.edu>
>   wrote:
>
>> Marc,
>>
>> Have you backed up your ZFS MDT? Our SSD RAID10 with 4 disks and ~200GB of metadata can take days to backup a snapshot.
>>
>> Andrew Wagner
>> Research Systems Administrator
>> Space Science and Engineering
>> University of Wisconsin
>> andrew.wagner at ssec.wisc.edu | 608-261-1360
>>
>> On 05/05/2015 03:07 PM, Stearman, Marc wrote:
>>> Most of our production MDS nodes have a 2.7TB zpool.  They vary in amount full, but one file system has 1 billion files and is 63% full.  I plan on adding a bit more storage to try and get % full down to about 50%.  This is another nice attribute of ZFS.  I can increase the size of the pool online without having to reformat the file system, thereby adding more inodes.
>>>
>>> Also, remember that by default, ZFS has redundant_metadata=all defined.  ZFS is storing an extra copy of all the metadata for the pool.  And, ZFS stores a checksum of all blocks on the file system, so yes there is more overhead, but you do not need to do offline fsck, and it's checking not just metadata, but actual data as well.
>>>
>>> Disks today are relatively inexpensive, and I feel the benefits of ZFS (online fsck, data integrity, etc) are well worth the cost (slightly slower perf., need a few extra drives)
>>>
>>> -Marc
>>>
>>> ----
>>> D. Marc Stearman
>>> Lustre Operations Lead
>>> stearman2 at llnl.gov
>>> 925.423.9670
>>>
>>>
>>>
>>>
>>> On May 5, 2015, at 10:16 AM, Alexander I Kulyavtsev <aik at fnal.gov>
>>>   wrote:
>>>
>>>> How much space is used per i-node on MDT in production installation.
>>>> What is recommended size of MDT?
>>>>
>>>> I'm presently at about 10 KB/inode which seems too high compared with ldiskfs.
>>>>
>>>> I ran out of inodes on zfs mdt in my tests and zfs got "locked". MDT zpool got all space used.
>>>>
>>>> We have zpool created as stripe of mirrors ( mirror s0 s1 mirror s3 s3). Total size ~940 GB, get stuck at about 97 mil files.
>>>> zfs v 0.6.4.1 . default 128 KB record. Fragmentation went to 83% when things get locked at 98 % capacity; now I'm at 62% fragmentation after I removed some files (down to 97% space capacity.)
>>>>
>>>> Shall we use smaller ZFS record size on MDT, say 8KB or 16KB? If inode is ~10KB and zfs record 128KB, we are dropping caches and read data we do not need.
>>>>
>>>> Alex.
>>>>
>>>> On May 5, 2015, at 10:43 AM, Stearman, Marc <stearman2@†llnl.gov> wrote:
>>>>
>>>>> We are using the HGST S842 line of 2.5" SSDs.  We have them configures as a raid10 setup in ZFS.  We started with SAS drives and found them to be too slow, and were bottlenecked on the drives, so we upgraded to SSDs.  The nice thing with ZFS is that it's not just a two device mirror.  You can do an n-way mirror, so we added the SSDs to each of the vdevs with the SAS drives, let them resilver online, and then removed the SAS drives.  Users did not have to experience any downtime.
>>>>>
>>>>> We have about 100PB of Lustre spread over 10 file systems.  All of them are using SSDs.  We have a couple using OCZ SSDs, but I'm not a fan of their RMA policies.  That has changed since they were bought by Toshiba, but I still prefer the HGST drives.
>>>>>
>>>>> We configure them as 10 mirror pairs (20 drives total), spread across two JBODs so we can lose an entire JBOD and still have the pool up.
>>>>>
>>>>> -Marc
>>>>>
>>>>> ----
>>>>> D. Marc Stearman
>>>>> Lustre Operations Lead
>>>>> stearman2 at llnl.gov
>>>>> 925.423.9670
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On May 4, 2015, at 11:18 AM, Kevin Abbey <kevin.abbey at rutgers.edu> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> For a single node OSS I'm planning to use a combined MGS/MDS.  Can anyone recommend an enterprise ssd designed for this workload?  I'd like to create a raid10  with 4x ssd using zfs as the backing fs.
>>>>>>
>>>>>> Are there any published/documented systems using zfs in raid 10 using ssd?
>>>>>>
>>>>>> Thanks,
>>>>>> Kevin
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Kevin Abbey
>>>>>> Systems Administrator
>>>>>> Rutgers University
>>>>>>
>>>>>> _______________________________________________
>>>>>> lustre-discuss mailing list
>>>>>> lustre-discuss at lists.lustre.org
>>>>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>>>> _______________________________________________
>>>>> lustre-discuss mailing list
>>>>> lustre-discuss at lists.lustre.org
>>>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>> _______________________________________________
>>> lustre-discuss mailing list
>>> lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 6248 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20150506/231a31d5/attachment-0001.bin>