[lustre-discuss] zfs -- mds/mdt -- ssd model / type recommendation

Wed May 6 09:44:16 PDT 2015

Dear Marc and list,

All of the replies are very helpful.

Could you share your method (command lines) to expand the zfs pool? I 
created my initial setup last summer and used the mkfs.lustre with zfs 
backing.  Are you expanding the zpool using the zfs commands directly on 
the zpool ?  I have not done editing of an existing pool with live data 
and just want to be sure I understand the methods correctly.

Regarding the statistics of the file system, can you or others share the 
cli methods to obtain:

- number of files on the file system
- number of inodes
- total space used by inodes
- size of inodes  (?  inodes total used space / inodes )
- fragmentation %
- distribution of file sizes on the system?
- frequency of file access?
- random vs streaming IO?

Perhaps a link to a reference is sufficient.

This is the layout of the system:
------------------------------------------------------------------------------------------------
zpool list
------------------------------------------------------------------------------------------------
NAME          SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH ALTROOT
lustre-mdt0  7.25T  1.01G  7.25T     0%  1.00x  ONLINE  -
lustre-ost0  72.5T  28.1T  44.4T    38%  1.00x  ONLINE  -
lustre-ost1  72.5T  33.1T  39.4T    45%  1.00x  ONLINE  -
lustre-ost2  72.5T  27.2T  45.3T    37%  1.00x  ONLINE  -
lustre-ost3  72.5T  31.2T  41.3T    43%  1.00x  ONLINE  -
------------------------------------------------------------------------------------------------
(using) df -h

lustre-mdt0/mdt0  7.1T  1.1G  7.1T   1% /mnt/lustre/local/lustre-MDT0000
lustre-ost0/ost0   57T   23T   35T  40% /mnt/lustre/local/lustre-OST0000
lustre-ost1/ost1   57T   27T   31T  47% /mnt/lustre/local/lustre-OST0001
lustre-ost2/ost2   57T   22T   35T  39% /mnt/lustre/local/lustre-OST0002
lustre-ost3/ost3   57T   25T   32T  45% /mnt/lustre/local/lustre-OST0003
------------------------------------------------------------------------------------------------
192.168.0.36 at tcp1:/lustre
                       226T   96T  131T  43% /lustre
------------------------------------------------------------------------------------------------
Disks in use.

lustre-mdt0/mdt0 raid10, each disk is a 4TB Enterprise SATA              
         -- ST4000NM0033-9Z
lustre-ost stripe across 2xraidz2, each raidz is 10x 4TB Enterprise 
SATA--  ST4000NM0033-9Z
------------------------------------------------------------------------------------------------

The zfs benefits you described are why I am using it.

The current mds/mdt I have consists of a zfs raid 10 using 4TB 
enterprise sata drives.  I haven't done a performance measure 
specifically but I have the assumption that this is a good place to make 
a performance improvement by using the proper type of ssd drive.  I'll 
be doubling the number of OSTs within a ~60days.  I may implement a new 
lustre with 2.7, then migrate data, then incorporate the existing jbods 
in the new lustre.  One issue to resolve is that the existing setup did 
not have the o2ib added as an option.  I read that adding this after the 
creation is not guaranteed to proceed without failure.  Thus, the reason 
for starting with a new mds/mdt. It is currently using tcp and IPoIB.  
We only have 16 ib clients and 26 tcp clients.  Most of the files access 
are large files for genomic/computational biology or md simulations, 
files sizes ranging from a few GB to 100-500GB.

The zil is another place for performance improvement.  I've read that 
since the zil is small, the zil from multiple pools could be located on 
partitions of mirrored disks, thus sharing mirrored ssds.  Is this 
incompatible with lustre?  It has been a while since I read about this 
and did not find any example usage with a lustre setup, only zfs alone 
setup.  I also read that there is a zil support plan for lustre. Is 
there a link to where I can read more about this and the schedule for 
implementation.  It will be interesting to learn if I can deploy a 
system now and turn on the zil support when it becomes available.

Thank you for any comments/assistance possible,
Kevin

On 05/05/2015 04:07 PM, Stearman, Marc wrote:
> Most of our production MDS nodes have a 2.7TB zpool.  They vary in amount full, but one file system has 1 billion files and is 63% full.  I plan on adding a bit more storage to try and get % full down to about 50%.  This is another nice attribute of ZFS.  I can increase the size of the pool online without having to reformat the file system, thereby adding more inodes.
>
> Also, remember that by default, ZFS has redundant_metadata=all defined.  ZFS is storing an extra copy of all the metadata for the pool.  And, ZFS stores a checksum of all blocks on the file system, so yes there is more overhead, but you do not need to do offline fsck, and it's checking not just metadata, but actual data as well.
>
> Disks today are relatively inexpensive, and I feel the benefits of ZFS (online fsck, data integrity, etc) are well worth the cost (slightly slower perf., need a few extra drives)
>
> -Marc
>
> ----
> D. Marc Stearman
> Lustre Operations Lead
> stearman2 at llnl.gov
> 925.423.9670
>
>
>
>
> On May 5, 2015, at 10:16 AM, Alexander I Kulyavtsev <aik at fnal.gov>
>   wrote:
>
>> How much space is used per i-node on MDT in production installation.
>> What is recommended size of MDT?
>>
>> I'm presently at about 10 KB/inode which seems too high compared with ldiskfs.
>>
>> I ran out of inodes on zfs mdt in my tests and zfs got "locked". MDT zpool got all space used.
>>
>> We have zpool created as stripe of mirrors ( mirror s0 s1 mirror s3 s3). Total size ~940 GB, get stuck at about 97 mil files.
>> zfs v 0.6.4.1 . default 128 KB record. Fragmentation went to 83% when things get locked at 98 % capacity; now I'm at 62% fragmentation after I removed some files (down to 97% space capacity.)
>>
>> Shall we use smaller ZFS record size on MDT, say 8KB or 16KB? If inode is ~10KB and zfs record 128KB, we are dropping caches and read data we do not need.
>>
>> Alex.
>>
>> On May 5, 2015, at 10:43 AM, Stearman, Marc <stearman2@†llnl.gov> wrote:
>>
>>> We are using the HGST S842 line of 2.5" SSDs.  We have them configures as a raid10 setup in ZFS.  We started with SAS drives and found them to be too slow, and were bottlenecked on the drives, so we upgraded to SSDs.  The nice thing with ZFS is that it's not just a two device mirror.  You can do an n-way mirror, so we added the SSDs to each of the vdevs with the SAS drives, let them resilver online, and then removed the SAS drives.  Users did not have to experience any downtime.
>>>
>>> We have about 100PB of Lustre spread over 10 file systems.  All of them are using SSDs.  We have a couple using OCZ SSDs, but I'm not a fan of their RMA policies.  That has changed since they were bought by Toshiba, but I still prefer the HGST drives.
>>>
>>> We configure them as 10 mirror pairs (20 drives total), spread across two JBODs so we can lose an entire JBOD and still have the pool up.
>>>
>>> -Marc
>>>
>>> ----
>>> D. Marc Stearman
>>> Lustre Operations Lead
>>> stearman2 at llnl.gov
>>> 925.423.9670
>>>
>>>
>>>
>>>
>>> On May 4, 2015, at 11:18 AM, Kevin Abbey <kevin.abbey at rutgers.edu> wrote:
>>>
>>>> Hi,
>>>>
>>>> For a single node OSS I'm planning to use a combined MGS/MDS.  Can anyone recommend an enterprise ssd designed for this workload?  I'd like to create a raid10  with 4x ssd using zfs as the backing fs.
>>>>
>>>> Are there any published/documented systems using zfs in raid 10 using ssd?
>>>>
>>>> Thanks,
>>>> Kevin
>>>>
>>>>
>>>> -- 
>>>> Kevin Abbey
>>>> Systems Administrator
>>>> Rutgers University
>>>>
>>>> _______________________________________________
>>>> lustre-discuss mailing list
>>>> lustre-discuss at lists.lustre.org
>>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>> _______________________________________________
>>> lustre-discuss mailing list
>>> lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

-- 
Kevin Abbey
Systems Administrator
Center for Computational and Integrative Biology (CCIB)
http://ccib.camden.rutgers.edu/

Rutgers University - Science Building
315 Penn St.
Camden, NJ 08102
Telephone: (856) 225-6770
Fax:(856) 225-6312
Email: kevin.abbey at rutgers.edu