[lustre-discuss] zfs -- mds/mdt -- ssd model / type recommendation

Wed May 6 10:19:11 PDT 2015

On May 6, 2015, at 9:44 AM, Kevin Abbey <kevin.abbey at rutgers.edu>
 wrote:

> Dear Marc and list,
> 
> All of the replies are very helpful.
> 
> Could you share your method (command lines) to expand the zfs pool? I created my initial setup last summer and used the mkfs.lustre with zfs backing.  Are you expanding the zpool using the zfs commands directly on the zpool ?  I have not done editing of an existing pool with live data and just want to be sure I understand the methods correctly.

Sure.  Given a pool like this:

# stout-mds1 /dev/disk/by-vdev > zpool status
  pool: stout-mds1
 state: ONLINE
  scan: 
config:
        NAME        STATE     READ WRITE CKSUM
        stout-mds1  ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            A0      ONLINE       0     0     0
            B0      ONLINE       0     0     0
          mirror-1  ONLINE       0     0     0
            A1      ONLINE       0     0     0
            B1      ONLINE       0     0     0
          mirror-2  ONLINE       0     0     0
            A2      ONLINE       0     0     0
            B2      ONLINE       0     0     0
…

We have 10 mirror pairs defined, but the commands are all the same.

To add a device (this is all in the zpool manpage btw) you could run:  "zpool add <pool-name> mirror <dev1> <dev2>"  This would add another mirror pair as a vdev to the pool.  If you want to do what we did, and replace SAS drives with SSDs, the procedure is a bit different.

	• Run zpool attach -o ashift=9 <pool> <dev1> <dev2> for each SDD being added.
		• ashift=9 is necessary to align the SSDs to the same sector size boundaries, in this case spinning disk is 512B (or 29)
		• pool is the zpool.  In lc, this is the hostname of the mds (i.e. stout-mds1)
		• dev 1 is the first device in the existing mirror set (i.e. A0, A1, A2, etc.)
		• dev 2 is the device name of the SSD you are adding (i.e. A10, B10, A11, etc.)
		• man zpool for more detailed information

So to add an SSD to the above pool, the command would be: "zpool attach -o ashift=9 stout-mds1 A0 A10".  This would add a new device as a third mirror to the vdev.

> Regarding the statistics of the file system, can you or others share the cli methods to obtain:
> 
> - number of files on the file system
> - number of inodes
> - total space used by inodes
> - size of inodes  (?  inodes total used space / inodes )
> - fragmentation %
> - distribution of file sizes on the system?
> - frequency of file access?
> - random vs streaming IO?

You can use the " -i " flag to df to show inodes.  Run that on the MDS and you can see how many total files you can support.  With a 7TB MDS, I suspect you can support roughly 4 billion files, but realistically you want to keep that volume around 50% so maybe 2 billion is more realistic.

You can run "lfs df" and "lfs df -i" to see how your OSTs are balanced for the distribution of objects within the file system.  If you add an OST later, lustre will give preference to the new OSTs to get them in balance.  This may impact performance a bit.

The newest version of ZFS (0.6.3)  has stats you can use to look at fragmentation of the pool via files in /proc  (we haven't done the pool upgrade yet, so I don't recall the path).

Typically, we run IOR and mdtest to benchmark the file system before we give it to users.  I will often run small IORs and log the data to splunk so that I can trend over time to see if changes impacted performance.  FIO is another good benchmarking tool to test your I/O.

> 
> Perhaps a link to a reference is sufficient.
> 
> 
> This is the layout of the system:
> ------------------------------------------------------------------------------------------------
> zpool list
> ------------------------------------------------------------------------------------------------
> NAME          SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH ALTROOT
> lustre-mdt0  7.25T  1.01G  7.25T     0%  1.00x  ONLINE  -
> lustre-ost0  72.5T  28.1T  44.4T    38%  1.00x  ONLINE  -
> lustre-ost1  72.5T  33.1T  39.4T    45%  1.00x  ONLINE  -
> lustre-ost2  72.5T  27.2T  45.3T    37%  1.00x  ONLINE  -
> lustre-ost3  72.5T  31.2T  41.3T    43%  1.00x  ONLINE  -
> ------------------------------------------------------------------------------------------------
> (using) df -h
> 
> lustre-mdt0/mdt0  7.1T  1.1G  7.1T   1% /mnt/lustre/local/lustre-MDT0000
> lustre-ost0/ost0   57T   23T   35T  40% /mnt/lustre/local/lustre-OST0000
> lustre-ost1/ost1   57T   27T   31T  47% /mnt/lustre/local/lustre-OST0001
> lustre-ost2/ost2   57T   22T   35T  39% /mnt/lustre/local/lustre-OST0002
> lustre-ost3/ost3   57T   25T   32T  45% /mnt/lustre/local/lustre-OST0003
> ------------------------------------------------------------------------------------------------
> 192.168.0.36 at tcp1:/lustre
>                      226T   96T  131T  43% /lustre
> ------------------------------------------------------------------------------------------------
> Disks in use.
> 
> lustre-mdt0/mdt0 raid10, each disk is a 4TB Enterprise SATA                      -- ST4000NM0033-9Z
> lustre-ost stripe across 2xraidz2, each raidz is 10x 4TB Enterprise SATA--  ST4000NM0033-9Z
> ------------------------------------------------------------------------------------------------
> 
> 
> 
> 
> The zfs benefits you described are why I am using it.
> 
> 
> The current mds/mdt I have consists of a zfs raid 10 using 4TB enterprise sata drives.  I haven't done a performance measure specifically but I have the assumption that this is a good place to make a performance improvement by using the proper type of ssd drive.  I'll be doubling the number of OSTs within a ~60days.  I may implement a new lustre with 2.7, then migrate data, then incorporate the existing jbods in the new lustre.  One issue to resolve is that the existing setup did not have the o2ib added as an option.  I read that adding this after the creation is not guaranteed to proceed without failure.  Thus, the reason for starting with a new mds/mdt. It is currently using tcp and IPoIB.  We only have 16 ib clients and 26 tcp clients.  Most of the files access are large files for genomic/computational biology or md simulations, files sizes ranging from a few GB to 100-500GB.

You should be able to change NIDs without reformatting or migrating.  You just need to do a write_conf on all the servers and restart them (clients unmounted of course).  This is all described in the Lustre Manual.  We've done it a few times here and there and it works.

> 
> The zil is another place for performance improvement.  I've read that since the zil is small, the zil from multiple pools could be located on partitions of mirrored disks, thus sharing mirrored ssds.  Is this incompatible with lustre?  It has been a while since I read about this and did not find any example usage with a lustre setup, only zfs alone setup.  I also read that there is a zil support plan for lustre. Is there a link to where I can read more about this and the schedule for implementation.  It will be interesting to learn if I can deploy a system now and turn on the zil support when it becomes available.

The ZIL is not supported by Lustre.  There are plans to add that to Lustre, but I don't know all the details.  Adding a cache device does work with the pool and can help if you are running on spinning disks.

I will second what everyone is saying about maxing out your memory in the MDS.  We are at 128GB today.  I would prefer 256GB.  Also, if you have a large number of clients, you should check /proc/slabinfo (or run slabtop) and see what is using the memory on your MDS.  We found that ldlm_locks and ldlm_resources was consuming a great deal of memory on our MDS nodes, and have taken steps to limit the clients to avoid OOM situations.  Also the more memory you have for the ZFS ARC, the better.  I think the memory is better used in the ARC, than in the ldlm_locks.  Obviously you want to be reasonable with your limits, but I doubt the MDS needs to hold onto 50+GB of RAM for locks.

-Marc

----
D. Marc Stearman
Lustre Operations Lead
stearman2 at llnl.gov
925.423.9670

> 
> On 05/05/2015 04:07 PM, Stearman, Marc wrote:
>> Most of our production MDS nodes have a 2.7TB zpool.  They vary in amount full, but one file system has 1 billion files and is 63% full.  I plan on adding a bit more storage to try and get % full down to about 50%.  This is another nice attribute of ZFS.  I can increase the size of the pool online without having to reformat the file system, thereby adding more inodes.
>> 
>> Also, remember that by default, ZFS has redundant_metadata=all defined.  ZFS is storing an extra copy of all the metadata for the pool.  And, ZFS stores a checksum of all blocks on the file system, so yes there is more overhead, but you do not need to do offline fsck, and it's checking not just metadata, but actual data as well.
>> 
>> Disks today are relatively inexpensive, and I feel the benefits of ZFS (online fsck, data integrity, etc) are well worth the cost (slightly slower perf., need a few extra drives)
>> 
>> -Marc
>> 
>> ----
>> D. Marc Stearman
>> Lustre Operations Lead
>> stearman2 at llnl.gov
>> 925.423.9670
>> 
>> 
>> 
>> 
>> On May 5, 2015, at 10:16 AM, Alexander I Kulyavtsev <aik at fnal.gov>
>>  wrote:
>> 
>>> How much space is used per i-node on MDT in production installation.
>>> What is recommended size of MDT?
>>> 
>>> I'm presently at about 10 KB/inode which seems too high compared with ldiskfs.
>>> 
>>> I ran out of inodes on zfs mdt in my tests and zfs got "locked". MDT zpool got all space used.
>>> 
>>> We have zpool created as stripe of mirrors ( mirror s0 s1 mirror s3 s3). Total size ~940 GB, get stuck at about 97 mil files.
>>> zfs v 0.6.4.1 . default 128 KB record. Fragmentation went to 83% when things get locked at 98 % capacity; now I'm at 62% fragmentation after I removed some files (down to 97% space capacity.)
>>> 
>>> Shall we use smaller ZFS record size on MDT, say 8KB or 16KB? If inode is ~10KB and zfs record 128KB, we are dropping caches and read data we do not need.
>>> 
>>> Alex.
>>> 
>>> On May 5, 2015, at 10:43 AM, Stearman, Marc <stearman2@†llnl.gov> wrote:
>>> 
>>>> We are using the HGST S842 line of 2.5" SSDs.  We have them configures as a raid10 setup in ZFS.  We started with SAS drives and found them to be too slow, and were bottlenecked on the drives, so we upgraded to SSDs.  The nice thing with ZFS is that it's not just a two device mirror.  You can do an n-way mirror, so we added the SSDs to each of the vdevs with the SAS drives, let them resilver online, and then removed the SAS drives.  Users did not have to experience any downtime.
>>>> 
>>>> We have about 100PB of Lustre spread over 10 file systems.  All of them are using SSDs.  We have a couple using OCZ SSDs, but I'm not a fan of their RMA policies.  That has changed since they were bought by Toshiba, but I still prefer the HGST drives.
>>>> 
>>>> We configure them as 10 mirror pairs (20 drives total), spread across two JBODs so we can lose an entire JBOD and still have the pool up.
>>>> 
>>>> -Marc
>>>> 
>>>> ----
>>>> D. Marc Stearman
>>>> Lustre Operations Lead
>>>> stearman2 at llnl.gov
>>>> 925.423.9670
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On May 4, 2015, at 11:18 AM, Kevin Abbey <kevin.abbey at rutgers.edu> wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> For a single node OSS I'm planning to use a combined MGS/MDS.  Can anyone recommend an enterprise ssd designed for this workload?  I'd like to create a raid10  with 4x ssd using zfs as the backing fs.
>>>>> 
>>>>> Are there any published/documented systems using zfs in raid 10 using ssd?
>>>>> 
>>>>> Thanks,
>>>>> Kevin
>>>>> 
>>>>> 
>>>>> -- 
>>>>> Kevin Abbey
>>>>> Systems Administrator
>>>>> Rutgers University
>>>>> 
>>>>> _______________________________________________
>>>>> lustre-discuss mailing list
>>>>> lustre-discuss at lists.lustre.org
>>>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>>> _______________________________________________
>>>> lustre-discuss mailing list
>>>> lustre-discuss at lists.lustre.org
>>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> 
> -- 
> Kevin Abbey
> Systems Administrator
> Center for Computational and Integrative Biology (CCIB)
> http://ccib.camden.rutgers.edu/
> Rutgers University - Science Building
> 315 Penn St.
> Camden, NJ 08102
> Telephone: (856) 225-6770
> Fax:(856) 225-6312
> Email: kevin.abbey at rutgers.edu
>