[lustre-discuss] zfs -- mds/mdt -- ssd model / type recommendation
Kevin Abbey
kevin.abbey at rutgers.edu
Wed May 6 09:44:16 PDT 2015
Dear Marc and list,
All of the replies are very helpful.
Could you share your method (command lines) to expand the zfs pool? I
created my initial setup last summer and used the mkfs.lustre with zfs
backing. Are you expanding the zpool using the zfs commands directly on
the zpool ? I have not done editing of an existing pool with live data
and just want to be sure I understand the methods correctly.
Regarding the statistics of the file system, can you or others share the
cli methods to obtain:
- number of files on the file system
- number of inodes
- total space used by inodes
- size of inodes (? inodes total used space / inodes )
- fragmentation %
- distribution of file sizes on the system?
- frequency of file access?
- random vs streaming IO?
Perhaps a link to a reference is sufficient.
This is the layout of the system:
------------------------------------------------------------------------------------------------
zpool list
------------------------------------------------------------------------------------------------
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
lustre-mdt0 7.25T 1.01G 7.25T 0% 1.00x ONLINE -
lustre-ost0 72.5T 28.1T 44.4T 38% 1.00x ONLINE -
lustre-ost1 72.5T 33.1T 39.4T 45% 1.00x ONLINE -
lustre-ost2 72.5T 27.2T 45.3T 37% 1.00x ONLINE -
lustre-ost3 72.5T 31.2T 41.3T 43% 1.00x ONLINE -
------------------------------------------------------------------------------------------------
(using) df -h
lustre-mdt0/mdt0 7.1T 1.1G 7.1T 1% /mnt/lustre/local/lustre-MDT0000
lustre-ost0/ost0 57T 23T 35T 40% /mnt/lustre/local/lustre-OST0000
lustre-ost1/ost1 57T 27T 31T 47% /mnt/lustre/local/lustre-OST0001
lustre-ost2/ost2 57T 22T 35T 39% /mnt/lustre/local/lustre-OST0002
lustre-ost3/ost3 57T 25T 32T 45% /mnt/lustre/local/lustre-OST0003
------------------------------------------------------------------------------------------------
192.168.0.36 at tcp1:/lustre
226T 96T 131T 43% /lustre
------------------------------------------------------------------------------------------------
Disks in use.
lustre-mdt0/mdt0 raid10, each disk is a 4TB Enterprise SATA
-- ST4000NM0033-9Z
lustre-ost stripe across 2xraidz2, each raidz is 10x 4TB Enterprise
SATA-- ST4000NM0033-9Z
------------------------------------------------------------------------------------------------
The zfs benefits you described are why I am using it.
The current mds/mdt I have consists of a zfs raid 10 using 4TB
enterprise sata drives. I haven't done a performance measure
specifically but I have the assumption that this is a good place to make
a performance improvement by using the proper type of ssd drive. I'll
be doubling the number of OSTs within a ~60days. I may implement a new
lustre with 2.7, then migrate data, then incorporate the existing jbods
in the new lustre. One issue to resolve is that the existing setup did
not have the o2ib added as an option. I read that adding this after the
creation is not guaranteed to proceed without failure. Thus, the reason
for starting with a new mds/mdt. It is currently using tcp and IPoIB.
We only have 16 ib clients and 26 tcp clients. Most of the files access
are large files for genomic/computational biology or md simulations,
files sizes ranging from a few GB to 100-500GB.
The zil is another place for performance improvement. I've read that
since the zil is small, the zil from multiple pools could be located on
partitions of mirrored disks, thus sharing mirrored ssds. Is this
incompatible with lustre? It has been a while since I read about this
and did not find any example usage with a lustre setup, only zfs alone
setup. I also read that there is a zil support plan for lustre. Is
there a link to where I can read more about this and the schedule for
implementation. It will be interesting to learn if I can deploy a
system now and turn on the zil support when it becomes available.
Thank you for any comments/assistance possible,
Kevin
On 05/05/2015 04:07 PM, Stearman, Marc wrote:
> Most of our production MDS nodes have a 2.7TB zpool. They vary in amount full, but one file system has 1 billion files and is 63% full. I plan on adding a bit more storage to try and get % full down to about 50%. This is another nice attribute of ZFS. I can increase the size of the pool online without having to reformat the file system, thereby adding more inodes.
>
> Also, remember that by default, ZFS has redundant_metadata=all defined. ZFS is storing an extra copy of all the metadata for the pool. And, ZFS stores a checksum of all blocks on the file system, so yes there is more overhead, but you do not need to do offline fsck, and it's checking not just metadata, but actual data as well.
>
> Disks today are relatively inexpensive, and I feel the benefits of ZFS (online fsck, data integrity, etc) are well worth the cost (slightly slower perf., need a few extra drives)
>
> -Marc
>
> ----
> D. Marc Stearman
> Lustre Operations Lead
> stearman2 at llnl.gov
> 925.423.9670
>
>
>
>
> On May 5, 2015, at 10:16 AM, Alexander I Kulyavtsev <aik at fnal.gov>
> wrote:
>
>> How much space is used per i-node on MDT in production installation.
>> What is recommended size of MDT?
>>
>> I'm presently at about 10 KB/inode which seems too high compared with ldiskfs.
>>
>> I ran out of inodes on zfs mdt in my tests and zfs got "locked". MDT zpool got all space used.
>>
>> We have zpool created as stripe of mirrors ( mirror s0 s1 mirror s3 s3). Total size ~940 GB, get stuck at about 97 mil files.
>> zfs v 0.6.4.1 . default 128 KB record. Fragmentation went to 83% when things get locked at 98 % capacity; now I'm at 62% fragmentation after I removed some files (down to 97% space capacity.)
>>
>> Shall we use smaller ZFS record size on MDT, say 8KB or 16KB? If inode is ~10KB and zfs record 128KB, we are dropping caches and read data we do not need.
>>
>> Alex.
>>
>> On May 5, 2015, at 10:43 AM, Stearman, Marc <stearman2@†llnl.gov> wrote:
>>
>>> We are using the HGST S842 line of 2.5" SSDs. We have them configures as a raid10 setup in ZFS. We started with SAS drives and found them to be too slow, and were bottlenecked on the drives, so we upgraded to SSDs. The nice thing with ZFS is that it's not just a two device mirror. You can do an n-way mirror, so we added the SSDs to each of the vdevs with the SAS drives, let them resilver online, and then removed the SAS drives. Users did not have to experience any downtime.
>>>
>>> We have about 100PB of Lustre spread over 10 file systems. All of them are using SSDs. We have a couple using OCZ SSDs, but I'm not a fan of their RMA policies. That has changed since they were bought by Toshiba, but I still prefer the HGST drives.
>>>
>>> We configure them as 10 mirror pairs (20 drives total), spread across two JBODs so we can lose an entire JBOD and still have the pool up.
>>>
>>> -Marc
>>>
>>> ----
>>> D. Marc Stearman
>>> Lustre Operations Lead
>>> stearman2 at llnl.gov
>>> 925.423.9670
>>>
>>>
>>>
>>>
>>> On May 4, 2015, at 11:18 AM, Kevin Abbey <kevin.abbey at rutgers.edu> wrote:
>>>
>>>> Hi,
>>>>
>>>> For a single node OSS I'm planning to use a combined MGS/MDS. Can anyone recommend an enterprise ssd designed for this workload? I'd like to create a raid10 with 4x ssd using zfs as the backing fs.
>>>>
>>>> Are there any published/documented systems using zfs in raid 10 using ssd?
>>>>
>>>> Thanks,
>>>> Kevin
>>>>
>>>>
>>>> --
>>>> Kevin Abbey
>>>> Systems Administrator
>>>> Rutgers University
>>>>
>>>> _______________________________________________
>>>> lustre-discuss mailing list
>>>> lustre-discuss at lists.lustre.org
>>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>> _______________________________________________
>>> lustre-discuss mailing list
>>> lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
--
Kevin Abbey
Systems Administrator
Center for Computational and Integrative Biology (CCIB)
http://ccib.camden.rutgers.edu/
Rutgers University - Science Building
315 Penn St.
Camden, NJ 08102
Telephone: (856) 225-6770
Fax:(856) 225-6312
Email: kevin.abbey at rutgers.edu
More information about the lustre-discuss
mailing list