[lustre-discuss] free space on ldiskfs vs. zfs

Alexander I Kulyavtsev aik at fnal.gov
Mon Aug 24 20:54:26 PDT 2015


Hmm,
I was assuming the question was about total space as I struggled for some time to understand  why do I have 99 TB total available space per OSS, after installing zfs lustre, while ldiskfs OSTs have 120 TB on the same hardware. The 20% difference was partially (10%) accounted by different raid6 / raidz2 configuration. But I was not able to explain the other 10%.

For question in original post, I can not make 24 TB from "available" field of df output:
207 KiB "available" on his zfs lustre,  198 KiB on ldiskfs lustre.
At the same time the difference of the total space is 
233548424256 -207693153280 = 25855270976 KiB = 24.09 TB.

Götz, could you please tell us what did you mean by "available" ?

Also,
in my case the output of linux df on OSS for the zfs pool looks strange:
zpool size reported as 25T (why?), and the formatted OST taking all space on this pool shows 33T:

[root at lfs1 ~]# df -h  /zpla-0000  /mnt/OST0000
Filesystem         Size  Used Avail Use% Mounted on
zpla-0000           25T  256K   25T   1% /zpla-0000
zpla-0000/OST0000   33T  8.3T   25T  26% /mnt/OST0000
[root at lfs1 ~]# 

in bytes:

[root at lfs1 ~]# df --block-size=1  /zpla-0000  /mnt/OST0000
Filesystem             1B-blocks          Used      Available Use% Mounted on
zpla-0000         26769344561152        262144 26769344299008   1% /zpla-0000
zpla-0000/OST0000 35582552834048 9093386076160 26489164660736  26% /mnt/OST0000

same ost reported by lustre:
[root at lfsa scripts]# lfs df 
UUID                   1K-blocks        Used   Available Use% Mounted on
lfs-MDT0000_UUID       974961920      275328   974684544   0% /mnt/lfsa[MDT:0]
lfs-OST0000_UUID     34748586752  8880259840 25868324736  26% /mnt/lfsa[OST:0]
...

Compare:

[root at lfs1 ~]# zpool list
NAME        SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
zpla-0000  43.5T  10.9T  32.6T         -    16%    24%  1.00x  ONLINE  -
zpla-0001  43.5T  11.0T  32.5T         -    17%    25%  1.00x  ONLINE  -
zpla-0002  43.5T  10.8T  32.7T         -    17%    24%  1.00x  ONLINE  -
I realize zfs reports raw disk space including parity blocks (48TB = 43.5 TiB);  and everything else (like metadata, space for xattr inodes).

I can not explain the difference 40 TB (dec.) of data space (10*4TB drives) and 35,582,552,834,048 bytes shown by df for OST.

Best regards, Alex.

On Aug 24, 2015, at 7:52 PM, Christopher J. Morrone <morrone2 at llnl.gov> wrote:

> I could be wrong, but I don't think that the original poster was asking 
> why the SIZE field of zpool list was wrong, but rather why the AVAIL 
> space in zfs list was lower than he expected.
> 
> I would find it easier to answer the question if I knew his drive count 
> and drive size.
> 
> Chris
> 
> On 08/24/2015 02:12 PM, Alexander I Kulyavtsev wrote:
>> Same question here.
>> 
>> 6TB/65TB is 11% . In our case about the same fraction was "missing."
>> 
>> My speculation was, It may happen if at some point between zpool and linux the value reported in TB is interpreted as in TiB, and then converted to TB. Or  unneeded conversion MB to MiB done twice, etc.
>> 
>> Here is my numbers:
>> We have 12* 4TB drives per pool, it is 48 TB (decimal).
>> zpool created as raidz2 10+2.
>> zpool reports  43.5T.
>> Pool size shall be 48T=4T*12, or 40T=4T*10 (depending what zpool shows, before raiding or after raiding).
>>> From the Oracle ZFS documentation, "zpool list" returns the total space without overheads, thus 48 TB shall be reported by zpool instead of 43.5TB.
>> 
>> In my case, it looked like conversion error/interpretation issue between TB and TiB:
>> 
>> 48*1000*1000*1000*1000/1024/1024/1024/1024 = 43.65574568510055541992
>> 
>> 
>> At disk level:
>> 
>> ~/sas2ircu 0 display
>> 
>> Device is a Hard disk
>>   Enclosure #                             : 2
>>   Slot #                                  : 12
>>   SAS Address                             : 5003048-0-015a-a918
>>   State                                   : Ready (RDY)
>>   Size (in MB)/(in sectors)               : 3815447/7814037167
>>   Manufacturer                            : ATA
>>   Model Number                            : HGST HUS724040AL
>>   Firmware Revision                       : AA70
>>   Serial No                               : PN2334PBJPW14T
>>   GUID                                    : 5000cca23de6204b
>>   Protocol                                : SATA
>>   Drive Type                              : SATA_HDD
>> 
>> One disk size is about 4 TB (decimal):
>> 
>> 3815447*1024*1024 = 4000786153472
>> 7814037167*512  = 4000787029504
>> 
>> vdev presents whole disk to zpool. There is some overhead, some space left on sdq9 .
>> 
>> [root at lfs1 scripts]# head -4 /etc/zfs/vdev_id.conf
>> alias s0  /dev/disk/by-path/pci-0000:03:00.0-sas-0x50030480015aa90c-lun-0
>> alias s1  /dev/disk/by-path/pci-0000:03:00.0-sas-0x50030480015aa90d-lun-0
>> alias s2  /dev/disk/by-path/pci-0000:03:00.0-sas-0x50030480015aa90e-lun-0
>> alias s3  /dev/disk/by-path/pci-0000:03:00.0-sas-0x50030480015aa90f-lun-0
>> ...
>> alias s12  /dev/disk/by-path/pci-0000:03:00.0-sas-0x50030480015aa918-lun-0
>> ...
>> 
>> [root at lfs1 scripts]# ls -l  /dev/disk/by-path/
>> ...
>> lrwxrwxrwx 1 root root  9 Jul 23 16:27 pci-0000:03:00.0-sas-0x50030480015aa918-lun-0 -> ../../sdq
>> lrwxrwxrwx 1 root root 10 Jul 23 16:27 pci-0000:03:00.0-sas-0x50030480015aa918-lun-0-part1 -> ../../sdq1
>> lrwxrwxrwx 1 root root 10 Jul 23 16:27 pci-0000:03:00.0-sas-0x50030480015aa918-lun-0-part9 -> ../../sdq9
>> 
>> Pool report:
>> 
>> [root at lfs1 scripts]# zpool list
>> NAME        SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
>> zpla-0000  43.5T  10.9T  32.6T         -    16%    24%  1.00x  ONLINE  -
>> zpla-0001  43.5T  11.0T  32.5T         -    17%    25%  1.00x  ONLINE  -
>> zpla-0002  43.5T  10.8T  32.7T         -    17%    24%  1.00x  ONLINE  -
>> [root at lfs1 scripts]#
>> 
>> [root at lfs1 ~]# zpool list -v zpla-0001
>> NAME   SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
>> zpla-0001  43.5T  11.0T  32.5T         -    17%    25%  1.00x  ONLINE  -
>>   raidz2  43.5T  11.0T  32.5T         -    17%    25%
>>     s12      -      -      -         -      -      -
>>     s13      -      -      -         -      -      -
>>     s14      -      -      -         -      -      -
>>     s15      -      -      -         -      -      -
>>     s16      -      -      -         -      -      -
>>     s17      -      -      -         -      -      -
>>     s18      -      -      -         -      -      -
>>     s19      -      -      -         -      -      -
>>     s20      -      -      -         -      -      -
>>     s21      -      -      -         -      -      -
>>     s22      -      -      -         -      -      -
>>     s23      -      -      -         -      -      -
>> [root at lfs1 ~]#
>> 
>> [root at lfs1 ~]# zpool get all zpla-0001
>> NAME       PROPERTY                    VALUE                       SOURCE
>> zpla-0001  size                        43.5T                       -
>> zpla-0001  capacity                    25%                         -
>> zpla-0001  altroot                     -                           default
>> zpla-0001  health                      ONLINE                      -
>> zpla-0001  guid                        5472902975201420000         default
>> zpla-0001  version                     -                           default
>> zpla-0001  bootfs                      -                           default
>> zpla-0001  delegation                  on                          default
>> zpla-0001  autoreplace                 off                         default
>> zpla-0001  cachefile                   -                           default
>> zpla-0001  failmode                    wait                        default
>> zpla-0001  listsnapshots               off                         default
>> zpla-0001  autoexpand                  off                         default
>> zpla-0001  dedupditto                  0                           default
>> zpla-0001  dedupratio                  1.00x                       -
>> zpla-0001  free                        32.5T                       -
>> zpla-0001  allocated                   11.0T                       -
>> zpla-0001  readonly                    off                         -
>> zpla-0001  ashift                      12                          local
>> zpla-0001  comment                     -                           default
>> zpla-0001  expandsize                  -                           -
>> zpla-0001  freeing                     0                           default
>> zpla-0001  fragmentation               17%                         -
>> zpla-0001  leaked                      0                           default
>> zpla-0001  feature at async_destroy       enabled                     local
>> zpla-0001  feature at empty_bpobj         active                      local
>> zpla-0001  feature at lz4_compress        active                      local
>> zpla-0001  feature at spacemap_histogram  active                      local
>> zpla-0001  feature at enabled_txg         active                      local
>> zpla-0001  feature at hole_birth          active                      local
>> zpla-0001  feature at extensible_dataset  enabled                     local
>> zpla-0001  feature at embedded_data       active                      local
>> zpla-0001  feature at bookmarks           enabled                     local
>> 
>> Alex.
>> 
>> On Aug 19, 2015, at 8:18 AM, Götz Waschk <goetz.waschk at gmail.com> wrote:
>> 
>>> Dear Lustre experts,
>>> 
>>> I have configured two different Lustre instances, both using Lustre
>>> 2.5.3, one with ldiskfs on RAID-6 hardware RAID and one using ZFS and
>>> RAID-Z2, using the same type of hardware. I was wondering, why I 24 TB
>>> less space available, when I should have the same amount of parity
>>> used:
>>> 
>>> # lfs df
>>> UUID                   1K-blocks        Used   Available Use% Mounted on
>>> fs19-MDT0000_UUID       50322916      472696    46494784   1%
>>> /testlustre/fs19[MDT:0]
>>> fs19-OST0000_UUID    51923288320       12672 51923273600   0%
>>> /testlustre/fs19[OST:0]
>>> fs19-OST0001_UUID    51923288320       12672 51923273600   0%
>>> /testlustre/fs19[OST:1]
>>> fs19-OST0002_UUID    51923288320       12672 51923273600   0%
>>> /testlustre/fs19[OST:2]
>>> fs19-OST0003_UUID    51923288320       12672 51923273600   0%
>>> /testlustre/fs19[OST:3]
>>> filesystem summary:  207693153280       50688 207693094400   0% /testlustre/fs19
>>> UUID                   1K-blocks        Used   Available Use% Mounted on
>>> fs18-MDT0000_UUID       47177700      482152    43550028   1%
>>> /lustre/fs18[MDT:0]
>>> fs18-OST0000_UUID    58387106064  6014088200 49452733560  11%
>>> /lustre/fs18[OST:0]
>>> fs18-OST0001_UUID    58387106064  5919753028 49547068928  11%
>>> /lustre/fs18[OST:1]
>>> fs18-OST0002_UUID    58387106064  5944542316 49522279640  11%
>>> /lustre/fs18[OST:2]
>>> fs18-OST0003_UUID    58387106064  5906712004 49560109952  11%
>>> /lustre/fs18[OST:3]
>>> filesystem summary:  233548424256 23785095548 198082192080  11% /lustre/fs18
>>> 
>>> fs18 is using ldiskfs, while fs19 is ZFS:
>>> # zpool list
>>> NAME          SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
>>> lustre-ost1    65T  18,1M  65,0T     0%  1.00x  ONLINE  -
>>> # zfs list
>>> NAME               USED  AVAIL  REFER  MOUNTPOINT
>>> lustre-ost1       13,6M  48,7T   311K  /lustre-ost1
>>> lustre-ost1/ost1  12,4M  48,7T  12,4M  /lustre-ost1/ost1
>>> 
>>> 
>>> Any idea on why my 6TB per OST went?
>>> 
>>> Regards, Götz Waschk
>>> _______________________________________________
>>> lustre-discuss mailing list
>>> lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>> 
>> _______________________________________________
>> lustre-discuss mailing list
>> lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>> .
>> 
> 
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org



More information about the lustre-discuss mailing list