[lustre-discuss] Recommandation of partitioning OSS with ZFS backend

Fri Oct 23 02:08:48 PDT 2020

Dear All,

I have a question about partitioning OSS with ZFS backend, where OSS
has a very large storage attached.

We have a lustre file system with two OSS. Each OSS has a storage
attached:

$ ssh fs2 df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/root        59G  8.4G   48G  16% /
tmpfs            13G  1.3M   13G   1% /run
tmpfs           5.0M     0  5.0M   0% /run/lock
devtmpfs         10M     0   10M   0% /dev
/dev/shm         63G     0   63G   0% /dev/shm
/dev/sda3       800G  197M  759G   1% /data1
chome_ost/ost   113T   41T   72T  37% /cfs/chome_ost1

$ ssh fs1 df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/root        14G  6.7G  6.5G  51% /
tmpfs           1.6G  672K  1.6G   1% /run
tmpfs           5.0M     0  5.0M   0% /run/lock
devtmpfs         10M     0   10M   0% /dev
/dev/shm        7.9G     0  7.9G   0% /dev/shm
/dev/sda3       113G  188M  107G   1% /data1
chome_ost1/ost  8.6T  121G  8.5T   2% /cfs/chome_ost1
chome_ost2/ost  8.6T  195G  8.5T   3% /cfs/chome_ost2
chome_ost3/ost  8.6T  187G  8.5T   3% /cfs/chome_ost3
chome_ost4/ost  8.6T  175G  8.5T   2% /cfs/chome_ost4

Here the OSS "fs2" was installed about 6 months earlier than "fs1".
Hence fs2 already has a lot of data stored, while "fs1" just started
to receive data.

The difference between the two OSSs are:

1. fs2: has a single OST with size 113T, which was formatted to ZFS.

2. fs1: the storage was partitioned equally into 4 partitions, and
        each was formatted to ZFS independently.

3. The host server of fs1 is about 8 years older than that of fs2.
   fs2: Xeon Silver 4214 2.2GHz (totally 24 cores) + 128GB RAM
   fs1: Xeon E5530 2.4GHz (totally 8 cores) + 16GB RAM

Now we noticed that the averaged loading of fs2 is much much heavier
than that of fs1. For fs2 the loading was usually around 30.0. For
fs1 the loading was usually around 1.0. The heavy loading of fs2
often leads the following error message in dmesg:

LNet: Service thread pid 19566 completed after 275.03s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).

Now we are wondering whether this is normal or not ? If we want to
lower the loading of fs2, is it helpful to re-partition the storage
of fs2 to, say, 4, and setup 4 OSTs in fs2, just like the case of fs1 ?

Our Lustre version is 2.10.7. The whole Lustre file system serves a
computing cluster with more than 400 CPU cores, about 70% of CPU cores
are busy in computing with many I/O.

Any suggestions are very appreciated.

Best Regards,

T.H.Hsieh