[Lustre-discuss] Problem with striping on lustre 1.8

Wed Aug 4 01:06:18 PDT 2010

Hi list

I have a small lustre storage system with 12 OSTs, after using in about 1
year, the free space on each one are as follow:

*UUID                                     bytes        Used
Available    Use%         Mounted on
lustre-MDT0000_UUID            189.4G      9.8G        168.8G        5%
       /mnt/lustre[MDT:0]
lustre-OST0001_UUID             6.3T         4.6T          1.3T
73%             /mnt/lustre[OST:1]
lustre-OST0003_UUID             4.0T         3.8T          22.0M
94%          /mnt/lustre[OST:3]
lustre-OST0004_UUID             5.4T         4.9T         163.2G         91%
         /mnt/lustre[OST:4]
lustre-OST0005_UUID             5.4T         4.7T         423.6G         87%
         /mnt/lustre[OST:5]
lustre-OST0006_UUID             4.0T         3.8T         356.3M
94%           /mnt/lustre[OST:6]
lustre-OST0008_UUID             5.4T         5.0T         99.2G
93%          /mnt/lustre[OST:8]
lustre-OST0009_UUID             5.4T         5.0T         124.2G
92%           /mnt/lustre[OST:9]
lustre-OST000a_UUID             5.4T         4.6T          540.9G
85%          /mnt/lustre[OST:10]
lustre-OST000b_UUID             5.4T         4.5T          557.9G
84%          /mnt/lustre[OST:11]
lustre-OST000c_UUID             6.7T         1.6T          4.7T
24%         /mnt/lustre[OST:12]
lustre-OST000d_UUID             6.7T         478.3G      5.9T
6%          /mnt/lustre[OST:13]
*
As you see, there is an unbalance in free spaces of our OSTs. I tried to
overcome it by setting a pool like this :

*root at MDS1: ~ # lctl pool_list lustre.para
Pool: lustre.para
lustre-OST0004_UUID
lustre-OST0005_UUID
lustre-OST000a_UUID
lustre-OST000b_UUID
lustre-OST0001_UUID*

We controlled  the write action in our directories manually by adding or
removing the member of pool ( based on their free space ). Everything worked
quite well for about 2 months. But when one of our OST ran out of free
space, there was many things like this appeared in MDS messages log:

* MDS1 kernel: Lustre: 12012:0:(lov_qos.c:460:qos_shrink_lsm()) using fewer
stripes for object 28452745: old 5 new 4
 MDS1 kernel: Lustre: 12012:0:(lov_qos.c:460:qos_shrink_lsm()) Skipped 4
previous similar messages
MDS1 kernel: Lustre: 12032:0:(lov_qos.c:460:qos_shrink_lsm()) using fewer
stripes for object 28453405: old 5 new 4
 MDS1 kernel: Lustre: 12032:0:(lov_qos.c:460:qos_shrink_lsm()) Skipped 39
previous similar messages*

And the problems now is:

There is nothing to be written on OST1 ( *lustre-OST0001_UUID ), *its free
space is always be 1.3T after many days while the others went do quite fast.

I also test by making a brand new directory in our storage system and set
stripe index to be only 1 like this:

*mkdir /mnt/lustre/HD-OST1/mv
lfs setstripe -c 1 -i 1 /mnt/lustre/HD-OST1/mv*

and touch one file :* **touch test
*and the result is:

* **lfs getstripe /mnt/lustre/HD-OST1/mv/test
OBDS:
1: lustre-OST0001_UUID ACTIVE
3: lustre-OST0003_UUID ACTIVE
4: lustre-OST0004_UUID ACTIVE
5: lustre-OST0005_UUID ACTIVE
6: lustre-OST0006_UUID ACTIVE
8: lustre-OST0008_UUID ACTIVE
9: lustre-OST0009_UUID ACTIVE
10: lustre-OST000a_UUID ACTIVE
11: lustre-OST000b_UUID ACTIVE
12: lustre-OST000c_UUID ACTIVE
13: lustre-OST000d_UUID ACTIVE
/mnt/lustre/HD-OST1/mv/test
        obdidx           objid          objid            group
             4         6759925       0x6725f5                0*

It obdidx was 4 !!! I also tried to change the index to another value ( 3 -
13 in our ost list ), and it showed the correct value of obdidx. It only
went wrong with my OST1!!!!

 Could you please explain it for me or show me what's wrong with my command
?

Many thanks
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20100804/d6e5fa68/attachment.htm>