[Lustre-discuss] Problem with striping on lustre 1.8

Sun Aug 8 21:43:58 PDT 2010

Hi list

After digging deeper on our mds and osts log, i found that maybe i had a
problem with *last_id and obdid*. I checked follow the instruction in the
lustre 1.8 manual ( 821-0035 v1.3 - 23.3.9 ) and saw that the last_id
matched with the object existing on my OST and that means the problem is the
* incorrect lov_objid* file.

The manual tell me that simply delete the love_objid file and it will be
re-created from the last_id of ost. I did have a problem with changing
parameter in MDS ( in mounted ldiskfs file system ) and i have to be very
careful when playing with it. So, could anyone here help me to confirm that:
* deleting the lov_objid file is harmless and that file will be re-created
and work properly* ?

Any help would be highly appreciated !!!

On Wed, Aug 4, 2010 at 3:06 PM, Lex <lexluthor87 at gmail.com> wrote:

> Hi list
>
> I have a small lustre storage system with 12 OSTs, after using in about 1
> year, the free space on each one are as follow:
>
> *UUID                                     bytes        Used
> Available    Use%         Mounted on
> lustre-MDT0000_UUID            189.4G      9.8G        168.8G        5%
>          /mnt/lustre[MDT:0]
> lustre-OST0001_UUID             6.3T         4.6T          1.3T
> 73%             /mnt/lustre[OST:1]
> lustre-OST0003_UUID             4.0T         3.8T          22.0M
> 94%          /mnt/lustre[OST:3]
> lustre-OST0004_UUID             5.4T         4.9T         163.2G
> 91%          /mnt/lustre[OST:4]
> lustre-OST0005_UUID             5.4T         4.7T         423.6G
> 87%          /mnt/lustre[OST:5]
> lustre-OST0006_UUID             4.0T         3.8T         356.3M
> 94%           /mnt/lustre[OST:6]
> lustre-OST0008_UUID             5.4T         5.0T         99.2G
> 93%          /mnt/lustre[OST:8]
> lustre-OST0009_UUID             5.4T         5.0T         124.2G
> 92%           /mnt/lustre[OST:9]
> lustre-OST000a_UUID             5.4T         4.6T          540.9G
> 85%          /mnt/lustre[OST:10]
> lustre-OST000b_UUID             5.4T         4.5T          557.9G
> 84%          /mnt/lustre[OST:11]
> lustre-OST000c_UUID             6.7T         1.6T          4.7T
> 24%         /mnt/lustre[OST:12]
> lustre-OST000d_UUID             6.7T         478.3G      5.9T
> 6%          /mnt/lustre[OST:13]
> *
> As you see, there is an unbalance in free spaces of our OSTs. I tried to
> overcome it by setting a pool like this :
>
> *root at MDS1: ~ # lctl pool_list lustre.para
> Pool: lustre.para
> lustre-OST0004_UUID
> lustre-OST0005_UUID
> lustre-OST000a_UUID
> lustre-OST000b_UUID
> lustre-OST0001_UUID*
>
> We controlled  the write action in our directories manually by adding or
> removing the member of pool ( based on their free space ). Everything worked
> quite well for about 2 months. But when one of our OST ran out of free
> space, there was many things like this appeared in MDS messages log:
>
> * MDS1 kernel: Lustre: 12012:0:(lov_qos.c:460:qos_shrink_lsm()) using
> fewer stripes for object 28452745: old 5 new 4
>  MDS1 kernel: Lustre: 12012:0:(lov_qos.c:460:qos_shrink_lsm()) Skipped 4
> previous similar messages
> MDS1 kernel: Lustre: 12032:0:(lov_qos.c:460:qos_shrink_lsm()) using fewer
> stripes for object 28453405: old 5 new 4
>  MDS1 kernel: Lustre: 12032:0:(lov_qos.c:460:qos_shrink_lsm()) Skipped 39
> previous similar messages*
>
> And the problems now is:
>
> There is nothing to be written on OST1 ( *lustre-OST0001_UUID ), *its free
> space is always be 1.3T after many days while the others went do quite fast.
>
>
> I also test by making a brand new directory in our storage system and set
> stripe index to be only 1 like this:
>
> *mkdir /mnt/lustre/HD-OST1/mv
> lfs setstripe -c 1 -i 1 /mnt/lustre/HD-OST1/mv*
>
> and touch one file :* **touch test
> *and the result is:
>
> * **lfs getstripe /mnt/lustre/HD-OST1/mv/test
> OBDS:
> 1: lustre-OST0001_UUID ACTIVE
> 3: lustre-OST0003_UUID ACTIVE
> 4: lustre-OST0004_UUID ACTIVE
> 5: lustre-OST0005_UUID ACTIVE
> 6: lustre-OST0006_UUID ACTIVE
> 8: lustre-OST0008_UUID ACTIVE
> 9: lustre-OST0009_UUID ACTIVE
> 10: lustre-OST000a_UUID ACTIVE
> 11: lustre-OST000b_UUID ACTIVE
> 12: lustre-OST000c_UUID ACTIVE
> 13: lustre-OST000d_UUID ACTIVE
> /mnt/lustre/HD-OST1/mv/test
>         obdidx           objid          objid            group
>              4         6759925       0x6725f5                0*
>
> It obdidx was 4 !!! I also tried to change the index to another value ( 3 -
> 13 in our ost list ), and it showed the correct value of obdidx. It only
> went wrong with my OST1!!!!
>
>
>  Could you please explain it for me or show me what's wrong with my command
> ?
>
> Many thanks
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20100809/2f8471f0/attachment.htm>