Hi list<br><br>I have a small lustre storage system with 12 OSTs, after using in about 1 year, the free space on each one are as follow:<br><br><i>UUID bytes Used Available Use% Mounted on<br>
lustre-MDT0000_UUID 189.4G 9.8G 168.8G 5% /mnt/lustre[MDT:0]<br>lustre-OST0001_UUID 6.3T 4.6T 1.3T 73% /mnt/lustre[OST:1]<br>lustre-OST0003_UUID 4.0T 3.8T 22.0M 94% /mnt/lustre[OST:3]<br>
lustre-OST0004_UUID 5.4T 4.9T 163.2G 91% /mnt/lustre[OST:4]<br>lustre-OST0005_UUID 5.4T 4.7T 423.6G 87% /mnt/lustre[OST:5]<br>lustre-OST0006_UUID 4.0T 3.8T 356.3M 94% /mnt/lustre[OST:6]<br>
lustre-OST0008_UUID 5.4T 5.0T 99.2G 93% /mnt/lustre[OST:8]<br>lustre-OST0009_UUID 5.4T 5.0T 124.2G 92% /mnt/lustre[OST:9]<br>lustre-OST000a_UUID 5.4T 4.6T 540.9G 85% /mnt/lustre[OST:10]<br>
lustre-OST000b_UUID 5.4T 4.5T 557.9G 84% /mnt/lustre[OST:11]<br>lustre-OST000c_UUID 6.7T 1.6T 4.7T 24% /mnt/lustre[OST:12]<br>
lustre-OST000d_UUID 6.7T 478.3G 5.9T 6% /mnt/lustre[OST:13]<br></i><br>As you see, there is an unbalance in free spaces of our OSTs. I tried to overcome it by setting a pool like this :<br>
<br><i>root@MDS1: ~ # <b>lctl pool_list lustre.para</b><br>Pool: lustre.para<br>lustre-OST0004_UUID<br>lustre-OST0005_UUID<br>lustre-OST000a_UUID<br>lustre-OST000b_UUID<br>lustre-OST0001_UUID</i><br><br>We controlled the write action in our directories manually by adding or removing the member of pool ( based on their free space ). Everything worked quite well for about 2 months. But when one of our OST ran out of free space, there was many things like this appeared in MDS messages log: <br>
<br><i> MDS1 kernel: Lustre: 12012:0:(lov_qos.c:460:qos_shrink_lsm()) using fewer stripes for object 28452745: old 5 new 4<br> MDS1 kernel: Lustre: 12012:0:(lov_qos.c:460:qos_shrink_lsm()) Skipped 4 previous similar messages<br>
MDS1 kernel: Lustre: 12032:0:(lov_qos.c:460:qos_shrink_lsm()) using fewer stripes for object 28453405: old 5 new 4<br> MDS1 kernel: Lustre: 12032:0:(lov_qos.c:460:qos_shrink_lsm()) Skipped 39 previous similar messages</i><br>
<br>And the problems now is: <br><br>There is nothing to be written on OST1 ( <i>lustre-OST0001_UUID ), </i>its free space is always be 1.3T after many days while the others went do quite fast. <br><br>I also test by making a brand new directory in our storage system and set stripe index to be only 1 like this: <br>
<br><b><i>mkdir /mnt/lustre/HD-OST1/mv</i><br><i>lfs setstripe -c 1 -i 1 /mnt/lustre/HD-OST1/mv</i></b><br><br>and touch one file :<b> </b><i><b>touch test</b><br></i>and the result is: <br><br><b> </b><i><b>lfs getstripe /mnt/lustre/HD-OST1/mv/test</b><br>
OBDS:<br>1: lustre-OST0001_UUID ACTIVE<br>3: lustre-OST0003_UUID ACTIVE<br>4: lustre-OST0004_UUID ACTIVE<br>5: lustre-OST0005_UUID ACTIVE<br>6: lustre-OST0006_UUID ACTIVE<br>8: lustre-OST0008_UUID ACTIVE<br>9: lustre-OST0009_UUID ACTIVE<br>
10: lustre-OST000a_UUID ACTIVE<br>11: lustre-OST000b_UUID ACTIVE<br>12: lustre-OST000c_UUID ACTIVE<br>13: lustre-OST000d_UUID ACTIVE<br>/mnt/lustre/HD-OST1/mv/test<br> <b>obdidx </b> objid objid group<br>
<b> 4</b> 6759925 0x6725f5 0</i><br><br>It obdidx was 4 !!! I also tried to change the index to another value ( 3 - 13 in our ost list ), and it showed the correct value of obdidx. It only went wrong with my OST1!!!!<br>
<br><br> Could you please explain it for me or show me what's wrong with my command ? <br><br>Many thanks <br>