[Lustre-discuss] What's the human translation for: ost_write operation failed with -28

Rappleye, Jason (ARC-TN)[Computer Sciences Corporation] jason.rappleye at nasa.gov
Mon Dec 5 23:31:24 PST 2011


Hi Thomas,

On Dec 5, 2011, at 10:31 PM, Thomas Guthmann wrote:

> Hi Jason,
> 
>> $ lctl get_param obdfilter.*.tot_granted
>> Units are in bytes.
> Thanks. I wasn't aware of this "grant". I googled for it and I found some
> information about it but it's still unclear. Should I understand that the
> value in obdfilter.*.tot_granted are actually 'reserved' space allocated 
> by clients but not used ? 

In a sense, yes. My understanding is that grant space exists to ensure that client applications can perform asynchronous writes without dirtying more pages than the available space on an OST. Otherwise, writes would have to be synchronous to ensure that clients didn't use more space than is available.

> So REAL_FREESPACE = DF_FREESPACE - TOT_GRANTED ? Correct ?

That's more or less how our monitoring tools interpret it; a knowledgeable Lustre engineer might chime in and say otherwise :-)

> FYI, I have the following values on the OSS it couldn't connect/write to :
> 
> obdfilter.foobar-OST0003.tot_granted=17429659648
> obdfilter.foobar-OST0004.tot_granted=13648875520
> obdfilter.foobar-OST0005.tot_granted=18136141824
> 
> and : lfs df (seen from the client)
> 
> foobar-OST0003_UUID   2113787824 1986169192  20244388   93% /lustre/foobar[OST:3]
> foobar-OST0004_UUID   2113787824 1986170884  20242696   93% /lustre/foobar[OST:4]
> foobar-OST0005_UUID   2113787824 1988667844  17745736   94% /lustre/foobar[OST:5]
> 
> So, for instance for OST5 I have 17745736 - (18136141824/1024) = ... 
>                                 17745736 - 17711076           = 34660 KB left
> 
> Am I right ? 

Yes, though on our system with ~12,000 clients, those values of tot_granted are obscenely low. A better comparison would be tot_granted on a freshly mounted OST on your filesystem.

>> One grant-related BZ that that bit us hard is 22755; in particular the
>> part that caused grant to grow when a user code continued trying to write
>> even after write(2) started returning EDQUOTA :-(
> That's interesting information. I also found the same via [1] and apparently
> it may not be fixed overall. Which may explain why I may have hit it with Lustre
> 1.8.5. 
> 
> But, again, my application was writing into sparse files so the space was 
> already allocated... and the sparse files haven't grown. 

Your specific problem may not be due to a bug. That last bit of the filesystem may not be easily usable due to the grant mechanism. I'll let someone with more knowledge about grants chime in here.

Also, as Heiko alluded to, running with an OST so full is going to increase the chance of exposure to problems described in LU-15.

Jason

> [1]: http://www.mail-archive.com/lustre-discuss@lists.lustre.org/msg07565.html
> 
> Thomas
> 
> 
>> 
>> On 12/5/11 5:05 PM, "Thomas Guthmann"<tguthmann at iseek.com.au>  wrote:
>> 
>>> Hi,
>>> 
>>>> # grep 28 /usr/include/asm-generic/errno-base.h
>>>> #define ENOSPC 28 /* No space left on device */
>>> Great. So it's really what's happening. But I have free space/inodes...
>>> I cannot remember anything in the documentation talking about 'reserved
>>> free space'.
>>> 
>>> So based on the following output, is it normal to have no space left on
>>> storage ?
>>> 
>>> # lfs df -h
>>> [..]
>>> UUID                       bytes        Used   Available Use% Mounted on
>>> foobar-MDT0000_UUID          4.1G      197.8M        3.7G   4%
>>> /lustre/foobar[MDT:0]
>>> foobar-OST0000_UUID          2.0T        1.8T       21.1G  93%
>>> /lustre/foobar[OST:0]
>>> foobar-OST0001_UUID          2.0T        1.8T       23.2G  93%
>>> /lustre/foobar[OST:1]
>>> foobar-OST0002_UUID          2.0T        1.8T       21.4G  93%
>>> /lustre/foobar[OST:2]
>>> foobar-OST0003_UUID          2.0T        1.8T       19.3G  93%
>>> /lustre/foobar[OST:3]
>>> foobar-OST0004_UUID          2.0T        1.8T       19.3G  93%
>>> /lustre/foobar[OST:4]
>>> foobar-OST0005_UUID          2.0T        1.9T       16.9G  94%
>>> /lustre/foobar[OST:5]
>>> 
>>> # lfs df -i
>>> [..]
>>> UUID                      Inodes       IUsed       IFree IUse% Mounted on
>>> foobar-MDT0000_UUID       1019403          64     1019339   0%
>>> /lustre/foobar[MDT:0]
>>> foobar-OST0000_UUID      32363906         102    32363804   0%
>>> /lustre/foobar[OST:0]
>>> foobar-OST0001_UUID      32920407          99    32920308   0%
>>> /lustre/foobar[OST:1]
>>> foobar-OST0002_UUID      32453038         100    32452938   0%
>>> /lustre/foobar[OST:2]
>>> foobar-OST0003_UUID      31904762         104    31904658   0%
>>> /lustre/foobar[OST:3]
>>> foobar-OST0004_UUID      31904338         103    31904235   0%
>>> /lustre/foobar[OST:4]
>>> foobar-OST0005_UUID      31280099         104    31279995   0%
>>> /lustre/foobar[OST:5]
>>> 
>>> For my dmesg on the OSS, Heiko pointed it out (in a private email) that I
>>> may have hit one of the following bottlenecks :
>>> - To little space left on file system
>>> - Performance of ext3/4 on large disks (Note: I am using
>>> ext4/lustre1.8.5/centos5)
>>> ==>  http://jira.whamcloud.com/browse/LU-15.
>>> 
>>> But it still does not explain why I couldn't write anymore.
>>> 
>>> Cheers
>>> Thomas
>>> 
>>> Any ide
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>> 
> 




More information about the lustre-discuss mailing list