[Lustre-discuss] ost's reporting full

Malcolm Cowe malcolm.cowe at oracle.com
Sat Sep 11 17:14:52 PDT 2010


  On 11/09/2010 19:27, Robin Humble wrote:
> Hey Dr Stu,
>
> On Sat, Sep 11, 2010 at 04:27:43PM +0800, Stuart Midgley wrote:
>> We are getting jobs that fail due to no space left on device.
>> BUT none of our lustre servers are full (as reported by lfs df -h on a client and by df -h on the oss's).
>> They are all close to being full, but are not actually full (still have ~300gb of space left)
> sounds like a grant problem.
>
>> I've tried playing around with tune2fs -m {0,1,2,3} and tune2fs -r 1024 etc and nothing appears to help.
>> Anyone have a similar problem?  We are running 1.8.3
> there are a couple of grant leaks that are fixed in 1.8.4 eg.
>    https://bugzilla.lustre.org/show_bug.cgi?id=22755
> or see the 1.8.4 release notes.
>
> however the overall grant revoking problem is still unresolved AFAICT
>    https://bugzilla.lustre.org/show_bug.cgi?id=12069
> and you'll hit that issue more frequently with many clients and small
> OSTs, or when any OST starts getting full.
>
> in your case 300g per OST should be enough headroom unless you have
> ~4k clients now (assuming 32-64m grants per client), so it's probably
> grant leaks. there's a recipe for adding up client grants and comparing
> them to server grants to see if they've gone wrong in bz 22755.
>
Per BZ 22755, comment #96 
(https://bugzilla.lustre.org/show_bug.cgi?id=22755#c96), you can arrest 
the grant leak by changing the "grant shrink interval" to a large value 
(if you want to reset the server side grant reservation, you will have 
to remount the OSTs). We have applied this work-around to our system 
with good results. We have been monitoring our file systems with Nagios 
and have not encountered a repeat of this problem.

Malcolm.




More information about the lustre-discuss mailing list