[lustre-discuss] Lustre/ZFS space accounting

Fri Jun 9 09:06:28 PDT 2017

The error 28 on close may also be out of space (28 == ENOSPC). 

How many clients on your system?

I would recommend to use find/lfs find to locate some of the larger files on OST0002 and lfs_migrate them to other OSTs. 

Cheers, Andreas

> On Jun 9, 2017, at 01:27, Hans Henrik Happe <happe at nbi.ku.dk> wrote:
> 
> Hi,
> 
> We have ruled that out by monitoring use. It is happening during
> checkpointing. So a continuing process were old checkpoints get deleted
> after new ones are made. There are many checkpoints before
> 
> I messed things up in my first mail, so it wasn't clear why I talked
> about space. Sometimes they they just get this (first number is MPI rank):
> 
> 222: forrtl: Input/output error
> 222: forrtl: severe (28): CLOSE error, unit 10, file "Unknown"
> 
> Sometimes they get:
> 
> 33: forrtl: No space left on device
> 14: forrtl: No space left on device
> 08: forrtl: Input/output error
> 08: forrtl: severe (28): CLOSE error, unit 10, file "Unknown"
> 
> Info: We have a ZFS snapshot of the osts and mdt. It's ZFS 0.6.5.7.
> 
> Cheers,
> Hans Henrik
> 
>> On 09-06-2017 08:41, Thomas Roth wrote:
>> Hi,
>> 
>> I don't know about the error messages. But are you sure that the
>> imbalance of the OST filling isn't due to some extremely large files
>> written overnight or so (- with default striping, one file -> one OST).
>> Our users are able to do that, without realizing.
>> 
>> Regards,
>> Thomas
>> 
>>> On 08.06.2017 10:11, Hans Henrik Happe wrote:
>>> Hi,
>>> 
>>> We are on Lustre 2.8 with ZFS.
>>> 
>>> Our users have seen some unexplainable errors:
>>> 
>>> 062: forrtl: Input/output error
>>> 
>>> Or
>>> 
>>> 062: forrtl: severe (28): CLOSE error, unit 10, file “Unknown"
>>> 
>>> 
>>> From attached 'lfs df -h' you can see that the OSTs are unbalanced and
>>> OST0001 but far from being full. We are using default allocation setting
>>> so we should be in weighted mode.
>>> 
>>> I've tried to find an LU matching this but no luck. Also, log on
>>> affected nodes and on servers are empty.
>>> 
>>> Any suggestions about how to debug this?
>>> 
>>> Cheers,
>>> Hans Henrik
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> lustre-discuss mailing list
>>> lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>> 
>> 
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org