[lustre-discuss] Lustre/ZFS space accounting

Hans Henrik Happe happe at nbi.ku.dk
Wed Jun 14 23:43:28 PDT 2017


On 09-06-2017 18:06, Dilger, Andreas wrote:
> The error 28 on close may also be out of space (28 == ENOSPC). 
> 
> How many clients on your system?

~240 clients.

> I would recommend to use find/lfs find to locate some of the larger files on OST0002 and lfs_migrate them to other OSTs. 

Of list, it was suggested to look at LU-2049 (Thanks). Could that be it.
I guess making more free space would help.

Cheers,
Hans Henrik

>> On Jun 9, 2017, at 01:27, Hans Henrik Happe <happe at nbi.ku.dk> wrote:
>>
>> Hi,
>>
>> We have ruled that out by monitoring use. It is happening during
>> checkpointing. So a continuing process were old checkpoints get deleted
>> after new ones are made. There are many checkpoints before
>>
>> I messed things up in my first mail, so it wasn't clear why I talked
>> about space. Sometimes they they just get this (first number is MPI rank):
>>
>> 222: forrtl: Input/output error
>> 222: forrtl: severe (28): CLOSE error, unit 10, file "Unknown"
>>
>> Sometimes they get:
>>
>> 33: forrtl: No space left on device
>> 14: forrtl: No space left on device
>> 08: forrtl: Input/output error
>> 08: forrtl: severe (28): CLOSE error, unit 10, file "Unknown"
>>
>> Info: We have a ZFS snapshot of the osts and mdt. It's ZFS 0.6.5.7.
>>
>> Cheers,
>> Hans Henrik
>>
>>> On 09-06-2017 08:41, Thomas Roth wrote:
>>> Hi,
>>>
>>> I don't know about the error messages. But are you sure that the
>>> imbalance of the OST filling isn't due to some extremely large files
>>> written overnight or so (- with default striping, one file -> one OST).
>>> Our users are able to do that, without realizing.
>>>
>>> Regards,
>>> Thomas
>>>
>>>> On 08.06.2017 10:11, Hans Henrik Happe wrote:
>>>> Hi,
>>>>
>>>> We are on Lustre 2.8 with ZFS.
>>>>
>>>> Our users have seen some unexplainable errors:
>>>>
>>>> 062: forrtl: Input/output error
>>>>
>>>> Or
>>>>
>>>> 062: forrtl: severe (28): CLOSE error, unit 10, file “Unknown"
>>>>
>>>>
>>>> From attached 'lfs df -h' you can see that the OSTs are unbalanced and
>>>> OST0001 but far from being full. We are using default allocation setting
>>>> so we should be in weighted mode.
>>>>
>>>> I've tried to find an LU matching this but no luck. Also, log on
>>>> affected nodes and on servers are empty.
>>>>
>>>> Any suggestions about how to debug this?
>>>>
>>>> Cheers,
>>>> Hans Henrik
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> lustre-discuss mailing list
>>>> lustre-discuss at lists.lustre.org
>>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>>>
>>>
>> _______________________________________________
>> lustre-discuss mailing list
>> lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org



-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: OpenPGP digital signature
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20170615/b2e5c5bc/attachment.pgp>


More information about the lustre-discuss mailing list