[lustre-discuss] lost files on ZFS

Sun Nov 6 20:48:02 PST 2016

Hello!

On Oct 30, 2016, at 8:33 AM, Thomas Roth wrote:

> Hi all,
> 
> we have a larger amount of files that give ??? on 'ls' and the error "Cannot allocate memory"
> The corresponding error on the OSS is
> "lvbo_init failed for resource ... rc = -2"
> 
> This seems similar to LU-5457 (although the OSTs do not go into disconn state).
> Our filesystem is on Lustre 2.5.3, zfs 0.6.3, from the start. So per Oleg's explanation,
> "this could be fallout from earlier sync failures where OST announced it created some objects, failed to sync that to disk and then after dying and restarting the objects that were handed out by MDTs out of this pool are no longer there"
> 
> The affected OSTs are evenly distributed, however.
> Finding the creation time of those files is difficult at best, but I am not aware of any series of crashes of so many OSSes in the recent months.
> And how can this happen with ZFS-OSTs? Should this be possible so easily?

   First of all, 2.5.3 is kind of old.

   The error itself means that you have a file on MDS, but no corresponding objects.
   The explanation in LU-5457 is just one possible scenario, but there might be others
   that cause the objects to be deleted.

   Is there a pattern to the files? I.e. is it so that all such files were created
   at aroudn the same time (if you cannot tell just by the filename/location, you might
   use debugfs/whatever zfs equivalent to look at inode modification time.)

   If they are distributed in time on different OSTs, but localised for every one OST
   individually, might be a good idea to check OST logs from that period.

Bye,
    Oleg