[lustre-discuss] OST went back in time: no(?) hardware issue

Wed Oct 4 17:30:25 PDT 2023

On Oct 3, 2023, at 16:22, Thomas Roth via lustre-discuss <lustre-discuss at lists.lustre.org> wrote:
> 
> Hi all,
> 
> in our Lustre 2.12.5 system, we have "OST went back in time" after OST hardware replacement:
> - hardware had reached EOL
> - we set `max_create_count=0` for these OSTs, searched for and migrated off the files of these OSTs
> - formatted the new OSTs with `--replace` and the old indices
> - all OSTs are on ZFS
> - set the OSTs `active=0` on our 3 MDTs
> - moved in the new hardware, reused the old NIDs, old OST indices, mounted the OSTs
> - set the OSTs `active=1`
> - ran `lfsck` on all servers
> - set `max_create_count=200` for these OSTs
> 
> Now the "OST went back in time" messages appeard in the MDS logs.
> 
> This doesn't quite fit the description in the manual. There were no crashes or power losses. I cannot understand how which cache might have been lost.
> The transaction numbers quoted in the error are both large, eg. `transno 55841088879 was previously committed, server now claims 4294992012`
> 
> What should we do? Give `lfsck` another try?

Nothing really to see here I think?

Did you delete LAST_RCVD during the replacement and the OST didn't know what transno was assigned to the last RPCs it sent?  The still-mounted clients have a record of this transno and are surprised that it was reset.  If you unmount and remount the clients the error would go away.

I'm not sure if the clients might try to preserve the next 55B RPCs in memory until the committed transno on the OST catches up, or if they just accept the new transno and get on with life?

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud