[Lustre-discuss] crash recovery insigt
Andrus, Brian Contractor
bdandrus at nps.edu
Thu Jun 21 12:00:32 PDT 2012
We had an ost get corrupt yesterday. It seems to be due to lost connectivity on the IB network (we use SRP to connect to a DDN set of disks).
When connectivity was restored, lustre continued to leave the filesystem as read-only and had some complaints about the particular ost, so I took that ost offline and ran e2fsck on it. There were a number of errors about shared inodes, which were cloned.
The filesystem came back online and was read/write, BUT it seems that all data newer than about 6 months is gone???
Running lustre 1.8.7 (using wc1 rpms).
How could something like that happen? I could possibly see a loss of arbitrary data, but anything newer than a definitive date/time seems very odd.
I am pretty much resigned to the fact that the data is gone and we will have to work with the users to get back to speed, but finding an understandable explanation would be quite helpful.
Any insight is appreciated,
More information about the lustre-discuss