[Lustre-discuss] lost data after MDS failover

Gregory Matthews greg.matthews at diamond.ac.uk
Wed Feb 17 04:15:40 PST 2010


We had an LBUG on our MDS (on 15th Feb) and so attempted a failover to 
the 2nd MGS/MDS server. This mounted the MGT fine but hung while 
mounting the MDT (longer than 5 minutes).

To resolve the problem I unmounted the MGT and the MDT on a freshly 
booted MDS/MGS and mounted the MDT as ldiskfs. Then moved aside the 
CATALOGS, OBJECTS and last_rcvd files/dirs, unmounted and restarted 
lustre (mount -t lustre ....)

This brought the file system back ok but one of our scientists appears 
to have lost an entire directory of data from the time the file system 
was taken down. The MDS was initally taken out at 1400 (16 Feb) and the 
file system was fully back around 1500. The scientist has files in the 
directory from 1400 onwards.

Approximately 4000 small files dating from the start of January are 
missing. We are running 1.6.6 with a patched kernel 2.6.18-92.1.10.el5 
on the servers, the client is running an unreleased patchless RH kernel 
2.6.18-171.el5 and 1.6.7.2 lustre modules.

We should have good backups of our metadata and we also have access to 
the removed ldiskfs files which were simply renamed. The missing files 
have fairly predictable names which might help tracking down the content?

Is there any hope of recovering the missing files/directory?

GREG

-- 
Greg Matthews            01235 778658
Senior Computer Systems Administrator
Diamond Light Source, Oxfordshire, UK
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: messages.tmp
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20100217/098aa30d/attachment.txt>


More information about the lustre-discuss mailing list