[Lustre-discuss] disappeared data from OST
Herbert Fruchtl
herbert.fruchtl at st-andrews.ac.uk
Mon Feb 15 13:08:01 PST 2010
Hi there,
We have 27TB data on 6 OSTs distributed over 3 OSSes. Lustre version 1.6.7.2 on
CentOS 4.6.
After a power spike this weekend that crashed several machines (not the
OSS'es...) and/or possibly hitting 100% file space usage on one of them (we have
been dangerously close for a while), it hung this morning. After restarting, it
showed many files as missing. I decided to unmount them all and do an fsck.
I unmonted the file system from the MDS, logged in to the OSSes and started
unmounting the OSTs. This went OK on two of the three, but on the third one, the
umount command hangs with an error message that has something with _BUG in it (I
can look it up tomorrow, if I still have it on the screen; I'm at home now).
Worryingly, if I do a "df" on that machine, I get 3% file usage:
[root at oss1 ~]# df
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sda5 236062880 5911252 218160312 3% /
/dev/sda1 101086 10993 84874 12% /boot
none 1803084 0 1803084 0% /dev/shm
/dev/sdb 236062880 5911252 218160312 3% /mnt/oss1-ost1
/dev/sdc 236062880 5911252 218160312 3% /mnt/oss1-ost2
It should be 98% or thereabouts! Now I am afraid that if I carry on (probably
just cycling the power, since "reboot" also hangs), it will come back in the
same state, i.e. 95% of the data gone. Is this already irreparably the case, or
am I just paranoid?
Any suggestions would be appreciated (in other words: HELP!!!!).
Before this, I had tried an "lfsck -c -l -f" on the mounted file system, but the
sudden drop in disk usage on oss1 definitely only happened after I killed this
and tried to umount by hand.
Cheers,
Herbert
--
Herbert Fruchtl
Senior Scientific Computing Officer
School of Chemistry, School of Mathematics and Statistics
University of St Andrews
--
The University of St Andrews is a charity registered in Scotland:
No SC013532
More information about the lustre-discuss
mailing list