[lustre-discuss] File missing with "Invalid argument" error
Tung-Han Hsieh
thhsieh at twcp1.phys.ntu.edu.tw
Tue Apr 16 00:30:49 PDT 2019
Dear All,
We finally figured out the problem.
After running:
tunefs.lustre --writeconf /dev/sdXX
for all the MDT and OST partitions to clear out the logs in
/proc/fs/lustre/osc/ of the MDS server, and then remount the
lustre file system, presumably the logs of all OSTs should be
automatically regenerated. But somehow two of the OSTs were
not regenerated, which caused a lot of missing files as shown
in my previous mail.
So we umount the lustre file system (umounting client, OST, MDT,
and MGS), and redo the mount again. This time the missing logs
of OSTs were regenerated, and the problem is solved.
Now here I have another question: If we want to remove a bad
OST (including its logs in MDT:/proc/fs/lustre/osc/ ), the only
known way is to run
tunefs.lustre --writeconf /dev/sdXX
for all the MDT, OST, and MGS partitions. It removes all the logs
including the bad and good OSTs. This seems quite unnecessary. Is
there a way to remove the logs of only the bad OST if I know its ID ?
Thanks in advance.
T.H.Hsieh
On Mon, Apr 15, 2019 at 04:32:57PM +0800, Tung-Han Hsieh wrote:
> Dear All,
>
> We are facing a serious problem after a mistake of doing Lustre
> (1.8.8) maintenance.
>
> We had a bad OST and want to remove it. So we went to MDS and run
>
> lctl conf_param foo-OSTXXXX.osc.active=0
>
> After doing this, in MDS there are still logs reside in /proc/fs/lustre/osc/
> directory. We want to clean it, too. So we umount the whole lustre
> file system (including clients and OSTs) and run
>
> tunefs.lustre --writeconf /dev/sdXX
>
> for each OST, MDT, and MDS devices. But we made a serious mistake.
> We forgot the umount MDT and MGS as well. We saw that the logs were
> cleaned before umounting MDT and MGS, and the system hung when we
> are going to umount MDT and MGS. After rebooting the MDS, remounting
> the whole lustre system. The we saw the logs were regenerated. But
> after mounting the clients, we saw a lot of files missing, e.g.,
>
> ls -l /path/to/rsync_tf16_twcp1.bat
>
> /path/to/file: Invalid argument
> -????????? ? ? ? ? ? ? ? /path/to/rsync_tf16_twcp1.bat
>
> Now we have umounted the whole lustre file system, and mount MDT with
> ldiskfs. We see that the file ROOT/to/rsync_tf16_twcp1.bat exists.
> Running "getfattr" can still extract the following code:
>
> # file: rsync_tf16_twcp1.bat
> trusted.lov=0s0AvRCwEAAAB9UG8JAAAAAAAAAAAAAAAAAAAQAAEAAADEjjUAAAAAAAAAAAAAAAAAAAAAAAQAAAA=
>
> My question is: Is it possible to figure out the location of OSTs of
> this file from this code ? If it really exists in our current OSTs,
> could we re-estiablish the connection and get the file back ?
>
> Sorry this problem is quite serious which affact our research works
> terribly, since we have lost a lot of files due to this mistake. Any
> suggestion is very appreciated.
>
>
> Best Regards,
>
> T.H.Hsieh
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
More information about the lustre-discuss
mailing list