[Lustre-discuss] No space left on device for just one file

Andreas Dilger adilger at sun.com
Mon Jan 11 18:24:46 PST 2010


On 2010-01-11, at 15:59, Michael Robbert wrote:
> The filename is not very unique. I can create a file with the same  
> name in another directory or on another Lustre filesystem. It is  
> just this exact path on this filesystem. The full path is:
> /lustre/scratch/smoqbel/Cenval/CLM/Met.Forcing/18X11/NLDAS.APCP. 
> 007100.pfb.00164
> The mount point for this filesystem is /lustre/scratch/

Robert,
does the same problem happen on multiple client nodes, or is it only  
happening on a single client?  Are there any messages on the MDS and/ 
or the OSSes when this problem is happening?  This problem is somewhat  
unusual, since I'm not aware of any places outside the disk filesystem  
code that would cause ENOSPC when creating a file.

Can you please do a bit of debugging on the system:

     {client}# cd /lustre/scratch/smoqbel/Cenval/CLM/Met.Forcing/18X11
{mds,client}# echo -1 > /proc/sys/lustre/debug       # enable full debug
{mds,client}# lctl clear                             # clear debug logs
     {client}# touch NLDAS.APCP.007100.pfb.00164
{mds,client}# lctl dk > /tmp/debug.{mds,client}      # dump debug logs

For now, please extract the ENOSPC error from the logs will be much  
shorter, and may be enough to identify where the problem is located,  
and will be a lot friendlier to the list.

grep -- "-28" /tmp/debug.{mds,client} > /tmp/debug-28.{mds,client}::

along with the "lfs df" and "lfs df -i" output.

If this is only on a single client, just dropping the locks on the  
client might be enough to resolve the problem:

for L in /proc/fs/lustre/ldlm/namespaces/*; do
     echo clear > $L
done

If, on the other hand, this same problem is happening on all clients  
then the problem is likely on the MDS.

>> On Fri, Jan 8, 2010 at 1:36 PM, Michael Robbert  
>> <mrobbert at mines.edu> wrote:
>>> I have a user that reported a problem creating a file on our  
>>> Lustre filesystem. When I investigated I found that the problem  
>>> appears to be unique to just one filename in one directory. I have  
>>> tried numerous ways of creating the file including echo, touch,  
>>> and "lfs setstripe" all return "No space left on device". I have  
>>> checked the filesystem with df and "lfs df" both show that the  
>>> filesystem and all OSTs are far from being full for both blocks  
>>> and inodes. Slight changes in the filename are created fine. We  
>>> had a kernel panic on the MDS yesterday and it was quite possible  
>>> that the user had a compute job working in this directory at the  
>>> time of that problem. I am guessing we have some kind of  
>>> corruption with the directory. This directory has around 1 million  
>>> files so moving the data around may not be a quick operation, but  
>>> we're willing to do it. I just want to know the best way, short of  
>>> taking the filesystem offline, to fix this problem.
>>>
>>> Any ideas? Thanks in advance,
>>> Mike Robbert
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss


Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.




More information about the lustre-discuss mailing list