[Lustre-discuss] No space left on device for just one file

Bernd Schubert bernd.schubert at fastmail.fm
Tue Jan 12 11:41:20 PST 2010


Hmm, seems there is no solution available except of to turn off dir-indexing. 
I never looked into the dir-index code so far, but wouldn't it make sense to 
skip the index just for this directory?

https://bugzilla.lustre.org/show_bug.cgi?id=10129

http://kerneltrap.org/mailarchive/linux-kernel/2008/5/18/1861604


Thanks,
Bernd

On Tuesday 12 January 2010, Bernd Schubert wrote:
> Hello Mike,
> 
> you really should fill a ticket to us (DDN). I think your problem is from
> these MDS messages:
> 
> 
> LDISKFS-fs warning (device dm-1): ldiskfs_dx_add_entry: Directory index
>  full! LDISKFS-fs warning (device dm-1): ldiskfs_dx_add_entry: Directory
>  index full!
> 
> 
> And /dev/dm-1 is also the scratch MDT.
> 
> 
> Cheers,
> Bernd
> 
> On Tuesday 12 January 2010, Michael Robbert wrote:
> > Andreas,
> > Here are the results of my debugging. This problem does show up on
> > multiple (presumably all) clients. I followed your instructions, changing
> > lustre to lnet in step 2, and got debug output on both machines, but the
> > -28 text only showed up on the client.
> >
> > [root at ra 18X11]# grep -- "-28" /tmp/debug.client
> > 00000100:00000200:5:1263315233.100525:0:22069:0:(client.c:841:ptlrpc_chec
> >k_ reply()) @@@ rc = 1 for  req at 00000103a5820800 x200609397/t0
> >  o36->scratch-MDT0000_UUID at 172.16.34.1@o2ib:12/10 lens 376/424 e 0 to 1
> > dl 1263315433 ref 1 fl Rpc:R/0/0 rc 0/-28
> > 
> > 00000100:00000200:5:1263315233.100538:0:22069:0:(events.c:95:reply_in_cal
> >l back()) @@@ type 5, status 0  req at 00000103a5820800 x200609397/t0
> > o36->scratch-MDT0000_UUID at 172.16.34.1@o2ib:12/10 lens 376/424 e 0 to 1 dl
> > 1263315433 ref 1 fl Rpc:R/0/0 rc 0/-28
> > 
> > 00000100:00100000:5:1263315233.100543:0:22069:0:(events.c:115:reply_in_ca
> >l lback()) @@@ unlink  req at 00000103a5820800 x200609397/t0
> >  o36->scratch-MDT0000_UUID at 172.16.34.1@o2ib:12/10 lens 376/424 e 0 to 1
> > dl 1263315433 ref 1 fl Rpc:R/0/0 rc 0/-28
> > 
> > 00000100:00000040:5:1263315233.100565:0:22069:0:(client.c:863:ptlrpc_chec
> >k _status()) @@@ status is -28  req at 00000103a5820800 x200609397/t0
> > o36->scratch-MDT0000_UUID at 172.16.34.1@o2ib:12/10 lens 376/424 e 0 to 1 dl
> > 1263315433 ref 1 fl Rpc:R/0/0 rc 0/-28
> > 
> > 00000100:00000001:5:1263315233.100570:0:22069:0:(client.c:869:ptlrpc_chec
> >k _status()) Process leaving (rc=18446744073709551588 : -28 :
> >  ffffffffffffffe4)
> > 
> > 00000100:00000001:5:1263315233.100578:0:22069:0:(client.c:955:after_reply
> >( )) Process leaving (rc=18446744073709551588 : -28 : ffffffffffffffe4)
> > 00000100:00100000:5:1263315233.100581:0:22069:0:(lustre_net.h:984:ptlrpc_
> >r qphase_move()) @@@ move req "Rpc" -> "Interpret"  req at 00000103a5820800
> > x200609397/t0 o36->scratch-MDT0000_UUID at 172.16.34.1@o2ib:12/10 lens
> > 376/424 e 0 to 1 dl 1263315433 ref 1 fl Rpc:R/0/0 rc 0/-28
> > 
> > 00000100:00000001:5:1263315233.100586:0:22069:0:(client.c:2094:ptlrpc_que
> >u e_wait()) Process leaving (rc=18446744073709551588 : -28 :
> >  ffffffffffffffe4)
> > 
> > 00000002:00000040:5:1263315233.100590:0:22069:0:(mdc_reint.c:67:mdc_reint
> >( )) error in handling -28
> > 
> > 00000002:00000001:5:1263315233.100593:0:22069:0:(mdc_reint.c:227:mdc_crea
> >t e()) Process leaving (rc=18446744073709551588 : -28 : ffffffffffffffe4)
> > 00000080:00000001:5:1263315233.100596:0:22069:0:(namei.c:881:ll_new_node(
> >) ) Process leaving via err_exit (rc=18446744073709551588 : -28 :
> >  ffffffffffffffe4)
> > 
> > 00000100:00000040:5:1263315233.100600:0:22069:0:(client.c:1629:__ptlrpc_r
> >e q_finished()) @@@ refcount now 0  req at 00000103a5820800 x200609397/t0
> > o36->scratch-MDT0000_UUID at 172.16.34.1@o2ib:12/10 lens 376/424 e 0 to 1 dl
> > 1263315433 ref 1 fl Interpret:R/0/0 rc 0/-28
> > 
> > 00000080:00000001:5:1263315233.100620:0:22069:0:(namei.c:930:ll_mknod_gen
> >e ric()) Process leaving (rc=18446744073709551588 : -28 :
> > ffffffffffffffe4)
> >
> > Finally here is the lfs df output:
> >
> > [root at ra 18X11]# lfs df
> > UUID                 1K-blocks      Used Available  Use% Mounted on
> > home-MDT0000_UUID    5127574032   2034740 4832512272    0%
> >  /lustre/home[MDT:0] home-OST0000_UUID    5768577552 1392861480
> > 4082688968 24% /lustre/home[OST:0] home-OST0001_UUID    5768577552
> > 1206861808 4268688824   20% /lustre/home[OST:1] home-OST0002_UUID   
> > 5768577552 1500109508 3975439928   26% /lustre/home[OST:2]
> > home-OST0003_UUID 5768577552 1233475740 4242074712   21%
> > /lustre/home[OST:3]
> >  home-OST0004_UUID    5768577552 1197398768 4278150628   20%
> >  /lustre/home[OST:4] home-OST0005_UUID    5768577552 1186058976
> > 4289491656 20% /lustre/home[OST:5]
> >
> > filesystem summary:  34611465312 7716766280 25136534716   22%
> > /lustre/home
> >
> > UUID                 1K-blocks      Used Available  Use% Mounted on
> > scratch-MDT0000_UUID 5127569936   9913156 4824629964    0%
> >  /lustre/scratch[MDT:0] scratch-OST0000_UUID 5768577552 4446029104
> >  1029519960   77% /lustre/scratch[OST:0] scratch-OST0001_UUID 5768577552
> >  3914730392 1560819220   67% /lustre/scratch[OST:1] scratch-OST0002_UUID
> >  5768577552 4268932844 1206616396   74% /lustre/scratch[OST:2]
> >  scratch-OST0003_UUID 5768577552 4307085048 1168464192   74%
> >  /lustre/scratch[OST:3] scratch-OST0004_UUID 5768577552 3920023888
> >  1555525724   67% /lustre/scratch[OST:4] scratch-OST0005_UUID 5768577552
> >  3590710852 1884838760   62% /lustre/scratch[OST:5] scratch-OST0006_UUID
> >  5768577552 4649048836 826500028   80% /lustre/scratch[OST:6]
> >  scratch-OST0007_UUID 5768577552 4089658692 1385890920   70%
> >  /lustre/scratch[OST:7] scratch-OST0008_UUID 5768577552 4151458292
> >  1324090948   71% /lustre/scratch[OST:8] scratch-OST0009_UUID 5768577552
> >  4116646240 1358902348   71% /lustre/scratch[OST:9] scratch-OST000a_UUID
> >  5768577552 3750259568 1725290032   65% /lustre/scratch[OST:10]
> >  scratch-OST000b_UUID 5768577552 4346406836 1129141752   75%
> >  /lustre/scratch[OST:11] scratch-OST000c_UUID 5768577552 4376152100
> >  1099396768   75% /lustre/scratch[OST:12] scratch-OST000d_UUID 5768577552
> >  4312773056 1162776184   74% /lustre/scratch[OST:13] scratch-OST000e_UUID
> >  5768577552 4900307080 575242532   84% /lustre/scratch[OST:14]
> >  scratch-OST000f_UUID 5768577552 4044304276 1431243940   70%
> >  /lustre/scratch[OST:15] scratch-OST0010_UUID 5768577552 3827521672
> >  1648026552   66% /lustre/scratch[OST:16] scratch-OST0011_UUID 5768577552
> >  3789120072 1686427400   65% /lustre/scratch[OST:17] scratch-OST0012_UUID
> >  5768577552 4023497048 1452052192   69% /lustre/scratch[OST:18]
> >  scratch-OST0013_UUID 5768577552 4133682544 1341866324   71%
> >  /lustre/scratch[OST:19] scratch-OST0014_UUID 5768577552 3690021408
> >  1785527832   63% /lustre/scratch[OST:20] scratch-OST0015_UUID 5768577552
> >  3891559096 1583990144   67% /lustre/scratch[OST:21] scratch-OST0016_UUID
> >  5768577552 4404600712 1070948896   76% /lustre/scratch[OST:22]
> >  scratch-OST0017_UUID 5768577552 4792223084 683326528   83%
> >  /lustre/scratch[OST:23] scratch-OST0018_UUID 5768577552 4486070024
> >  989478844   77% /lustre/scratch[OST:24] scratch-OST0019_UUID 5768577552
> >  4471754448 1003795164   77% /lustre/scratch[OST:25] scratch-OST001a_UUID
> >  5768577552 4517349052 958199536   78% /lustre/scratch[OST:26]
> >  scratch-OST001b_UUID 5768577552 3989325372 1486223000   69%
> >  /lustre/scratch[OST:27] scratch-OST001c_UUID 5768577552 4024754964
> >  1450793904   69% /lustre/scratch[OST:28] scratch-OST001d_UUID 5768577552
> >  3883873220 1591676392   67% /lustre/scratch[OST:29] scratch-OST001e_UUID
> >  5768577552 4928383088 547166152   85% /lustre/scratch[OST:30]
> >  scratch-OST001f_UUID 5768577552 4291418836 1184130776   74%
> >  /lustre/scratch[OST:31]
> >
> > filesystem summary:  184594481664 134329681744 40887889340   72%
> >  /lustre/scratch
> >
> > [root at ra 18X11]# lfs df -i
> > UUID                    Inodes     IUsed     IFree IUse% Mounted on
> > home-MDT0000_UUID    1287101228   5716405 1281384823    0%
> >  /lustre/home[MDT:0] home-OST0000_UUID    366288896    871143 365417753
> >  0% /lustre/home[OST:0] home-OST0001_UUID    366288896    900011
> > 365388885 0% /lustre/home[OST:1] home-OST0002_UUID    366288896    804892
> > 365484004    0% /lustre/home[OST:2] home-OST0003_UUID    366288896 836213
> > 365452683    0% /lustre/home[OST:3] home-OST0004_UUID    366288896 836852
> > 365452044    0% /lustre/home[OST:4] home-OST0005_UUID
> >  366288896    850446 365438450    0% /lustre/home[OST:5]
> >
> > filesystem summary:  1287101228   5716405 1281384823    0% /lustre/home
> >
> > UUID                    Inodes     IUsed     IFree IUse% Mounted on
> > scratch-MDT0000_UUID 1453492963 174078773 1279414190   11%
> >  /lustre/scratch[MDT:0] scratch-OST0000_UUID 337257280   6621404
> > 330635876 1% /lustre/scratch[OST:0] scratch-OST0001_UUID 366288896  
> > 6697629 359591267    1% /lustre/scratch[OST:1] scratch-OST0002_UUID
> > 366288896 5272904 361015992    1% /lustre/scratch[OST:2]
> > scratch-OST0003_UUID 366288896   5161903 361126993    1%
> > /lustre/scratch[OST:3]
> >  scratch-OST0004_UUID 366288896   5327683 360961213    1%
> >  /lustre/scratch[OST:4] scratch-OST0005_UUID 366288896   5582579
> > 360706317 1% /lustre/scratch[OST:5] scratch-OST0006_UUID 285040431  
> > 5158974 279881457    1% /lustre/scratch[OST:6] scratch-OST0007_UUID
> > 366288896 5307157 360981739    1% /lustre/scratch[OST:7]
> > scratch-OST0008_UUID 366288896   5387313 360901583    1%
> > /lustre/scratch[OST:8]
> >  scratch-OST0009_UUID 366288896   5426523 360862373    1%
> >  /lustre/scratch[OST:9] scratch-OST000a_UUID 366288896   5424803
> > 360864093 1% /lustre/scratch[OST:10] scratch-OST000b_UUID 360664073  
> > 5122378 355541695    1% /lustre/scratch[OST:11] scratch-OST000c_UUID
> > 353235316 5129413 348105903    1% /lustre/scratch[OST:12]
> > scratch-OST000d_UUID 366288896   5053936 361234960    1%
> > /lustre/scratch[OST:13]
> >  scratch-OST000e_UUID 222189585   5122229 217067356    2%
> >  /lustre/scratch[OST:14] scratch-OST000f_UUID 366288896   5281196
> > 361007700 1% /lustre/scratch[OST:15] scratch-OST0010_UUID 366288896  
> > 5274738 361014158    1% /lustre/scratch[OST:16] scratch-OST0011_UUID
> > 366288896 5409560 360879336    1% /lustre/scratch[OST:17]
> > scratch-OST0012_UUID 366288896   5369406 360919490    1%
> > /lustre/scratch[OST:18]
> >  scratch-OST0013_UUID 366288896   5502974 360785922    1%
> >  /lustre/scratch[OST:19] scratch-OST0014_UUID 366288896   5521406
> > 360767490 1% /lustre/scratch[OST:20] scratch-OST0015_UUID 366288896  
> > 5550606 360738290    1% /lustre/scratch[OST:21] scratch-OST0016_UUID
> > 345993048 4999552 340993496    1% /lustre/scratch[OST:22]
> > scratch-OST0017_UUID 249051056   4963064 244087992    1%
> > /lustre/scratch[OST:23]
> >  scratch-OST0018_UUID 325734426   5108454 320625972    1%
> >  /lustre/scratch[OST:24] scratch-OST0019_UUID 329427010   5222114
> > 324204896 1% /lustre/scratch[OST:25] scratch-OST001a_UUID 317921820  
> > 5115591 312806229    1% /lustre/scratch[OST:26] scratch-OST001b_UUID
> > 366288896 5353229 360935667    1% /lustre/scratch[OST:27]
> > scratch-OST001c_UUID 366288896   5383473 360905423    1%
> > /lustre/scratch[OST:28]
> >  scratch-OST001d_UUID 366288896   5411890 360877006    1%
> >  /lustre/scratch[OST:29] scratch-OST001e_UUID 216236615   6188887
> > 210047728 2% /lustre/scratch[OST:30] scratch-OST001f_UUID 366288896  
> > 6465049 359823847    1% /lustre/scratch[OST:31]
> >
> > filesystem summary:  1453492963 174078773 1279414190   11%
> > /lustre/scratch
> >
> >
> > Thanks,
> > Mike Robbert
> >
> > On Jan 11, 2010, at 7:24 PM, Andreas Dilger wrote:
> > > On 2010-01-11, at 15:59, Michael Robbert wrote:
> > >> The filename is not very unique. I can create a file with the same
> > >> name in another directory or on another Lustre filesystem. It is
> > >> just this exact path on this filesystem. The full path is:
> > >> /lustre/scratch/smoqbel/Cenval/CLM/Met.Forcing/18X11/NLDAS.APCP.
> > >> 007100.pfb.00164
> > >> The mount point for this filesystem is /lustre/scratch/
> > >
> > > Robert,
> > > does the same problem happen on multiple client nodes, or is it only
> > > happening on a single client?  Are there any messages on the MDS and/
> > > or the OSSes when this problem is happening?  This problem is somewhat
> > > unusual, since I'm not aware of any places outside the disk filesystem
> > > code that would cause ENOSPC when creating a file.
> > >
> > > Can you please do a bit of debugging on the system:
> > >
> > >     {client}# cd /lustre/scratch/smoqbel/Cenval/CLM/Met.Forcing/18X11
> > > {mds,client}# echo -1 > /proc/sys/lustre/debug       # enable full
> > > debug {mds,client}# lctl clear                             # clear
> > > debug logs {client}# touch NLDAS.APCP.007100.pfb.00164
> > > {mds,client}# lctl dk > /tmp/debug.{mds,client}      # dump debug logs
> > >
> > > For now, please extract the ENOSPC error from the logs will be much
> > > shorter, and may be enough to identify where the problem is located,
> > > and will be a lot friendlier to the list.
> > >
> > > grep -- "-28" /tmp/debug.{mds,client} > /tmp/debug-28.{mds,client}::
> > >
> > > along with the "lfs df" and "lfs df -i" output.
> > >
> > > If this is only on a single client, just dropping the locks on the
> > > client might be enough to resolve the problem:
> > >
> > > for L in /proc/fs/lustre/ldlm/namespaces/*; do
> > >     echo clear > $L
> > > done
> > >
> > > If, on the other hand, this same problem is happening on all clients
> > > then the problem is likely on the MDS.
> > >
> > >>> On Fri, Jan 8, 2010 at 1:36 PM, Michael Robbert
> > >>>
> > >>> <mrobbert at mines.edu> wrote:
> > >>>> I have a user that reported a problem creating a file on our
> > >>>> Lustre filesystem. When I investigated I found that the problem
> > >>>> appears to be unique to just one filename in one directory. I have
> > >>>> tried numerous ways of creating the file including echo, touch,
> > >>>> and "lfs setstripe" all return "No space left on device". I have
> > >>>> checked the filesystem with df and "lfs df" both show that the
> > >>>> filesystem and all OSTs are far from being full for both blocks
> > >>>> and inodes. Slight changes in the filename are created fine. We
> > >>>> had a kernel panic on the MDS yesterday and it was quite possible
> > >>>> that the user had a compute job working in this directory at the
> > >>>> time of that problem. I am guessing we have some kind of
> > >>>> corruption with the directory. This directory has around 1 million
> > >>>> files so moving the data around may not be a quick operation, but
> > >>>> we're willing to do it. I just want to know the best way, short of
> > >>>> taking the filesystem offline, to fix this problem.
> > >>>>
> > >>>> Any ideas? Thanks in advance,
> > >>>> Mike Robbert
> > >>>> _______________________________________________
> > >>>> Lustre-discuss mailing list
> > >>>> Lustre-discuss at lists.lustre.org
> > >>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> > >>
> > >> _______________________________________________
> > >> Lustre-discuss mailing list
> > >> Lustre-discuss at lists.lustre.org
> > >> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> > >
> > > Cheers, Andreas
> > > --
> > > Andreas Dilger
> > > Sr. Staff Engineer, Lustre Group
> > > Sun Microsystems of Canada, Inc.
> >
> > _______________________________________________
> > Lustre-discuss mailing list
> > Lustre-discuss at lists.lustre.org
> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
> 




More information about the lustre-discuss mailing list