[Lustre-discuss] un-even distribution of data over OSTs
Andreas Dilger
adilger at whamcloud.com
Wed Mar 7 12:43:17 PST 2012
On 2012-03-08, at 0:33, Roland Laifer <roland.laifer at kit.edu> wrote:
> I recently had a similar problem: Very bad OST space occupation but I could not
> find a corresponding very large file.
>
> Finally I found a process with parent 1 that was still appending data to a
> file which was already deleted by the user, i.e. this was the reason why
> I could not find a corresponding large file with "find".
It is possible to find open-unlinked files on the clients by using "lsof | grep deleted", since deleted files get " (deleted)" added at the end.
Sometimes this is normal, for temp files that should be unlinked when the process exits, but usually not.
These files can be accessed via /proc/{PID}/{fileno}, though I've never checked if "lfs getstripe" would work there or not.
> I found that process because the corresponding client was reporting
> "The ost_write operation failed with -28" LustreError messages and because
> I was lucky that only few user processes were running on that client.
> The owner of that process had Lustre quotas of 1.5 TB but "du -hs" on his
> home directory only showed 80 GB. After killing the process Lustre quotas
> went down and "lfs df" showed that OST usage was going down, too.
>
> Regards,
> Roland
>
>
> On Wed, Mar 07, 2012 at 07:41:28AM -0800, Grigory Shamov wrote:
>> Dear Lustre-Users,
>>
>> Recently we had an issue with file data distribution over our Lustre OSTs. We have a Lustre storage cluster here, of two OSS servers in active-active failover mode. The version of luster is 1.8, possibly with DDN patches.
>>
>> The cluster has 12 OSTs, 7.3Tb each. Normally, they are occupied to about 60% of the space (4.5Tb or so); but recently, one of them got completely filled (99%) with two other also keeping up (80%). The rest of OSTs stayed at the usual 60%.
>>
>> Why would that happen, shouldn't' Lustre try to distribute the space evenly? I have checked the filled OSTs for large files; there were no files that can be called large enough to explain the difference (with size of the order of magnitude of the difference between 99% and 60% occupation, i.e. 2-3Tb); some users did have large directories, but the files were of about 5-10Gb size.
>>
>> I have checked our Lustre parameters, the qos_prio_free seems to be default 90%, qos_threshold_rr is 16%, and stripe count is 1.
>>
>> Could you please suggest what might have caused such behavior of Lustre, are there any tunables/better values of tresholds, etc. to change to avoid such imbalances, etc.?
>>
>> Thank you very much in advance!
>>
>> --
>> Grigory Shamov
>> HPC Analyst,
>> University of Manitoba
>> Winnipeg MB Canada
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>
> --
> Karlsruhe Institute of Technology (KIT)
> Steinbuch Centre for Computing (SCC)
>
> Roland Laifer
> Scientific Computing and Simulation (SCS)
>
> Zirkel 2, Building 20.21, Room 209
> 76131 Karlsruhe, Germany
> Phone: +49 721 608 44861
> Fax: +49 721 32550
> Email: roland.laifer at kit.edu
> Web: http://www.scc.kit.edu
>
> KIT – University of the State of Baden-Wuerttemberg and
> National Laboratory of the Helmholtz Association
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
More information about the lustre-discuss
mailing list