[lustre-discuss] File locking errors.

Patrick Farrell paf at cray.com
Tue Feb 20 05:58:33 PST 2018


There is almost NO overhead to this locking unless you’re using it to keep threads away from each other’s on multiple nodes, in which case the time is spent doing waiting your app is asking for.

Lustre is doing implicit metadata and data locking constantly throughout normal operation, all active clients at all times, this is just a little more locking, of an explicit kind.  It should be almost impossible to measure a cost to flock vs localflock unless you’re really going wild with your file locking, in which case it would be worth your while to modify your job to use a better kind of concurrency control, because even localflock is slow compared to MPI communication.


________________________________________
From: lustre-discuss <lustre-discuss-bounces at lists.lustre.org> on behalf of Michael Di Domenico <mdidomenico4 at gmail.com>
Sent: Tuesday, February 20, 2018 6:47:16 AM
To: Prentice Bisbal
Cc: lustre-discuss
Subject: Re: [lustre-discuss] File locking errors.

On Fri, Feb 16, 2018 at 11:52 AM, Prentice Bisbal <pbisbal at pppl.gov> wrote:
> On 02/15/2018 06:30 PM, Patrick Farrell wrote:
>> Localflock will only provide flock between threads on the same node.  I
>> would describe it as “likely to result in data corruption unless used with
>> extreme care”.
>
> I can't agree with this enough. Someone runs a single node job and thinks
> "file locking works just fine!" and the runs a large multinode job, and then
> wonders why the output files are all messed up. I think enabling filelocking
> must be an all or nothing thing.

there are a few instances where this proves false.  think mpi job that
spawns openmp threads, but needs a local scratch file to run
out-of-core work...  also keep in mind locking across a large number
of clients imparts some performance penalty on the file system.

on a side note, in a previous email you stated your lustre version,
just a word from the wise, don't fall behind unless it's a managed
vendor solution.  its painful to update the servers, but failing
behind and then trying to update the servers is much worse.  i did it
once and have the scars to prove it... :)
_______________________________________________
lustre-discuss mailing list
lustre-discuss at lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


More information about the lustre-discuss mailing list