[lustre-discuss] lustre write wrong data under postgresql benchmark tool test(concurrent access same data file with primary node write and standy node read)

Michael Nishimoto michael at kmesh.io
Fri Oct 5 17:30:20 PDT 2018


Hi Andreas,

I didn't see an answer back, but I have a followup question.  Also, my
Postgres knowledge is pretty much zero.

I understand your concern about using direct I/O and how application
buffers can cause problems.
My question is about a statement in the original email, a reference to
shutting down Postgres
on the standby node and running pg_xlogdump which I assume is a standalone
command.

Shouldn't a standalone command running on the standby node see a consistent
copy of data?

I assume that active and standby nodes are Lustre clients in the same
cluster.

Thanks,

Michael

On Sat, Sep 29, 2018 at 2:54 PM Andreas Dilger <adilger at whamcloud.com>
wrote:

> Is PG using O_DIRECT or buffered read/write?  Is it caching the pages in
> userspace?
>
> Lustre will definitely keep pages consistent between clients, but if the
> application is caching the pages in userspace, and does not have any
> protocol between the nodes to invalidate cached pages when they are
> modified on disk, then the data will become inconsistent when one node
> modifies it.
>
> That is the same reason it isn't possible to mount a single ext4
> filesystem r/w on one node and r/o on another node with shared storage,
> because the filesystem doesn't expect data to be changing underneath it,
> and will cache pages in RAM and not re-read them if they are modified on
> the other node.
>
> Cheers, Andreas
>
> On Sep 29, 2018, at 07:57, 王亮 <wanziforever at gmail.com> wrote:
>
> Hello, lustre development team
>
>
> background: we have two postgresql instances running as a primary and
> standby and they share the same xlog file and data files (we change PG code
> to achieve this) which located in the mounted lustre file system, and we
> want to have a try with the lustre file system, we used gfs2 before, and we
> expect the lustre will show a much better performance, but ...
>
>
>
> We meet a read/write concurrent access problem related with Lustre. Would
> you like to give us some suggestions? Any advices are appreciated, and
> thank you in advanced : )
>
> note: we are very sure the standy instance will not write any data to
> disk. (to be sure of this, we also shutdown the standby end, and use
> pg_xlogdump tool to read the xlog file, the problem still happened, and
> pg_xlogdump to is just a query to with any write operation)
>
>
>
> Scenario Description:
>
> There’re 4 nodes(CentOS Linux release 7.4) connected with infiniband
> network(driven by MLNX_OFED_LINUX-4.4):
>
> 10.0.0.106 acts as MDS with a local PCI-E 800GB SSD that used as MDT.
>
> 10.0.0.101 acts as OSS with a same local PCI-E 800GB SSD that used as OST.
>
> 10.0.0.104 and 10.0.0.105 act as Lustre client and mount the Lustre file
> system at the directory of “/lustre”.
>
> The Lustre related packages are compiled from official
> lustre-2.10.5-1.src.rpm.
>
>
>
> The simplest verification(i.e. dd command) passed without errors.
>
>
>
> Error:
>
> Then start our customized PostgreSQL service at 104 and 105. 104 runs as
> the primary PostgreSQL server, and 105 runs as the secondary PostgreSQL
> server. All the two PostgreSQL nodes read/write the shared directory of “
> /lustre” provided by Lustre. The primary PostgreSQL server will open
> files with **RW** mode and write something into the files; at the *
> *meantime** the second PostgreSQL server will open the same files with *
> *R** mode and read the data written by the primary PostgreSQL server, and
> it gets the **wrong** data (the flushed data by primary in disk is error
> i.e. write wrong data into disk). This will happen when we run a benchmark
> tool of PostgreSQL.
>
>
>
>
>
> PS 1.
>
> We tried different options to mount the Lustre:
>
> mount -t lustre -o flock 10.0.0.106 at o2ib0:/birdfs /lustre
>
> mount -t lustre -o flock -o ro 10.0.0.106 at o2ib0:/birdfs /lustre
>
> but the error always exists.
>
>
>
> PS 2.
>
> Attach the initial information, maybe helpful.
>
> [root at 106 ~]# mkfs.lustre --fsname=birdfs --mgs --mdt --index=0
> --reformat /dev/nvme0n1
>
>
>
>    Permanent disk data:
>
> Target:     birdfs:MDT0000
>
> Index:      0
>
> Lustre FS:  birdfs
>
> Mount type: ldiskfs
>
> Flags:      0x65
>
>               (MDT MGS first_time update )
>
> Persistent mount opts: user_xattr,errors=remount-ro
>
> Parameters:
>
>
>
> device size = 763097MB
>
> formatting backing filesystem ldiskfs on /dev/nvme0n1
>
>        target name   birdfs:MDT0000
>
>        4k blocks     195353046
>
>        options        -J size=4096 -I 1024 -i 2560 -q -O
> dirdata,uninit_bg,^extents,dir_nlink,quota,huge_file,flex_bg -E
> lazy_journal_init -F
>
> mkfs_cmd = mke2fs -j -b 4096 -L birdfs:MDT0000  -J size=4096 -I 1024 -i
> 2560 -q -O dirdata,uninit_bg,^extents,dir_nlink,quota,huge_file,flex_bg -E
> lazy_journal_init -F /dev/nvme0n1 195353046
>
>
>
> Writing CONFIGS/mountdata
>
>
>
> [root at 101 ~]# mkfs.lustre --fsname=birdfs --ost --reformat --index=0
> --mgsnode=10.0.0.106 at o2ib0 /dev/nvme0n1
>
>
>
>    Permanent disk data:
>
> Target:     birdfs:OST0000
>
> Index:      0
>
> Lustre FS:  birdfs
>
> Mount type: ldiskfs
>
> Flags:      0x62
>
>               (OST first_time update )
>
> Persistent mount opts: ,errors=remount-ro
>
> Parameters: mgsnode=10.0.0.106 at o2ib
>
>
>
> device size = 763097MB
>
> formatting backing filesystem ldiskfs on /dev/nvme0n1
>
>        target name   birdfs:OST0000
>
>        4k blocks     195353046
>
>        options        -J size=400 -I 512 -i 69905 -q -O
> extents,uninit_bg,dir_nlink,quota,huge_file,flex_bg -G 256 -E
> resize="4290772992",lazy_journal_init -F
>
> mkfs_cmd = mke2fs -j -b 4096 -L birdfs:OST0000  -J size=400 -I 512 -i
> 69905 -q -O extents,uninit_bg,dir_nlink,quota,huge_file,flex_bg -G 256 -E
> resize="4290772992",lazy_journal_init -F /dev/nvme0n1 195353046
>
> Writing CONFIGS/mountdata
>
>
>
>
>
> Looking forward to any replies.
>
>
>
> Regards,
>
> Bird
>
>
> --
> regards
> denny
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>


-- 
----------------
Michael Nishimoto
cell: 408-410-9277
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20181005/c74b021c/attachment-0001.html>


More information about the lustre-discuss mailing list