eulerto / wal2json Goto Github PK

View Code? Open in Web Editor NEW

1.2K 1.2K 163.0 408 KB

JSON output plugin for changeset extraction

License: BSD 3-Clause "New" or "Revised" License

Makefile 1.03% PLpgSQL 10.31% C 88.66%

cdc json plugin postgres replication

wal2json's People

Contributors

Stargazers

Watchers

Forkers

ethanfrey jarreds dmitriy-riabov dmonroy uphold-forks birdandbees postsql irshusdock amiganguli torsten-deriv emrul mxku guptam dvarrazzo niquola dpirotte goldfix paulczajka wind39 koordinates the4thdoctor deep-ghost notxcain wolverinoid pkoyani igouss davidfetter weiboyiyou bertall zhuomingliang keunsy linz mobiletta instructure hmilkovi tomsmalley suhuaguo mrxliu chengxiaoz alexshakurin marshallshen sumitjhamindtickle henrytseng ondrazizka alagalah streamsets db4u huopaopao erikvanzijst timostolz akaitoua rkrage pgexperts jwdeitch zhaoshiling1017 natalliabarysevich nhurd xrmx yum-install-brains greigwise obashitechnology benjie aiven mattrobenolt compasses smudduluru getsentry vox-tecnologia zylo47 bahaa skondla sam-jain mahabobur-rahman-m4 bmd-studio downssd dongzl dongbin86 phoenixhadoop metafour-int lipaul capablemonkey novanbramantya enggarranu pxtruy evgwpw oliverhao jmealo haizz cybernetics ssaldi abhilashr2020 nachocab private-citizen fabriziomello junneyang aspros-zhong ruiaylin bwalendz noyb34 mywin

wal2json's Issues

Sequence of decoding

Hello,
This is more of a question than an issue.
Assuming I have 3 concurrent transactions A,B and C all started at the same time x.
B happens to commit before A , and A commits before C.
So the commit order is something in the line of :

B,A,C.

giving they are commit at different times, do they all get decoded in the same order of commit. Or is it possible that in an environment with high concurrency , it's possible that the decoding is not strictly followed by the order/time with which each transaction was committed ?

If it happens that the transaction could be decoded regardless of their commit time , then it does suffice to say that the transaction commit time is not a sure way to track consumed messages and start replaying from where one left off.
Is it then possible to have the LSN number for both a begin and a commit statement ?

I know in the current setup of wal2json and the output messages, the begin header , transaction message and the commit header are all in a single message making it impossible to track the commit LSN.

Could you please shed some light on this topic ?

regards,

Numeric/Decimals should be strings?

NUMERICOID is encoded as a number in the JSON output, but decimals/numeric values are typically encoded as strings to preserve the precision and lengths without floating point abiguity. Gut feel is wal2json should do the same thing.

Thoughts?

bytea data not properly decoded

Hello,
I recently found out that bytea columns are not properly decoded. The bytea data comes in some
strange decoding that is not in any way similar to the source data.

Could this be some kind of bug ?

cheers .

rugging24

(maybe) wrong memory context?

More a question than a bug report, as I don't know many details of Postgres allocator:

pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
				 Relation relation, ReorderBufferChange *change)

	/* ... */

	old = MemoryContextSwitchTo(data->context);

	/* ... */

	switch (change->action)
	{
		case REORDER_BUFFER_CHANGE_INSERT:
			if (change->data.tp.newtuple == NULL)
			{
				elog(WARNING, "no tuple data for INSERT in table \"%s\"", NameStr(class_form->relname));
				return;
			}
			break;

	/* ... */

	MemoryContextSwitchTo(old);
	MemoryContextReset(data->context);
}

if the return is early, isn't the memory context left in an inconsistent state?

Thank you.

postpone write parameter

Currently, the plugin outputs JSON in pieces (got data and sent it). The receiver has to maintain the state to know if a complete JSON object has already arrived then it can process it.

At least for the SQL interface POV, it would be a good idea to support buffering JSON object until its end. It would avoid maintain the state in the receiver. The drawback is long transactions would consume a lot of server memory.

PS> unfortunately, pg_recvlogical can't take advantage of this option because you can only provide parameters at startup. Of course, you could choose a convenient default for all your streaming.

Size unhandled in pg_decode_message

Hello again,

While testing some code path, I found that the content_size isn't used in every code path of pg_decode_mesage(), which can lead to erreoneous out. For instance:

rjuju=# SELECT pg_logical_emit_message(false, 'wal2json', 'a very long message');
 pg_logical_emit_message 
-------------------------
 0/629A5358
(1 row)

rjuju=# SELECT pg_logical_emit_message(false, 'wal2json', 'meh');
 pg_logical_emit_message 
-------------------------
 0/629A5398
(1 row)

rjuju=# select * from pg_logical_slot_peek_changes('wal2json', null, null, 'pretty-print', '1');
    lsn     | xid |                           data                           
------------+-----+----------------------------------------------------------
 0/629A5358 |   0 | {                                                       +
            |     |         "change": [                                     +
            |     |                 {                                       +
            |     |                         "kind": "message",              +
            |     |                         "transactional": false,         +
            |     |                         "prefix": "wal2json",           +
            |     |                         "content": "a very long message"+
            |     |                 }                                       +
            |     |         ]                                               +
            |     | }
 0/629A5398 |   0 | {                                                       +
            |     |         "change": [                                     +
            |     |                 {                                       +
            |     |                         "kind": "message",              +
            |     |                         "transactional": false,         +
            |     |                         "prefix": "wal2json",           +
            |     |                         "content": "meh"                +
            |     |                 }                                       +
            |     |         ]                                               +
            |     | }
(2 rows)

rjuju=# select * from pg_logical_slot_peek_changes('wal2json', null, null, 'pretty-print', '0');
    lsn     | xid |                                                   data                                                    
------------+-----+-----------------------------------------------------------------------------------------------------------
 0/629A5358 |   0 | {"change":[{"kind":"message","transactional":false,"prefix":"wal2json","content":"a very long message"}]}
 0/629A5398 |   0 | {"change":[{"kind":"message","transactional":false,"prefix":"wal2json","content":"mehery long message"}]}
(2 rows)

(see last two messages content).

I didn't read upstream infrastructure, but I assume that for performance issue content isn't zeroed on every call, so you can't rely on it being NULL-terminated.

I wanted to provide a fix for this, but I have some questions.

I'm not sure why when pretty-print is asked you do a appendBinaryStringInfo() and otherwise call quote_escape_json(). I assume that quote_escape_json() should be called in both cases, or is there something I missed?

If quote_escape_json() has to be called() for both, I think we can ad a Size parameter, and iterate as an array instead of currrent pointer loop. We could also t manually call enlargeStringInfo() with the given size, instead of relying on it doubling it's size autmatically. That probably won't be a huge win, but it might still win some cycles.

What do you thing?

Compile issue using USE_PGXS=1 make && make install

Hi, I've been trying to use wal2son with kafka connect. However, I am not able to compile because the directory/file containing pgxs.mk cannot be found

I'm using Ubuntu 10.04 and postgres is installed under /usr/lib but it doesn't have the .mk file

I also tried installing postgresql from the source still didn't work. Any help would be greatly appreciated :)

Create an official release or tag

Hi, the Debezium project is interested in using this plug-in; would it be possible to cut an official release and/or tag so we have a stable version we can refer to? I'm a bit reluctant to pull in just the HEAD of master as it's a moving target inherently. So we currently refer a specific commit id, but having a named tag would be better of course. Thanks a lot for considering this request!

Question on the format of datetime

I use wal2json and psycopg2 to got postgresql data change. I have a question on the format of datatime. In psql I have column type timestamp with timezone

In psql console:

postgres=# select create_time from test limit 1;
          create_time
-------------------------------
 2018-05-18 03:06:22.446039+00

In some graph ide like DataGrip (after I click on the column)

create_time
2018-05-18 03:06:22.446039 +00:00

The wal2json return '2018-05-18 03:06:22.446039+00'. But I look at https://docs.python.org/3/library/datetime.html The %z is '+0000' have 4 digit.
Does I got some misunderstanding on the stand?

Options defaults are not very sensible

I don't think the current defaults are particularly sensible:

include-xid = true: xids are only useful if you know what to do with them, as well as LSN. Otherwise they are just a source of noise to be filtered out. See test suite.
write-in-chunks = true: with this value records emitted are not valid JSON objects and a streaming parser is required. I don't even know what is it useful for

piro=# SELECT data::json FROM pg_logical_slot_peek_changes('jslot', NULL, NULL, 'write-in-chunks', 'f');
     data      
---------------
 {"change":[]}
(1 row)

piro=# SELECT data::json FROM pg_logical_slot_peek_changes('jslot', NULL, NULL, 'write-in-chunks', 't');
ERROR:  invalid input syntax for type json
DETAIL:  The input string ended unexpectedly.
CONTEXT:  JSON data, line 1: {"change":[

I don't know what is the current policy about backwards compatibility, but I think these values should default to false.

Include user name and application name in output?

Hi,

I wondered if it would be possible to (optionally) include the SESSION_USER and application_name values for the backend that committed a transaction in the json output?

Thanks.

Invalid JSON is generated for quoted identifiers

When a qutoed identifiers are used - like when a table name contains space - then the JSON generator is not escaping the quotes and as a result the JSON is invalid.

Json without terminating line

Hi,
I noticed that sometimes json is not terminated with the final line "]}". After the final change array element I have directly a line similar to this:
'{"xid":000000,"change":['
and next the other changes and eventually the json terminating line.
This makes json validation more difficult.

"no known snapshots" error when db rows are large

We are using wal2json with Debezium to replicate some db tables containing large rows. While streaming changes, we have encountered the following error:

2018-07-27T19:54:09.616547179Z ERROR:  no known snapshots
--
2018-07-27T19:54:09.616587252Z CONTEXT:  slot "debezium", output plugin "wal2json", in the change callback, associated LSN 0/16D3850

We think this is a bug with wal2json's handling of TOAST columns, since this behavior only started happening after commit ce82d73

I have a more detailed analysis, including steps to reproduce the issue, up on Debezium's bugtracker:
https://issues.jboss.org/browse/DBZ-842

Expose scale and precision for NUMERIC columns

Hi, for columns such as NUMERIC(10,3), we'll only see NUMERIC as a column type in corresponding messages, i.e. precision and scale are gone. It'd be great to have them, though, in order to derive corresponding schemas on the consumer side.

I'm not sure whether this (easily) can be done in wal2json, so I'm opening this to get a discussion started. For sure it'd be something were we'd benefit very much from in Debezium.

Why is the LSN key called 'nextlsn'?

This is more of a query than an issue.
The LSN for each change comes with key as 'nextlsn' even though the value is the actual LSN. This gives the impression that the value given by wal2json for the 'nextlsn' key is for the next change rather than the current one. This threw me off.

The code says that it is the value pointing to the next commit record.

Output using JDBC and setting the slot option include-lsn to true:

{
	"xid": 740,
	"nextlsn": "0/18154A8",
	"timestamp": "2018-02-16 12:46:40.877369+05:30",
	"change": [
		{
			"kind": "insert",
			"schema": "public",
			"table": "test_table",
			"columnnames": ["pk", "name"],
			"columntypes": ["integer", "text"],
			"columnvalues": [160, "val1"]
		}
	]
}
-----
{
	"xid": 741,
	"nextlsn": "0/1815558",
	"timestamp": "2018-02-16 12:46:43.485518+05:30",
	"change": [
		{
			"kind": "insert",
			"schema": "public",
			"table": "test_table",
			"columnnames": ["pk", "name"],
			"columntypes": ["integer", "text"],
			"columnvalues": [161, "val2"]
		}
	]
}
-----
{
	"xid": 742,
	"nextlsn": "0/1815608",
	"timestamp": "2018-02-16 12:46:46.437739+05:30",
	"change": [
		{
			"kind": "insert",
			"schema": "public",
			"table": "test_table",
			"columnnames": ["pk", "name"],
			"columntypes": ["integer", "text"],
			"columnvalues": [162, "val3"]
		}
	]
}

Output from psql:

SELECT * FROM pg_logical_slot_peek_changes('test_slot', NULL, NULL);
    lsn    | xid |                                                                              data                                                                              
-----------+-----+----------------------------------------------------------------------------------------------------------------------------------------------------------------
 0/18154A8 | 740 | {"change":[{"kind":"insert","schema":"public","table":"test_table","columnnames":["pk","name"],"columntypes":["integer","text"],"columnvalues":[160,"val1"]}]}
 0/1815558 | 741 | {"change":[{"kind":"insert","schema":"public","table":"test_table","columnnames":["pk","name"],"columntypes":["integer","text"],"columnvalues":[161,"val2"]}]}
 0/1815608 | 742 | {"change":[{"kind":"insert","schema":"public","table":"test_table","columnnames":["pk","name"],"columntypes":["integer","text"],"columnvalues":[162,"val3"]}]}

As you can see the nextlsn from wal2json maps to the lsn column given by Postgres.

Can someone clear the confusion?

Isn't compatible with PostgreSQL 10

The extension doesn't compile with lots of various errors.

You need to install postgresql-server-dev-X.Y for building a server-side extension or libpq-dev for building a client-side application

Hello,

I have postgres 9.5 installed on Ubuntu and trying to install wal2json.
During execution of USE_PGXS=1 make or USE_PGXS=1 make install, I'm getting below error.
"/opt/wal2json# USE_PGXS=1 make
You need to install postgresql-server-dev-X.Y for building a server-side extension or libpq-dev for building a client-side application.
make: *** No targets. Stop.
:/opt/wal2json# USE_PGXS=1 make install
You need to install postgresql-server-dev-X.Y for building a server-side extension or libpq-dev for building a client-side application.
make: *** No rule to make target 'install'. Stop."

Please suggest.

Provide an option to output type OIDs instead of type names

Some connectors, namely Python psycopg2 in my case, use type OIDs directly for internal type conversion methods. In these cases it is far simpler to provide these OIDs directly and not rely on the type names.

Since a JSON output is a better API than the text-based example provided by Postgres, we should accommodate softwares using already existing connectors and frameworks. I hope this feature can be useful for a lot of people out there.

I'm providing a pull request for this feature, and am at your disposal for corrections or improvements.

Which user caused the changes?

Is there any way to figure out which user caused the changes? thanks so much.

btw. thanks for this awesome plugin :)

Control-chars should be escaped in strings

The JSON specification requires that control chars are properly escaped. If the string value contains control chars then it is sent as is without conversion to escape sequence. This causes JSON parsers to fail.

the removal of 'x' in "\x" corrupts data

I have data in a table the holds valid hex \x sequences (e.g. "\x1b" for an encoded [ESC] character). The following code in quote_escape_json seems to be the culprit:

/* XXX suppress \x in bytea field? */
if (ch == '\\' && *(valptr + 1) == 'x')
{
	valptr++;
	continue;
}

With this, my "\x1b" value is emitted from this plugin as "\1b".

Even if this is right thing to do for bytea values (not convinced it is), it is not correct for other textual data.

Warnings from materialized views

Logical replication will emit events for materialized views, which will not have primary keys. Depending on the complexity of the mat view, this can result in many empty warnings which will saturate the backend. Is there a way around this?

Error when run "USE_PGXS=1 make" to build wal2json on SmartOS

Hi
There an error when build wal2json on SmartOS(base on Solaris), I have researched it on google few hours but not found any valuable information.

any help would be greatly appreciated :)

my env info:

[root@8cd4bd6e-ee63-667f-f105-85303b90d559 ~/wal2json]# gcc --version
gcc (GCC) 4.9.4
[root@8cd4bd6e-ee63-667f-f105-85303b90d559 ~/wal2json]# make --version
GNU Make 4.1
[root@8cd4bd6e-ee63-667f-f105-85303b90d559 ~/wal2json]# ld --version
GNU ld (GNU Binutils) 2.26.1
postgres=# select version();
                                     version
---------------------------------------------------------------------------------
 PostgreSQL 9.5.9 on x86_64-sun-solaris2.11, compiled by gcc (GCC) 4.9.4, 64-bit

and the error:

[root@8cd4bd6e-ee63-667f-f105-85303b90d559 ~/wal2json]# USE_PGXS=1 make
gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -pipe -O2 -pipe -O2 -I/opt/local/include -I/opt/local/include/ncurses -DLDAP_DEPRECATED -I/usr/include -fPIC -L/opt/local/lib -L/opt/local/gcc49/lib/gcc/x86_64-sun-solaris2.11/4.9.4 -Wl,-R/opt/local/gcc49/lib/gcc/x86_64-sun-solaris2.11/4.9.4 -L/opt/local/lib -Wl,-R/opt/local/lib -L/usr/lib/amd64 -Wl,-R/usr/lib/amd64 -L/opt/local/lib -L/opt/local/lib -L/opt/local/lib -L/opt/local/lib  -Wl,--as-needed -Wl,-R'/opt/local/lib'  -shared -o wal2json.so wal2json.o
/opt/local/x86_64-sun-solaris2.11/bin/ld:/opt/local/gcc49/lib/gcc/x86_64-sun-solaris2.11/4.9.4/../../../libgcc-unwind.map: file format not recognized; treating as linker script
/opt/local/x86_64-sun-solaris2.11/bin/ld:/opt/local/gcc49/lib/gcc/x86_64-sun-solaris2.11/4.9.4/../../../libgcc-unwind.map:1: syntax error
collect2: error: ld returned 1 exit status
/opt/local/lib/postgresql/pgxs/src/makefiles/../../src/Makefile.port:22: recipe for target 'wal2json.so' failed
make: *** [wal2json.so] Error 1

Previous value for update statement

Is it available get by update statement previous value in result diff? For example next queries

create table test_table(id NUMERIC PRIMARY KEY, value VARCHAR(200));
insert into test_table(id, value) values(1, 'original');
update test_table set value = 'modified' where id = 1;

generate next diff

{
    "xid": 8273,
    "change": [
        {
            "kind": "insert",
            "schema": "public",
            "table": "test_table",
            "columnnames": ["id", "value"],
            "columntypes": ["numeric", "varchar"],
            "columnvalues": [1, "original"]
        }
    ]
}
{
    "xid": 8274,
    "change": [
        {
            "kind": "update",
            "schema": "public",
            "table": "test_table",
            "columnnames": ["id", "value"],
            "columntypes": ["numeric", "varchar"],
            "columnvalues": [1, "modified"],
            "oldkeys": {
                "keynames": ["id"],
                "keytypes": ["numeric"],
                "keyvalues": [1]
            }
        }
    ]
}

It not allow propagate changes to external system. For example tables can be connected to one document and send to elasticsearch. After update some reference field we should invalidate previous value(it can be back-reference)

Ga support 9.5

Amazon RDS

Hi,
I'm using version 9.6.3 and cannot load the wal2json extension. I cannot see it when i list the available extensions as well.
I can however work with 'auto_complete' and apparently they were both added together.

Option for a "column":"value" data structure

Currently, the output data structure is:

{"change":[{"kind":"update","table":"mytable2",
    "columnnames":["id","name"],
    "columnvalues":[400,"Update 1"],
    "oldkeys":{"keynames":["id"],"keyvalues":[400]}}]
}

For postprocessing purposes, it would be more appropriate to have this format:

{"change":[{"kind":"update","table":"mytable2",
    "changes":{"id":400, "name":"Update 1"},
    "oldkeys":{"id":400}}]
}

I suggest this could be an option

changes-data-structure = two-arrays | key-value-hash

Thanks for considering.

Options Uknown

When I try to select/get changes with the following parameter :
'include_xids': '1',
'include_timestamp': '1',
'include_schemas': '1',
'include_types': '1',
'pretty_print' : '1'

there's always a warning :
2017-04-04 22:32:44 CEST [27933-25] postgres@test_db WARNING: option include_xids = 1 is unknown
2017-04-04 22:32:44 CEST [27933-26] postgres@test_db CONTEXT: slot "test", output plugin "wal2json", in the startup callback
2017-04-04 22:32:44 CEST [27933-27] postgres@test_db WARNING: option include_timestamp = 1 is unknown
2017-04-04 22:32:44 CEST [27933-28] postgres@test_db CONTEXT: slot "test", output plugin "wal2json", in the startup callback
2017-04-04 22:32:44 CEST [27933-29] postgres@test_db WARNING: option include_schemas = 1 is unknown

Only the option : write-in-chunks
works without any warning.

Are these options supported ?

regards

Tests fail _sometimes_

Hi,

I noticed that various tests sometimes fail. I didn't explore the issue deeply but it looks like just a flaw of the tests, not a bug in the extension itself.

Example 1:
http://afiskon.ru/s/fc/53ae61ede5_regression.diffs.txt
http://afiskon.ru/s/d4/bd9a065311_regression.out.txt

Example 2:
http://afiskon.ru/s/79/11a4ea2430_regression.diffs.txt
http://afiskon.ru/s/35/1523b25c8b_regression.out.txt

The environment is Arch Linux x64, GCC 7.2.0, Core i7 3632QM 2200 Mhz (4 cores, 8 with HT).

pg_recvlogical: could not clear search_path: ERROR: syntax error

I am having problems with "pg_recvlogical: could not clear search_path: ERROR: syntax error" on postgresql 9.6.8. Can anyone tell me how to solve it? Thank you

Events are resent

Hi,

I noticed a scenario in which events are resent. This can be found here: https://www.postgresql.org/message-id/CANcxKGKg5eof4Zm1fj-8EhirX9ZT4gnZyNoVEXqp74wruJvg8A%40mail.gmail.com

I suspect that this is related to the wal2json plugin.

Please let me know when you need more information or testing from my side.

Thanks you.

Deferred PK are not recognized

I have an issue with a table which has a primary key defined as DEFERRABLE INITIALLY DEFERRED.

Here is postgres setup:

wal_level = logical
max_wal_senders = 3
max_replication_slots = 3

The PK is ignored and any data processed as an UPDATE or DELETE operation is not written by wal2json plugin.
Looking at the code, I found:

/* Make sure rd_replidindex is set */
RelationGetIndexList(relation);
...
if (!OidIsValid(relation->rd_replidindex) && relation->rd_rel->relreplident != REPLICA_IDENTITY_FULL)

RelationGetIndexList does not seem to set the right PK.

I created a small test case to show this behaviour (I updated the given sample).

$ cat /tmp/deferredPK.sql 
CREATE TABLE table_with_deferred_pk (a SERIAL, b VARCHAR(30), c TIMESTAMP NOT NULL, PRIMARY KEY(a, c) DEFERRABLE INITIALLY DEFERRED);

BEGIN;
INSERT INTO table_with_deferred_pk (b, c) VALUES('Backup and Restore', now());
INSERT INTO table_with_deferred_pk (b, c) VALUES('Tuning', now());
INSERT INTO table_with_deferred_pk (b, c) VALUES('Replication', now());
UPDATE table_with_deferred_pk SET b = 'Tuning - update' where b = 'Tuning';
DELETE FROM table_with_deferred_pk WHERE a < 3;
COMMIT;

$ psql -At -f /tmp/deferredPK.sql  postgres
CREATE TABLE
BEGIN
INSERT 0 1
INSERT 0 1
INSERT 0 1
UPDATE 1
DELETE 2
COMMIT

We can see that only INSERT operations are written by wal2json

$ pg_recvlogical -d postgres --slot test_slot --start -o pretty-print=1 -f -
{
	"change": [
	]
}
WARNING:  table "table_with_deferred_pk" without primary key or replica identity is nothing
WARNING:  table "table_with_deferred_pk" without primary key or replica identity is nothing
WARNING:  table "table_with_deferred_pk" without primary key or replica identity is nothing
{
	"change": [
		{
			"kind": "insert",
			"schema": "public",
			"table": "table_with_deferred_pk",
			"columnnames": ["a", "b", "c"],
			"columntypes": ["integer", "character varying(30)", "timestamp without time zone"],
			"columnvalues": [1, "Backup and Restore", "2018-04-19 07:19:02.867699"]
		}
		,{
			"kind": "insert",
			"schema": "public",
			"table": "table_with_deferred_pk",
			"columnnames": ["a", "b", "c"],
			"columntypes": ["integer", "character varying(30)", "timestamp without time zone"],
			"columnvalues": [2, "Tuning", "2018-04-19 07:19:02.867699"]
		}
		,{
			"kind": "insert",
			"schema": "public",
			"table": "table_with_deferred_pk",
			"columnnames": ["a", "b", "c"],
			"columntypes": ["integer", "character varying(30)", "timestamp without time zone"],
			"columnvalues": [3, "Replication", "2018-04-19 07:19:02.867699"]
		}
	]
}

Note:

Changing the PK to DEFERRABLE INITIALLY IMMEDIATE produces the same behaviour

Provide an optional flag for columns

When one is building an Avro schema for the events it might be important to distinguish between optinal and non-optional fields for downstream clients. Would it be possible to add a flag to columns that they are optional?

Thanks!

pretty print option

Currently, pretty print JSON is the only option. It would be desirable to add an option to print a compact format.

Clarify what's the encoding for JSON messages

Hi, we're seeking some clarity on the encoding used for sending the messages by wal2json. More specifically, is it safe to assume that the bytes can be interpreted using UTF-8?

Official Docker release

It's nice to have an official docker image based on official postgres with at least 2 distributions: debian and alpine.

Are there any plans to create release versions?

I find tracking how far behind the tip of the master branch we are is much easier with version numbers than commit hashes, and it would be good to know which commits you think warrant minor and major version number increases.

Cheers.

Add parameter to filter on table name

I see one of the use cases is to replicate a single table with known data schema to Kafka topic.
What do you thing about adding table parameter to be used as a filter on relation name?
Maybe there is some other way?

UPD: I forgot to add that schema also should be filtered.

Alpine Linux PostgreSQL Docker image: wal2json.so: __snprintf_chk: symbol not found

I have compiled wal2json against PostgreSQL 9.5, in Docker container of postgres:9.5.
Then I am trying to load the plugin in PostgreSQL 9.5, again, Docker container, postgres:9.5-alpine.
But here is what I get after SELECT * FROM pg_create_logical_replication_slot('export5lot', 'wal2json');:

ERROR: could not load library "/usr/local/lib/postgresql/wal2json.so": Error relocating /usr/local/lib/postgresql/wal2json.so: __snprintf_chk: symbol not found

This is due to the Alpine Linux using musl instead of glibc.

I'm not quite sure what to do about it. It's been some time since my last C project.
Could you please see if there's a way to get around that? Perhaps some flag to use something else than __snprintf_chk when building for Alpine?

Add per-transaction unique number to each change

Hi, we use wal2json to transfer incremental changes to an OLAP server (SnowFlake).
It's done in a batch after some period (nightly).
During the day, one table row may change several times.
By SnowFlake import, the changes are considered as a set, not an ordered list, so the duplications may cause the earlier change override the former. For instance:

START TRANSACTION;
UPDATE myTable2 SET name = 'Update AA' WHERE id = 401;
UPDATE myTable2 SET name = 'Update BB' WHERE id = 401;
COMMIT;

Leads to:

{"xid":1074,"timestamp":"2018-07-18 17:49:54.719475+02","change": [
  {"kind":"update","table":"mytable2","columnnames":["id","name","age"],"columnvalues":[401,"Update AA",20],"oldkeys":{"keynames":["id"],"keyvalues":[401]}},
  {"kind":"update","table":"mytable2","columnnames":["id","name","age"],"columnvalues":[401,"Update BB",20],"oldkeys":{"keynames":["id"],"keyvalues":[401]}}
]}

It would be great if wal2json could add a number to each change

{"xid":1074,"timestamp":"2018-07-18 17:49:54.719475+02","change": [
  {"sn":1, "kind":"update","table":"mytable2","columnnames":["id","name","age"],"columnvalues":[401,"Update AA",20],"oldkeys":{"keynames":["id"],"keyvalues":[401]}},
  {"sn":12 "kind":"update","table":"mytable2","columnnames":["id","name","age"],"columnvalues":[401,"Update BB",20],"oldkeys":{"keynames":["id"],"keyvalues":[401]}}
]}

Thanks for considering.

Provide a primary key flag for columns

When the event is processed downstream it is important to know what columns forms the primary key. Could you please provide a flag for columns that says that it is a part of primary key?

Placeholder `{organization}` in readme and license

The readme and license file contain an {organization} placeholder that should be replaced with something else.

Warning with update on table without primary key and data are lost

Hi!

Executing tests from README.md I take this case:

[email protected]:moses_dev_db> SELECT data FROM pg_logical_slot_get_changes('test_slot', NULL, NULL, 'pretty-print', '1', 'write-in-chunks', '0');
WARNING:  table "table_without_pk" without primary key or replica identity is nothing

+--------+
| data   |
|--------|
| {
        "change": [
        ]
}        |
+--------+
SELECT 1
Time: 0.095s

But using test_decoding within another slot:

[email protected]:moses_dev_db> SELECT data FROM pg_logical_slot_get_changes('test_slot2', NULL, NULL);
+-------------------------------------------------------------------------------------+
| data                                                                                |
|-------------------------------------------------------------------------------------|
| BEGIN 1890                                                                          |
| table public.table_without_pk: UPDATE: a[integer]:1 b[numeric]:2.34 c[text]:'anta2' |
| COMMIT 1890                                                                         |
+-------------------------------------------------------------------------------------+
SELECT 3
Time: 0.084s

Looking the source code I see another possibly points where WARNINGs will be printed and data discated, how I can bypass this and don't lose data? For my case use test_decoding and implement data parser is better option?

Thanks a lot!

"One change per row" output mode

We use wal2json to do daily batch of incremental updates of an OLAP db (SnowFlake).
The way the JSON is consumed is that the JSON structure is loaded by a SQL function as a table with a single column and one table row per text line; And then accessed using a JSON-parsing SQL functions.

Having multiple changes in a single text line makes processing it quite difficult - it is necessary (if not impossible) to iterate over the changes: [...] array.

{"xid":1074,"timestamp":"2018-07-18 17:49:54.719475+02","change": [
  {"kind":"update","table":"mytable2","columnnames":["id","name","age"],"columnvalues":[401,"Update AA",20],"oldkeys":{"keynames":["id"],"keyvalues":[401]}},
  {"kind":"update","table":"mytable2","columnnames":["id","name","age"],"columnvalues":[401,"Update BB",20],"oldkeys":{"keynames":["id"],"keyvalues":[401]}}
]}

It would relieve a lot of work if the wal2json plugin could optionally print a single change ,per JSON entry like this: one-change-per-entry = true would lead to:

{"xid":1074,"timestamp":"2018-07-18 17:49:54.719475+02","change": 
  {"kind":"update","table":"mytable2","columnnames":["id","name","age"],"columnvalues":[401,"Update AA",20],"oldkeys":{"keynames":["id"],"keyvalues":[401]}}
}

{"xid":1074,"timestamp":"2018-07-18 17:49:54.719475+02","change": 
  {"kind":"update","table":"mytable2","columnnames":["id","name","age"],"columnvalues":[401,"Update BB",20],"oldkeys":{"keynames":["id"],"keyvalues":[401]}}
}

Along with #70, this would still keep each entry uniquely identifiable:

{"xid":1074, "sn":1, ... }
{"xid":1074, "sn":2, ... }

This way, feeding the data to the OLAP like SnowFlake would need no pre-processing, which would be a big improvement.

Thanks for considering.

Possible to ignore tables w/o PK?

Hi,

We are testing this plugin and got tons of log warnings ( > 1M lines) all referring to the same table from the output plugin due to a table without a PK. The table in question, isn't one that we are interested in changes from and is not part of the list we pass into the "add-tables" parameter when we connect.

Is it possible to ignore changes we're not interested in?

Thanks for any thoughts on this.

Connection request:

PGReplicationStream stream =
replConnection.getReplicationAPI()
.replicationStream()
.logical()
.withSlotName("app_decoder_slot")
.withSlotOption("add-tables", subscribedTables.join(','))
.withSlotOption("include-unchanged-toast", false)
.withStatusInterval(20, TimeUnit.SECONDS)
.start();

Sample from log output

... ] 82 WARNING: table "domains_290906" without primary key or replica identity is nothing
... ] 83 CONTEXT: slot "cache_decoder_slot", output plugin "wal2json", in the change callback, associated LSN 3D13/A20D2B60
... ] 84 WARNING: table "domains_290906" without primary key or replica identity is nothing
... ] 85 CONTEXT: slot "cache_decoder_slot", output plugin "wal2json", in the change callback, associated LSN 3D13/A20D2C30
... ] 86 WARNING: table "domains_290906" without primary key or replica identity is nothing
... ] 87 CONTEXT: slot "cache_decoder_slot", output plugin "wal2json", in the change callback, associated LSN 3D13/A20D2D00
... ] 88 WARNING: table "domains_290906" without primary key or replica identity is nothing
... ] 89 CONTEXT: slot "cache_decoder_slot", output plugin "wal2json", in the change callback, associated LSN 3D13/A20D2DC8
... ] 90 WARNING: table "domains_290906" without primary key or replica identity is nothing

Table whitelist

Hi, thanks for the filter-tables feature. I tried it today and it works fine, despite the fact, that it works like a blacklist. I have a ton of tables in my database, but I only want a couple of them to be streamed. Is there a way to somehow "inverse" that filter and make it work like a whitelist?

Transaction id and timestamp options

Why can I not use -o include_xids=1 or -o include_timestamp=1 as options.

When I try to use these options when running --start i get the following error:

pg_recvlogical: unexpected termination of replication stream: ERROR:  option "include_xids" = "1" is unknown
CONTEXT:  slot "test_slot", output plugin "wal2json", in the startup callback

What am i doing wrong? and how to I show transaction ids and timestamps?

ERROR: invalid memory alloc request size 1073741824

Hello ,
Irecently ran into an interesting situation with the plugin.

- cenario :
I made a ddl change in a relatively large table involves adding a new column with a default value, and a type; all in a single transaction . The table has a trigger added to it , but was disabled during the operation.

--
What I found out was that the wal2json plugin while decoding that part of the WAL output a lot of warning messages :

WARNING: column "idl_old_id_column_nanme" has an unchanged TOAST :

And eventually error out with the following error message:

ERROR: invalid memory alloc request size 1073741824

--
beyond this point , there was no decoding possible. Even after I manually provided LSNs and purged out all the data just before the complaining LSN value.

The above error message shows that it is trying to use more work_mem that it could be provided, and a few digging around points to a gradual reduction of the work_mem to prevent this; which didn't help either.

-- My question

As this could lead to some serious problems as data loss is eminent since no decoding can be done ,
what is the way to avoid this in the future.
What's the best way to fix decoding after it abruptly stopped after a huge transaction. And would not continue again even after a restart.
Is it necessary for all tables that were to be decoded have at least the minimal replica identity even when the decoded data are being discarded simply because they are not needed ?
Is there a plan to eventually implement table/schema filtering while decoding the WAL ?

cheers .

Toast Columns

HI,

there is a contraindication to force read of "Toast Columns"?

Maybe there are performance problem?

		/* XXX Unchanged TOAST Datum does not need to be output */
		if (!isnull && typisvarlena && VARATT_IS_EXTERNAL_ONDISK(origval))
		{
			elog(WARNING, "column \"%s\" has an unchanged TOAST", NameStr(attr->attname));
			continue;
		}

Thanks!

ciao
p

eulerto / wal2json Goto Github PK

wal2json's People

Contributors

Stargazers

Watchers

Forkers

wal2json's Issues

Recommend Projects

Recommend Topics

Recommend Org