qovery / replibyte Goto Github PK

View Code? Open in Web Editor NEW

4.0K 24.0 129.0 7.73 MB

Seed your development database with real data ⚡️

Home Page: https://www.replibyte.com

License: GNU General Public License v3.0

Rust 94.49% Dockerfile 0.44% Shell 0.51% JavaScript 1.07% TypeScript 0.49% CSS 0.40% MDX 2.60%

database cloud cloudnative rust rust-lang postgresql s3 backup aws postgres

replibyte's Introduction

replibyte's People

Contributors

Stargazers

Watchers

Forkers

pepoviola benjaminch simonsan tbmreza fabriceclementz vagelis-prokopiou posilva benny-n pascalgrimaud louisjj mdzahedhossain franckpachot y-yagi annihilatorrrr techwookie jhetchan cxz sondrelg forkkit doytsujin logantect corikachu pascal-h-kim spread0x nickfantasy wtait1-ff anborg rheehot arafat-al-mahmud duzhanyuan huangweiboy2 superd22 tgmerritt legolasan mumumumu technificentconsulting vemilyus iprince19 best-defense markrechler maathor agorapulse suryatmodulus ribafs vic92548 warifp 2kable jamesloosli xteen16 fusiongalaxy richezelee ronzyfonzy fabienfoerster forkmeplease abdusco pushpen cloudbox-network otmosina xuwupeng2000 traviscooper cyberflamego kokizzu nanderoo bbenzikry jomoespe arswysocki karpa4o4 lumiqai anietieakpan bryant-finney dut3062796s auaan timkrins ikrestov fcoury kontinuity adminsharmasecureservicescausa cluas rainbowdashy evanrichter skyline93 mayhemheroes spike008t hionnode c-goosen cnlubo markmagnus artart788 luis-sousa-pinto tee8z digksskawk01 gugacavalieri jrcribb tribe-health ewaldhorn eoy bastiankistner gg-big-org michalkutrzeba-odrabiamy alice-financial

replibyte's Issues

Set custom backup name

As a user of RepliByte, I might want to choose the name of a backup with a better name than the one auto-generated. The option could be added during the run of backup run command.

replibyte -c conf.yaml backup run --name=my-backup-name

WDYT?

Improvement: check options from CLI

It will be useful to better check the parameters combination that is used for backup run and restore to avoid unexpected behavior. E.g:

replibyte backup run -s postgres -i -> -s <database_type> and -i work together.

While replibyte backup run alone does not need -i and -s parameters.

Improve documentation for Google Cloud Storage

Hi team, I'm evaluating using replibyte for our GCP environments. This tool looks great! And about a week ago I had read a table of supported datastore providers in the README that included Google Cloud Storage, as GCS is also able to "speak" in the S3 wire protocol.

However it seems all the documentation from the README has been moved to https://www.replibyte.com, but that table of supported datastores was not moved over (having a dedicated docs page on Datastores would be helpful).

Also, the datastore config seems to be pretty AWS-centric with keys like access_key_id. This would probably be a separate issue but it would also be nice to abstract the datastore configuration to be more provider-agnostic. Either that or, have an additional config layer to namespace the provider-specific keys like

datastore:
  aws:
    bucket: my-replibyte-dumps
    region: us-east-2
    access_key_id: $ACCESS_KEY_ID
    secret_access_key: $AWS_SECRET_ACCESS_KEY

P.S.

I filed another issue about creating issue / PR templates for the repo to aid future users / contributors in authoring requests.

MySQL parsing expression grammar

We use dump-parser subcrate to parse and edit database dump. For the long term, if and when maintaining hand-crafted parsers for different databases is unwieldy, the plan is to use a parser generator such as https://github.com/pest-parser/pest. This issue could be a first step towards that, in addition to the immediate purpose of supporting MySQL.

Enhancement: make RepliByte installable from Homebrew

Everything is in the title :) Today, it's tedious to install RepliByte from MacOSX. It will be great to make it installable from Homebrew for MacOSX Intel and M1 processors. Anyone can help here? :)

Create ISSUE_TEMPLATE and PULL_REQUEST_TEMPLATE for the repo

It would be great to add templates that guide users in creating issues and PR's for this repo

Support COPY query for PostgreSQL

Today, the PostgreSQL source only supports INSERT INTO ... query. For small dump it's ok, but for bigger one it will take ages to restore data. Supporting COPY ... query is necessary here. Feel free to pick this enhancement :)

Feature: implement database subseting for MySQL

Implement database subsetting for MySQL as we did for PostgreSQL.

Database Subsetting: Scale down a production database to a more reasonable size

What is Subsetting

From Tonic.ai

Subsetting data is the process of taking a representative sample of your data in a manner that preserves the integrity of your database, e.g., give me 5% of my users. If you do this naively, e.g., just grab 5% of all the tables in your database, most likely, your database will break foreign key constraints. At best, you’ll end up with a statistically non-representative data sample.

One common use case is to scale down a production database to a more reasonable size so that it can be used in staging, test, and development environments. This can be done to save costs and, when used in tandem with PII removal, can be quite powerful as a productivity enhancer. Another example is copying specific rows from one database and placing them into another while maintaining referential integrity.

As discussed on Discord, database subsetting will be super valuable to restore a subset of a production database for a development purpose. E.g. developers from growing companies are interested in using RepliByte for their database with TBs of data 😮 Subsetting a database is needed for very large DBs. It even does not make any sense to try to re-import a DB with TB of data for development purposes.

In this issue, I propose that we work together in designing the "Database Subsetting" feature.

References

Here are some must-read references about database subsetting:

I recommend reading them. They are full of information.

Design references

I am going to take some time digging into Condenser (OSS Tonic.ai subsetting python tool) to suggest a starting implementation. I keep you posted.

Design proposal

sequenceDiagram
participant RepliByte
participant PostgreSQL (Source)
participant AWS S3 (Bridge)
PostgreSQL (Source)->>RepliByte: Dump data
loop
    RepliByte->>RepliByte: a. Get database schema and tables relationships
    RepliByte->>RepliByte: b. Support virtual relationships
    RepliByte->>RepliByte: c. Take x% rows of the ref table
end
loop
    RepliByte->>RepliByte: Hide/fake sensitive data
    RepliByte->>RepliByte: Compress data
    RepliByte->>RepliByte: Encrypt data
end
RepliByte->>AWS S3 (Bridge): Upload data
RepliByte->>AWS S3 (Bridge): 6. Write index file

Restore command doesn't keep column name quoted

The current implementation of the restore command doesn't seem to keep the column name quoted:

Given a Order table with a customerId column, the pg_dump will correctly keep the column name quoted
But when running the restore command it will forward the sql command with column unquoted, leading pgsql, to look for the customerid column

EG:
The raw dump will contain:

INSERT INTO public.orders (id, "customerId") VALUES ....

While the restore command will send

INSERT INTO public.orders (id, customerId) VALUES ....

leading to the following error

ERROR:  column "customerId" of relation "orders" does not exist

Allow all supported ways for AWS authentication

Hi there!

Why are you enforcing the use of the IAM access key ID and secret access key options? This is preventing us from using this tool effectively in an environment where you don't get those credentials by default (like EC2 with an instance profile).

The AWS SDK for Rust already supports all the various methods of authentication, so I don't see why access key id/secret access key pairs would be absolutely necessary.

Replibyte/replibyte/src/datastore/s3.rs

Line 53 in 67c84c3

("AWS_ACCESS_KEY_ID", access_key_id.as_str()),

From our understanding it is AWS best practice to offload the actual handling of credentials to the SDK and to use AWS configuration profiles if access to other accounts is required.

All the best,
Alex

Make password in connection uri optionnal

missing <password> property from connection uri error is triggered for a local database without password.

It should support mysql://root:@mysql:3306/my_db uri.

Warnings when launching build

There are a lot of warning when launching cargo build --release
Should it be clean or is it intended to keep these variables/functions for futur features?

19:02:43 in projects/qovery/replibyte on  readme-typo-postgresql via 𝗥 v1.59.0 
➜ cargo build --release                    
warning: function is never used: `parse_quoted_ident`
   --> dump-parser/src/postgres/mod.rs:614:4
    |
614 | fn parse_quoted_ident(chars: &mut Peekable<Chars<'_>>, quote_end: char) -> (String, Option<char>) {
    |    ^^^^^^^^^^^^^^^^^^
    |
    = note: `#[warn(dead_code)]` on by default

warning: constant is never used: `LINE_SEPARATOR`
 --> dump-parser/src/utils.rs:7:1
  |
7 | const LINE_SEPARATOR: char = ';';
  | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

warning: `dump-parser` (lib) generated 2 warnings
warning: unused imports: `stdin`, `stdout`
 --> replibyte/src/destination/postgres.rs:1:15
  |
1 | use std::io::{stdin, stdout, Error, ErrorKind, Write};
  |               ^^^^^  ^^^^^^
  |
  = note: `#[warn(unused_imports)]` on by default

warning: unused import: `ErrorKind`
 --> replibyte/src/destination/postgres_stdout.rs:1:30
  |
1 | use std::io::{stdout, Error, ErrorKind, Write};
  |                              ^^^^^^^^^

warning: unused imports: `Command`, `Stdio`
 --> replibyte/src/destination/postgres_stdout.rs:2:20
  |
2 | use std::process::{Command, Stdio};
  |                    ^^^^^^^  ^^^^^

warning: unused import: `tokio::io::AsyncWriteExt`
 --> replibyte/src/destination/postgres.rs:3:5
  |
3 | use tokio::io::AsyncWriteExt;
  |     ^^^^^^^^^^^^^^^^^^^^^^^^

warning: unused variable: `host`
   --> replibyte/src/main.rs:176:50
    |
176 | ...                   ConnectionUri::Mysql(host, port, username, password, database) => {
    |                                            ^^^^ help: if this is intentional, prefix it with an underscore: `_host`
    |
    = note: `#[warn(unused_variables)]` on by default

warning: unused variable: `port`
   --> replibyte/src/main.rs:176:56
    |
176 | ...                   ConnectionUri::Mysql(host, port, username, password, database) => {
    |                                                  ^^^^ help: if this is intentional, prefix it with an underscore: `_port`

warning: unused variable: `username`
   --> replibyte/src/main.rs:176:62
    |
176 | ...                   ConnectionUri::Mysql(host, port, username, password, database) => {
    |                                                        ^^^^^^^^ help: if this is intentional, prefix it with an underscore: `_username`

warning: unused variable: `password`
   --> replibyte/src/main.rs:176:72
    |
176 | ...                   ConnectionUri::Mysql(host, port, username, password, database) => {
    |                                                                  ^^^^^^^^ help: if this is intentional, prefix it with an underscore: `_password`

warning: unused variable: `database`
   --> replibyte/src/main.rs:176:82
    |
176 | ...                   ConnectionUri::Mysql(host, port, username, password, database) => {
    |                                                                            ^^^^^^^^ help: if this is intentional, prefix it with an underscore: `_database`

warning: unused variable: `host`
   --> replibyte/src/main.rs:241:42
    |
241 |                     ConnectionUri::Mysql(host, port, username, password, database) => {
    |                                          ^^^^ help: if this is intentional, prefix it with an underscore: `_host`

warning: unused variable: `port`
   --> replibyte/src/main.rs:241:48
    |
241 |                     ConnectionUri::Mysql(host, port, username, password, database) => {
    |                                                ^^^^ help: if this is intentional, prefix it with an underscore: `_port`

warning: unused variable: `username`
   --> replibyte/src/main.rs:241:54
    |
241 |                     ConnectionUri::Mysql(host, port, username, password, database) => {
    |                                                      ^^^^^^^^ help: if this is intentional, prefix it with an underscore: `_username`

warning: unused variable: `password`
   --> replibyte/src/main.rs:241:64
    |
241 |                     ConnectionUri::Mysql(host, port, username, password, database) => {
    |                                                                ^^^^^^^^ help: if this is intentional, prefix it with an underscore: `_password`

warning: unused variable: `database`
   --> replibyte/src/main.rs:241:74
    |
241 |                     ConnectionUri::Mysql(host, port, username, password, database) => {
    |                                                                          ^^^^^^^^ help: if this is intentional, prefix it with an underscore: `_database`

warning: unused variable: `err`
  --> replibyte/src/bridge/s3.rs:61:21
   |
61 |                 Err(err) => s3_config_builder.build(),
   |                     ^^^ help: if this is intentional, prefix it with an underscore: `_err`

warning: unused variable: `original_query`
  --> replibyte/src/tasks/full_backup.rs:91:39
   |
91 |             .read(self.transformers, |original_query, query| {
   |                                       ^^^^^^^^^^^^^^ help: if this is intentional, prefix it with an underscore: `_original_query`

warning: variable does not need to be mutable
  --> replibyte/src/main.rs:78:9
   |
78 |     let mut style_is_progress_bar = false;
   |         ----^^^^^^^^^^^^^^^^^^^^^
   |         |
   |         help: remove this `mut`
   |
   = note: `#[warn(unused_mut)]` on by default

warning: variable does not need to be mutable
   --> replibyte/src/main.rs:185:37
    |
185 | ...                   let mut reader = BufReader::new(dump_file);
    |                           ----^^^^^^
    |                           |
    |                           help: remove this `mut`

warning: variant is never constructed: `FailedToDeleteBucket`
   --> replibyte/src/bridge/s3.rs:187:5
    |
187 |     FailedToDeleteBucket { bucket: &'a str },
    |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |
    = note: `#[warn(dead_code)]` on by default

warning: variant is never constructed: `FailedToDeleteObject`
   --> replibyte/src/bridge/s3.rs:192:5
    |
192 |     FailedToDeleteObject { bucket: &'a str, key: &'a str },
    |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

warning: function is never used: `delete_bucket`
   --> replibyte/src/bridge/s3.rs:279:4
    |
279 | fn delete_bucket<'a>(client: &Client, bucket: &'a str, force: bool) -> Result<(), S3Error<'a>> {
    |    ^^^^^^^^^^^^^

warning: function is never used: `delete_object`
   --> replibyte/src/bridge/s3.rs:371:4
    |
371 | fn delete_object<'a>(client: &Client, bucket: &'a str, key: &'a str) -> Result<(), S3Error<'a>> {
    |    ^^^^^^^^^^^^^

warning: enum is never used: `ConnectorConfig`
  --> replibyte/src/config.rs:20:10
   |
20 | pub enum ConnectorConfig<'a> {
   |          ^^^^^^^^^^^^^^^

warning: associated function is never used: `connector`
  --> replibyte/src/config.rs:26:12
   |
26 |     pub fn connector(&self) -> Result<ConnectorConfig, Error> {
   |            ^^^^^^^^^

warning: variant is never constructed: `Mysql`
   --> replibyte/src/config.rs:183:5
    |
183 |     Mysql(Host, Port, Username, Password, Database),
    |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

warning: associated function is never used: `new`
  --> replibyte/src/destination/postgres_stdout.rs:12:12
   |
12 |     pub fn new() -> Self {
   |            ^^^

warning: field is never read: `database`
  --> replibyte/src/source/postgres.rs:22:5
   |
22 |     database: &'a str,
   |     ^^^^^^^^^^^^^^^^^

warning: associated function is never used: `new`
  --> replibyte/src/source/postgres_stdin.rs:13:12
   |
13 |     pub fn new() -> Self {
   |            ^^^

warning: associated function is never used: `new`
  --> replibyte/src/transformer/transient.rs:22:12
   |
22 |     pub fn new<S>(database_name: S, table_name: S, column_name: S) -> Self
   |            ^^^

warning: enum is never used: `Transformers`
 --> replibyte/src/transformer/mod.rs:9:10
  |
9 | pub enum Transformers {
  |          ^^^^^^^^^^^^

warning: associated function is never used: `name`
  --> replibyte/src/types.rs:42:12
   |
42 |     pub fn name(&self) -> &str {
   |            ^^^^

warning: associated function is never used: `number_value`
  --> replibyte/src/types.rs:52:12
   |
52 |     pub fn number_value(&self) -> Option<&i128> {
   |            ^^^^^^^^^^^^

warning: associated function is never used: `string_value`
  --> replibyte/src/types.rs:59:12
   |
59 |     pub fn string_value(&self) -> Option<&str> {
   |            ^^^^^^^^^^^^

warning: associated function is never used: `float_number_value`
  --> replibyte/src/types.rs:66:12
   |
66 |     pub fn float_number_value(&self) -> Option<&f64> {
   |            ^^^^^^^^^^^^^^^^^^

warning: associated function is never used: `char_value`
  --> replibyte/src/types.rs:73:12
   |
73 |     pub fn char_value(&self) -> Option<&char> {
   |            ^^^^^^^^^^

warning: `replibyte` (bin "replibyte") generated 35 warnings
    Finished release [optimized] target(s) in 0.09s

Use pg_dump instead of pg_dumpall command

pg_dumpall is too aggressive/stressful for a production database
needs superuser permissions which is not possible sometimes (e.g AWS RDS)

Can't connect to local postgres with "restore local" command

To connect to your Postgres database, use the following connection string:
> postgres://root:password@localhost:5432/root
Waiting for Ctrl-C to stop the container

and when trying to connect to the local db

PGPASSWORD=password psql -h localhost -p 5432 -U root -d root
psql: error: connection to server at "localhost" (::1), port 5432 failed: FATAL:  password authentication failed for user "root"

implement non Docker builds

Sorry, this was a typo.

Writing a complete guide on how to use Replibyte

I think we have reached a good milestone (big thanks to @fabriceclementz and @benny-n). It's the perfect time to show how to use RepliByte. Writing an article covering the basics (backup, restore, transformer, subset) and some more advanced use cases (E.g. using wasm transformer) of Replibyte would be greatly appreciated by any user that wants to get started.

Here is an example of a guide structure that can be used:

Install Replibyte (MacOSX, Linux, Windows)
Create an account on AWS S3 (or equivalent)
Create a dev dataset
3.a Create a dev dataset from the production database
3.b Create a dev dataset from a dump
3.c Hide sensitive data
3.d Subset (optional)
Run RepliByte in a Docker container
Deploy RepliByte
5.a Explain how to set Replibyte with a cron to get a fresh dev dataset every day / week / month...
Use a dev dataset
6.a Restore a dev dataset in a local container
6.b Restore in a remote database
Create a custom transformer

Source: Support MySQL

Implement and support MySQL source connector.

TBD

Release action fails for target `x86_64-apple-darwin`

Currently, the release action fails with this error:

= note: Undefined symbols for architecture x86_64:
"_clock_gettime", referenced from:
wasmer_wasi::syscalls::clock_time_get::haf582eb828a0fd26 in libwasmer_wasi-fc2597b33c106c49.rlib(wasmer_wasi-fc2597b33c106c49.wasmer_wasi.a26eec5f-cgu.0.rcgu.o)
"_clock_getres", referenced from:
wasmer_wasi::syscalls::clock_res_get::h24af0486cc9fc758 in libwasmer_wasi-fc2597b33c106c49.rlib(wasmer_wasi-fc2597b33c106c49.wasmer_wasi.a26eec5f-cgu.0.rcgu.o)
ld: symbol(s) not found for architecture x86_64
clang-12: error: linker command failed with exit code 1 (use -v to see invocation)

As mentioned here, it looks like clock_gettime was only added in OSX 10.12.
Which means compiling for OSX 10.11 and earlier will fail.

The action we are currently using for our releases indeed uses an earlier version (10.10) for compilation.

What we can do next:

We can find a different github action for binary releases to use.
We can open an issue for https://github.com/rust-build/rust-build.action, letting them know about the error and maybe ask them to update their build script.
We can open an issue for https://github.com/wasmerio/wasmer, and if we're lucky they might come up with a clever solution which will remove the dependency of this API and make the error irrelevant.

@evoxmusic How would you like to procceed?

Nomenclature / naming

I have the feeling that "bridge" does not mean anything and it might be more accurate to rename it into something like "store". The idea is a place where to store the created dataset from the source database. A "bridge" does not really reflect this concept. What do you think?

Support for Microsoft SQL

We'd love to use this with MSSQL databases as well - can you please add support?

Cannot create S3 bucket in us-east-1 region

I've noticed I'm not able to initialize an S3 bucket in the us-east-1 region.

My replibyte.yaml:

datastore:
  aws:
    bucket: $S3_BUCKET
    region: $S3_REGION
    access_key_id: $S3_ACCESS_KEY_ID
    secret_access_key: $S3_SECRET_ACCESS_KEY

Running replibyte -c replibyte.yaml dump list with S3_BUCKET=us-east-1 results in this error:

[2022-05-14T03:13:11Z ERROR replibyte::datastore::s3] Error { code: "InvalidLocationConstraint", message: "The specified location-constraint is not valid", request_id: "...", s3_extended_request_id: "..." }
failed to create bucket 'replibyte'

But the command works fine with S3_BUCKET=us-east-2 and creates the bucket as expected.

I've attempted a workaround (mumumumu@c6eff63), and it works, but I'm not sure if there's a better way to handle this (I'm not really a rust dev 😅).

Found this in the AWS docs:

If you don't specify a Region, the bucket is created in the US East (N. Virginia) Region (us-east-1)

It seems like you must not specify a region in order to create a bucket in us-east-1, at least with the Rust SDK.

Looking for a PostgreSQL and MongoDB dump between 1TB and 30TB of data?

Hi, I am looking for a PostgreSQL and MongoDB data dump between 1TB and 30TB. The idea would be to build a pipeline test of RepliByte with a very large dump file. It will help a lot to optimize RepliByte. Anyone?

Feature: skip table synchronization

Hi, a user asked me if it was possible to skip one or more tables from the synchronization. The use case was: some tables are pretty heavy (> 100GB) and do not need to be synchronized for a development environment.

To skip the public.logs table, the configuration YAML could be like this:

source:
  connection_uri: postgres://root:password@localhost:5432/root
  skip:
    - database: public
      table: logs
  transformers:
    - database: public
      table: employees
      columns:
        - name: fist_name
          transformer: first-name
        - name: last_name
          transformer: random
bridge:
  bucket: replibyte-test
  region: us-east-2
  access_key_id: $AWS_ACCESS_KEY_ID
  secret_access_key: $AWS_SECRET_ACCESS_KEY

Feel free to pick this issue.

Rename backup to dump?

As we rename the cli command from backup to dump, I think we could rename it to Dump in the datastore directory.
WDYT?

Example:

// datastore/mod.rs
pub struct Backup {
    pub directory_name: String,
    pub size: usize,
    pub created_at: u128,
    pub compressed: bool,
    pub encrypted: bool,
}

pub struct Dump {
    pub directory_name: String,
    pub size: usize,
    pub created_at: u128,
    pub compressed: bool,
    pub encrypted: bool,
}

// datastore/mod.rs
pub struct IndexFile {
    pub backups: Vec<Backup>,
}

pub struct IndexFile {
    pub dumps: Vec<Dump>,
}

This update will lead to a renaming of the key backups to dumps inside the metadata.json file.

`postgresql://` is a valid connection string prefix for Postgres

When using a connection_uri that starts with postgresql://, I get the error: 'postgresql' not supported. Changing the connection_uri to start with postgres:// fixes the issue, but postgresql:// is also valid:

https://www.postgresql.org/docs/current/libpq-connect.html#LIBPQ-CONNSTRING

The URI scheme designator can be either postgresql:// or postgres://

Feature: restore in a local container for dev purpose

Hey everyone 👋🏽 I have an idea that can make developer life so much easier while developing locally. What do you think of creating a command like this one 👇

$ replibyte restore -v latest --local
Restore complete!

To connect to your Postgres database, use the following connection string:
DATABASE_URL=postgres://root:password@localhost:5432/root

This command would start a local docker of the appropriate database (E.g Postgres) and restore the latest dump into it.

Main benefits

Work locally with safe production data.

Happy to have your feedback.

Backup is encrypted while no encryption is set

Everything is in the title :)

~/I/q/replibyte (main|…) $ replibyte -c qovery-replibyte.yaml backup list
⠤
 name                 | size  | when           | compressed | encrypted
----------------------+-------+----------------+------------+-----------
 backup-1650796765697 | 87 kB | 42 minutes ago | true       | false
 backup-1650794871640 | 35 kB | 1 hour ago     | true       | false
 backup-1650789943393 | 44 kB | 3 hours ago    | true       | false
 backup-1650789414794 | 3 kB  | 3 hours ago    | true       | false
 backup-1650788239683 | 33 kB | 3 hours ago    | true       | false

the conf file

source:
  connection_uri: postgres://root:password@localhost:5432/root
bridge:
  bucket: replibyte-qovery-7
  region: us-east-2
  access_key_id: $AWS_ACCESS_KEY_ID
  secret_access_key: $AWS_SECRET_ACCESS_KEY

here is a screenshot of the encrypted data

This is probably a bug in the CLI arg parsing

BucketAlreadyOwnedByYou error when running `replibyte bucket list` multiple times

When running replibyte bucket list, I get the following error if the bridge bucket already exists.

[2022-05-05T20:40:28Z ERROR replibyte::bridge::s3] BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
failed to create bucket 'revere-staging-sync-service-data'

panic: Unterminated string literal in SQL instruction

Hello, at Flexhire we are trying out this tool to seed our staging DB with production data.

We are using replibyte v0.6 for Linux x86 and the dump is of a PostgreSQL 11 DB hosted on AWS RDS.

When generating a dump from our production DB we are encountering the following crash. I included a stack backtrace generated with RUST_BACKTRACE=full

We have text columns containing user entered text in the markdown format that might have characters such as '. The application uses rails so these characters should be escaped. Perhaps there is some issue with the escaping done by Replibyte? It looks like INSERT instructions are the ones causing the problem.

thread 'main' panicked at 'TokenizerError { message: "Unterminated string literal", line: 1, col: 788 }', dump-parser/src/postgres/mod.rs:747:13
stack backtrace:
   0:     0x7f147c81bc5d - std::backtrace_rs::backtrace::libunwind::trace::h081201764674ef17
                               at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/../../backtrace/src/backtrace/libunwind.rs:93:5
   1:     0x7f147c81bc5d - std::backtrace_rs::backtrace::trace_unsynchronized::hebab37398c391bd7
                               at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2:     0x7f147c81bc5d - std::sys_common::backtrace::_print_fmt::h301516df68ed24f9
                               at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/sys_common/backtrace.rs:66:5
   3:     0x7f147c81bc5d - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h8f5170f4f03a12c0
                               at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/sys_common/backtrace.rs:45:22
   4:     0x7f147c867fac - core::fmt::write::h5dc5601e8d9f6367
                               at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/core/src/fmt/mod.rs:1190:17
   5:     0x7f147c8138c8 - std::io::Write::write_fmt::h5b19302eb99d9acf
                               at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/io/mod.rs:1657:15
   6:     0x7f147c81e2e7 - std::sys_common::backtrace::_print::hd81cf53a75c8ae6a
                               at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/sys_common/backtrace.rs:48:5
   7:     0x7f147c81e2e7 - std::sys_common::backtrace::print::hb5aa882e87c2a0dc
                               at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/sys_common/backtrace.rs:35:9
   8:     0x7f147c81e2e7 - std::panicking::default_hook::{{closure}}::had913369af61b326
                               at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/panicking.rs:295:22
   9:     0x7f147c81dfb0 - std::panicking::default_hook::h37b06af9ee965447
                               at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/panicking.rs:314:9
  10:     0x7f147c81ea39 - std::panicking::rust_panic_with_hook::hf2019958d21362cc
                               at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/panicking.rs:698:17
  11:     0x7f147c81e727 - std::panicking::begin_panic_handler::{{closure}}::he9c06fdd592f8785
                               at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/panicking.rs:588:13
  12:     0x7f147c81c124 - std::sys_common::backtrace::__rust_end_short_backtrace::ha521b96560789310
                               at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/sys_common/backtrace.rs:138:18
  13:     0x7f147c81e439 - rust_begin_unwind
                               at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/panicking.rs:584:5
  14:     0x7f147b94ba23 - core::panicking::panic_fmt::h28f1697d4e9394b4
                               at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/core/src/panicking.rs:143:14
  15:     0x7f147c267afe - dump_parser::postgres::get_tokens_from_query_str::h5c129d0e7d926086
  16:     0x7f147b982287 - replibyte::source::postgres::read_and_transform::{{closure}}::h38d9592830b06c69
  17:     0x7f147b97b9d6 - dump_parser::utils::list_sql_queries_from_dump_reader::h540f884dd1afeb0d
  18:     0x7f147b9dad0f - <replibyte::tasks::full_dump::FullDumpTask<S> as replibyte::tasks::Task>::run::ha2213c1574b9739f
  19:     0x7f147ba97a55 - replibyte::commands::dump::run::h86af3c9d6a23f4c9
  20:     0x7f147ba59b30 - replibyte::main::h880123c1e7bd6b08
  21:     0x7f147ba8b9b3 - std::sys_common::backtrace::__rust_begin_short_backtrace::h9dd32b852f85f8a3
  22:     0x7f147b9f1e89 - std::rt::lang_start::{{closure}}::h434a6ad0c3bb7757
  23:     0x7f147c81b3b4 - core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once::hd127f27863548251
                               at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/core/src/ops/function.rs:259:13
  24:     0x7f147c81b3b4 - std::panicking::try::do_call::h926290883a1d024e
                               at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/panicking.rs:492:40
  25:     0x7f147c81b3b4 - std::panicking::try::hc74a3d1f4a4b6e5f
                               at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/panicking.rs:456:19
  26:     0x7f147c81b3b4 - std::panic::catch_unwind::h5eb7ded2df1a4d5f
                               at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/panic.rs:137:14
  27:     0x7f147c81b3b4 - std::rt::lang_start_internal::{{closure}}::h0736f9682f7c55ea
                               at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/rt.rs:128:48
  28:     0x7f147c81b3b4 - std::panicking::try::do_call::h2772c479b1c89ef7
                               at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/panicking.rs:492:40
  29:     0x7f147c81b3b4 - std::panicking::try::h967ebbc371287391
                               at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/panicking.rs:456:19
  30:     0x7f147c81b3b4 - std::panic::catch_unwind::h41bcc02b28316856
                               at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/panic.rs:137:14
  31:     0x7f147c81b3b4 - std::rt::lang_start_internal::haf46799f55774d07
                               at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/rt.rs:128:20
  32:     0x7f147ba5db02 - main

Transformer: keep first character

Create a transformer to keep only the first character of the column.

Examples

Example 1

Input: Romaric
Output: R

Example 2

Input: 123
Output: 1

Example 3

Input: R
Output: R

Example 4

Input: Null
Output: Null

Example 5

Input: "" (empty)
Output: "" (empty)

Create official replibyte container image

The new documentation for how to deploy / run replibyte as a container workload is great! But right now there are a bunch of steps on how to build a custom container image. It would be ideal to have some CI that builds and releases a standard replibyte container image that users can just download from some well-known, public registry.

Seeing as replibyte already uses Github Actions, it seems reasonable to create another GA workflow that builds and pushes an image to Github Container Registry. All the binary releases are currently triggered when a Github release is created, so we can use the same trigger for the image build + push.

Catch error external command errors exit != 0

Verify that we correctly catch potential external command exec error (status != 0) that can lead to potential bugs.

Using local datastore

Can I use the datastore on the local disk?rather than the cloud as we host our services onPerm and we don't have S3 complaint service.

Improve Postgres subsetting performance

I've tried to subset a Postgres DB of 2GB of data and RepliByte took 38 minutes to complete. I suspect the function subset.postgres.filter_insert_into_rows(..) to be the bottleneck since it is called multiple times and scan the entire file (even if there is a small index).

Something that can be done to drastically reduce the time would be to split the dump into multiple table files. Then scan will be limited to the table.

Provide hash function transformer/option

I think it will be a good idea to provide same input same output in order to preserve consistency across a database.

You could imagine having a parameter :

- name: email
  transformer_name: email
  hash: true

With that parameter and that input [email protected] this will produce something like ddaf35a193617abacc417349ae20413112e6fa4e89a97ea20a9eeee64b55d39a2192992a274fc1a836ba3c23a3feebbd454d4423643ce80e2a9ac94fa54ca49f@gmail.com with a sha512 hash function (using md5 will be faster but less secure)

Feature: implement database subseting for MongoDB

Implement database subsetting for MongoDB as we did for PostgreSQL..

However, MongoDB is not a relational database and we need to support "Virtual Foreign Key". Meaning, as a user I want to indicate that column collection_a.post_id is linked to column collection_b.id and then keep the consistency across the collections.

Unable to process local dump file

Good Afternoon,

I am trying to process a local dump file but unable to bypass the requirement for a valid connection_uri.

cat production.sql | replibyte -c conf.yaml dump create -i

Please see below for my configuration
`source:
connection_uri: "postgres://root:password@localhost:5453/root"
transformers:
- database: public
table: table2
columns:
- name: city
transformer_name: random
- name: postal_code
transformer_name: random
destination:

it's different to the source

connection_uri: postgres://root:password@localhost:5453/root
datastore:
gcp:
bucket: xxx-db-xxx
region: xxx
access_key: xxxx
secret: xxxx`

Setting Dummy placeholder
command error: pg_dump: error: connection to server at "local.google.com" (172.217.165.14), port 5432 failed: Operation timed out Is the server running on that host and accepting TCP/IP connections?
Not Setting value
thread 'main' panicked at 'bad config file format: Message("missing fieldconnection_uri", Some(Pos { marker: Marker { index: 22, line: 2, col: 14 }, path: "source" }))', replibyte/src/main.rs:83:56

Auto-detect sensitive fields and automatically hide potential sensitive data

One feature that could be very useful and that will prevent any potential unexpected data leak would be to automatically detect sensitive fields to apply a transformer on it. It could be an option in the conf.yaml that will enable it:

source:
  auto_hide_sensitive_data: 
    enable: true
    fallback_transformers: 
    -  field_type: string
        transformer: random
...

Why

They are many reasons why this feature would be useful:

Suppose you use Replibyte and you defined your conf.yaml with a certain version of your database schema. Then someone adds a field in your database that you are not aware of; if the conf.yaml is not updated, then we will leak this new field.
Specifying every field from the database that we need to hide is tedious and even almost impossible with a large database.

Happy to have your feedback:

Does this feature request make sense?
How we can design it?

Failing tests for pull requests from forked repositories

Hi @evoxmusic

I think I've found why tests are failing on pull requests, it's because secrets are not passed to the runner for security concerns.

See https://docs.github.com/en/actions/security-guides/encrypted-secrets#using-encrypted-secrets-in-a-workflow

There is this note :

Note: With the exception of GITHUB_TOKEN, secrets are not passed to the runner when a workflow is triggered from a forked repository.

I propose to replace ${{ secrets.AWS_ACCESS_KEY_ID }} and ${{ secrets.AWS_SECRET_ACCESS_KEY }} directly by the MinIO credentials.

I don't know any other solution, what do you think ?

Provide a delete backup command

Hi @evoxmusic

Over the time, more and more backups adds up on the bridge so I think that would be good to have a command to delete a backup from the bridge. This is the signature I propose :

replibyte -c "config.yml" backup delete "backup-name-here"

In the future we could add options to delete backups like --older-than=14d or so.

WDYT?

Feature: allow to add custom transformer

Replibyte is great and really fast. Thanks for the good work.

I have a jsonb columns in my database with a user-defined format. And this column can hold sensitive information.
So I need to use custom logic to properly anonymize it.

Is it possible to add a transformer that would allow applying any custom transformation?
This transformer could call a user-defined bash command to transform the value.

Support MongoDB import from stdin

As for MySQL and PostgreSQL, we want to add the possibility to import an existing mongodump.

Ex :

cat mongo_dump/* | replibyte -c ./conf.yaml dump create  -i -s mongodb

Missing MongoDB and MySQL clients in Dockerfile

Hi, I noticed that the MongoDB and MySQL clients are missing in the Dockerfile -> https://github.com/Qovery/replibyte/blob/main/Dockerfile

BucketAlreadyOwnedByYou s3

There seems to be something doing on the with s3 part of the tool. After I initially create a dump, which looks like it creates the s3 bucket as well. Every replibyte command after replibyte -c conf.yaml dump create results in [2022-05-10T02:06:58Z ERROR replibyte::datastore::s3] BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it. My conf.yaml is attached

I am using minio: https://min.io/product/s3-compatibility as my data store.

Thank you in advance for your guidance.

conf.yaml:

        source:
          connection_uri: postgres://postgres:*****@dev-pg.rqdev.net:5432/mydatabase
          transformers:
            - database: mydatabase
              table: pending_ticket_messages
              columns:
                - name: message
                  transformer_name: random
        datastore:
          aws:
            bucket: rigs-database-state
            region: us-east-1
            access_key_id: greymatter
            secret_access_key: greymatter
            endpoint:
              custom: 'http://localhost:9000'
        destination:
          connection_uri: postgres://greymatter:******@localhost:5432/redarrow

MySQL dump is not working and also not throwing any error ,

I am using MySQL 5.7 and its running on docket container in my local. I am trying to access the same from my Windows 64 system. I am able to access the DB from one of mysql client.

Find attached sample conf file I am using

Postgres boolean column type not supported

I've noticed that postgres boolean type is not supported. I did a backup from a table with 2 booleans, and those boolean were NULL.

Source: Support MongoDB

Implement and support MongoDB source connector.

TBD

New subset strategy: create SELECT with WHERE clause strategy

As a user, I want to select a subset of a database based on a SELECT and WHERE clause.

source:
  database_subset:
    database: public
    table: orders
    strategy_name: query
    strategy_options:
      query: "SELECT * FROM public.table WHERE customer_id = 'abcdef' AND total_cost_usd > 50"

It means that we need to create a small SQL parser to build a conditional tree and filter on those conditions.

Examples of queries we can support:

SELECT * FROM public.table WHERE customer_id = 'abcdef';
SELECT * FROM public.table WHERE customer_id = 'abcdef' AND total_cost_usd > 50;
SELECT * FROM public.table WHERE customer_id = 'abcdef' AND total_cost_usd < 50;
SELECT * FROM public.table WHERE customer_id = 'abcdef' AND total_cost_usd <= 50;
SELECT * FROM public.table WHERE customer_id = 'abcdef' AND total_cost_usd >= 50;
SELECT * FROM public.table WHERE customer_id = 'abcdef' AND total_cost_usd != 50;
SELECT * FROM public.table WHERE customer_id = 'abcdef' OR total_cost_usd > 50;
SELECT * FROM public.table WHERE customer_id = 'abcdef' OR total_cost_usd < 50;
SELECT * FROM public.table WHERE customer_id = 'abcdef' OR total_cost_usd <= 50;
SELECT * FROM public.table WHERE customer_id = 'abcdef' OR total_cost_usd >= 50;
SELECT * FROM public.table WHERE customer_id = 'abcdef' OR total_cost_usd != 50;

As an improvement, we can put database_subset.database and database_subset.table parameters into database_subset.strategy_options. Then, they will be optional depending on the subset strategy taken.

Create a Transformer to generate a first name

Create a transformer to generate a "first name" from a list.

Examples

Example 1

Input: Romaric
Output: Julie

Example 2

Input: Lucas
Output: Georges

Example 3

Input: Etienne
Output: Julienne

Example 4

Input: Null
Output: Null

Example 5

Input: "" (empty)
Output: "" (empty)

Example 6

Input: 123
Output: 123

qovery / replibyte Goto Github PK

replibyte's Introduction

Seed Your Development Database With Real Data ⚡️

Prerequisites

Usage

Features

Getting Started

Demo

Contributing

Thanks

replibyte's People

Contributors

Stargazers

Watchers

Forkers

replibyte's Issues

What is Subsetting

References

Design references

Design proposal

Main benefits

Examples

Example 1

Example 2

Example 3

Example 4

Example 5

it's different to the source

Why

Examples

Example 1

Example 2

Example 3

Example 4

Example 5

Example 6

Recommend Projects

Recommend Topics

Recommend Org