Comments (6)
After some further considerations we've decided that it's actually not feasible to retire our Postgres database. We plan to continue to use it for the Meltano system database (and for some other metadata capturing, Argo Workflows archive etc).
So, from my perspective this issue could be closed.
from meltano.
Hey Matt, thanks for filing!
Can you say more about how you're using the runs
table? Is it something that you glance at occasionally, or are some of your workflows dependent on it?
I'm trying to think what this would look like. State backends are essentially key-value stores so it's easy to use object storage for that, but the runs table is more transactional.
See for example how the job/run model is used within a transaction to keep a heartbeat
meltano/src/meltano/core/job/job.py
Lines 249 to 280 in bcbe3eb
and to control concurrency
meltano/src/meltano/core/block/extract_load.py
Lines 477 to 499 in bcbe3eb
I'm happy to discuss spec and implementation proposals, and even review PRs, but this is something that we probably won't prioritize ourselves.
That said, one option that may be available today is to rely on the default SQLite system db and use something like Litestream1 to sync the database with S3.
Footnotes
-
https://litestream.io Litestream - Streaming SQLite Replication ↩
from meltano.
Another idea that just came to mind is to search for or implement a sqlalchemy dialect that's sqlite + s3, so it could be used like the example in #7143 (comment).
The individual components seem to be out there:
-
https://github.com/uktrade/sqlite-s3vfs
Although this is concerning
No locking is performed, so client code must ensure that writes do not overlap with other writes or reads. If multiple writes happen at the same time, the database will probably become corrupt and data be lost.
-
https://github.com/edgarrmondragon/sqlalchemy-sqlean/blob/206f86df0a325a9d853d5afb066d26cbcddd41d7/src/sqlean_driver/__init__.py (Example of a SQLite driver implementation)
from meltano.
Can you say more about how you're using the runs table? Is it something that you glance at occasionally, or are some of your workflows dependent on it?
We don't have anything today that depends on it specifically. We query it manually, occasionally, for debugging purposes.
As we build out our "second-generation" Meltano platform, though, we would like to better implement "reporting and analytics on our ELT workflows" - ideally we'd be able to e.g. surface in a dashboard which ELT jobs have run recently, succeeded, failed, etc.
We run Meltano in Kubernetes via Argo Workflows and we have the Argo Workflows workflow archive set up, so all Argo Workflows executions are recorded in a database today. A Meltano run corresponds to exactly one Argo Workflows run, so we still have good information available on what Meltano jobs ran when, succeeded, failed, etc.
The part I'm thinking about specifically as a potential limitation is not having the "payload" field from the meltano runs table available. It seems like it'd be useful to have that explicitly persisted - it seems to provide the value of the replication key for each stream in the run, at the start of the run.
from meltano.
The other thought I had is that we're moving from Postgres into Snowflake for our warehouse. System database support for Snowflake would accomplish the same goal for us (continue to leverage a persistent state backend without running a Postgres database).
from meltano.
We don't have anything today that depends on it specifically. We query it manually, occasionally, for debugging purposes.
As we build out our "second-generation" Meltano platform, though, we would like to better implement "reporting and analytics on our ELT workflows" - ideally we'd be able to e.g. surface in a dashboard which ELT jobs have run recently, succeeded, failed, etc.
We run Meltano in Kubernetes via Argo Workflows and we have the Argo Workflows workflow archive set up, so all Argo Workflows executions are recorded in a database today. A Meltano run corresponds to exactly one Argo Workflows run, so we still have good information available on what Meltano jobs ran when, succeeded, failed, etc.
The part I'm thinking about specifically as a potential limitation is not having the "payload" field from the meltano runs table available. It seems like it'd be useful to have that explicitly persisted - it seems to provide the value of the replication key for each stream in the run, at the start of the run.
Thanks for adding context! That makes sense. The payload
field can indeed give some insight into "state evolution" of a tap and its streams, which can be valuable.
The other thought I had is that we're moving from Postgres into Snowflake for our warehouse. System database support for Snowflake would accomplish the same goal for us (continue to leverage a persistent state backend without running a Postgres database).
Yeah, that's been asked in Slack before. The database_uri
is a SQLAlchemy URL, so in theory you could point it to a snowflake instance by setting it to 'snowflake://<user_login_name>:<password>@<account_name>/<database_name>/<schema_name>?warehouse=<warehouse_name>&role=<role_name>'
1.
Now, I recall that doesn't work because at least one of the migration scripts is not compatible with snowflake's sql so changes would be required there (see #6529 and #6167). Do log an issue for it if Snowflake support for systemdb would make this transition easier for you, and of course PRs would be welcome 😄.
Footnotes
from meltano.
Related Issues (20)
- docs: 404 on `/concepts/plugins/project` HOT 1
- Replace use of flakeheaven with Ruff for flake8-print
- Replace use of flakeheaven with Ruff for flake8-return
- feature: Capture run ID HOT 7
- Ruff: Enable flake8-pie, flake8-quotes and flake8-debugger
- feature: Make `pip` an explicit dependency of Meltano HOT 1
- bug: public discovery.yml is throwing 404 HOT 12
- bug: Environment variables are not passed from `.env` to the `pip install ...` subprocess
- Missing dependency wheels tracker
- feature: Make installation progress messages use logging instead of `click.echo` HOT 2
- bug: SingerMapper sets mapping configuration provided from env to null HOT 3
- bug: mysql-tap test returns Plugin configuration is invalid HOT 3
- feature: Add a `--run-id` option to `meltano el` similar to the one in `meltano run`
- bug: `meltano config <extractor> test` fails for SDK-based taps configured to use `BATCH` messages
- bug: Output of `meltano select <extractor> --list` is **visually** inconsistent between different Python 3.8-3.10 and 3.11+
- bug: Known upstream uv issues
- bug: Partial state causes failure when using a filesystem-based state backend
- bug: documentation for configuring plug-in python version HOT 1
- bug: Docs search broken HOT 2
- feature: Allow Meltano to collect log records from plugins to support collecting log-level and other Python logging features
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from meltano.