Coder Social home page Coder Social logo

Comments (6)

menzenski avatar menzenski commented on May 27, 2024 1

After some further considerations we've decided that it's actually not feasible to retire our Postgres database. We plan to continue to use it for the Meltano system database (and for some other metadata capturing, Argo Workflows archive etc).

So, from my perspective this issue could be closed.

from meltano.

edgarrmondragon avatar edgarrmondragon commented on May 27, 2024

Hey Matt, thanks for filing!

Can you say more about how you're using the runs table? Is it something that you glance at occasionally, or are some of your workflows dependent on it?


I'm trying to think what this would look like. State backends are essentially key-value stores so it's easy to use object storage for that, but the runs table is more transactional.

See for example how the job/run model is used within a transaction to keep a heartbeat

@asynccontextmanager
async def run(self, session):
"""Run wrapped code in context of a job.
Transitions state to RUNNING and SUCCESS/FAIL as appropriate and
records heartbeat every second.
Args:
session: the session to use for writing to the db
Raises:
BaseException: re-raises an exception occurring in the job running
in this context
""" # noqa: DAR301
try:
self.start()
self.save(session)
with self._handling_sigterm(session):
async with self._heartbeating(session):
yield
self.success()
self.save(session)
except BaseException as err: # noqa: WPS424
if not self.is_running():
raise
self.fail(error=self._error_message(err))
self.save(session)
raise

and to control concurrency

async def run_with_job(self) -> None:
"""Run the ELT task within the context of a job.
Raises:
RunnerError: if failures are encountered during execution or if the
underlying pipeline/job is already running.
"""
job = self.context.job
fail_stale_jobs(self.context.session, job.job_name)
if not self.context.force and (
existing := JobFinder(job.job_name).latest_running(
self.context.session,
)
):
raise RunnerError(
f"Another '{job.job_name}' pipeline is already running "
f"which started at {existing.started_at}. To ignore this "
"check use the '--force' option.",
)
with closing(self.context.session) as session:
async with job.run(session):
await self.execute()

I'm happy to discuss spec and implementation proposals, and even review PRs, but this is something that we probably won't prioritize ourselves.


That said, one option that may be available today is to rely on the default SQLite system db and use something like Litestream1 to sync the database with S3.

Footnotes

  1. https://litestream.io Litestream - Streaming SQLite Replication

from meltano.

edgarrmondragon avatar edgarrmondragon commented on May 27, 2024

Another idea that just came to mind is to search for or implement a sqlalchemy dialect that's sqlite + s3, so it could be used like the example in #7143 (comment).

The individual components seem to be out there:

from meltano.

menzenski avatar menzenski commented on May 27, 2024

Can you say more about how you're using the runs table? Is it something that you glance at occasionally, or are some of your workflows dependent on it?

We don't have anything today that depends on it specifically. We query it manually, occasionally, for debugging purposes.

As we build out our "second-generation" Meltano platform, though, we would like to better implement "reporting and analytics on our ELT workflows" - ideally we'd be able to e.g. surface in a dashboard which ELT jobs have run recently, succeeded, failed, etc.

We run Meltano in Kubernetes via Argo Workflows and we have the Argo Workflows workflow archive set up, so all Argo Workflows executions are recorded in a database today. A Meltano run corresponds to exactly one Argo Workflows run, so we still have good information available on what Meltano jobs ran when, succeeded, failed, etc.

The part I'm thinking about specifically as a potential limitation is not having the "payload" field from the meltano runs table available. It seems like it'd be useful to have that explicitly persisted - it seems to provide the value of the replication key for each stream in the run, at the start of the run.

from meltano.

menzenski avatar menzenski commented on May 27, 2024

The other thought I had is that we're moving from Postgres into Snowflake for our warehouse. System database support for Snowflake would accomplish the same goal for us (continue to leverage a persistent state backend without running a Postgres database).

from meltano.

edgarrmondragon avatar edgarrmondragon commented on May 27, 2024

We don't have anything today that depends on it specifically. We query it manually, occasionally, for debugging purposes.

As we build out our "second-generation" Meltano platform, though, we would like to better implement "reporting and analytics on our ELT workflows" - ideally we'd be able to e.g. surface in a dashboard which ELT jobs have run recently, succeeded, failed, etc.

We run Meltano in Kubernetes via Argo Workflows and we have the Argo Workflows workflow archive set up, so all Argo Workflows executions are recorded in a database today. A Meltano run corresponds to exactly one Argo Workflows run, so we still have good information available on what Meltano jobs ran when, succeeded, failed, etc.

The part I'm thinking about specifically as a potential limitation is not having the "payload" field from the meltano runs table available. It seems like it'd be useful to have that explicitly persisted - it seems to provide the value of the replication key for each stream in the run, at the start of the run.

Thanks for adding context! That makes sense. The payload field can indeed give some insight into "state evolution" of a tap and its streams, which can be valuable.

The other thought I had is that we're moving from Postgres into Snowflake for our warehouse. System database support for Snowflake would accomplish the same goal for us (continue to leverage a persistent state backend without running a Postgres database).

Yeah, that's been asked in Slack before. The database_uri is a SQLAlchemy URL, so in theory you could point it to a snowflake instance by setting it to 'snowflake://<user_login_name>:<password>@<account_name>/<database_name>/<schema_name>?warehouse=<warehouse_name>&role=<role_name>'1.

Now, I recall that doesn't work because at least one of the migration scripts is not compatible with snowflake's sql so changes would be required there (see #6529 and #6167). Do log an issue for it if Snowflake support for systemdb would make this transition easier for you, and of course PRs would be welcome 😄.

Footnotes

  1. https://github.com/snowflakedb/snowflake-sqlalchemy/?tab=readme-ov-file#connection-parameters

from meltano.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.