Feature scope Configuration (settings parsing, validation, etc.)</

feature: System database support for AWS S3,about meltano/meltano

menzenski commented on May 27, 2024 1

After some further considerations we've decided that it's actually not feasible to retire our Postgres database. We plan to continue to use it for the Meltano system database (and for some other metadata capturing, Argo Workflows archive etc).

So, from my perspective this issue could be closed.

from meltano.

edgarrmondragon commented on May 27, 2024

Hey Matt, thanks for filing!

Can you say more about how you're using the runs table? Is it something that you glance at occasionally, or are some of your workflows dependent on it?

I'm trying to think what this would look like. State backends are essentially key-value stores so it's easy to use object storage for that, but the runs table is more transactional.

See for example how the job/run model is used within a transaction to keep a heartbeat

meltano/src/meltano/core/job/job.py

Lines 249 to 280 in bcbe3eb

    
               @asynccontextmanager 
        
               async def run(self, session): 
        
                   """Run wrapped code in context of a job. 
        
                   Transitions state to RUNNING and SUCCESS/FAIL as appropriate and 
        
                   records heartbeat every second. 
        
                   Args: 
        
                       session: the session to use for writing to the db 
        
                   Raises: 
        
                       BaseException: re-raises an exception occurring in the job running 
        
                           in this context 
        
                   """  # noqa: DAR301 
        
                   try: 
        
                       self.start() 
        
                       self.save(session) 
        
                       with self._handling_sigterm(session): 
        
                           async with self._heartbeating(session): 
        
                               yield 
        
                       self.success() 
        
                       self.save(session) 
        
                   except BaseException as err:  # noqa: WPS424 
        
                       if not self.is_running(): 
        
                           raise 
        
                       self.fail(error=self._error_message(err)) 
        
                       self.save(session) 
        
                       raise

and to control concurrency

meltano/src/meltano/core/block/extract_load.py

Lines 477 to 499 in bcbe3eb

    
               async def run_with_job(self) -> None: 
        
                   """Run the ELT task within the context of a job. 
        
                   Raises: 
        
                       RunnerError: if failures are encountered during execution or if the 
        
                           underlying pipeline/job is already running. 
        
                   """ 
        
                   job = self.context.job 
        
                   fail_stale_jobs(self.context.session, job.job_name) 
        
                   if not self.context.force and ( 
        
                       existing := JobFinder(job.job_name).latest_running( 
        
                           self.context.session, 
        
                       ) 
        
                   ): 
        
                       raise RunnerError( 
        
                           f"Another '{job.job_name}' pipeline is already running " 
        
                           f"which started at {existing.started_at}. To ignore this " 
        
                           "check use the '--force' option.", 
        
                       ) 
        
                   with closing(self.context.session) as session: 
        
                       async with job.run(session): 
        
                           await self.execute()

I'm happy to discuss spec and implementation proposals, and even review PRs, but this is something that we probably won't prioritize ourselves.

That said, one option that may be available today is to rely on the default SQLite system db and use something like Litestream¹ to sync the database with S3.

https://litestream.io Litestream - Streaming SQLite Replication ↩

from meltano.

edgarrmondragon commented on May 27, 2024

Another idea that just came to mind is to search for or implement a sqlalchemy dialect that's sqlite + s3, so it could be used like the example in #7143 (comment).

The individual components seem to be out there:

https://github.com/uktrade/sqlite-s3vfs

Although this is concerning

No locking is performed, so client code must ensure that writes do not overlap with other writes or reads. If multiple writes happen at the same time, the database will probably become corrupt and data be lost.
https://rogerbinns.github.io/apsw/vfs.html
https://github.com/edgarrmondragon/sqlalchemy-sqlean/blob/206f86df0a325a9d853d5afb066d26cbcddd41d7/src/sqlean_driver/__init__.py (Example of a SQLite driver implementation)

from meltano.

menzenski commented on May 27, 2024

Can you say more about how you're using the runs table? Is it something that you glance at occasionally, or are some of your workflows dependent on it?

We don't have anything today that depends on it specifically. We query it manually, occasionally, for debugging purposes.

As we build out our "second-generation" Meltano platform, though, we would like to better implement "reporting and analytics on our ELT workflows" - ideally we'd be able to e.g. surface in a dashboard which ELT jobs have run recently, succeeded, failed, etc.

We run Meltano in Kubernetes via Argo Workflows and we have the Argo Workflows workflow archive set up, so all Argo Workflows executions are recorded in a database today. A Meltano run corresponds to exactly one Argo Workflows run, so we still have good information available on what Meltano jobs ran when, succeeded, failed, etc.

The part I'm thinking about specifically as a potential limitation is not having the "payload" field from the meltano runs table available. It seems like it'd be useful to have that explicitly persisted - it seems to provide the value of the replication key for each stream in the run, at the start of the run.

from meltano.

menzenski commented on May 27, 2024

The other thought I had is that we're moving from Postgres into Snowflake for our warehouse. System database support for Snowflake would accomplish the same goal for us (continue to leverage a persistent state backend without running a Postgres database).

from meltano.

edgarrmondragon commented on May 27, 2024

We don't have anything today that depends on it specifically. We query it manually, occasionally, for debugging purposes.

As we build out our "second-generation" Meltano platform, though, we would like to better implement "reporting and analytics on our ELT workflows" - ideally we'd be able to e.g. surface in a dashboard which ELT jobs have run recently, succeeded, failed, etc.

We run Meltano in Kubernetes via Argo Workflows and we have the Argo Workflows workflow archive set up, so all Argo Workflows executions are recorded in a database today. A Meltano run corresponds to exactly one Argo Workflows run, so we still have good information available on what Meltano jobs ran when, succeeded, failed, etc.

The part I'm thinking about specifically as a potential limitation is not having the "payload" field from the meltano runs table available. It seems like it'd be useful to have that explicitly persisted - it seems to provide the value of the replication key for each stream in the run, at the start of the run.

Thanks for adding context! That makes sense. The payload field can indeed give some insight into "state evolution" of a tap and its streams, which can be valuable.

The other thought I had is that we're moving from Postgres into Snowflake for our warehouse. System database support for Snowflake would accomplish the same goal for us (continue to leverage a persistent state backend without running a Postgres database).

Yeah, that's been asked in Slack before. The database_uri is a SQLAlchemy URL, so in theory you could point it to a snowflake instance by setting it to 'snowflake://<user_login_name>:<password>@<account_name>/<database_name>/<schema_name>?warehouse=<warehouse_name>&role=<role_name>'¹.

Now, I recall that doesn't work because at least one of the migration scripts is not compatible with snowflake's sql so changes would be required there (see #6529 and #6167). Do log an issue for it if Snowflake support for systemdb would make this transition easier for you, and of course PRs would be welcome 😄.

https://github.com/snowflakedb/snowflake-sqlalchemy/?tab=readme-ov-file#connection-parameters ↩

from meltano.

feature: System database support for AWS S3 about meltano HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	@asynccontextmanager
	async def run(self, session):
	"""Run wrapped code in context of a job.

	Transitions state to RUNNING and SUCCESS/FAIL as appropriate and
	records heartbeat every second.

	Args:
	session: the session to use for writing to the db

	Raises:
	BaseException: re-raises an exception occurring in the job running
	in this context
	""" # noqa: DAR301
	try:
	self.start()
	self.save(session)

	with self._handling_sigterm(session):
	async with self._heartbeating(session):
	yield

	self.success()
	self.save(session)
	except BaseException as err: # noqa: WPS424
	if not self.is_running():
	raise

	self.fail(error=self._error_message(err))
	self.save(session)
	raise

	async def run_with_job(self) -> None:
	"""Run the ELT task within the context of a job.

	Raises:
	RunnerError: if failures are encountered during execution or if the
	underlying pipeline/job is already running.
	"""
	job = self.context.job
	fail_stale_jobs(self.context.session, job.job_name)
	if not self.context.force and (
	existing := JobFinder(job.job_name).latest_running(
	self.context.session,
	)
	):
	raise RunnerError(
	f"Another '{job.job_name}' pipeline is already running "
	f"which started at {existing.started_at}. To ignore this "
	"check use the '--force' option.",
	)

	with closing(self.context.session) as session:
	async with job.run(session):
	await self.execute()

Comments (6)

Footnotes

Footnotes

Related Issues (20)

Recommend Projects

Recommend Topics

Recommend Org