Coder Social home page Coder Social logo

ingest-framework's Introduction

Ingest Framework

Description

A framework for developing data ingestion pipelines and deploying them in a serverless fashion.

Why build a new framework?

At Development Seed, we build a lot of data pipelines, and we've settled on a reference implementation for the majority of cases. We could make use of an existing ETL tool to build out our applications, but we have a fairly limited set of requirements, as well as some odd constraints (e.g. limited permissions in specific client environments), so it makes sense to build a lightweight framework that specifically targets our needs. We have the talent and capacity to extend it, as needed, and we can govern it in such a way that it stays focused on our specific needs. This doesn't need to be a one-size-fits-all solution that eventually collapses under the weight of 1000 edge cases. If it works well for 80% of our pipeline projects, that's good enough.

Goals

Having worked on data ingestion for the majority of my time here, I've experienced first-hand some of the pain points in developing, testing and monitoring a pipeline. With that in mind, here are my goals for this project:

Decouple business logic from infrastructure

Just because the code ends up deployed in a Lambda doesn't mean it needs to look like a Lambda handler. We should be able to write our business logic in a way that allows it to be run anywhere, if only because it makes testing easier.

Provide an easy way to reason about all steps in a pipeline and how they interact

I should be able to look at the definition of a pipeline and reason, at a glance, about how data moves through it. I shouldn't need to jump between CDK constructs and application code. We should also be able to enforce a contract between each step in a pipeline and the subsequent step. We shouldn't have to deploy the pipeline to discover that one step is passing the wrong data type to the next step.

Provide the ability to run and test an entire pipeline locally

I shouldn't need to deploy my pipeline and all of its dependencies (VPC, Database, etc) in AWS in order to run an end-to-end test. I should be able to spin up any required services locally and run my entire pipeline from my dev machine.

Provide an out-of-the-box monitoring interface for every pipeline

Monitoring is a pain to add after the fact and it's especially a pain to add to an adhoc pipeline. We should bake it into the framework, so it's available for free with every deployment. The monitoring tool should represent the logical pipelines defined in your application, decoupled from the infrastructure that runs them. So, even if a pipeline consists of two or more Step Functions behind the scenes, it appears as a single pipeline in the monitoring interface.

Codify best practices in code

We now have a reference design for this sort of ingestion pipeline. But the only thing better than codifying best practices in a document is codifying them in, well, code. For example, we shouldn't have to consult the reference doc every time we want to implement a bulk processing step. We should be able to simply annotate a step in our pipeline as accumulating data from other pipeline runs and let the framework handle provisioning all the required resources to make that happen.

Portability between cloud providers

This is a future goal, but worth keeping in mind. All infrastructure in this project is currently being spun up in AWS via CDK, but references to CDK have been mostly kept out of the core library. Eventually, we may want to add support for deploying to Azure or GCP via an appropriate IaC tool and we should be able to do that without too much refactoring.

Example

https://github.com/edkeeble/ingest-example

Missing

A non-comprehensive list of things which are missing right now.

  • Defining networking requirements (we may be able to infer these from required access to DB in private subnet)
  • Support a requirements file per step rather than per app
  • Additional trigger types (SQS, HTTP request?). Define a plugin interface for triggers, so that other devs can provide their own.
  • Support injecting named secrets into a step
  • Allow triggering another pipeline from within a step (support parallelization)
  • Monitoring:
    • tracing each pipeline run through each step in the pipeline
    • an interface for monitoring pipeline runs
    • the ability to retry a run
  • Update current approach using class types for steps in a pipeline to a more standard class instance (with parameters passed in constructor)
    • This requires re-instantiating the classes with the same parameters within each lambda function
  • A non-toy working example
  • A multi-pipeline example
  • Move the examples into this repo
  • A better name for this project

ingest-framework's People

Contributors

alukach avatar edkeeble avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.