Coder Social home page Coder Social logo

awesome-etl's Introduction

awesome-etl

A curated list of notable ETL (extract, transform, load) frameworks, libraries and software.

ETL Tools (GUI)

  • Pentaho Kettle - The most popular open-source graphical ETL tool.
  • Talend - "an open source application for data integration job design with a graphical development environment"
  • Informatica PowerCenter - "a toolset for establishing and maintaining enterprise-wide data warehouses. It has a customer base of over 5,000 companies."
  • Microsoft SSIS - "a component of the Microsoft SQL Server database software that can be used to perform a broad range of data migration tasks."
  • Apache NiFi - "a rich, web-based interface for designing, controlling, and monitoring a dataflow."
  • Jitterbit - "commercial software integration product that facilitates transport between legacy, enterprise, and on-demand computing applications."

Workflow Management/Engines

  • Luigi - "a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in."
  • TaskFlow - "allows the creation of lightweight task objects and/or functions that are combined together into flows (aka: workflows) in a declarative manner. It includes engines for running these flows in a manner that can be stopped, resumed, and safely reverted."
  • Airflow - "Use airflow to author workflows as directed acyclic graphs (DAGs) of tasks. The airflow scheduler executes your tasks on an array of workers while following the specified dependencies. Rich command line utilities make performing complex surgeries on DAGs a snap. The rich user interface makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed."
  • Pinball - "a scalable workflow management platform developed at Pinterest. It is built based on layered approach."
  • Azkaban - "a batch workflow job scheduler created at LinkedIn to run Hadoop jobs. Azkaban resolves the ordering through job dependencies and provides an easy to use web user interface to maintain and track your workflows."
  • Dray.it - "Docker workflow engine. Allows users to separate a workflow into discrete steps each to be handled by a single container."

Job Scheduling

  • Chronos - "a distributed and fault-tolerant scheduler that runs on top of Apache Mesos that can be used for job orchestration."
  • Dagobah - "a simple dependency-based job scheduler written in Python. Dagobah allows you to schedule periodic jobs using Cron syntax. Each job then kicks off a series of tasks (subprocesses) in an order defined by a dependency graph you can easily draw with click-and-drag in the web interface."

Python

Libraries

  • Pandas - Implements dataframes in Python for easier data processing and includes a number of tools that make it easier to extract data from multiple file formats.
  • Bubbles - "a Python ETL Framework and set of tools. It can be used for processing, auditing and inspecting data. Focus is on understandability and transparency of the process."
  • SQLAlchemy - "the Python SQL toolkit and Object Relational Mapper that gives application developers the full power and flexibility of SQL."
  • dataset - A wrapper around SQLAlchemy that simplifies database operations (including upserting).
  • Dask - Ever tried using Pandas to process data that won't fit into memory? Dask makes it easy.
  • Blaze - "translates a subset of modified NumPy and Pandas-like syntax to databases and other computing systems."
  • Odo - Moves data across containers (SQL, CSV, MongoDB, Pandas, etc). Claims to be the easiest and fastest way to load a CSV into your database.
  • xmltodict - Makes working with XML as easy as working with JSON. Also allows streaming so you don't run out of memory on large XML files.
  • Celery - "an asynchronous task queue/job queue based on distributed message passing. It is focused on real-time operation, but supports scheduling as well."
  • MrJob - "lets you write MapReduce jobs in Python 2.6+ and run them on several platforms. The easiest route to writing Python programs that run on Hadoop."
  • Joblib - "a set of tools to provide lightweight pipelining in Python."
  • Orange - "data visualization and analysis for novice and experts. Data mining through visual programming or Python scripting. Components for machine learning. Add-ons for bioinformatics and text mining. Packed with features for data analytics."
  • BeautifulSoup - Popular library used to extract data from web pages.
  • PyQuery - Extracts data from web pages with a jquery-like syntax.
  • PETL - "a general purpose Python package for extracting, transforming and loading tables of data." Slower than Pandas and not as good for larger amounts of data, but simpler.

Talks/Articles

Ruby

Go

  • Crunch - "A fast to develop, fast to run, Go based toolkit for ETL and feature extraction on Hadoop."

Talks/Articles

Cloud Services

  • Google Dataflow - "Google Cloud Dataflow provides a simple, powerful model for building both batch and streaming parallel data processing pipelines."
  • Amazon Data Pipeline - "a web service that helps you reliably process and move data between different AWS compute and storage services, as well as on-premise data sources, at specified intervals."
  • Amazon SWF - "helps developers build, run, and scale background jobs that have parallel or sequential steps. You can think of Amazon SWF as a fully-managed state tracker and task coordinator in the Cloud."
  • Snaplogic - "a self-upgrading, elastic execution grid that streams data between applications, databases, files, social and big data sources."

Big Data (Hadoop Stack)

  • Spark - "a fast and general-purpose cluster computing system. It provides high-level APIs in Scala, Java, and Python that make parallel jobs easy to write, and an optimized engine that supports general computation graphs. It also supports a rich set of higher-level tools including Shark (Hive on Spark), MLlib for machine learning, GraphX for graph processing, and Spark Streaming."
  • Pig - "a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs."

awesome-etl's People

Contributors

pawl avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.