Coder Social home page Coder Social logo

andreveit / fed-by-tweets-batch-ingestion Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 891 KB

End-to-end ML system - Batch Ingestion

Makefile 1.68% Python 69.42% Dockerfile 0.30% Shell 11.02% HCL 17.59%
data-engineering data-science twitter twitter-sentiment-analysis

fed-by-tweets-batch-ingestion's Introduction

Fed By Tweets - Batch Ingestion

The Fed By Tweets Project

This repository is part of the fedbytweets project.

The aim of the project is to set up an end-to-end ML system using the AWS infrastructure to ingest, process and extract insights from Twitter data. An NLP model is trained for sentiment analysis and then used to classify the tweets. The final results should be displayed on a dashboard in a public gataway.

The architecture is designed to be primarily covered by the AWS Free Tier, but still be scalable at some degree. The best practices of Data Engineering and MLOps are applied to build the system.


*This is an ongoing project



Batch Ingestion

This repository should contain the necessary code to setup the tweets ingestion and processing all the way to the silver layer.


Table of Contents

  1. Artchitecture
  2. Data-Lake
  3. Workflow

Artchitecture

The data is pulled from Twitter's Recent API using the av-tweet-ingestion package, running on a Lambda Function.

At first, the ideia was to work with AirFlow to orchestrate the jobs, but AWS Step Functions happend to be more suitable to the problem, offering better economic advantages besides a handier setup.

For the data processing itself, different tools were evaluated, such as AWS EMR and AWS Glue Jobs. Lambda Functions ended up being the way to go, as the costs are low and the designed work load wasn't huge. It is also possible to parallelize the processing if necessary.

The tables matadata are kept in the AWS Glue Data Catalog, being possible to query the data and run some analytics using AWS Athena.

The job runs at 7h, 14h and 21h (Brazillian time) and was scheduled through a AWS EventBridge Rule.

The code to run in the Lambda Functions is containerized (used Docker), making it easy to perform unit tests, integrations tests and setup CI/CD workflows. The CI/CD pipeline was bult with GitHub Actions and Terraform.

Two deployment environments were created, staging and production, as the jobs are running since June/2022 and some updates to the code and the infrastructure were needed.


Ingestion architecture


Data-Lake

The Data Lake was built with three layers, BRONZE, SILVER and GOLD. The ingested data gets to the BRONZE layer at a designed partition for raw files, in this case json. It is then processed just enough to be read in a tabular format (parquet).The data is modeled in three tables, here we have the TWEETS fact table and two dimensional tables, USERS and PLACES. Up to this point, the data is still kept in the BRONZE layer, even though in a tabular format. At each ingestion, new tweets data are appended to these tables.

A new ETL process is responsible for taking the data to the SILVER layer. It performs data cleasing, adjusting data types, removing duplicated records and assuring the next layer's data to be trusted.


Below, it can be seen the schema of each table through the data lake layers.


Batch workflow - Step Functions


Batch workflow - Step Functions


Workflow

AWS Step Functions is used to orchestrate the data processing through 5 Lambda Functions.

Get-Tweets:

Hits the Twitter's API and performs the data ingestion.

Porcessing-Raw:

ETL from raw json files to tabular.

Tweets-to-Silver:

ETL Tweets tables from bronze to silver.

Users-to-Silver:

ETL Users tables from bronze to silver.

Places-to-Silver:

ETL Places tables from bronze to silver.


Batch workflow - Step Functions


fed-by-tweets-batch-ingestion's People

Contributors

andreveit avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.