ruth-mwangi / youtube-data-etl Goto Github PK

View Code? Open in Web Editor NEW

The purpose of the project is to efficiently collect, process, and store Twitter data using a combination of Apache Airflow, Apache Spark, and Amazon S3.

Python 100.00%

youtube-data-etl's Introduction

youtube-data-etl

The purpose of the project is to efficiently collect, process, and store Youtube data using a combination of Apache Airflow, Apache Spark, and MongoDB.

Prerequisites

Before setting up and running the YOutube Data Pipeline project, make sure you have the following prerequisites in place:

Environment Setup:
- Install and configure Apache Airflow.
- Install and configure Apache Spark on the target machine or cluster.
Youtube Developer Account:
- Obtain YouTube API credentials.
MongoDB:
- Install MongoDB.
Access and Permissions:
- Grant necessary permissions for YouTube API access and AWS S3 resources.
Data Schema Understanding:
- Familiarize yourself with the structure of YouTube data returned by the API.
Apache Airflow Plugins:
- Identify and install required Airflow plugins based on project needs.
Spark Job Configuration:
- Develop Spark jobs and ensure the correct setup of dependencies and configurations.

Getting Started

Follow these steps to set up and run the Twitter Data Pipeline:

Clone the repository.
Install dependencies using the provided in the requirements.txt
```
pip install -r requirements. txt
```
Run the Airflow DAG to initiate the data pipeline.

Recommend Projects