The purpose of the project is to efficiently collect, process, and store Youtube data using a combination of Apache Airflow, Apache Spark, and MongoDB.
Before setting up and running the YOutube Data Pipeline project, make sure you have the following prerequisites in place:
-
Environment Setup:
- Install and configure Apache Airflow.
- Install and configure Apache Spark on the target machine or cluster.
-
Youtube Developer Account:
- Obtain YouTube API credentials.
-
MongoDB:
- Install MongoDB.
-
Access and Permissions:
- Grant necessary permissions for YouTube API access and AWS S3 resources.
-
Data Schema Understanding:
- Familiarize yourself with the structure of YouTube data returned by the API.
-
Apache Airflow Plugins:
- Identify and install required Airflow plugins based on project needs.
-
Spark Job Configuration:
- Develop Spark jobs and ensure the correct setup of dependencies and configurations.
Follow these steps to set up and run the Twitter Data Pipeline:
- Clone the repository.
- Install dependencies using the provided in the requirements.txt
pip install -r requirements. txt
- Run the Airflow DAG to initiate the data pipeline.