In this project, I will create data pipeline for collecting top 35 open source company profile and their repositories, commits, members, ... I update them and do analysis hourly.
Github APIs (or Github ReST APIs) are the APIs that you can use to interact with GitHub. They allow you to create and manage repositories, branches, issues, pull requests, and many more. For fetching publicly available information (like public repositories, user profiles, etc.), you can call the API. For other actions, you need to provide an authenticated token.
- Company : https://api.github.com/orgs/<Company Name>
- User : https://api.github.com/users/<User Name>
- Company Member : https://api.github.com/orgs/<Company Name>/members
- Repository : https://api.github.com/repos/<Company Name>/<Repository Name>
- Repository Commit : https://api.github.com/repos/<Company name>/<Repository Name>/commits
- Repository Language : https://api.github.com/repos/<Company Name>/<Repository Name>/languages
- Repository Tag : https://api.github.com/repos/<Company Name>/<Repository Name>/tags
- License : https://api.github.com/licenses/<License Key>
- Repository Activity : https://github.com/orgs/<Company Name>/repositories
With Repository Activity, I used BeautifulSoup, a Python package for parsing HTML and XML documents (including having malformed markup, i.e. non-closed tags, so named after tag soup). It can creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping.
- Cloud - Supabase Cloud
- Containerization - Docker, Docker Compose
- Batch Processing - Spark
- Orchestration - Airflow
- Transformation - dbt
- Data Lake - Supabase Storage
- Data Warehouse - Postgres
- Data Visualization - Power BI
- Language - Python
- LICENSE : Open source software need to be licensed (to provide rights to modify, distribute other than creator) so the repository in which the software dwells should include the license file.
- USER : Include Company/Organization and normal User
- REPOSITORY : A central location where the user's data is stored.
- COMMIT : A command that is used to save your changes to the local repository.
- REPOSITORY TAG : A commit that is marked at a point in your repository history
- REPOSITORY CONTRIBUTOR : A User who has at least 1 commit for a specific repository
- REPOSITORY LANGUAGE : Programming languages that a repository used
-
Install Docker and Docker Compose
-
Create your personal access token on Github. If you don't do this, you can only extract 60 times/hour instead of 5000 times/hour using the GitHub rest api.
-
Remember to use use your own key and version in Docker and Docker Compose file.
-
Docker - Airflow - Spark
a) Service
When you check the Docker and Docker Compose file in airflow directory, you can see main services that I used:
-
Postgres - Airflow
- Version 13
- Port 5432
-
Airflow Webserver
- Version Apache Airflow 2.3.1
- Port 8080
-
Spark Master
- Version Bitnami Spark 3.2.1
- Port 8181
-
PG Admin
- Version 4
- Port 5050
b) Running
cd airflow Docker-compose build
You will need to wait 10-15 minutes for the first time.
To run the VM:
Docker-compose up
You also will be asked to wait 10-15 minutes for the first time. After running successfully, you can open another terminal and run:
Docker ps
Now you can check whether it runs or not by using below service urls.
- Airflow : http://localhost:8080/
- PG Admin : http://localhost:5050/
- Spark : http://localhost:8181/
Next you need to add connection on Airflow for
If you can go there, you have successfully setup Airflow. Now you can run all the tasks except dbt_analysis_daily (I will instruct you to run this in the next section).
c) Result
-
-
DBT
As you can see in the previous section, If you run DAG dbt_analysis_daily, system will throw error. Now let's setup nescessary things to run it sucessfullly.
Open new terminal and run:
docker ps
Let's find the first Airflow container's id and copy it: Then run:
docker exec -it bdd46fa14ba4 bash cd /opt/dbt dbt build
After running these commands, you will see:
Now you can go back to Airflow and run DAB dbt_analysis_daily