Coder Social home page Coder Social logo

github-data-pipeline's Introduction

Open Source Companie's Data Pipeline using GitHub

In this project, I will create data pipeline for collecting top 35 open source company profile and their repositories, commits, members, ... I update them and do analysis hourly.

Data source

Github APIs (or Github ReST APIs) are the APIs that you can use to interact with GitHub. They allow you to create and manage repositories, branches, issues, pull requests, and many more. For fetching publicly available information (like public repositories, user profiles, etc.), you can call the API. For other actions, you need to provide an authenticated token.

- Company : https://api.github.com/orgs/<Company Name>

- User : https://api.github.com/users/<User Name>

- Company Member : https://api.github.com/orgs/<Company Name>/members

- Repository : https://api.github.com/repos/<Company Name>/<Repository Name>

- Repository Commit : https://api.github.com/repos/<Company name>/<Repository Name>/commits

- Repository Language : https://api.github.com/repos/<Company Name>/<Repository Name>/languages

- Repository Tag : https://api.github.com/repos/<Company Name>/<Repository Name>/tags

- License : https://api.github.com/licenses/<License Key>

- Repository Activity : https://github.com/orgs/<Company Name>/repositories

With Repository Activity, I used BeautifulSoup, a Python package for parsing HTML and XML documents (including having malformed markup, i.e. non-closed tags, so named after tag soup). It can creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping.

Architecture

Use Case

Tools & Technologies

Database Design

Use Case

  • LICENSE : Open source software need to be licensed (to provide rights to modify, distribute other than creator) so the repository in which the software dwells should include the license file.
  • USER : Include Company/Organization and normal User
  • REPOSITORY : A central location where the user's data is stored.
  • COMMIT : A command that is used to save your changes to the local repository.
  • REPOSITORY TAG : A commit that is marked at a point in your repository history
  • REPOSITORY CONTRIBUTOR : A User who has at least 1 commit for a specific repository
  • REPOSITORY LANGUAGE : Programming languages that a repository used

Final Result

Setup

Prerequisite

Running

  1. Docker - Airflow - Spark

    a) Service

    When you check the Docker and Docker Compose file in airflow directory, you can see main services that I used:

    • Postgres - Airflow

      • Version 13
      • Port 5432
    • Airflow Webserver

      • Version Apache Airflow 2.3.1
      • Port 8080
    • Spark Master

      • Version Bitnami Spark 3.2.1
      • Port 8181
    • PG Admin

      • Version 4
      • Port 5050

    b) Running

    cd airflow
    Docker-compose build

    You will need to wait 10-15 minutes for the first time.

    To run the VM:

    Docker-compose up

    You also will be asked to wait 10-15 minutes for the first time. After running successfully, you can open another terminal and run:

    Docker ps

    The successful running setup: console

    Now you can check whether it runs or not by using below service urls.

    Next you need to add connection on Airflow for

    If you can go there, you have successfully setup Airflow. Now you can run all the tasks except dbt_analysis_daily (I will instruct you to run this in the next section).

    c) Result

    • DAG create_postgres_database (Create tables) dag create db

    • DAG spark-postgres (Load historical data) dag load historical

    • DAG update_github_repo_hourly (Load data hourly) dag load hourly

  2. DBT

    As you can see in the previous section, If you run DAG dbt_analysis_daily, system will throw error. Now let's setup nescessary things to run it sucessfullly.

    Open new terminal and run:

    docker ps

    Let's find the first Airflow container's id and copy it: console Then run:

    docker exec -it bdd46fa14ba4 bash
    cd /opt/dbt
    dbt build

    After running these commands, you will see: dbt setup

    Now you can go back to Airflow and run DAB dbt_analysis_daily

github-data-pipeline's People

Contributors

uts58 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.