Coder Social home page Coder Social logo

oss-datastore's Introduction

OSS-Datastore

The goal of this package is to gather data from GitHub via API calls, from both v3 and v4 APIs, and then store said data for further analysis in a data warehouse.

Goals

GitHub doesn't keep information around indefinitely but for a short amount of time, usually around 14 days worth of data. Because of this, it is difficult to track information around your GitHub organizations (orgs) and repositories (repos) for planning the best place to focus your time/resources. This is exacerbated by the fact that requests to the GitHub APIs are finicky when asking for multiple large sets of data and can get your IP address blocked from making API requests for a period of time.

To get around this, the OSS-Datastore will establish a pipeline for ingesting data from the GitHub APIs, run ETL (extract/transform/load) operations, store data into Redshift using the data vault modeling technique, and establish a base set of information/data marts for end user analysis and consumption. This modeling technique moves data from multiple sources to a staging area, ETL actions store that data into the data vault, and the data is then spun out into infomarts which are where the end users actually interacts with the data. Below is a really bad diagram to demonstrate the flow of data through the system and will, from a high-level, look like the following.

overview

Items that will be tracked include:

  • Stars/forks/watchers
  • Repo traffic/stats
  • Issue metrics
  • Pull request activity
    • New pull requests
    • replies or updates required
  • Tracking user (member and external colalborator) access levels
  • Security issues in a repo and its dependencies

This isn't the definitive list of information that will be tracked and will expand as the package grows. The tentative data model can be found in overview

Setup for CLI

You need to have pipenv installed locally

pip install --user pipenv

To activate shell:

pipenv shell

To install new runtime dependencies

pipenv install <package name>

To install new dev dependencies

pipenv install <package name> --dev

To run, ensure you have copied sample.env to .env and source it or you export the environment variables when running from the command line, fill in the config info, and run

pipenv run python datastore.py

Additional setup for AWS

In order to get things setup for running thing in AWS you will need to export your completed .env file and run the aws-cdk bootstrap.

source .env

cdk boostrap

For more information see the AWS CDK docs.

Deploying to AWS

Due to limitations in the AWS-CDK you will need to follow AWS Sectretsmanager Create a Basic Secret and set the name and secret key to 'OSS-Datastore-GitHub-Token'. All you have left to do is run the following from the root folder:

./deploy_lambda.sh

This will create and launch a Cloudformation template to build the resources we need in AWS. **** THIS WILL COST YOU MONEY ****

Notes on running in AWS

This will charge YOUR account so you should keep that in mind when running this package.

The cron scheduler is setup to run once every 24 hours and is kicked off at 00:00 GMT, configurable in oss_datastore_lambda_stack.py. This was selected to ensure the pulls don't interfere with PST working hours. The schedule can be altered in oss_datastore_lambda_stack.py

Note on the Dead Letter Queue

In order to ensure that our data is being pulled and stored from the GitHub API I track and store failed requests in a Dead Letter Queue (DLQ) for manual action. My intention is to attach monitoring to the DLQ and alarm/alert when a new item is added so an admin can investigate why the request failed. Since the reason for the failure can occur for many different reasons I track the infomration about which API version failed, the query/request that failed, and the response body/headers/status code for analysis. From there it is on the admin to take action to ensure the request is retried and the data is successfully stored.

Development tracker

  • Staging area work
    • request, locally store, and push files to S3 for the following
    • Cron job to kick-off new data requests
  • Refine data vault model in docs/images/GH_Data_Vault_Layout.jpeg
  • Data warehouse construction via AWS Cloudformation
    • Define and build the hubs
    • Define and build the links
    • Define and build the satellites
    • Define and build the metrics vault
    • Define and build the business vault
  • ETL functionality
    • Staging to data vault
    • Staging to long term storage in Amazon S3
  • AWS automation via CDK
    • Staging creation/configuration
    • Datamart creation/configuration
  • Data exploration via Amazon QuickSight

License

This library is licensed under the Apache 2.0 License.

oss-datastore's People

Contributors

jamesiri avatar joemocode avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.