Coder Social home page Coder Social logo

rob-writes-code / pisa-project Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 2.31 MB

Team project to create an ETL pipeline for incoming educational data and provide insights in a dashboard.

License: MIT License

Python 3.56% Jupyter Notebook 96.44%

pisa-project's Introduction

PISA Project

The final project of the Makers Data Engineering specialist track.

Project Overview

The goal of the project was to analyse the PISA 2018 dataset and develop a functioning dashboard that the client (GEI) can use to easily visualise and interpret the data.

The project is carried out in a distributed environment in the Cloud to allow for efficient data processing and collaboration.

It was a team project, which I completed with two other colleagues. We used GitHub throughout the project for version control.

About Global Education Insights (GEI)

Global Education Insights (GEI) is a non-profit organisation dedicated to improving education systems worldwide. They work with governments, educational institutions, and stakeholders to provide data-driven insights and recommendations for educational policy and practice.

About PISA

PISA is the OECD's "Programme for International Student Assessment", which assesses educational practices around the world and uses its findings to influence policy.

The Dataset

The dataset used in this project is a real-life dataset, collected in 2018 by PISA. It contains responses from hundreds of thousands of students, teachers and educational facilitators from 79 countries, providing information about their backgrounds and experiences in education.

For more information about the 2018 dataset, please follow this link.

For the purposes of this project, we took a sample of 20 countries. Responses were submitted gradually over time, to simulate the real-life scenario of collecting data.

The dataset began at around 100 responses, but grew to over 100,000 in little over a week's time.

Extract from the raw data: raw_data_extract

Structure

  • 20 source databases, each holding data from a different country
  • 1,120 columns in each database
  • Varying amounts of rows in each database, but all constantly growing over time
  • Column names coded, PISA index required to decode data

Project Objectives

The main goal was to turn this unreadable, coded mass of data into something salient and insightful.

The project had three levels of challenge:

  • Level 1: develop dashboard charts displaying correct summary data that is no more than an hour old.
  • Level 2: the data should be no more than a minute old.
  • Level 3: the data should be up-to-the-second.

Solution Overview

For our solution we implemented the following:

  • Airflow to routinely extract data from 20 source databases and load into an analytical database held on AWS RDS
  • Flask app to transform and serve data to our dashboard app
  • Dashboard app tp poll endpoint each second to constantly update data visualisations as responses are submitted
  • All running in the Cloud, using AWS EC2 and Render (PaaS provider)

Data Pipeline - Planning Stage pipeline_plan

Data Pipeline - Implemented pipeline_implemented

Stack

The tech stack used in this project.

project_stack

Apache Airflow

Workflow orchestration tool used to pull data from 20 source databases and load them into an analytical database held on AWS RDS.

Tasks and order of flow defined in a Directed Acyclic Graph (DAG):

PISA DAG

overview_of_dag_flow

DAG scheduled to run every 30 seconds to continuously and automatically keep pulling any new submissions, pooling them into a central database on which our dashboard was based.

DAG hosted inside an AWS EC2 instance to keep it running continously, independent from local machines.

Benefits of approach:

  • Resusable - code accepts list of countries; more can easily be added to increase scope of project.
  • Scalable - designed to only extract new entries to reduce latency. Runtime remained consistent, despite growing dataset.
  • Data Integrity - checks made for duplicates and conflicts updated with new values, to ensure integrity and freshness of data.

Limitations:

  • Minimum runtime of the DAG was 15 seconds, meaning the analytical database would be slightly out of date / not precisely real-time.
  • Could be improved by reconfiguring concurrency / parallel computing settings, or by redesigining project to use a data streaming service, such as Kafka.

Flask App

We designed the Flask app to extract data from the analytical database and then perform transformations on it, converting it to a usable format for the metrics dashboard.

The app was then hosted on Render, a Cloud platform service. Our web service supported Continuous Integration/Continous Deployment (CI/CD) practices, and would immediately redeploy each time the app was updated on GitHub.

web_server_on_render

Each time a GET request was made to the web service, the app would query the analytical database, providing the most up-to-date data and allowing our dashboard to give real-time insights.

The full code for this app (with comments) can be accessed here, in the 'render' directory of this repository:

Flask App

Dashboard

Forage

The dashboard was supplied to us by Makers. It was a custom React app, named "Forage", designed to accept data in json format.

We reverse engineered the dashboard app to ascertain the format our data needed to be in for the app to function correctly.

The dashboard was configured to poll a url every second, so that the visualisations would constantly update with new data, providing real-time insights.

Overview of the completed dashboard:

For more information on the metrics used and on how they were extracted, see the below notebook:

Dashboard Metrics

dashboard_overview

Demo Day

As a final part of the project, we presented and demonstrated our application to stakeholders, internal and external to Makers Academy, with the audience a mix of technical and non-technical backgrounds.

Conclusion

In one week, the team and I developed an ETL pipeline which batch processed raw data from 20 different sources, transformed it into valuable insights and presented it in an attractive and intuitive metrics dashboard.

The entire application was hosted in the Cloud and automated, allowing resources to be fully accessible and for our dashboard to be continuously refreshed with the most current data.

Our application provided near real-time insights which could be used to guide policy and influence decision-makers across the world.


This project is licensed under MIT.

pisa-project's People

Contributors

rob-writes-code avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.