Coder Social home page Coder Social logo

insight_de_smart_grid's Introduction

Python 3.6 Spark Kafka Druid license

iGrid

Making the power grid smarter. An Insight 2020 May Data Engineering Project by Kevin Yi-Wei Lin.

Table of Contents

  1. Background
  2. Demo
  3. Slide Deck
  4. Data Pipeline
  5. Repository Structure
  6. Instructions

Background

Power outages cause economic loss. Due to the rise of WFH, it can be difficult for the old power grid to adjust to the new demand pattern from regular households. To reduce the chance of blackout, utility companies need a new way to turn off/down unessential appliances when there is regional power demand/supply imbalance. Instead of power grid infrastructure overhaul, the utility companies can exploit the trend of smart appliances and smart plugs, controlling them through home assistants from the cloud when needed by using long-term and short-term metrics.

Demo

iGrid demo

Slide Deck

Link

Data pipeline

The pipeline was designed to seperate fine-grain analysis (Spark) from coarse-grain analysis. The latter was possible with only a python script because of the roll-up on ingestion and powerful queries powered by Druid. The current setup can at least handle 100k msg/s with 10k appliances. This message velocity is on the level of a single power distribution station.
pipeline

Repository Structure

├── batch               Batch processing python script with Airfow
├── data                Usage of GREEND and REDD data sets
├── database            Druid config files
├── example config      Example configuration file 
├── frontend            Imply Pivot config files
├── ingestion           Kafka producer scripts
└── stream_processing   Spark Structured Streaming script

Instructions

Data sets

  • The Reference Energy Disaggregation Data Set (REDD) [1]:
    The low frequency data set was used.
  • GREEND: Energy Metering Data Set [2]:
    Version GREEND_0-2_300615.zip was used. Please refer to the instruction in /data.

Cluster setup
(I strongly recommend future fellows to utilize AWS managed clusters.)

  • Kafka v2.2.1: AWS MSK three m5.large nodes
  • Spark v2.4.5: AWS EMR v5.30.1 three m5 large nodes (1 master and 2 workers) with bootstrap action script: stream_processing/init_emr.sh
  • Druid v0.18.1: single server "small" using i3.2xlarge
  • Kafka producers: four t2.xlarge
  • Batch with Airflow: t2.small
  • Imply Pivot: t2.medium
    pip3 requirement files are in individual folders

Create Kafka Topics
powerraw, history and dutycycle, where 6 partitions and replication factor 2 were used.

Initiate Druid Datasources
Change the relevant address and import the specfications into datasources.

Start Kafka Producers

  1. Change relevant parameters in config.ini
  2. Place the Python and Bash scripts, config.ini and schema.avsc under the same directory
  3. ./run_GREEND.sh [starting day shift] [ending day shift] or
    ./run_REDD.sh [starting day shift] [ending day shift]
    This will replay the whole data set for each day shift specified in the argument. Do not do more than 20 playbacks on a single machine.

Submit Spark Strutured Streaming Job

  1. Change relevant parameters in config.ini
  2. Place duty_cycle_avro.py, config.ini and schema.avsc under the same directory
  3. spark-submit --master yarn --deploy-mode client --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.5,org.apache.spark:spark-avro_2.11:2.4.5 duty_cycle_avro.py

Start Batch Historcal Processing

  1. Put druid_batch.py and config.ini under /home/ubuntu, or other path specified in the DAG file.
  2. Change relevant parameters in config.ini
  3. Put DAG script in dags folder and turn on in Airflow

Dashboard

  1. Connect Pivot to Druid Datasources
  2. Import dashboard config file frontend/dashboard-iGridDemo.json

References

[1] J. Zico Kolter and Matthew J. Johnson. REDD: A public data set for energy disaggregation research. In proceedings of the SustKDD workshop on Data Mining Applications in Sustainability, 2011.
[2] S. D’Alessandro, A.M. Tonello, A. Monacchi, W. Elmenreich, “GREEND: An Energy Consumption Dataset of Households in Italy and Austria,” Proc. of IEEE SMARTGRIDCOMM 2014, Venice, Italy, November 3-6, 2014.

License

This project is licensed under the MIT License - see the LICENSE file for details

insight_de_smart_grid's People

Contributors

linkevinlin1 avatar dependabot[bot] avatar

Stargazers

 avatar Moutasem Akkad avatar  avatar Josh Lang avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.