Coder Social home page Coder Social logo

database-exploration's Introduction

Project: Database Exploration / Pipeline

Python application

This is my final project at Spiced Academy. I wanted to compare different databases and use some tools, which are not part of the Spiced curriculum. The idea came from the book "Seven Databases in Seven Weeks" by Luc Perkins. The other part of my project is building a pipeline and connecting it to different database. I wanted to use Apache Spark to show how pandas breaks with big data.


Goal:

  • Setting up a Data Collector
  • Connecting to different databases:
    • MySQL
    • Postgres
    • MongoDB
    • Redis
    • (HBase)
  • Using Apache Spark for analyzing the data
  • Connecting everything via Docker
  • Orchestrated with Airflow

image

Learnings from working with different databases


MySQL Postgres MongoDB Redis
MySQL Workbench very useful, easy to maintain, import/export data, visualization of data easier to use than MySQL, free GUI Tools like Postico (not that much functionality like Workbench), but quick check for content easy set up, free Monitoring with unique URL not really a Database, Key-Value-Storage, Caches other DB's

Learnings from HBase

  • no official Docker Image
  • for really(!) Big Data
  • need to know Ruby or Java
  • data insertion not the best
  • indexes with timestamps by itself
  • rows are similar to Python dictionaries

Performance

  • MySQl performance:
    • %time pd.to_sql: CPU times: user 1min 33s, sys: 24.2 s, total: 1min 57s Wall time: 5min 14s
    • SELECT * FROM table1: 2.047.078 row(s) returned 0.024 sec (Duration)/ 38.171 sec (Fetch time)
  • Postgres Performance:
    • %time pd.to_sql: interrupted the kernel after over an hour, table is created but no data is inserted (even with echo=True and INSERT is executed)

Security

  • use .env instead of config files (.env should be by default in gitignore)
  • you can add under settings -> secrets your passwords for e.g. database or AWS connection
  • have no problems with CI using Github Actions, because Github can access everything via secrets

  • Apache Spark maybe a bit overkill, clusters also quite complex instead using PySpark

database-exploration's People

Contributors

fe-bern avatar dependabot[bot] avatar

Stargazers

 avatar Ivan Trusov avatar Paul Wlodkowski avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.