Coder Social home page Coder Social logo

huunhat1703tkbn / data-engineering-on-azure-spark-cluster-japan-visa-analysis Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 1.48 MB

I set up the Spark master-worker architecture in a Docker container on Azure. ๐Ÿš€ We'll then perform end-to-end data processing and visualization of visa numbers in Japan using PySpark and Plotly. ๐Ÿ“ˆ Learn how to clean, transform, and visualize your data in an interactive manner, and gain insights into visa trends in Japan. ๐Ÿ‡ฏ๐Ÿ‡ต

Shell 0.01% Python 0.07% HTML 99.93%

data-engineering-on-azure-spark-cluster-japan-visa-analysis's Introduction

Japan Visa Analysis: Azure End to End Data Engineering ๐ŸŒ

This project provides an end-to-end data processing and visualization of visa numbers in Japan using PySpark and Plotly. The spark clusters are set up within a Docker container on Azure.

๐Ÿ“ Table of Contents

System Architecture

System Architecture

๐Ÿ›  Setup & Requirements

  1. Azure Account: Ensure you have an active Azure account.
  2. Docker: The Spark master-worker architecture is set up in a Docker container on Azure.
  3. Python Libraries: Install the required Python libraries:
    • PySpark
    • Plotly Express
    • pycountry
    • pycountry_convert
    • fuzzywuzzy

๐Ÿš€ Usage

  1. Data Input: Place your CSV file named visa_number_in_japan.csv in the input directory.
  2. Run the Script: Execute the provided Python script.
  3. Visualizations: After execution, you'll find the visualizations saved as HTML files in the output directory.
  4. Cleaned Data: The cleaned data will also be saved as a CSV file in the output directory.

๐Ÿ“ˆ Features

  • System Architecture: The Spark master-worker architecture is set up in a Docker container on Azure.
  • Data Ingestion: The script ingests the CSV file containing the visa numbers in Japan.
  • Data Cleaning: The script standardizes column names, drops null columns, and corrects country names using fuzzy matching.
  • Data Transformation: The data is further enriched by adding continent information for each country.
  • Data Visualization: The cleaned and transformed data is visualized using Plotly Express to provide insights into visa trends in Japan.

๐Ÿ“ Notes

  • Ensure that your Azure and Docker setups are correctly configured to allow the Spark master-worker architecture to function seamlessly.
  • The country name corrections and continent mapping are based on the pycountry and pycountry_convert libraries. Ensure that these libraries are up-to-date to get accurate results.
  • You can adjust the manual mappings in the country_mapping dictionary in the main.py file to correct any country names that are not correctly matched.

data-engineering-on-azure-spark-cluster-japan-visa-analysis's People

Contributors

huunhat1703tkbn avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.