Coder Social home page Coder Social logo

vn-stock-analyzer's Introduction

Vietnam stock data analyzer

Objectives

The goal of this project is to collect and visualize the stock price of all tickers in Vietnam. There is quite limited access to API for a single business user, this project aim at scrap data from website, clean, extract and load into datawarehouse. The final product is a maintainable/reliable data pipeline with exposed analytic dashboard hosted on cloud, and end authorized users can access to 24/7 with daily updated data.

Data are collected for almost 10 years from 2010 to support the analytic (such as backtest or trading strategy for a particular ticker) steps afterward. For the scope of this project, the stock price will be visualized in time series domain for the monitor pupose only.

The report is live from HERE: Link To Report Dashboard

Proposed Architecture

To prove the working concept with reasonable cost, below is the proposed architecture solution.

The whole picture derives from 4 main components:

  • Data collection: This project can be accomplished from 2 data sources: ticker URL & price URL hosted by VNDirect. This is where all Vietnam tickers officially registered.

    Ticker URL: https://www.vndirect.com.vn/portal/thong-tin-co-phieu/nhap-ma-chung-khoan.shtml?request_locale=en_GB

    Price URL: https://www.vndirect.com.vn/portal/thong-ke-thi-truong-chung-khoan/lich-su-gia.shtml?request_locale=en

  • ETL/Orchestration: Airflow is chosen to play the orchestration role to run ETL and scheduling jobs. Airflow is dockerized and deployed on Digital Ocean cloud service. Airflow docker is selected to facilitate CI/CD tasks later on. Refer here: Airflow Docker Repository

  • Data storage: Up to now, there are 1710 tickers are registered and ~2.4M records stored on Postgres database maintained by Digital Ocean service. Since there are not so many users access at the moment and the data size is not quite significant big, this selection is cost effective while not affect the performance.

  • Data Visualization/Analytics: Because of time constraint, only visualization is implemented for this project to visualize the colleted data from Postgres database. Future feature will be Data Analytics with Spark by leveraging available data for 10 years to conduct the backtest as an example.

    The superset visualization tool is also dockerized and hosted on EC2 t2.micro which is free up to 1 year service. This option is very optimal since visualization task is not a heavy hardware consumption.Refer here: Superset Docker Repository

Data sources

As mentioned before, there are 2 main data sources including ticker and price. The data are in html format which required the ELT job to include some special techniques to extract needed information.

Data model

Data model is simply described as below.

ETL Job

Although the data model looks simple, the heavy task of ETL job is to extract the needed data from html format. In this project, Selenium and BeautifulSoup libraries are used to automate the process of crawling stock data.

The pipeline includes 2 main steps: ticker ingestion and price ingestion. At price ingestion step, it is devided into different flows by exchange to enable mutithread processing for better performance. Within each thread, the data quality check has been includeded before every insert command.

How the project is organized and setup

  • Files in repository: In the main local dag directory (defined by AIRFLOW_HOME), separate project is put inside to make project more scalable. With this file structure design, where there are other stock data source, just simple create another one beside vn_stock project.

Within vn_stock project, dag.py lists the pipeline design and scheduler setting. It calls the ticker and price ingestion via PythonOperator.

.
└── vn_stock
    ├── dag.py
    ├── README.md
    └── tasks
        ├── config.py
        ├── data
        ├── etl_vndirect_price.py
        ├── etl_vndirect_ticker.py
        ├── requirements.txt
        ├── setup_database.py
        ├── sql_queries.py
        └── utils.py
  • Build and run docker-airflow with LocalExecutor setup to enable parallelizing task instances locally:

    docker build --rm --build-arg AIRFLOW_DEPS="datadog,dask" --build-arg PYTHON_DEPS="flask_oauthlib>=0.9" -t puckel/docker-airflow .

    docker-compose -f docker-compose-LocalExecutor.yml up -d

  • Setup database for the first run (access from inside airflow container):

    docker exec -it -u 0 docker-airflow_webserver_1 bash

    python vn_stock/task/setup_database.py

Future improvement

  • When data increased by 100x: move data storage to AWS Redshift which is designed to handling columnar store for better performance.
  • The database needed to be accessed by 100+ people: when the user access increase, it is necessary to setup the role and group for particular purpose (view/modify) to enhance the security management and performance effectiveness. Refer here: Managing PostgreSQL users
  • Integrate Spark for ticker analytic
  • Crawl more data source to enhance the data features

vn-stock-analyzer's People

Contributors

vanducng avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.