Coder Social home page Coder Social logo

investment_data's Introduction

中文 README: ch

Chinese blog about this project: 量化系列2 - 众包数据集

Table of contents generated with markdown-toc

How to use it

  1. Download tar ball from latest release page on github
  2. Extract tar file to default qlib directory
wget https://github.com/chenditc/investment_data/releases/download/2023-04-20/qlib_bin.tar.gz
tar -zxvf qlib_bin.tar.gz -C ~/.qlib/qlib_data/cn_data --strip-components=2

Developement Setup

If you want to contribute to the set of scripts or the data, here is what you should do to set up a dev environment.

Install dolt

Follow https://github.com/dolthub/dolt

Clone data

Raw data hosted on dolt: https://www.dolthub.com/repositories/chenditc/investment_data

To download as dolt database:

dolt clone chenditc/investment_data

Export to qlib format

docker run -v /<some output directory>:/output -it --rm chenditc/investment_data bash dump_qlib_bin.sh && cp ./qlib_bin.tar.gz /output/

Run Daily Update

You will need tushare token to use tushare api. Get tushare token from https://tushare.pro/

export TUSHARE=<Token>
bash daily_update.sh

Daily update and output

docker run -v /<some output directory>:/output -it --rm chenditc/investment_data bash daily_update.sh && bash dump_qlib_bin.sh && cp ./qlib_bin.tar.gz /output/

Extract tar file to qlib directory

tar -zxvf qlib_bin.tar.gz -C ~/.qlib/qlib_data/cn_data --strip-components=2

Initiative

  1. Try to fill in missing data by combining data from multiple data source. For example, delist company's data.
  2. Try to correct data by cross validate against multiple data source.

Project Detail

Data Source

The database table on dolthub is named with prefix of data source, for example ts_a_stock_eod_price. The meaning of the prefix:

Initial loading and Validation logic for each table

Contribution Guide

Add more stock index

To add a new stock index, we need to change:

  1. Add index weight download script. Change tushare/dump_index_eod_price.py script to dump the index info. If the index is not available in tushare, write a new script and add to the daily_update.sh script. Example commit
  2. Add price download script. Change tushare/dump_index_eod_price.py to add the index price. Eg. Example Commit
  3. Modify export script. Change the qlib dump script qlib/dump_index_weight.py#L13, so that index will be dump and renamed to a txt file for use. Example commit

Add more data source or fields

Please raise an issue to discuss the plan, example issue: chenditc#11

It should includes:

  1. Why do we want this data?
  2. How do we do regular update?
    • Which data source would we use?
    • When should we trigger update?
    • How do we validate regular update complete correctly?
  3. Which data source should we get historical data?
  4. How do we plan to validate the historical data?
    • Is the data source complete? How did we verify this?
    • Is the data source accurate? How did we verify this?
    • If we see error in validation, how will we deal with them?
  5. Are we changing exisiting table or adding new table?

If the data is not clean, we might try hard to dig insight from it and find incorrect insight. So we want high quality data instead of just data.

investment_data's People

Contributors

chenditc avatar xu-li avatar zhuoju36 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.