timothyyu / gdax-orderbook-ml Goto Github PK

Application of machine learning to the Coinbase (GDAX) orderbook

License: BSD 3-Clause "New" or "Revised" License

Jupyter Notebook 100.00%

gdax keras tensorflow machine-learning bitcoin lstm gru orderbook trading candlesticks

gdax-orderbook-ml's Introduction

gdax-orderbook-ml

Application of machine learning to the Coinbase (GDAX) orderbook using a stacked bidirectional LSTM/GRU model to predict new support and resistance on a 15-minute basis; Currently under heavy development.

Model Structure (visual):

General project API/data structure:

General Project Requirements

Anaconda environment strongly recommended
- see requirements.txt for pip, or environment.yml for Anaconda/conda
  - Jupyter Notebook
  - Python, Pandas, Matplotlib, MongoDB, PyMongo, Git LFS
  - Scipy, Numpy, Feather
  - Keras, Tensorflow, Scikit-Learn
Python client for the Coinbase Pro API: [coinbasepro-python] (https://github.com/danpaquin/coinbasepro-python)
CUDA/CUDNN-compatible GPU highly recommended for model training, testing, and predicting

Tensorflow/Keras local GPU backend configuration (Nvidia CUDA/cuDNN)

Local GPU used to greatly accelerate prototyping, construction, and building of ML model(s) for this project, especially considering the nature of the dataset & machine learning model complexity.

Requirements to run tensorflow with GPU support
Nvidia GPU compatible with with CUDA Compute 3.0
- Nvidia CUDA 9.0
- Nvidia cuDNN 7.0 (v7.1.2)
  - Install cuDNN .dlls in CUDA directory
  - Edit environment variables:
    - C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0
- pip uninstall tensorflow && pip install --ignore-installed --upgrade tensorflow-gpu
  - Default tensorflow install is CPU-only; install CUDA and cuDNN requirements, then uninstall tensorflow and reinstall tensorflow-gpu (pip install --ignore-installed --upgrade tensorflow-gpu)

Project/File Structure

Latest notebook file(s) with project code:

9_data_pipeline_development.ipynb:

Development of data pipelines and optimization of data from MongoDB instance to ML model pretraining
Removal of deprecated packages + base package version upgrade (i.e. Pandas)
Development groundwork for automation pipeline for automated hourly data scrape, cycling, and training for model through segregated instance or live online-based model
Usage of in-line markdown cells in-notebook for readability and consistency
Even further refinement to program structure
Function scope and structure & function creation for common operations
Parsing of raw data into 4 separate l2 update (4 consecutive 15 minute l2update segments)

9a_model_restructure.ipynb:

Notebook used for further development of model in different formats, and testing of reduced complexity models
- Keras Sequential() Model
- Keras Functional API
- Raw Tensorflow

8_program_structure_improvement.ipynb:

Previous notebook with proof-of-concept output results
- Several function calls via API and multiple packages required are deprecated
- Use as reference for updated development files/notebooks

6_raw_dataset_update.ipynb:

Notebook file used to scrape/update raw_data for both MongoDB and csv format, 1 hour of websocket data from GDAX
- L2 Snapshot + L2 Updates without overhead of Match data response (does not have Match data; test data has Match data and adds significant I/O overhead)

Folder/Repository Structure

'gdax-python' and 'gdax-ohlc-import' are repositories imported as Git Submodules:
- After cloning the main project repository, the following command is required to ensure that the submodule repository contents are pulled/present: git submodule update --init --recursive
- .gitmodules file is file for submodule parameters
'model_saved' folder:
- Contains .json and .h5 files for current and previous Tensorflow/Keras models (trained model and model weight export/import)
'documentation' folder:
- 'rds_ml_yu.revised.pptx' is a powerpoint presentation summarizing the key technical components, scope, limitations, of this project.
- 'design_mockup' folder:
  - Contains diagrams, drawings, and notes used in the process of model and project design during prototyping, testing, and expansion.
- 'design_explanation' folder:
  - Contains 8 pages of detailed explanations and diagrams in regards to both project/model structure and design.
- 'previous_revisions' folder:
  - Contains previous/outdated versions of readme documentation and powerpoint presentations documenting the nature of this project
'saved_charts' folder:
- Output of generate_chart() for candlestick chart with visualized autogenerated support and resistance from autoSR()
- Screenshot of model layer structure in text format
- Graphviz output of model layer structure
'test_data' folder:
- Only has 10 minutes of scraped data for testing, development, and model input prototyping (snapshot + l2 response updates)
'raw_data' folder:
- 1 hour of scraped data (snapshot + l2 response updates)
  - l2update_15min_1-4: 1 hour of l2 updates split into four 15-minute increments
  - mongo_raw.json: 1 hour of scraped data from the gdax-python API websocket in raw mongoDB format
'raw_data_10h' folder:
- 10 hours of scraped data:
  - l2update_10h, request_log_10h, and snapshot_asks/bids_10h
  - 10 hours of scraped data in raw mongoDB export (JSON): mongo_raw_10h.json
- Data in .msg (MessagePack) format currently experimental/testing as alternative to .csv format for I/O operations
'raw_data_pipeline' folder:
- Contains data in .feather format as part of data pipeline(s) implementation and development
'archived_ipynb' folder:
- Contains previous Jupyter Notebook files used in the construction, design, and prototyping of components of this project.
  - Jupyter Notebook (.ipynb) notebook files 1-5 & 7
  - Each successive notebook was used to construct and test whether at each "stage" if a project of this kind of scope would even be technically possible.
- Successive numbered notebooks generally improve and are iterative in nature on previous notebook files for this project.

Misc. Technical Reference

Publications and Journals referenced for model structure and design

License

- gdax-orderbook-ml: BSD-3 Licensed, Copyright (c) 2018 Timothy Yu
- coinbasepro-python: MIT Licensed, Copyright (c) 2017 Daniel Paquin 
- autoSR() function adapted from nakulnayyar/SupResGenerator, Copyright (c) 2016 Nakul Nayyar (https://github.com/nakulnayyar/SupResGenerator)

gdax-orderbook-ml's People

Contributors

Stargazers

Watchers

gdax-orderbook-ml's Issues

attempt absolute insanity

https://keras.io/layers/convolutional/

to look into:
keras flatten layer, permute layer, reshape layer, repeat vector

[pure size and price, potentially position also]

take l2 states as grid, feed into convolutional 2d, then flatten/reshape, then feed into lstm/gru as input in series

lambada functionality for specialized weighting of values near s/r lines?
https://github.com/keras-team/keras/blob/master/keras/layers/core.py#L564

model predictions: groupby for nlargest output/query

restrict range of query before groupby if necessary; potential insights into model behavior and orderbook s/r movement states with query/group combination

train_test_split() does not work with shaped data (x features/y target)

to look into - split data before reshape, manually split into train/test/split, or use another library that works with 3-dimensional input data

scrape 10 hours/1 day of data

reconstruct/alter #6 jupyter notebook scrape file for 10 hours of scrape

issues:

Ram limitations
Mongo db raw scrape size
csv limitations as a file format

travis ci and deployment hook research

Implementation/hooks to research:

https://travis-ci.org/
https://www.codacy.com
https://codecov.io/
https://coveralls.io/

Live mongo db instance for db instance, linked to herkou (potential):
https://mlab.com/ + https://www.heroku.com/

coinbase pro/prime + websocket endpoint upgrade/replacement

replace existing ".gdax.com" endpoint/websocket references with gdax prime/pro endpoints
https://docs.gdax.com/#coinbase-prime

ppt missing LSTM/GRU model layout after axis breakout slides

turn autoSR() into API component (internal/external)

input: OHLC data
output: autogenerated S/R levels

have be accessible via private authentication, limit requests when run on own hardware server

route using flask/heroku

look into gdax-python-api (async) replacement for api library

https://github.com/csko/gdax-python-api

sr_prox_value 0 as target instead of 'is_line'

apply_l2_update() func not updating size correctly

function may not be updating 'size' correctly between orderbook states; look into + fix

refactor/rewrite autoSR() function

rewrite autoSR() function to be more accurate to S/R done by hand

new scrape 1-2 weeks of data for iterator/sampling frames

after scrape, then parse into snapshots of orderbook at different states

Research: arctic over feather for distributed deployment scaling

Research:
https://github.com/manahl/arctic

mongodb compatible; feather is not

roll improvements from wsae-lstm repo into project

https://github.com/timothyyu/wsae-lstm

Roll improved aspects of project structure, methodology, and model design/data structures from WSAE-LSTM repository into this project after WSAE-LSTM model tested (eventually)

gut all matplotlib.finance dependecies, use d3/plotly instead

matplotlib finance is deprecated
use d3/plotly/seaborn or pure matplotlib;
d3/plotly = actual usable modern charts
related to #9

deployed mongo instance test (local/hosted)

mongo db collection size limit reset in code (upon new size limit or collection start)

missing rationale/story slide(s) for ppt

ch15m_req_time() not respecting time format

timestamps from first row of each l2update_15m are not being passed correctly; gdax-python api requires specific format for historical candlestick requests

redo requirements.txt

current requirements.txt is invalid/missing requirements

mockup design in html/d3 of predicted vs actual

plotly, d3, flask, html
bootstrap

save generated s/r from autosr() to dataframe + file/csv upon request

save results of autosr() x4 (every 15min) to csv or file in conjunction with scrape time start period for data scrape

scrape abstraction for input batch size requirements

limit n size for input batch for stateful = True
i.e. loop abstraction to split 10 hours of data into 1 hour each, or 15min x4 per hour for training and validation; related to train_test_split issue

y variable/sr_value one-hot w/ weighted prox value

new venv for project/requirements.txt

pipreqs package doesn't work on windows; create new venv and requirements.txt

fit generator for iterator()

closely related to #48 and iterator/sampling structure

new complementary tool

I want to offer a new point of view, and my colaboraty

Why this stock prediction project ?

Things this project offers that I did not find in other free projects, are:

Testing with +-30 models. Multiple combinations features and multiple selections of models (TensorFlow , XGBoost and Sklearn )
Threshold and quality models evaluation
Use 1k technical indicators
Method of best features selection (technical indicators)
Categorical target (do buy, do sell and do nothing) simple and dynamic, instead of continuous target variable
Powerful open-market-real-time evaluation system
Versatile integration with: Twitter, Telegram and Mail
Train Machine Learning model with Fresh today stock data

https://github.com/Leci37/stocks-prediction-Machine-learning-RealTime-telegram/tree/develop

animations and transistions for updated ppt/slides (v01b presentation)

add animations and transitions for revision of ppt

existing L2 dataset push fail due to size; csv is 2.3gb+

Existing sample historical BTCUSD tick data from http://api.bitcoincharts.com/v1/csv/ failure to load + commit, even exceeding Git LFS file limit size.
File will not load under excel - use Text Editor/IDE:

vscode:

volume as feature input (separate model, keras merge layer)

volume as feature input from seperate model?
https://statcompute.wordpress.com/2017/01/08/an-example-of-merge-layer-in-keras/

hyperparameter l1/l2 regularization test

revised limitation slides for ppt

restructure project to cookiecutter-data-science template

https://drivendata.github.io/cookiecutter-data-science/

git LFS (large file system) for 10hr+ scrape of data enable

https://git-lfs.github.com/

iterator() sampling frames of data on 1-5m timeframe

1-5m sampling frames of data, prior to transfer learning
related to #47

readme redo+ program file name clarity

needs model image and screenshot(s) from ppt in default readme

implementation of sequential drop at front of model

dataframe to h5py or msgpack format for size/speed over csv load

https://pandas.pydata.org/pandas-docs/stable/io.html

https://www.h5py.org/

function to predict foward on t-timesteps based on n

test implementation of keras.js for reduced complexity model

https://github.com/transcranial/keras-js

this plus mlive/heroku + or a combination of dedicated or virtual VPS instance due to model bandwith and data size requirements (current)

missing design/mockup folder

missing folder with design sketches/notes and mockups

function(s) to build model, train model, retrieve previous model, model parameters/struct to file

model structure + parameter output potential for #18 (dashboard)

graphic visualization of layer weights

https://stackoverflow.com/questions/42861460/how-to-interpret-weights-in-a-lstm-layer-in-keras?utm_medium

keras-team/keras#3088

for e in zip(model.layers[0].trainable_weights, model.layers[0].get_weights()):
print('Param %s:\n%s' % (e[0],e[1]))

context vector for historical s/r awareness (static/dynamic)

context vector for historical s/r awareness of price movement, either using single layer dense or single layer LSTM

implement either static (predefined, manually calc) or dynamic (live, on the fly calc)

fix scrape_start()

boolean flag for scrape_running() status
save to raw_data_pipeline folder
timezone basis/reference config (timezone of scrape)
outline of function def for new hour of data (mock function)
outline of error handling if scrape interrupted

new 1 hour data scrape w/ OHLC data save

scrape new set of 1hr test data with OHLC candlestick data saved to msgpack/csv additionally
related to "ch15m_req_time() not respecting time format" issue + autosr() results save:
#9 & #33

n=200-500 from execution point implement

n=200-500 from execution point to reduce memory/cpu/io overhead

see documentation, design_mockup folder: