kah-ve / marketgan Goto Github PK

Implementing a Generative Adversarial Network on the Stock Market

Jupyter Notebook 100.00%

marketgan's Introduction

MarketGAN

Implementing a Generative Adversarial Network (GAN) on the stock market through a pipeline on Google Colab. Data used from 500 Companies from S&P500, downloaded by Alpha Vantage, and trained using a 3-Layer Dense Network as the Generator and a 3-Layer Convolutional Neural Network as the Discriminator.

Follow the instructions below and you can get an easy to use google colab up and training. Then you can modify and use it as a playground as you'd like.

Update 05/09/2021: Updated the notebook to remove deprecated functions, removed some code clutter, added instructions, updated stock data to sort correctly from newest to oldest to avoid biased training, and updated readme with extra instructions below. The notebook runs from start to finish without issues as of today. Tensorflow is an evolving library and things may become deprecated and fail in the future but I'll try to keep this working. Let me know if it isn't.

Abstract

Neural networks have been advancing in capability very rapidly in recent years. One of the newest techniques with these networks is Generative Adversarial Networks. In this GAN architecture you have two neural networks pitted against each other, one trying to fool the other with noise, while the other trains on real data and responds with information on how to make that noise more realistic. After many runs, you would ideally be able to generate data that the other network wouldn't know was real or fake. We aim to implement this powerful method in the modeling of time series data, with our current medium being the stock market. A GAN that is able to work well with time series data, especially chaotic ones such as the market, would be very useful in many other areas. One is finance where you can better predict the risk in an investment, but another application might be in effectively anonymizing private sensitive data. A lot of data today is not shared because of confidentiality, so being able to generate accurate synthetic versions without loss of information would be useful.

Setup

Built a pipeline on Google Colab (offers a free K80 GPU for 12-hour sessions). Can be tedious to setup but works like a charm after since I could access it anywhere, change some parameters, and train a model. Some variables I played around with are different lists of companies, number of epochs, days to predict with, days to predict ahead, and threshold of percentage change to predict on. I built on top of code found on Github and added many modifications, some of which are adding methods to stop training and view confusion matrices, streamlining the process to deploy files for predictions, adapting the code to work with Google Colab, and allowing for prompt parameter changes.

Results

Using 30 company stocks based on highest market cap as I initially planned turned out to be completely unhelpful. Moved on to S&P 500 because it accounted for 80% of movement in the market. Shown below are the GAN results from the S&P 500 companies, 20 historical days, 5 days ahead predictions, 50k epochs, and various levels of percentage threshold. For predictions of simply up or down (0% threshold), the GAN has decent results, though a CNN I also trained alongside it (not shown below) still came out slightly better. In terms of predicting a 10% change, the GAN does quite badly. It seems to predict most of the down movements correctly but almost none of the up. Loosening the threshold to 1%, we can see that there is actually a significant change in up predictions compared to the 10% threshold. We are now predicting 14% of the true ups rather than just 1% of them, while losing very little of the accuracy in down predictions.

Update: usage notes

Colab Notebook: Stock Market GAN (s&p 500 companies, 50k epochs, 20 historical days, predict 5 days ahead, 1 percent change).ipynb

There are step-by-step instructions below that explain how to use the notebook. The first cell of the notebook also has markdown that covers these steps. Furthermore, you can find explanations in the cell comments as well for extra details.

This google colab will take several hours to run (even with Google's provided GPU). Downloading all the stock data for the 500 companies in S&P can take 2-3 hours by itself.

Then depending on the parameters you are using for training the models, the training can take some hours more.

What results do I get from this?

This project was mostly to help me get familiar with GANs. I modified the code so that it can run on Google Colab in a very streamlined way with the parameters that I'd like. I think this is a great way to get familiar with GANs and the training process. This code will output a confusion matrix that shows the prediction accuracy rates overall and at the end there is code that will use the model to output predictions for each stock.

An example of this. (This is for a model trained with the notebook provided for 1pct changes. It predicts a great deal of stocks increasing in price by 1pct)

You can see the confusion matrix for this model in the notebook provided:

How does it work?

For our solution we are using a convolutional neural network as the discriminator network and a 3-layered network for the generator. We are using a technique called adversarial feature learning in which you are additionally attempting to encode the features learned by the discriminator into the generating distribution. However, what we implemented doesn't follow this exactly as we are training a boosted tree classifier (using XGBoost) on the weights of the convolutional layer in the discriminator. Then to predict for different historical days, we would input that data into the discriminator and get the flattened weights, run them through the classifier, and finally get the prediction of the price’s movement through the backward propagation of the weights.

During each training output we would be saving the different trained models: including the discriminator models (at multiple epochs), the generator model (at multiple epochs), the benchmark CNN (at multiple epochs), and the XGBoost model. XGBoost is an open-source software library which provides a gradient boosting framework, and has performed very well on other machine learning competitions such as Kaggle. As mentioned before, the XGBoost would be trained on the discriminator network’s flattened weights after we had already trained the GAN. The assumption (which has been seen to work in other applications as well) is that the GAN will learn the feature space of the data and the weights within the discriminator will then have predictive information for how it believes the stock prices will behave. So going forward, when we want to predict on unseen data, we would send it through the discriminator in the GAN, then get the flattened weights, and then feed those weights to the XGBoost model we had trained. This would then give us a prediction.

Note: Here I am using the term we since the code for training isn't sourced by me. See source at bottom of README.

Instructions

The comments explain the steps you need to take. With some minor changes you can run this colab notebook so that it trains a model for you and makes some predictions.

First you need to copy my notebook to Google Colab or locally. There are advantages to using google colab (such as access to their GPU and also being able to make modifications from any computer that has access to the colab notebook) and that's the method I used for training these models. However, you can also set this up locally for your local gpu. Leave the googlepath variable blank or point it to the folder you want for your local jupyter notebook.

To copy the required jupyter notebook you can do one of these as you prefer.

Fork this repo, then go to google drive and (New -> More -> Google Colaboratory) then File -> Open Notebook -> GitHub -> pass in the url of the notebook in your own repo.
Clone this repo or clone your fork of it, then open new google colab notebook in google drive (New -> More -> Google Colaboratory). Then upload the notebook by File -> Open Notebook -> Upload -> Select the jupyter notebook you cloned from this repo.

Instructions for getting the notebook running:

Run the first two cells and follow the instructions in the second cell with code in it. This will mount the google drive so you can save files and load them from your google drive folder.
In the 3rd cell, select a googlepath that you will use to store all your folders and files. The folder structure that you will end up with is shown at the end of these instructions. However, for now just choose a googlepath to use. (e.g. /drive/My Drive/Colab Notebook/MarketGAN
Modify the parameters in that same cell for training according to your wishes. These are the only modifications you need to make to be able to train the models. It's very simple to change these parameters to train different models and also change the companylist to try different stocks. (remember to delete old stock data within the stock_data folder)
Get an alphavantage api key at the website https://www.alphavantage.co/support/#api-key. This will allow you to download the stock data using their API.

Note: Alphavantage has a limit of 500 requests per day and also 6 per minute for its free service.

Run the code until the section called Change the names of the files in the deployed_model folder. Select the models you want to use from XGB, CNN, and GAN and place them into the deployed_model folder. Run this cell to rename them and prepare them to be used for predictions.

Note: When training the GAN or CNN models, the training will pick up from where it left off. The model will save every 10,000 steps (by default. you can modify this by changing the TRAINING_AMOUNT variable.) For this to work, you only need to leave the model in the models folder and it will select the latest step one and continue training. Thus, I recommend that when moving the trained models to the deployed_model folder, you do a copy and paste. The script under the heading 'Change the names of the files in the deployed_model folder' will rename the model for you once it's in the deployed_model folder (regardless of copied or not).

Run the remaining cells and wait for the output.

Expected File Directory Layout in Folder at the End

cnn models (where trained cnn models get saved)
deployed_model (This is the folder that you must MANUALLY put the trained model into after you are satisfied with the number of steps for training. XGB, CNN, and GAN models go into here. There is a script further below which will take those files in deployed_model, rename if need be, and then use them for making predictions.)
logs (Logs will be added here) w/ subdirectories test and train.
models (The models will be added into this directory)
stock_data (This is the folder that all our stock data will be placed into)
companylist.csv (this is a company list that you must provide, or use the one I have provided at GitHub. We will download the data for these tickers.)
stockmarketGan.ipynb (this is this google colab document)

An image of how it will look.

MISC Notes

Google Colab uses a K80 GPU which offers 24GB of video ram for training. Training must be limited to under 12 hour sessions or Google Colab shuts down the session. However, the training does not take more than a couple of hours and the models are being continuously saved so nothing will be lost.

Future Work

I've barely scratched the surface with what is possible with GANs. This has mostly been setting up the framework and data pipeline. There can be a lot of improvement in terms of the type of layers and the depth of the layers that are used. I can look at different indicators to include in the training (instead of just the open, close, and volume of each stock), different parameters to train against, different selections of stocks. If I continue to apply the GAN in this field, my main goal next semester is to build it using recurrent neural networks for both the generator and discriminator. Also I want to look into the method of adversarial feature learning which is what's being used currently, and see if I can find a different way to make predictions.

Sources:

Modified and reused code from https://github.com/MiloMallo/StockMarketGAN which was sourced from https://github.com/nmharmon8/StockMarketGAN.

marketgan's People

Contributors

Stargazers

Watchers

Forkers

zeroljy markcheno fenglui saadmahboob nakosy whidbey jrdeco560 dhaalves deng1689 hakanaku1234 golu2499 kmishra1204 javifalces subbulakshmisubha pureuniverse lesteraleong aerodeepflow mcren88 darknerofil finansarastirmalari citymap ayeps dabbingpandaops webclinic017 lydia99992 flyertea rrfaria elpolini hackerzeus gurusura 43trh sanaz-tu suzuki1519 quantxtz kristianspurling kanonskud nemeritz-ming aaug3544 wongi1 macheng98 ryray jammer345 10sun ljosephy virtualpeer sackio xingquan-li seanahmad fswzb ashwin2695 samsonq makovez kumarsanjay435 rocklule nova-land divij02 kamren22 breakingdusk397 benwaldner vadymurupa kenneth105 nabazar jaychaontpuacc dheerajsingha97 sahilmathew23 vincentwolf99 stuti-mishra

marketgan's Issues

error when trying to predict

hello after training i have issues with opening the deployd models and predicting. the issues are while opening, it says "permission denied"

Sir can you please upload the data or atleast the fields and the details of the data which is used in this code.

New complementary real time tool

My name is Luis, I'm a big-data machine-learning developer, I'm a fan of your work, and I usually check your updates.

I was afraid that my savings would be eaten by inflation. I have created a powerful tool that based on past technical patterns (volatility, moving averages, statistics, trends, candlesticks, support and resistance, stock index indicators).
All the ones you know (RSI, MACD, STOCH, Bolinger Bands, SMA, DEMARK, Japanese candlesticks, ichimoku, fibonacci, williansR, balance of power, murrey math, etc) and more than 200 others.

The tool creates prediction models of correct trading points (buy signal and sell signal, every stock is good traded in time and direction).
For this I have used big data tools like pandas python, stock technical patterns market libraries like: tablib, TAcharts ,pandas_ta... For data collection and calculation.
And powerful machine-learning libraries such as: Sklearn.RandomForest , Sklearn.GradientBoosting, XGBoost, Google TensorFlow and Google TensorFlow LSTM.

With the models trained with the selection of the best technical indicators, the tool is able to predict trading points (where to buy, where to sell) and send real-time alerts to Telegram or Mail. The points are calculated based on the learning of the correct trading points of the last 2 years (including the change to bear market after the rate hike).

I think it could be useful to you, to improve, I would like to share it with you, and if you are interested in improving and collaborating we could, who knows how to make beautiful things.

Thank you for your time
I'm sorry to contact you here ,by issues, I don't know how I would be able to do it.
mail : [email protected] or https://github.com/Leci37/stocks-Machine-learning-RealTime-telegram/discussions

Lookahead bias?

You're using a negative shift in your data, doesn't this incorporate future data into your training set? shift(-num_historical_days)

Possible collaboration?

I am working with a guy on a similar project based on classification using sigmoid function.

We are generating about 1k technical indicators and testing about 30+- models per stock, and applying a threshold selection based on scores percentiles. We would also like to implement GAN and some auto learning model.

I would happy to have an exchange of ideas as well

My telegram is: @sbongown

Sir what we have to do with the models file

Sir is there any default models present

sir can I get your contact number like whatsapp no. so that i can have my doubts clarified over the code...please

Using the CNN benchmark to predict rather than the GAN/XGB hybrid?

Could you walk me through how to modify the code in the predictions cell? I can't seem to untangle the GAN and XGB models. At first, I tried to just replace the GAN model for the CNN, but the CNN class doesn't contain features for the XGB to read. I'm also not very familiar with tf v1 code. Any help would be much appreciated.

This is my code for the XGB training cell:

class TrainXGBBoost:
def init(self, num_historical_days, days=10, pct_change=0):
self.data = []
self.labels = []
self.test_data = []
self.test_labels = []

    assert os.path.exists(f"{googlepath}cnn_models/checkpoint")
    cnn = CNN(num_features=5, num_historical_days=num_historical_days, is_train=False)
    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        saver = tf.train.Saver()
        if os.path.exists(f'{googlepath}cnn_models/checkpoint'):
                
                with open(f'{googlepath}cnn_models/checkpoint', 'rb') as f:
                    model_name = next(f).split('"'.encode())[1]
                filename = "{}cnn_models/{}".format(googlepath, model_name.decode())
                currentStep = filename.split("-")[1]
                new_saver = tf.train.import_meta_graph('{}.meta'.format(filename))
                new_saver.restore(sess, "{}".format(filename))
        files = [os.path.join(f'{googlepath}stock_data', f) for f in os.listdir(f'{googlepath}/stock_data')]

        for file in files:
            print(file)
            #Read in file -- note that parse_dates will be need later
            df = pd.read_csv(file, index_col='timestamp', parse_dates=True)
            

            if len(df) > 12: 

              df = df[['open','high','low','close','volume']]

              df = df.fillna(0)
          
              

              #Normilize using a of size num_historical_days
              labels = df.close.pct_change(days).map(lambda x: int(x > pct_change/100.0))
              df = ((df -
              df.rolling(num_historical_days).mean().shift(-num_historical_days))
              /(df.rolling(num_historical_days).max().shift(-num_historical_days)
              -df.rolling(num_historical_days).min().shift(-num_historical_days)))

              df['labels'] = labels
              df = df.apply(pd.to_numeric, downcast='float')
              df = df.apply(pd.to_numeric, downcast='integer')

              df = df.dropna()

              #Hold out the testing data
              test_df = df[:500]
              df = df[500:]

              data = df[['open','high','low','close','volume']].values
              labels = df['labels'].values
              for i in range(num_historical_days, len(df), num_historical_days):
                  features = sess.run(cnn.features, feed_dict={cnn.X:[data[i-num_historical_days:i]]})
                  self.data.append(features[0])

                  self.labels.append(labels[i-1])
              data = test_df[['open','high','low','close','volume']].values
              labels = test_df['labels'].values
              for i in range(num_historical_days, len(test_df), 1):
                  features = sess.run(cnn.features, feed_dict={cnn.X:[data[i-num_historical_days:i]]})
                  self.test_data.append(features[0])
                  self.test_labels.append(labels[i-1])



def train(self):
    params = {}
    params['objective'] = 'multi:softprob'
    params['eta'] = 0.01
    params['num_class'] = 2
    params['max_depth'] = 20
    params['subsample'] = 0.05
    params['colsample_bytree'] = 0.05
    params['eval_metric'] = 'mlogloss'

    train = xgb.DMatrix(self.data, self.labels)
    test = xgb.DMatrix(self.test_data, self.test_labels)

    watchlist = [(train, 'train'), (test, 'test')]
    clf = xgb.train(params, train, 2000, evals=watchlist, early_stopping_rounds=100)
    joblib.dump(clf, f'{googlepath}models/clf.pkl')
    cm = confusion_matrix(self.test_labels, list(map(lambda x: int(x[1] > .5), clf.predict(test))))
    print(cm)
    plot_confusion_matrix(cm, ['Down', 'Up'], normalize=True, title="Confusion Matrix")

tf.compat.v1.reset_default_graph()
boost_model = TrainXGBBoost(num_historical_days=HISTORICAL_DAYS_AMOUNT, days=DAYS_AHEAD, pct_change=PCT_CHANGE_AMOUNT)
boost_model.train()

This is the error code I get when trying to train the XGB on the CNN:

AttributeError Traceback (most recent call last)

ipython-input-7-b77e4ceeeb9ahttps://localhost:8080/# in <cell line: 118>()
116
117 tf.compat.v1.reset_default_graph()
--> 118 boost_model = TrainXGBBoost(num_historical_days=HISTORICAL_DAYS_AMOUNT, days=DAYS_AHEAD, pct_change=PCT_CHANGE_AMOUNT)
119 boost_model.train()

ipython-input-7-b77e4ceeeb9ahttps://localhost:8080/# in init(self, num_historical_days, days, pct_change)
82 labels = df['labels'].values
83 for i in range(num_historical_days, len(df), num_historical_days):
---> 84 features = sess.run(cnn.features, feed_dict={cnn.X:[data[i-num_historical_days:i]]})
85 self.data.append(features[0])
86 # print(features[0])

AttributeError: 'CNN' object has no attribute 'features'

Here's what I have for my Predictions cell:

import os
import pandas as pd
import random
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
import xgboost as xgb
from sklearn.externals import joblib

class Predict:

def __init__(self, num_historical_days=20, days=10, pct_change=0, 
             cnn_model=f'{googlepath}deployed_model/cnn', 
             xgb_model=f'{googlepath}deployed_model/xgb'):
    self.data = []
    self.num_historical_days = num_historical_days
    self.cnn_model = cnn_model
    self.xgb_model = xgb_model

    files = [os.path.join(f'{googlepath}stock_data', f) for f in os.listdir(f'{googlepath}stock_data')]
    for file in files:
        print(file)
        df = pd.read_csv(file, index_col='timestamp', parse_dates=True)
        df = df.sort_index(ascending=False)
        df = df[['open', 'high', 'low', 'close', 'volume']]
        df = ((df -
              df.rolling(num_historical_days).mean().shift(-num_historical_days)) /
              (df.rolling(num_historical_days).max().shift(-num_historical_days) -
              df.rolling(num_historical_days).min().shift(-num_historical_days)))
        df = df.dropna()

        """
        file.split --> is the symbol of the current file. Append a tuple of
        that symbol and the dataframe index[0] which is the timestamp, and
        thirdly append the data for 200 to 200 + num_historical_days values
        (open, high, low, close, volume). For each symbol we have, we are
        predicting based on the df[200:200+num_historical_days].values...
        """
        self.data.append((file.split('/')[-1], df.index[0], df[200:200+num_historical_days].values))
        
def cnn_predict(self):
    # clears the default graph stack and resets the global default graph.

    tf.compat.v1.reset_default_graph()
    cnn = CNN(num_features=5, num_historical_days=self.num_historical_days, is_train=False)
    # A class for running Tensorflow operations. A session object
    # encapsulates the environment in which Operation objects are executed,
    # and Tensor objects are evaluated. A session may own resources, such as
    # tf.Variable, tf.QueueBase and tf.ReaderBase. It is important to
    # release these resources when they are no longer required. Invoke
    # tf.Session.close method on the session or use the session as a context
    # manager. 
    # with tf.Session() as sess:
    #   sess.run(...)
    # or
    # sess = tf.Session()
    # sess.run(...)
    # sess.close()     
    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        saver = tf.train.Saver()
        saver.restore(sess, self.cnn_model)
        # Reconstruct a Python object from a file persisted with joblib.dump
        clf = joblib.load(self.xgb_model)
        for sym, date, data in self.data:
            # run takes in feed_dict=None, session=None. A feed_dict is a
            # dictionary that maps Tensor objects to feed values. In this
            # case, I believe we are doing run( fetches, feed_dict=None...)
            # case where the fetches is gan.features and the feed_dict
            # points to the gan.X dictionary which points to data. The
            # fetches argument may be a single graph element, or an
            # arbitrarily nested list, tuple, namedtuple, dict, or
            # OrderedDict containing graph elements at its leaves.

            try:

              features = sess.run(cnn.features, feed_dict={cnn.X:[data]})
              # Value returned by run() has the same shape as the fetches
              # argument, where the leaves are replaced by the corresponding
              # values returned by TensorFlow.  
              
              # xgb.DMatrix, construct one from either a dense matrix, a
              # sparse matrix, or a local file. Supported input file formats
              # are either a libsvm text file or a binary file that was
              # created previously by xgb.DMatrix.save. Internal data
              # structure that is used by XGBoost which is optimized for both
              # memory efficiency and training speed.
              features = xgb.DMatrix(features)
            

              # The clf predict is the xgb classifier that is used on the gan
              # features (the flattened last layer of the convolutional neural
              # network, that is the discriminator). As far as I can tell, we
              # are using the GAN on the past 20 days to come up with some
              # features. Then these features are plugged into the XGBoost
              # Classifier. Then the XGBoost Classifier makes a prediction for
              # the stock (going Up or Down).
              print('{} {} {}'.format(str(date).split(' ')[0], sym, clf.predict(features)[0][1] > 0.5))
            except Exception as e:
              print(Exception)

p = Predict(num_historical_days=HISTORICAL_DAYS_AMOUNT, days=DAYS_AHEAD, pct_change=PCT_CHANGE_AMOUNT)
p.cnn_predict()

If you could help me through modifying the rest of my prediction cell to only use the CNN model to predict, I would be forever in your debt. I think that would be easier than trying to feed the CNN features into the XGB training. Although on a tangent, I wonder if one could extract more accuracy that way.

Data Leakage in Training the CNN

Your implementation demonstrates a brilliant and ingenious approach that truly stands out. However, during my examination of the code, I noticed a potential issue that I believe requires your attention.
It appears that there is a case of data leakage in your CNN classifier. Specifically, the classifier seems to be utilizing information from the same day to predict the outcome for that day. Data leakage can lead to inflated performance metrics during testing but result in poor performance when applied to real-world scenarios.

There is a data leakage issue in the training CNN section of the STOCK_Market_GAN:

# start at num_historical_days and iterate the full length of the training
# data at intervals of num_historical_days
for i in range(num_historical_days, len(df), num_historical_days):
    # split the df into arrays of length num_historical_days and append
    # to data, i.e. array of df[curr - num_days : curr] -> a batch of values
    self.data.append(data[i-num_historical_days:i])

    # appending if price went up or down in curr day of "i" we are looking
    # at
    self.labels.append(labels[i-1])

# do same for test data
data = test_df[['open','high','low','close','volume']].values

You should change self.labels.append(labels[i-1]) with self.labels.append(labels[i])

Sir, this is not to raise an error, this is simply to ask for if you can make the data of this code available.

Sir, I am working on time series data and need some help with the pre processing of my data if you can make this repository of the data publically available it will help me alot.