Coder Social home page Coder Social logo

robeespi / deploying-on-aws-a-dockerized-api-to-predict-the-daily-average-sentiment-of-financial-news-articles Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 0.0 1.14 MB

Dockerfile 0.01% Python 0.21% Jupyter Notebook 99.79%
flask python docker aws-ec2 postgresql time-series arima-models sentiment

deploying-on-aws-a-dockerized-api-to-predict-the-daily-average-sentiment-of-financial-news-articles's Introduction

Deploying a dockerized API to predict the daily average sentiment of financial news articles

1. Objetives

1.1. Build and host a predictive model on AWS with Python

1.2. Using a dataset of news articles, train a model to predict the average sentiment of the next day.

1.3. Host the model within a free-tier instance on AWS.

1.4. Build an endpoint that accepts parameters from a user and returns a time series of average sentiment values with the final value as a prediction from the model.

2. Solution

2.1. Postgresql stores the sentiment of the articles, summaries, the articles itslef and categorization by several topics.

2.2 The endpoint accepts parameters from the user in a request like the following.Those parameters are the input for an ARIMA model

{"hold out samples": 20, "lag observations": 3, "degree of differencing": 0, "moving average window": 0}

The user can change any of these parameters but be aware that some combinations are computationally expensive, it is just a free-tier ec2 instance.

2.3. After accept the inputs, the API script (api.py) performed a rolling forecast to re-create the ARIMA model after each new observation is received. Therefore, the model able to adapt to new data easily.

2.4.This walk-forward validation is performed in the hold out samples and then finally predict the average sentiment of the articles for the next day.

image

3. Results and Pipeline

You can test by using postman at:

http://ec2-54-79-143-239.ap-southeast-2.compute.amazonaws.com/API/PREDICT_AVG_SENTIMENT

The endpoint accepts parameters from the user in a request like the following.

{"hold out samples": 20, "lag observations": 3, "degree of differencing": 0, "moving average window": 0}

3.1. Why arima?

All models were tested with a hold out samples (33% of the dataset).

Even tough Regularized Regression such as Ridge performed slightly better than ARIMA models, I picked ARIMA model because it can be adapted easily to new data by incorporating each new observation into the model (Autoregressive models have worked better (>1,0,0))

Model RMSE
Persistence(Baseline) 0.124
Autoregressive (X,0,0) 0.098
ARIMA(X,X,0) 0.11
Linear Regression 8*10>
Lasso Regression 0.094
Ridge Regression 0.087
Decision Tree Regression 0.11
XGB Regressor 0.11
Univariate LSTM 0.092

Please consider that the cells related to LSTM approach will not work in the container environment because I did not install TensorFlow there. I developed the LSTM approach in my local environment, just for time convenience.

3.2.Pipeline

Trial1 notebook has all the details about connection to the database, EDA, basic feature engineering and performance and experiment of these models

Some dataframes were inspected by profiling pandas library.There are two html outpus for this purpose. The bigger one couldn´t be uploaded here, but you can pull the images from DockerHub to access it

https://hub.docker.com/repository/docker/robeespi/roblast27

Some EDA activities and basic feature engineering techniques explored:

  • Pandas profiling ( They are in the docker container as output2.html and output3.html, output2 is the EDA about the sql query and output3 is the dataframe by grouping the timestamp by day and incorporating category and sector as dummy variables)

  • Lag plots

  • Autocorrelation plots

  • Plotting Distribution response variable vs variables in the dataset

  • Correlations

  • Category and Sector as a dummy variable to run regressions

  • Feature Importance performance but not conclusive at all

  • There are three timestamps on the data, but I picked the timestamp with more distinct observations and longer period of time.

4. Future Work

LSTM univariate approach shows good performance, but showing overfitting. Still room for find suitable hyperparameters. AutoML/DL and/or Multivariate approach by ussing attention mechanism will be explored

deploying-on-aws-a-dockerized-api-to-predict-the-daily-average-sentiment-of-financial-news-articles's People

Contributors

robeespi avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.