Coder Social home page Coder Social logo

gpt-investar's Introduction

GPT-InvestAR

Enhancing Stock Investment Strategies through Annual Report Analysis with Large Language Models

This repository contains a set of tools and scripts designed to enhance stock investment strategies through the analysis of annual reports using Large Language Models. The components in this repository are organized as follows:

  1. download_10k.py: This Python script downloads 10-K filings of companies from the SEC website, which contain crucial financial information.

  2. convert_html_to_pdf.py: Converts HTML files to PDF files. PDFs are preferred due to their token efficiency for further analysis.

  3. make_targets.py: Generates a DataFrame of stock tickers with target values of different time resolutions, which can be used as investment targets for a Machine Learning model.

  4. embeddings_save.py: Generates embeddings of PDF files and saves them using Cromadb. These embeddings are numerical representations of the textual content in annual reports.

  5. gpt_scores_as_features.py: Utilizes saved embeddings to query all questions for each annual report using a Large Language Model (LLM) such as GPT-3.5, and uses the scores or answers as features.

  6. modeling_and_return_estimation.ipynb: This Jupyter Notebook contains the core modeling process. It uses machine learning techniques, specifically Linear Regression, to model the dataset and estimate returns. The goal is to create a portfolio of top-k predicted stocks and compare their returns with the S&P 500 index.

By following the sequence of these components, you can analyze annual reports, generate embeddings, and build predictive models to potentially enhance stock investment strategies.

Feel free to explore each component for more details and usage instructions.

Dependencies

  1. LLama Index (and related dependencies)

  2. OpenBB (and related dependencies)

  3. Scikit-Learn

  4. PDFKit (and related dependencies)

It is recommended to install libraries 1 and 2 in separate virtual (conda) environments. The python scripts mentioned above do not require both these libraries to be installed in the same environment.

Citation

If you use the code or find this repository helpful, please consider citing the paper:

GPT-InvestAR: Enhancing Stock Investment Strategies through Annual Report Analysis with Large Language Models
Udit Gupta
Publication Links:

  1. arXiv Link
  2. SSRN link
@article{GPT-InvestAR,
  author = {Udit Gupta},
  title = {GPT-InvestAR: Enhancing Stock Investment Strategies through Annual Report Analysis with Large Language Models},
  journal = {arXiv e-prints},
  year = {2023},
  eprint = {arXiv:2309.03079},
  url = {https://arxiv.org/abs/2309.03079},
}

gpt-investar's People

Contributors

uditgupta10 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gpt-investar's Issues

Template to create the questions

I tried some questions such as: "profit": "Does the company got a good profit in this year? Does the company do good in this year?"
But the llm give me the answer is not '0.8' something like that. It give me {"score":"Yes"}.
Is there any tips to makesure the questions will get a value answer not "Yes", "No", "Uncertain". Thanks so much!

Correct the requirements.txt

I've been unable to set up the requirements.txt.
I've tried to manually go through each conda and each pip install and there are so many issues I encounter I've given up.
Python=3.9 got me the furthest, but still problematic.

It would be great if you maybe offered clearer list for users.

Thanks.

Can't find the 27 dimensions question list

Hi, I recently had the opportunity to read your paper and found it to be incredibly insightful and thought-provoking. However, I noticed in your paper you referenced a comprehensive list of questions encompassing 27 dimensions. We are very keen to delve deeper into your research, but upon reviewing the materials on your GitHub, it appears that only a portion of this list is available. (exactly in question.json)

Would it be possible for you to share the complete list of questions? Having access to this information would greatly assist us in fully understanding and potentially replicating your study.

Thank you in advance for your time and assistance. Your work is greatly appreciated and we are looking forward to exploring it further.

Only one question?

Curious about what the good features to the non-negative logistic regression but only found one question under questions.json. Are there other features used for the modeling?

Issue with the CONFIG_PATH in download_10k.py

Hi Udit,

I have followed the readme dependencies notes but when trying to run I am coming up against an error which reads:

usage: ipykernel_launcher.py [-h] --config_path CONFIG_PATH
ipykernel_launcher.py: error: the following arguments are required: --config_path

Any help appreciated.

Thank you

Where can I find those 27 questions the paper mentioned?

While I read the paper, noticed that you guys use 27 questions for LLM to extract features and scores, and the source code have a question.json which have just one question in the file.

So, could you help where can I find those questions for studying?

about pd.qcut()

In this part, I got a error:
image
image
If it means the codes need change a little bit to avoid the duplicated bin edges? Did u meet this error before? Thanks so much for your reply in advance !

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.