The data-pipeline from elvbom

data-pipeline's Introduction

Klimatkollen Data Pipeline

This is the main repo for fetching data from sources and adding them to our database in a multi-step process using BullMQ as task handler.

Current Status

First working prototype for pipeline but doesn't work on large PDF files yet.

Data Flow

Some of the following steps will be performed in parallel and most will be asynchronous. If a process is failed it's important to be able to restart it after a new code release so we can iterate on the prompts etc without having to restart the whole process again.

Import PDF from URL
Parse Text
Send text to OpenAI for embeddings
Index vector database with embeddings
Build query from prompt together with relevant embeddings
Send to LLM
Verify the results first automatically
Verify results in Discord channel (separate PR #2)
Save to Wikidata or other database (not done)

Get Started

Get an OPENAI_API_KEY from OpenAI and add it to a .env file in the root directory. Run redis locally or add REDIS_HOST and REDIS_PORT into the .env file.

npm i
docker run -d -p 6379:6379 redis
docker run -d -p 8000:8000 chromadb/chroma
npm run dev

Next steps / Tasks

First Milestone

Test on smaller PDF files
Split PDF text into smaller chunks (maybe using langchain pdf instead of custom?)
Add chunks to vector database (ChromaDB)
Use vector database with langchain when doing queries to limit amount of tokens
Docker-compose file for dependencies
DevOps/Kubernetes setup for databases and deployment
Tests etc

License

MIT

Recommend Projects

elvbom / data-pipeline Goto Github PK

data-pipeline's Introduction

Klimatkollen Data Pipeline

Current Status

Data Flow

Get Started

Next steps / Tasks

First Milestone

License

data-pipeline's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent