This is the main repo for fetching data from sources and adding them to our database in a multi-step process using BullMQ as task handler.
![image](https://private-user-images.githubusercontent.com/395843/296063012-d280fbc0-6fd9-496e-a487-9b37c3ab179f.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjIxNjY0ODEsIm5iZiI6MTcyMjE2NjE4MSwicGF0aCI6Ii8zOTU4NDMvMjk2MDYzMDEyLWQyODBmYmMwLTZmZDktNDk2ZS1hNDg3LTliMzdjM2FiMTc5Zi5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNzI4JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDcyOFQxMTI5NDFaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT0yNzI5YzM4NGJkYTI1ZGUzMGYwOTlhOWY1NTE5NmZiZWMxODQzNDc3MzZhZTc1NzUwYmM3NWIyMDY3MmRlMTVjJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.PRXbhOnc2xGNbkrROenU4TgdN9DoP19uk0sTFg5AeR8)
First working prototype for pipeline but doesn't work on large PDF files yet.
Some of the following steps will be performed in parallel and most will be asynchronous. If a process is failed it's important to be able to restart it after a new code release so we can iterate on the prompts etc without having to restart the whole process again.
- Import PDF from URL
- Parse Text
- Send text to OpenAI for embeddings
- Index vector database with embeddings
- Build query from prompt together with relevant embeddings
- Send to LLM
- Verify the results first automatically
- Verify results in Discord channel (separate PR #2)
- Save to Wikidata or other database (not done)
Get an OPENAI_API_KEY from OpenAI and add it to a .env file in the root directory. Run redis locally or add REDIS_HOST and REDIS_PORT into the .env file.
npm i
docker run -d -p 6379:6379 redis
docker run -d -p 8000:8000 chromadb/chroma
npm run dev
- Test on smaller PDF files
- Split PDF text into smaller chunks (maybe using langchain pdf instead of custom?)
- Add chunks to vector database (ChromaDB)
- Use vector database with langchain when doing queries to limit amount of tokens
- Docker-compose file for dependencies
- DevOps/Kubernetes setup for databases and deployment
- Tests etc
MIT