Note
This project is a Community Project.
The project is maintained and supported by the community. Upstash may contribute but does not officially support or assume responsibility for it.
DegreeGuru is a chatbot project designed to effortlessly integrate a chatbot into any web project, enabling seamless question-answering functionality within an hour. The project includes a configurable crawler that meticulously traverses the target website, indexing its contents into an Upstash Vector Database. This database becomes the backbone for the chatbot, allowing it to swiftly retrieve relevant context when users pose questions.
Upon receiving a user query, the chatbot leverages the Upstash Vector Database to deliver accurate responses utilizing streaming with Vercel AI. Additionally, Upstash rate limiting is employed to control access, preventing excessive queries from a particular IP address.
The project employs OpenAI embeddings during both the web crawling phase and the user query embedding process. OpenAI models play a crucial role in generating responses by utilizing the relevant context retrieved from the Upstash Vector Database.
Despite its name, DegreeGuru is not limited to any specific domain; it is domain-agnostic. The chatbot can proficiently answer questions from any topic, provided the information is appropriately stored in the vector database. The only domain-specific aspect lies in the crawler settings, which are configurable in the crawler.yaml file and geared towards universities. The chatbot underwent testing by crawling a university website, with a refined denied keyword list ensuring that pages containing specific words in their URLs are excluded from the crawling process.
The versatility of the DegreeGuru project extends beyond crawling university websites. It can effortlessly be employed to crawl any website, creating a comprehensive vector database that can then be utilized to deploy the chatbot for diverse applications.
- Crawler: scrapy
- Chatbot App: Next.js
- Vector DB: Upstash
- LLM Orchestration: Langchain.js
- Generative Model: OpenAI, gpt-3.5-turbo-1106
- Embedding Model: OpenAI, text-embedding-ada-002
- Text Streaming: Vercel AI
- Rate Limiting: Upstash
Before doing anything, we recommend commencing by forking this repository on GitHub and subsequently cloning it for local development. Execute the following command to clone the repository:
git clone [email protected]:[YOUR_GITHUB_ACCOUNT]/DegreeGuru.git
As outlined in the project description, the project comprises two primary components: the crawler and the chatbot. Naturally, we will initially focus on how the crawler facilitates the creation of an Upstash Vector Database from any given website. In instances where a vector database is already available, the crawler stage can be bypassed.
The crawler is developed using Python, by initializing a Scrapy project and implementing a custom spider. The spider is equipped with the parse_page
function, invoked each time the spider visits a webpage. This callback function performs the task of segmenting the text on the webpage into chunks, generating embeddings for each chunk, and subsequently upserting the vectors into the Upstash Vector Database. Alongside the vectors representing the text, the chunk's text and the website URL are transmitted to the database as metadata.
To execute the crawler, follow the steps outlined below:
Configure Environment Variables
Before initiating the crawler, it is essential to configure environment variables. These variables serve the purpose of enabling text embedding with OpenAI and facilitating the transmission of vectors to the Upstash Vector Database.If you don't have an Upstash Vector Database already, create one by setting 1536 as the vector size to match the text-embedding-ada-002 model.
Following environment variables should be set:
# UPSTASH VECTOR DB
UPSTASH_VECTOR_REST_URL=****
UPSTASH_VECTOR_REST_TOKEN=****
# OPENAI KEY
OPENAI_API_KEY=****
Install Required Python Libraries
To install the libraries, we suggest setting up a virtual Python environment. Before starting the installation, navigate to the degreegurucrawler
directory.
To setup a virtual environment, first install virtualenv
package:
pip install virtualenv
Then, create a new virtual environment and activate it:
# create environment
python3 -m venv venv
# activate environment
source venv/bin/activate
Finally, use the requirements.txt
to install the required libraries:
pip install -r requirements.txt
Tip
If you have docker installed, you can skip the "Configure Environment Variables" and "Install Required Python Libraries" sections. Instead you can simply update the environment variables in docker-compose.yml and run docker-compose up
. This will create a container running our crawler. Don't forget to configure the crawler as explained in the following sections!
Upon configuring the environment variables and establishing the virtual environment, you are on the verge of launching the crawler. The subsequent step involves configuring the crawler itself, primarily accomplished through the crawler.yaml
file located in the degreegurucrawler/utils
directory. Additionally, it is imperative to address a crucial setting within the settings.py
file.
Configuring the Crawler Through `crawler.yaml`
The crawler.yaml has two main sections: crawler
and index
:
crawler:
start_urls:
- https://www.some.domain.com
link_extractor:
allow: '.*some\.domain.*'
deny:
- "#"
- '\?'
- about
index:
openAI_embedding_model: text-embedding-ada-002
text_splitter:
chunk_size: 1000
chunk_overlap: 100
Under the crawler
section, there are two sections:
start_urls
: denotes a list of urls which are the urls our spider will crawling searching fromlink_extractor
: denotes a dictionary which will be passed as arguments toscrapy.linkextractors.LinkExtractor
. Some important parameters are:allow
: Only extracts links which match the given regex(s)allow_domains
: Only extract links which match the domain(s)deny
: Deny links which match the given regex(s)
Under the index
section, there are two sections:
openAI_embedding_model
: embedding model to usetest_splitter
: denotes a dictionary which will be passed as arguments tolangchain.text_splitter.RecursiveCharacterTextSplitter
Configuring Depth Through `settings.py`
settings.py
file has an important setting called DEPTH_LIMIT
which determines how many consecutive links our spider can crawl. Set a value too high and the spider will visit the deepest corners of the website, taking too long to finish with possibly diminishing returns. Set a value too low and the crawl will end before visiting relevant pages.
Scrapy logs the urls of pages when they are skipped because of the depth limit. Since this results in a lot of logs, this log type is disabled in our project. To enable it back, simply remove the "scrapy.spidermiddlewares.depth"
from the disable_loggers
in degreegurucrawler/spider/configurable.py
file.
When you finish configuring the crawler, you are finally ready to run it to create the Upstash Vector Database! Run the following command to start the crawler:
scrapy crawl configurable --logfile degreegurucrawl.log
Note that this will take some time. You can observe the progress by looking at the log file degreegurucrawl.log
or from the metrics in the dashboard of your Upstash Vector Database.
Tip
If you want to do a dry run (without creating embeddings or a vector database), you can achieve this by simply commenting the line where we pass the callback
parameter to the Rule
object in ConfigurableSpider
out
Before running the ChatBot locally, we need to set the environment variables as shown in the .env.local.example
. To start off, copy the example environment file to the actual environment file we will update:
cp .env.local.example .env.local
UPSTASH_VECTOR_REST_URL
and UPSTASH_VECTOR_REST_TOKEN
are needed to access the Upstash Vector Database. Here, we can use the read-only tokens provided by Upstash since we only need to query the vector database.
UPSTASH_REDIS_REST_URL
and UPSTASH_REDIS_REST_TOKEN
are needed for rate-limiting based on IP address. In order to get these secrets, go to Upstash dashboard and create a Redis database.
Finally, set the OPENAI_API_KEY
environment variable to embed user queries and to generate a response.
Once the environment variables are set, DegreeGuru is finally ready to wake up and share its wisdom with the whole world. First run npm install
first to install required packages. Then, simply run the following to start DegreeGuru web application:
npm run dev
The web application will typically be available at http://localhost:3000/, unless stated otherwise in the console where npm run dev
was run.
The chat bot can be configured to work in two modes:
- streaming mode: Response of the generative model is streamed to the web application as they are generated by the model. Interaction with the app is more fluid.
- non-streaming mode: Response of the generative model is shown to the user when the generation is finished. Model takes longer to respond but in this mode DegreeGuru can explicitly provide the urls of the webpages it used as context.
Changing Streaming Mode
To enable/disable streaming, simply navigate to src/app/route/guru
directory and open route.tsx
file. Setting returnIntermediateSteps
to true
disables streaming while setting it to false
enables streaming.`
To customize the ChatBot further, you may want to update the AGENT_SYSTEM_TEMPLATE in route.tsx file. Note that we reference Stanford University in our template. You may want to change this for your own application if you use a different university.
In conclusion, DegreeGuru project seamlessly integrates Langchain, Vercel AI, Upstash rate limiting, and Upstash Vector Database. The chatbot delivers accurate responses by efficiently indexing content, demonstrated by the tests we carried out on a university website. With a user-friendly interface and adaptable settings, DegreeGuru is a valuable tool for developers, enhancing user interactions and information retrieval.
The project has a few shortcomings we can mention:
UpstashVectorStore
extends the LangChain vector store but it is not a complete implementation. It only implements thesimilaritySearchVectorWithScore
method which is needed for our agent. Once the vector store is properly added to LangChain, this project can be updated with the new vector store.- When the non-streaming mode is enabled, message history causes an error after the user enters another query.
- Our sources are available as urls in the Upstash Vector Database but we are not able to show the sources explicitly in the streaming mode. Instead, we provide the links to the chatbot as context and expect the bot to include the links in the response.