Coder Social home page Coder Social logo

eraykeskinmac / degreeguru Goto Github PK

View Code? Open in Web Editor NEW

This project forked from upstash/degree-guru

0.0 0.0 0.0 29.75 MB

AI chatbot for expert answers on university degrees

Home Page: https://degreeguru.vercel.app/

JavaScript 3.30% Python 37.71% TypeScript 56.32% CSS 2.09% Dockerfile 0.58%

degreeguru's Introduction

DegreeGuru

Deploy with Vercel

overview

Note

This project is a Community Project.

The project is maintained and supported by the community. Upstash may contribute but does not officially support or assume responsibility for it.

DegreeGuru is a chatbot project designed to effortlessly integrate a chatbot into any web project, enabling seamless question-answering functionality within an hour. The project includes a configurable crawler that meticulously traverses the target website, indexing its contents into an Upstash Vector Database. This database becomes the backbone for the chatbot, allowing it to swiftly retrieve relevant context when users pose questions.

Upon receiving a user query, the chatbot leverages the Upstash Vector Database to deliver accurate responses utilizing streaming with Vercel AI. Additionally, Upstash rate limiting is employed to control access, preventing excessive queries from a particular IP address.

The project employs OpenAI embeddings during both the web crawling phase and the user query embedding process. OpenAI models play a crucial role in generating responses by utilizing the relevant context retrieved from the Upstash Vector Database.

Despite its name, DegreeGuru is not limited to any specific domain; it is domain-agnostic. The chatbot can proficiently answer questions from any topic, provided the information is appropriately stored in the vector database. The only domain-specific aspect lies in the crawler settings, which are configurable in the crawler.yaml file and geared towards universities. The chatbot underwent testing by crawling a university website, with a refined denied keyword list ensuring that pages containing specific words in their URLs are excluded from the crawling process.

The versatility of the DegreeGuru project extends beyond crawling university websites. It can effortlessly be employed to crawl any website, creating a comprehensive vector database that can then be utilized to deploy the chatbot for diverse applications.

Overview

  1. Stack
  2. Quickstart
    1. Crawler
    2. ChatBot
  3. Conclusion
  4. Shortcomings

Stack

Quickstart

Before doing anything, we recommend commencing by forking this repository on GitHub and subsequently cloning it for local development. Execute the following command to clone the repository:

git clone [email protected]:[YOUR_GITHUB_ACCOUNT]/DegreeGuru.git

As outlined in the project description, the project comprises two primary components: the crawler and the chatbot. Naturally, we will initially focus on how the crawler facilitates the creation of an Upstash Vector Database from any given website. In instances where a vector database is already available, the crawler stage can be bypassed.

Crawler

crawler-diagram

The crawler is developed using Python, by initializing a Scrapy project and implementing a custom spider. The spider is equipped with the parse_page function, invoked each time the spider visits a webpage. This callback function performs the task of segmenting the text on the webpage into chunks, generating embeddings for each chunk, and subsequently upserting the vectors into the Upstash Vector Database. Alongside the vectors representing the text, the chunk's text and the website URL are transmitted to the database as metadata.


To execute the crawler, follow the steps outlined below:

Configure Environment Variables Before initiating the crawler, it is essential to configure environment variables. These variables serve the purpose of enabling text embedding with OpenAI and facilitating the transmission of vectors to the Upstash Vector Database.

If you don't have an Upstash Vector Database already, create one by setting 1536 as the vector size to match the text-embedding-ada-002 model.

vector-db-create

Following environment variables should be set:

# UPSTASH VECTOR DB
UPSTASH_VECTOR_REST_URL=****
UPSTASH_VECTOR_REST_TOKEN=****

# OPENAI KEY
OPENAI_API_KEY=****
Install Required Python Libraries

To install the libraries, we suggest setting up a virtual Python environment. Before starting the installation, navigate to the degreegurucrawler directory.

To setup a virtual environment, first install virtualenv package:

pip install virtualenv

Then, create a new virtual environment and activate it:

# create environment
python3 -m venv venv

# activate environment
source venv/bin/activate

Finally, use the requirements.txt to install the required libraries:

pip install -r requirements.txt

Tip

If you have docker installed, you can skip the "Configure Environment Variables" and "Install Required Python Libraries" sections. Instead you can simply update the environment variables in docker-compose.yml and run docker-compose up. This will create a container running our crawler. Don't forget to configure the crawler as explained in the following sections!


Upon configuring the environment variables and establishing the virtual environment, you are on the verge of launching the crawler. The subsequent step involves configuring the crawler itself, primarily accomplished through the crawler.yaml file located in the degreegurucrawler/utils directory. Additionally, it is imperative to address a crucial setting within the settings.py file.

Configuring the Crawler Through `crawler.yaml`

The crawler.yaml has two main sections: crawler and index:

crawler:
  start_urls:
    - https://www.some.domain.com
  link_extractor:
    allow: '.*some\.domain.*'
    deny:
      - "#"
      - '\?'
      - about
index:
  openAI_embedding_model: text-embedding-ada-002
  text_splitter:
    chunk_size: 1000
    chunk_overlap: 100

Under the crawler section, there are two sections:

  • start_urls: denotes a list of urls which are the urls our spider will crawling searching from
  • link_extractor: denotes a dictionary which will be passed as arguments to scrapy.linkextractors.LinkExtractor. Some important parameters are:
    • allow: Only extracts links which match the given regex(s)
    • allow_domains: Only extract links which match the domain(s)
    • deny: Deny links which match the given regex(s)

Under the index section, there are two sections:

Configuring Depth Through `settings.py`

settings.py file has an important setting called DEPTH_LIMIT which determines how many consecutive links our spider can crawl. Set a value too high and the spider will visit the deepest corners of the website, taking too long to finish with possibly diminishing returns. Set a value too low and the crawl will end before visiting relevant pages.

Scrapy logs the urls of pages when they are skipped because of the depth limit. Since this results in a lot of logs, this log type is disabled in our project. To enable it back, simply remove the "scrapy.spidermiddlewares.depth" from the disable_loggers in degreegurucrawler/spider/configurable.py file.


When you finish configuring the crawler, you are finally ready to run it to create the Upstash Vector Database! Run the following command to start the crawler:

scrapy crawl configurable --logfile degreegurucrawl.log

Note that this will take some time. You can observe the progress by looking at the log file degreegurucrawl.log or from the metrics in the dashboard of your Upstash Vector Database.

vector-db

Tip

If you want to do a dry run (without creating embeddings or a vector database), you can achieve this by simply commenting the line where we pass the callback parameter to the Rule object in ConfigurableSpider out

ChatBot

chatbot-diagram

Before running the ChatBot locally, we need to set the environment variables as shown in the .env.local.example. To start off, copy the example environment file to the actual environment file we will update:

cp .env.local.example .env.local

UPSTASH_VECTOR_REST_URL and UPSTASH_VECTOR_REST_TOKEN are needed to access the Upstash Vector Database. Here, we can use the read-only tokens provided by Upstash since we only need to query the vector database.

vector-db-read-only

UPSTASH_REDIS_REST_URL and UPSTASH_REDIS_REST_TOKEN are needed for rate-limiting based on IP address. In order to get these secrets, go to Upstash dashboard and create a Redis database.

redis-create

Finally, set the OPENAI_API_KEY environment variable to embed user queries and to generate a response.

Once the environment variables are set, DegreeGuru is finally ready to wake up and share its wisdom with the whole world. First run npm install first to install required packages. Then, simply run the following to start DegreeGuru web application:

npm run dev

The web application will typically be available at http://localhost:3000/, unless stated otherwise in the console where npm run dev was run.

The chat bot can be configured to work in two modes:

  • streaming mode: Response of the generative model is streamed to the web application as they are generated by the model. Interaction with the app is more fluid.
  • non-streaming mode: Response of the generative model is shown to the user when the generation is finished. Model takes longer to respond but in this mode DegreeGuru can explicitly provide the urls of the webpages it used as context.
Changing Streaming Mode

To enable/disable streaming, simply navigate to src/app/route/guru directory and open route.tsx file. Setting returnIntermediateSteps to true disables streaming while setting it to false enables streaming.`

To customize the ChatBot further, you may want to update the AGENT_SYSTEM_TEMPLATE in route.tsx file. Note that we reference Stanford University in our template. You may want to change this for your own application if you use a different university.


Conclusion

In conclusion, DegreeGuru project seamlessly integrates Langchain, Vercel AI, Upstash rate limiting, and Upstash Vector Database. The chatbot delivers accurate responses by efficiently indexing content, demonstrated by the tests we carried out on a university website. With a user-friendly interface and adaptable settings, DegreeGuru is a valuable tool for developers, enhancing user interactions and information retrieval.

Shortcomings

The project has a few shortcomings we can mention:

  • UpstashVectorStore extends the LangChain vector store but it is not a complete implementation. It only implements the similaritySearchVectorWithScore method which is needed for our agent. Once the vector store is properly added to LangChain, this project can be updated with the new vector store.
  • When the non-streaming mode is enabled, message history causes an error after the user enters another query.
  • Our sources are available as urls in the Upstash Vector Database but we are not able to show the sources explicitly in the streaming mode. Instead, we provide the links to the chatbot as context and expect the bot to include the links in the response.

degreeguru's People

Contributors

cahidarda avatar ademilter avatar enesakar avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.