Coder Social home page Coder Social logo

vndee / local-rag-example Goto Github PK

View Code? Open in Web Editor NEW
130.0 7.0 33.0 719 KB

Build your own ChatPDF and run them locally

Home Page: https://blog.duy-huynh.com/build-your-own-rag-and-run-them-locally/

License: MIT License

Python 99.23% Shell 0.77%
langchain llm ollama rag

local-rag-example's Introduction

local-rag-example

Build your own ChatPDF and run them locally

Dependencies:

  • langchain
  • streamlit
  • streamlit-chat
  • pypdf
  • chromadb
  • fastembed
pip install langchain streamlit streamlit_chat chromadb pypdf fastembed

Blog post: https://blog.duy-huynh.com/build-your-own-rag-and-run-them-locally/

A Tutorial On How to Build Your Own RAG and How to Run It Locally: Langchain + Ollama + Streamlit

With the rise of Large Language Models and their impressive capabilities, many fancy applications are being built on top of giant LLM providers like OpenAI and Anthropic. The myth behind such applications is the RAG framework, which has been thoroughly explained in the following articles:

To become familiar with RAG, I recommend going through these articles. This post, however, will skip the basics and guide you directly on building your own RAG application that can run locally on your laptop without any worries about data privacy and token cost.

We will build an application that is something similar to ChatPDF but simpler. Where users can upload a PDF document and ask questions through a straightforward UI. Our tech stack is super easy with Langchain, Ollama, and Streamlit.

  • LLM Server: The most critical component of this app is the LLM server. Thanks to Ollama, we have a robust LLM Server that can be set up locally, even on a laptop. While llama.cpp is an option, I find Ollama, written in Go, easier to set up and run.

  • RAG: Undoubtedly, the two leading libraries in the LLM domain are Langchain and LLamIndex. For this project, I’ll be using Langchain due to my familiarity with it from my professional experience. An essential component of any RAG framework is vector storage. We’ll be using Chroma here, as it integrates well with Langchain.

  • Chat UI: The user interface is also an important component. Although there are many technologies available, I prefer using Streamlit, a Python library, for peace of mind.

Okay, let’s start setting it up.


Setup Ollama

As mentioned above, setting up and running Ollama is straightforward. First, visit ollama.ai and download the app appropriate for your operating system.

Next, open your terminal, and execute the following command to pull the latest Mistral-7B. While there are many other LLM models available, I choose Mistral-7B for its compact size and competitive quality.

ollama pull mistral

Build the RAG Pipeline

The second step in our process is to build the RAG pipeline. Given the simplicity of our application, we primarily need two methods: ingest and ask.

The ingest method accepts a file path and loads it into vector storage in two steps: first, it splits the document into smaller chunks to accommodate the token limit of the LLM; second, it vectorizes these chunks using Qdrant FastEmbeddings and stores them into Chroma.

The ask method handles user queries. Users can pose a question, and then the RetrievalQAChain retrieves the relevant contexts (document chunks) using vector similarity search techniques.

With the user's question and the retrieved contexts, we can compose a prompt and request a prediction from the LLM server.

from langchain_community.vectorstores import Chroma
from langchain_community.chat_models import ChatOllama
from langchain_community.embeddings import FastEmbedEmbeddings
from langchain.schema.output_parser import StrOutputParser
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema.runnable import RunnablePassthrough
from langchain.prompts import PromptTemplate
from langchain.vectorstores.utils import filter_complex_metadata


class ChatPDF:
    vector_store = None
    retriever = None
    chain = None

    def __init__(self):
        self.model = ChatOllama(model="mistral")
        self.text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=100)
        self.prompt = PromptTemplate.from_template(
            """
            <s> [INST] You are an assistant for question-answering tasks. Use the following pieces of retrieved context 
            to answer the question. If you don't know the answer, just say that you don't know. Use three sentences
             maximum and keep the answer concise. [/INST] </s> 
            [INST] Question: {question} 
            Context: {context} 
            Answer: [/INST]
            """
        )

    def ingest(self, pdf_file_path: str):
        docs = PyPDFLoader(file_path=pdf_file_path).load()
        chunks = self.text_splitter.split_documents(docs)
        chunks = filter_complex_metadata(chunks)

        vector_store = Chroma.from_documents(documents=chunks, embedding=FastEmbedEmbeddings())
        self.retriever = vector_store.as_retriever(
            search_type="similarity_score_threshold",
            search_kwargs={
                "k": 3,
                "score_threshold": 0.5,
            },
        )

        self.chain = ({"context": self.retriever, "question": RunnablePassthrough()}
                      | self.prompt
                      | self.model
                      | StrOutputParser())

    def ask(self, query: str):
        if not self.chain:
            return "Please, add a PDF document first."

        return self.chain.invoke(query)

    def clear(self):
        self.vector_store = None
        self.retriever = None
        self.chain = None

The prompt is sourced from the Langchain hub: Langchain RAG Prompt for Mistral. This prompt has been tested and downloaded thousands of times, serving as a reliable resource for learning about LLM prompting techniques.

You can learn more about LLM prompting techniques here.

More details on the implementation:

ingest: We use PyPDFLoader to load the PDF file uploaded by the user. The RecursiveCharacterSplitter, provided by Langchain, then splits this PDF into smaller chunks. It's important to filter out complex metadata not supported by ChromaDB using the filter_complex_metadata function from Langchain.

For vector storage, Chroma is used, coupled with Qdrant FastEmbed as our embedding model. This lightweight model is then transformed into a retriever with a score threshold of 0.5 and k=3, meaning it returns the top 3 chunks with the highest scores above 0.5. Finally, we construct a simple conversation chain using LECL.

ask: This method simply passes the user's question into our predefined chain and then returns the result.

clear: This method is used to clear the previous chat session and storage when a new PDF file is uploaded.

Draft A Simple UI

For a simple user interface, we will use Streamlit, a UI framework designed for the fast prototyping of AI/ML applications.

import os
import tempfile
import streamlit as st
from streamlit_chat import message
from rag import ChatPDF

st.set_page_config(page_title="ChatPDF")


def display_messages():
    st.subheader("Chat")
    for i, (msg, is_user) in enumerate(st.session_state["messages"]):
        message(msg, is_user=is_user, key=str(i))
    st.session_state["thinking_spinner"] = st.empty()


def process_input():
    if st.session_state["user_input"] and len(st.session_state["user_input"].strip()) > 0:
        user_text = st.session_state["user_input"].strip()
        with st.session_state["thinking_spinner"], st.spinner(f"Thinking"):
            agent_text = st.session_state["assistant"].ask(user_text)

        st.session_state["messages"].append((user_text, True))
        st.session_state["messages"].append((agent_text, False))


def read_and_save_file():
    st.session_state["assistant"].clear()
    st.session_state["messages"] = []
    st.session_state["user_input"] = ""

    for file in st.session_state["file_uploader"]:
        with tempfile.NamedTemporaryFile(delete=False) as tf:
            tf.write(file.getbuffer())
            file_path = tf.name

        with st.session_state["ingestion_spinner"], st.spinner(f"Ingesting {file.name}"):
            st.session_state["assistant"].ingest(file_path)
        os.remove(file_path)


def page():
    if len(st.session_state) == 0:
        st.session_state["messages"] = []
        st.session_state["assistant"] = ChatPDF()

    st.header("ChatPDF")

    st.subheader("Upload a document")
    st.file_uploader(
        "Upload document",
        type=["pdf"],
        key="file_uploader",
        on_change=read_and_save_file,
        label_visibility="collapsed",
        accept_multiple_files=True,
    )

    st.session_state["ingestion_spinner"] = st.empty()

    display_messages()
    st.text_input("Message", key="user_input", on_change=process_input)


if __name__ == "__main__":
    page()

Run this code with the command streamlit run app.py to see what it looks like.

Okay, that’s it! We now have a ChatPDF application that runs entirely on your laptop. Since this post mainly focuses on providing a high-level overview of how to build your own RAG application, there are several aspects that need fine-tuning. You may consider the following suggestions to enhance your app and further develop your skills:

  • Add Memory to the Conversation Chain: Currently, it doesn’t remember the conversation flow. Adding temporary memory will help your assistant be aware of the context.

  • Allow multiple file uploads: it’s okay to chat about one document at a time. But imagine if we could chat about multiple documents — you could put your whole bookshelf in there. That would be super cool!

  • Use Other LLM Models: While Mistral is effective, there are many other alternatives available. You might find a model that better fits your needs, like LlamaCode for developers. However, remember that the choice of model depends on your hardware, especially the amount of RAM you have 💵

  • Enhance the RAG Pipeline: There’s room for experimentation within RAG. You might want to change the retrieval metric, the embedding model,.. or add layers like a re-ranker to improve results.

Full source code: https://github.com/vndee/local-rag-example

local-rag-example's People

Contributors

vndee avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

local-rag-example's Issues

Versioning issues

Would it be possible to produce a requirements.txt (with versions) for pip? I'm attempting to implement, but hitting various obstacles. default pip install langchain means it complains about needing to import various libraries from langchain_community.

I've installed the libraries released prior to your 1st Dec commit:

pip install langchain==0.0.343 streamlit==1.28.2 streamlit_chat chromadb==0.4.18 pypdf==3.17.1 fastembed==0.1.0

But I'm still getting an error:

/home/USERNAME/anaconda3/envs/py310/lib/python3.10/site-packages/langchain_core/vectorstores.py:325: UserWarning: No relevant docs were retrieved using the relevance score threshold 0.5
  warnings.warn(

Any assistance would be appreciated! I've tried with both python 3.10 and 3.11 environments, but 3.11 gives the following warning on running app.py: gio: http://localhost:8501: Operation not supported

Thanks.

No context message

I'm always getting the following message from the bot: I'm an assistant for answering questions. I don't have enough context to identify "it". Please provide more information.

No matter which PDF document I load or the question I ask.

Ollama server is running well, all the previous steps are working, but the bot doesn't seem able to chat on the document I loaded.

For what is GIO used?

When running the script on WSL with Ubuntu i get the following output:

  You can now view your Streamlit app in your browser.

  Local URL: http://localhost:8501
  Network URL: http://172.21.20.151:8501

gio: http://localhost:8501: Operation not supported

For what is gio used and which operation is not supported?

After input of my query and pressing ENTER, I get this error: ConnectionError: HTTPConnectionPool(host='localhost', port=11434): Max retries exceeded with url: /api/chat/

I am running the streamlit app locally on my Windows PC, which runs fine.
I can upload a PDF, which runs fine.
But: then I input my query, and after pressing ENTER I get the following error, which I can not get rid of:

ConnectionError: HTTPConnectionPool(host='localhost', port=11434): Max retries exceeded with url: /api/chat/ (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x000001AEDAEFCE10>: Failed to establish a new connection: [WinError 10061] Kan geen verbinding maken omdat de doelcomputer de verbinding actief heeft geweigerd'))
Traceback:
File "C:\ProgramData\Anaconda3\Lib\site-packages\streamlit\runtime\scriptrunner\script_runner.py", line 531, in _run_script
self._session_state.on_script_will_rerun(rerun_data.widget_states)
File "C:\ProgramData\Anaconda3\Lib\site-packages\streamlit\runtime\state\safe_session_state.py", line 63, in on_script_will_rerun
self._state.on_script_will_rerun(latest_widget_states)
File "C:\ProgramData\Anaconda3\Lib\site-packages\streamlit\runtime\state\session_state.py", line 504, in on_script_will_rerun
self._call_callbacks()
File "C:\ProgramData\Anaconda3\Lib\site-packages\streamlit\runtime\state\session_state.py", line 517, in _call_callbacks
self._new_widget_state.call_callback(wid)
File "C:\ProgramData\Anaconda3\Lib\site-packages\streamlit\runtime\state\session_state.py", line 261, in call_callback
callback(*args, **kwargs)
File "C:\Users\HP\RAG\RAG FULLY LOCAL Langchain + Ollama + Streamlit\local-rag-example\app.py", line 21, in process_input
agent_text = st.session_state["assistant"].ask(user_text)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\HP\RAG\RAG FULLY LOCAL Langchain + Ollama + Streamlit\local-rag-example\rag.py", line 54, in ask
return self.chain.invoke(query)
^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\HP\AppData\Roaming\Python\Python311\site-packages\langchain_core\runnables\base.py", line 2053, in invoke
input = step.invoke(
^^^^^^^^^^^^
File "C:\Users\HP\AppData\Roaming\Python\Python311\site-packages\langchain_core\language_models\chat_models.py", line 165, in invoke
self.generate_prompt(
File "C:\Users\HP\AppData\Roaming\Python\Python311\site-packages\langchain_core\language_models\chat_models.py", line 543, in generate_prompt
return self.generate(prompt_messages, stop=stop, callbacks=callbacks, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\HP\AppData\Roaming\Python\Python311\site-packages\langchain_core\language_models\chat_models.py", line 407, in generate
raise e
File "C:\Users\HP\AppData\Roaming\Python\Python311\site-packages\langchain_core\language_models\chat_models.py", line 397, in generate
self._generate_with_cache(
File "C:\Users\HP\AppData\Roaming\Python\Python311\site-packages\langchain_core\language_models\chat_models.py", line 576, in _generate_with_cache
return self._generate(
^^^^^^^^^^^^^^^
File "C:\Users\HP\AppData\Roaming\Python\Python311\site-packages\langchain_community\chat_models\ollama.py", line 250, in _generate
final_chunk = self._chat_stream_with_aggregation(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\HP\AppData\Roaming\Python\Python311\site-packages\langchain_community\chat_models\ollama.py", line 183, in _chat_stream_with_aggregation
for stream_resp in self._create_chat_stream(messages, stop, **kwargs):
File "C:\Users\HP\AppData\Roaming\Python\Python311\site-packages\langchain_community\chat_models\ollama.py", line 156, in _create_chat_stream
yield from self._create_stream(
^^^^^^^^^^^^^^^^^^^^
File "C:\Users\HP\AppData\Roaming\Python\Python311\site-packages\langchain_community\llms\ollama.py", line 215, in _create_stream
response = requests.post(
^^^^^^^^^^^^^^
File "C:\ProgramData\Anaconda3\Lib\site-packages\requests\api.py", line 115, in post
return request("post", url, data=data, json=json, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ProgramData\Anaconda3\Lib\site-packages\requests\api.py", line 59, in request
return session.request(method=method, url=url, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ProgramData\Anaconda3\Lib\site-packages\requests\sessions.py", line 589, in request
resp = self.send(prep, **send_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ProgramData\Anaconda3\Lib\site-packages\requests\sessions.py", line 703, in send
r = adapter.send(request, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ProgramData\Anaconda3\Lib\site-packages\requests\adapters.py", line 519, in send
raise ConnectionError(e, request=request)

image

And then Immediately thereafter:

image

What can I do about it ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.