Coder Social home page Coder Social logo

yash9439 / rayqdrantfastembed Goto Github PK

View Code? Open in Web Editor NEW
4.0 1.0 0.0 109.2 MB

Generating embedding for 1000s of PDF Documents, in Qdrant using FastEmbed with distributed Computing in Ray

License: Apache License 2.0

Jupyter Notebook 100.00%
distributed-computing fastembed qdrant-vector-database rag ray

rayqdrantfastembed's Introduction

Ray Distributed Computing with FastEmbed and Qdrant

This repository contains code to demonstrate the usage of Ray distributed computing framework along with FastEmbed for embedding generation and Qdrant for similarity search. Specifically, it shows how to efficiently generate embeddings for text data, store them in Qdrant, and perform similarity search queries.

Requirements

  • Python 3.x
  • Jupyter Notebook (for running RayQdrant.ipynb)
  • PyPDF2
  • nltk
  • ray
  • fastembed
  • qdrant_client

You can install the required libraries using pip:

pip install PyPDF2 nltk fastembed qdrant-client[fastembed]
pip install -U "ray[data,train,tune,serve]"

Usage

  1. Clone the Repository:

    Clone this repository to your local machine:

    git clone https://github.com/yash9439/RayQdrantFastEmbed.git
  2. Start Docker Environment:

    Open the RayQdrant.ipynb file using Jupyter Notebook:

    sudo docker pull qdrant/qdrant
    sudo docker run -p 6333:6333 -p 6334:6334 -v $(pwd)/qdrant_storage:/qdrant/storage:z qdrant/qdrant
  3. Run Jupyter Notebook:

    Open the RayQdrant.ipynb file using Jupyter Notebook:

    jupyter notebook RayQdrant.ipynb

    Execute each cell in the notebook sequentially to run the code. Ensure you have the necessary dependencies installed.

  4. Interpret Results:

    After running the notebook, you will see the time taken for embedding generation using Ray distributed computing. Additionally, you'll get the results of similarity search queries using Qdrant.

Folder Structure

  • Docs/: This directory contains the PDF documents for which embeddings are generated.
  • RayQdrant.ipynb: Jupyter Notebook containing the code for embedding generation using Ray and similarity search using Qdrant.

License

This code is provided under the Apache License 2.0.

Feel free to modify and distribute it as needed. If you find any issues or have suggestions for improvements, please feel free to open an issue or create a pull request.

rayqdrantfastembed's People

Contributors

yash9439 avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar

rayqdrantfastembed's Issues

use fastembed-gpu Error!

from typing import List
import os
import fitz
import time
import numpy as np
from nltk.tokenize import sent_tokenize
from fastembed import TextEmbedding
import ray
from typing import List
import time,requests
#from typing import List
#import numpy as np
documents : List[str] =  list(np.repeat('This is built to be faster and lighter than other embedding libraries',500))
embedding_model_gpu = TextEmbedding(model_name="BAAI/bge-base-en", cache_dir="./embeddings", providers=['CUDAExecutionProvider'])
print(embedding_model_gpu.model.model.get_providers())
def extract_text_from_pdf(pdf_path):
    reader =  fitz.open(pdf_path)
    extracted_text = ""
    for page_num in range(len(reader)):
        page = reader.load_page(page_num)
        extracted_text += page.get_text()
    return extracted_text


def extract_text_from_pdfs_in_directory(directory):
    for filename in os.listdir(directory):
        if filename.endswith(".pdf"):
            pdf_path = os.path.join(directory, filename)
            extracted_text = extract_text_from_pdf(pdf_path)
            txt_filename = os.path.splitext(filename)[0] + ".txt"
            txt_filepath = os.path.join(directory, txt_filename)

            with open(txt_filepath, "w",encoding='utf-8') as txt_file:
                txt_file.write(extracted_text)

# Specify the directory containing PDF files
directory_path = r"/home/xxx/fastembd/demo"
s_time = time.time()
# Extract text from PDFs in the directory and save as text files
extract_text_from_pdfs_in_directory(directory_path)
e_time = time.time()
print(f"Time taken to extract text from all PDFs: {e_time - s_time} seconds")

# List all .txt files in the directory
txt_files = [file for file in os.listdir(directory_path) if file.endswith('.txt')]

# List to store sentences from all files
all_sentences = []

# Read each text file, split into sentences, and store
for txt_file in txt_files:
    file_path = os.path.join(directory_path, txt_file)
    with open(file_path, "r") as file:
        text = file.read()
        sentences = sent_tokenize(text)
        all_sentences.extend(sentences)
e_time = time.time()
print(len(all_sentences))
print(f"Time Langchain split: {e_time - s_time} seconds")

os.environ["CUDA_VISIBLE_DEVICES"] = '0' 

embedding_model_gpu = TextEmbedding(model_name="BAAI/bge-base-en", cache_dir="./embeddings", providers=['CUDAExecutionProvider'])
embedding_model_gpu.model.model.get_providers()
#
ray.init()

@ray.remote
class EmbeddingWorker:
    def __init__(self):
        embedding_model_gpu = TextEmbedding(model_name="BAAI/bge-base-en", cache_dir="./embeddings", providers=['CUDAExecutionProvider'])
        embedding_model_gpu.model.model.get_providers()

        self.embedding_model = embedding_model_gpu

    def embed_documents(self, documents):
        embeddings = []
        for document in documents:
            embeddings.append(np.array(list(self.embedding_model.embed([document]))))
        return embeddings

# Define the number of workers
num_workers = 1  # Adjust this according to your resources
documents = all_sentences

# Split documents into chunks for each worker
chunk_size = len(documents) // num_workers
document_chunks = [documents[i:i+chunk_size] for i in range(0, len(documents), chunk_size)]

# Start the workers
embedding_workers = [EmbeddingWorker.remote() for _ in range(num_workers)]

# Perform embedding generation in parallel
start_time = time.time()
embedding_tasks = [worker.embed_documents.remote(chunk) for worker, chunk in zip(embedding_workers, document_chunks)]
embeddings = ray.get(embedding_tasks)
end_time = time.time()

# Flatten the embeddings list

embeddings = [embedding for sublist in embeddings for embedding in sublist]
print(len(embeddings))
# print(embeddings)
print("Time taken to generate embeddings with Ray Distributed Computing:", end_time - start_time, "seconds")

# Shutdown Ray
ray.shutdown()
ray::EmbeddingWorker.__init__() (pid=3197588, ip=192.168.45.164, actor_id=7a76e5641370afec20a0db9403000000, repr=<test.EmbeddingWorker object at 0x7f29f8207070>)
  File "/home/xxx/fastembd/test.py", line 82, in __init__
    embedding_model_gpu = TextEmbedding(model_name="BAAI/bge-base-en", cache_dir="./embeddings", providers=['CUDAExecutionProvider'])
  File "/root/miniconda3/envs/FastEmbed/lib/python3.10/site-packages/fastembed/text/text_embedding.py", line 68, in __init__
    self.model = EMBEDDING_MODEL_TYPE(
  File "/root/miniconda3/envs/FastEmbed/lib/python3.10/site-packages/fastembed/text/onnx_embedding.py", line 227, in __init__
    self.load_onnx_model(
  File "/root/miniconda3/envs/FastEmbed/lib/python3.10/site-packages/fastembed/text/onnx_text_model.py", line 46, in load_onnx_model
    super().load_onnx_model(
  File "/root/miniconda3/envs/FastEmbed/lib/python3.10/site-packages/fastembed/common/onnx_model.py", line 84, in load_onnx_model
    self.model = ort.InferenceSession(
  File "/root/miniconda3/envs/FastEmbed/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 432, in __init__
    raise fallback_error from e
  File "/root/miniconda3/envs/FastEmbed/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 427, in __init__
    self._create_inference_session(self._fallback_providers, None)
  File "/root/miniconda3/envs/FastEmbed/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 483, in _create_inference_session
    sess.initialize_session(providers, provider_options, disabled_optimizers)
RuntimeError: /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:123 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*, const char*, int) [with ERRTYPE = cudaError; bool THRW = true; std::conditional_t<THRW, void, common::Status> = void] /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:116 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*, const char*, int) [with ERRTYPE = cudaError; bool THRW = true; std::conditional_t<THRW, void, common::Status> = void] CUDA failure 100: no CUDA-capable device is detected ; GPU=32554 ; hostname=yingke ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/cuda_execution_provider.cc ; line=280 ; expr=cudaSetDevice(info_.device_id);

If this program does not use ray, there will be no problem. I hope you can provide a GPU version for use. Looking forward to your reply. Thank you very much!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.