Coder Social home page Coder Social logo

joontju / nvidai-llm-rag-windows Goto Github PK

View Code? Open in Web Editor NEW

This project forked from nvidia/chatrtx

0.0 0.0 0.0 1.36 MB

A developer reference project for creating Retrieval Augmented Generation (RAG) chatbots on Windows using TensorRT-LLM

License: Other

Python 100.00%

nvidai-llm-rag-windows's Introduction

๐Ÿš€ RAG on Windows using TensorRT-LLM and LlamaIndex ๐Ÿฆ™

This repository showcases a Retrieval-augmented Generation (RAG) pipeline implemented using the llama_index library for Windows. The pipeline incorporates the LLaMa 2 13B model, TensorRT-LLM, and the FAISS vector search library. For demonstration, the dataset consists of thirty recent articles sourced from NVIDIA Geforce News.

What is RAG? ๐Ÿ”

Retrieval-augmented generation (RAG) for large language models (LLMs) seeks to enhance prediction accuracy by leveraging an external datastore during inference. This approach constructs a comprehensive prompt enriched with context, historical data, and recent or relevant knowledge.

Getting Started

Ensure you have the pre-requisites in place:

  1. Install TensorRT-LLM for Windows using the instructions here.

  2. Ensure you have access to the Llama 2 repository on Huggingface

  3. In this project, the LLaMa 2 13B AWQ 4bit quantized model is employed for inference. Before using it, you'll need to compile a TensorRT Engine specific to your GPU. If you're using the GeForce RTX 4090 (TensorRT 9.1.0.4 and TensorRT-LLM release 0.5.0), the compiled TRT Engine is available for download here. For other NVIDIA GPUs or TensorRT versions, please refer to the instructions.

Setup Steps

  1. Clone this repository:
git clone https://github.com/NVIDIA/trt-llm-rag-windows.git
  1. Place the TensorRT engine for LLaMa 2 13B model in the model/ directory
  • For GeForce RTX 4090 users: Download the pre-built TRT engine here and place it in the model/ directory.
  • For other NVIDIA GPU users: Build the TRT engine by following the instructions provided here.
  1. Acquire the llama tokenizer here.
  2. Download AWQ weights for building the TensorRT engine model.pt here. (For RTX 4090, use the pregenerated engine provided earlier.)
  3. Install the necessary libraries:
pip install -r requirements.txt
  1. Launch the application using the following command:
python app.py --trt_engine_path <TRT Engine folder> --trt_engine_name <TRT Engine file>.engine --tokenizer_dir_path <tokernizer folder> --data_dir <Data folder>

In our case, that will be:

python app.py --trt_engine_path model/ --trt_engine_name llama_float16_tp1_rank0.engine --tokenizer_dir_path model/ --data_dir dataset/

Note: On its first run, this example will persist/cache the data folder in vector library. Any modifications in the data folder won't take effect until the "storage-default" cache directory is removed from the application directory.

Detailed Command References

python app.py --trt_engine_path <TRT Engine folder> --trt_engine_name <TRT Engine file>.engine --tokenizer_dir_path <tokernizer folder> --data_dir <Data folder>

Arguments

Name Details
--trt_engine_path <> Directory of TensorRT engine
--trt_engine_name <> Engine file name (e.g. llama_float16_tp1_rank0.engine)
--tokenizer_dir_path <> HF downloaded model files for tokenizer e.g. https://huggingface.co/meta-llama/Llama-2-13b-chat-hf
--data_dir <> Directory with context data (pdf, txt etc.) e.g. ".\dataset"

Building TRT Engine

For RTX 4090 (TensorRT 9.1.0.4 & TensorRT-LLM 0.5.0), a prebuilt TRT engine is provided. For other RTX GPUs or TensorRT versions, follow these steps to build your TRT engine:

Download LLaMa 2 13B chat model from https://huggingface.co/meta-llama/Llama-2-13b-chat-hf

Download LLaMa 2 13B AWQ int4 checkpoints model.pt from here

Clone the TensorRT LLM repository:

git clone https://github.com/NVIDIA/TensorRT-LLM.git

Navigate to the examples\llama directory and run the following script:

python build.py --model_dir <path to llama13_chat model> --quant_ckpt_path <path to model.pt> --dtype float16 --use_gpt_attention_plugin float16 --use_gemm_plugin float16 --use_weight_only --weight_only_precision int4_awq --per_group --enable_context_fmha --max_batch_size 1 --max_input_len 3000 --max_output_len 1024 --output_dir <TRT engine folder>

Adding your own data

  • This app loads data from the dataset/ directory into the vector store. To add support for your own data, replace the files in the dataset/ directory with your own data. By default, the script uses llamaindex's SimpleDirectoryLoader which supports text files in several platforms such as .txt, PDF, and so on.

This project requires additional third-party open source software projects as specified in the documentation. Review the license terms of these open source projects before use.

nvidai-llm-rag-windows's People

Contributors

kedarpotdar-nv avatar blsharda avatar anujj avatar shishirgoyal85 avatar akaanupkumar avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.