Coder Social home page Coder Social logo

ashleykleynhans / runpod-worker-llava Goto Github PK

View Code? Open in Web Editor NEW
11.0 2.0 5.0 257 KB

LLaVA: Large Language and Vision Assistant | RunPod Serverless Worker

License: GNU General Public License v3.0

Python 86.54% Shell 13.46%
chatbot docker-image gpt-4 llava multimodal runpod runpod-worker vision-language-model

runpod-worker-llava's Introduction

LLaVA: Large Language and Vision Assistant | RunPod Serverless Worker

This is the source code for a RunPod Serverless worker for LLaVA: Large Language and Vision Assistant.

Docker Pulls Worker Version

Model

LLaVA-v1.6

Model Environment Variable Value Version LLM Default
llava-v1.6-vicuna-7b liuhaotian/llava-v1.6-vicuna-7b LLaVA-1.6 Vicuna-7B no
llava-v1.6-vicuna-13b liuhaotian/llava-v1.6-vicuna-13b LLaVA-1.6 Vicuna-13B no
llava-v1.6-mistral-7b liuhaotian/llava-v1.6-mistral-7b LLaVA-1.6 Mistral-7B yes
llava-v1.6-34b liuhaotian/llava-v1.6-34b LLaVA-1.6 Hermes-Yi-34B no

LLaVA-v1.5

Model Environment Variable Value Version Size Default
llava-v1.5-7b liuhaotian/llava-v1.5-7b LLaVA-1.5 7B no
llava-v1.5-13b liuhaotian/llava-v1.5-13b LLaVA-1.5 13B no
BakLLaVA-1 SkunkworksAI/BakLLaVA-1 LLaVA-1.5 7B no

Testing

  1. Local Testing
  2. RunPod Testing

Building the Docker image that will be used by the Serverless Worker

There are two options:

  1. Network Volume
  2. Standalone (without Network Volume)

RunPod API Endpoint

You can send requests to your RunPod API Endpoint using the /run or /runsync endpoints.

Requests sent to the /run endpoint will be handled asynchronously, and are non-blocking operations. Your first response status will always be IN_QUEUE. You need to send subsequent requests to the /status endpoint to get further status updates, and eventually the COMPLETED status will be returned if your request is successful.

Requests sent to the /runsync endpoint will be handled synchronously and are blocking operations. If they are processed by a worker within 90 seconds, the result will be returned in the response, but if the processing time exceeds 90 seconds, you will need to handle the response and request status updates from the /status endpoint until you receive the COMPLETED status which indicates that your request was successful.

RunPod API Examples

Endpoint Status Codes

Status Description
IN_QUEUE Request is in the queue waiting to be picked up by a worker. You can call the /status endpoint to check for status updates.
IN_PROGRESS Request is currently being processed by a worker. You can call the /status endpoint to check for status updates.
FAILED The request failed, most likely due to encountering an error.
CANCELLED The request was cancelled. This usually happens when you call the /cancel endpoint to cancel the request.
TIMED_OUT The request timed out. This usually happens when your handler throws some kind of exception that does return a valid response.
COMPLETED The request completed successfully and the output is available in the output field of the response.

Serverless Handler

The serverless handler (rp_handler.py) is a Python script that handles the API requests to your Endpoint using the runpod Python library. It defines a function handler(event) that takes an API request (event), runs the inference using LLaVA with the input, and returns the output in the JSON response.

Acknowledgements

Additional Resources

Community and Contributing

Pull requests and issues on GitHub are welcome. Bug fixes and new features are encouraged.

Appreciate my work?

Buy Me A Coffee

runpod-worker-llava's People

Contributors

ashleykleynhans avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

runpod-worker-llava's Issues

Multiple Parallel Request Handling

Hi Ashley,

Thank you for the great repo and making deployment on runpod serverless a breeze. It looks like the server can only handle 1 request at a time. However, my GPU utilization (2x A6000) is at about 50% so I should be able to handle 2-3 requests at the same time for any given worker. Is there a way to enable this or is it strictly 1 at a time? I know that I can increase the number of workers but it would be great to saturate each of the workers before I add another one.

Also, are you using SGLang for this server or are you wrapping it some other way?

Thank you again for the generous contribution to the open-source LLM community.

Cheers,
Daniel

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.