Coder Social home page Coder Social logo

bpevangelista / vfastml Goto Github PK

View Code? Open in Web Editor NEW
3.0 2.0 1.0 995 KB

Inference and Training Engine for LLMs, Image2Image and Other Models

License: Apache License 2.0

Python 60.14% Shell 0.49% HLSL 3.44% Cuda 35.93%
gpt inference llm machine-learning mistral phi transformers

vfastml's Introduction

VFastML

Architecture:

Screenshot

Running Prebuilt Apps

pip install -r requirements.txt
python -m prebuilt_apps.openai.api_server   # run on bash 1
python -m prebuilt_apps.openai.model_server # run on bash 2

Writing Your Own Apps

# Model Server
from vfastml.models_servers.model_server_text_hf import TextGenerationModelServerHF

model_server = TextGenerationModelServerHF(
    model_type='text_generation',
    model_uri='mistralai/Mistral-7B-v0.1',
    model_device='cuda:0',
    model_forward_kwargs={
        'top_p': 0.9,
    },
    log_level='debug',
)
model_server.run_forever()
# API Server
from vfastml.engine.dispatch_requests import TextGenerationReq
from vfastml.entrypoints.api_server import FastMLServer as ml_server

@ml_server.app.post(path='/v1/chat/completions')
async def chat_completions(request: ChatCompletionsRequest):
    request_id = api_utils.gen_request_id('completions')
    
    selected_model = {
        'mistral': 'mistralai/Mistral-7B-v0.1',
        'phi2': 'microsoft/phi-2',
    }.get(request.model, 'mistralai/Mistral-7B-v0.1')

    dispatch_task = ml_server.dispatch_engine.dispatch(
        TextGenerationReq(
            request_id=request_id,
            model_uri=selected_model,
            model_adapter_uri=None,
            messages=request.messages))
    
    task_result = await dispatch_task.get_result()
    return api_utils.build_json_response(request_id, task_result)

Why?

  • Easy serving and training of diverse ML pipelines
  • Horizontal Scalability and Stability with k8s (deploy and forget)

Next Steps?

No Code! Just YAML → Inference server docker image and kubernets deployments.

apis:
  - path: /v1/chat/completions
    inputs:
      - messages: list[dict] | str
      - model: str
        optional: true
    outputs:
      - choices: list[dict]
servers:
  - model:
      type: text_generation
      uri: mistralai/Mistral-7B-v0.1
      device: cuda:0
      generation_params:
        top_p: 0.9
    resources:
      cpus: 2
      memory: 16GB
      gpus: 1
      gpu_memory: 24GB
      gpu_cuda_capability: 9.0
    rpc_port: 6500

TODOs:

X-Large:

  • Support continuous batching (per next-token iteration)

Core:

  • Wait model_server before REST API available
  • Support api_server->models router (for multi-model support)

Models

  • Support heterogeneous sampling on same TextGen batch (e.g. beam and sample)
  • Refactor wdnet and move it to models
  • Add whisper model support

Performance:

  • Profile and calculate optimal batch size on start
  • Implement benchmark for TGI and HF
    • Actually count generated tokens (don't trust it respect forward_params)

Stability & QoL:

  • Add and improve input validation
  • Refactor TextGeneration classes (internals and openai are a bit mixed)
  • Expose classes through main package (avoid random imports)

Frequent Issues:

Docker GPU build / run Issues?

  • sudo apt-get install -y nvidia-container-toolkit
  • docker run --gpus all -e HF_ACCESS_TOKEN=TOKEN vfastml.apps.openai.model:v1

Torch profiler not working on WSL2?

  • nVidia Control Panel → Developer → Manage GPU Performance Counters → Allow Access To All users
  • Windows Settings → System → For Developers → Developer Mode ON

Incorrect CUDA version? (We are on cu118)

How create ssh key pair?

  • ssh-keygen -t ed25519 -C [email protected] -f any_ed25519
    • eval $(ssh-agent -s) ssh-add ~/.ssh/any_ed25519
    • add to ~/.bashrc or ~/.zshrc: "IdentityFile ~/.ssh/any_ed25519"

Performance Notes:

LLMs Engines That Beat vLLM:

What is VFastML missing for a 5x speed-up?

What I've tested:

  • Flash Attention: ~2x speed-up over SDPA on RTX 3090 24GB

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.