Coder Social home page Coder Social logo

predibase / lorax Goto Github PK

View Code? Open in Web Editor NEW
1.6K 28.0 104.0 4.75 MB

Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs

Home Page: https://loraexchange.ai

License: Apache License 2.0

Dockerfile 0.43% Makefile 0.17% Rust 13.89% Shell 0.37% Python 58.11% JavaScript 0.10% Cuda 17.24% C++ 9.51% C 0.17% Smarty 0.01%
fine-tuning gpt llama llm llm-inference llm-serving llmops lora model-serving pytorch

lorax's Introduction

LoRAX Logo

LoRAX: Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs

License Artifact Hub

LoRAX (LoRA eXchange) is a framework that allows users to serve thousands of fine-tuned models on a single GPU, dramatically reducing the cost of serving without compromising on throughput or latency.

πŸ“– Table of contents

🌳 Features

  • πŸš… Dynamic Adapter Loading: include any fine-tuned LoRA adapter from HuggingFace, Predibase, or any filesystem in your request, it will be loaded just-in-time without blocking concurrent requests. Merge adapters per request to instantly create powerful ensembles.
  • πŸ‹οΈβ€β™€οΈ Heterogeneous Continuous Batching: packs requests for different adapters together into the same batch, keeping latency and throughput nearly constant with the number of concurrent adapters.
  • 🧁 Adapter Exchange Scheduling: asynchronously prefetches and offloads adapters between GPU and CPU memory, schedules request batching to optimize the aggregate throughput of the system.
  • πŸ‘¬ Optimized Inference: high throughput and low latency optimizations including tensor parallelism, pre-compiled CUDA kernels (flash-attention, paged attention, SGMV), quantization, token streaming.
  • 🚒 Ready for Production prebuilt Docker images, Helm charts for Kubernetes, Prometheus metrics, and distributed tracing with Open Telemetry. OpenAI compatible API supporting multi-turn chat conversations. Private adapters through per-request tenant isolation. Structured Output (JSON mode).
  • 🀯 Free for Commercial Use: Apache 2.0 License. Enough said 😎.

🏠 Models

Serving a fine-tuned model with LoRAX consists of two components:

  • Base Model: pretrained large model shared across all adapters.
  • Adapter: task-specific adapter weights dynamically loaded per request.

LoRAX supports a number of Large Language Models as the base model including Llama (including CodeLlama), Mistral (including Zephyr), and Qwen. See Supported Architectures for a complete list of supported base models.

Base models can be loaded in fp16 or quantized with bitsandbytes, GPT-Q, or AWQ.

Supported adapters include LoRA adapters trained using the PEFT and Ludwig libraries. Any of the linear layers in the model can be adapted via LoRA and loaded in LoRAX.

πŸƒβ€β™‚οΈ Getting Started

We recommend starting with our pre-built Docker image to avoid compiling custom CUDA kernels and other dependencies.

Requirements

The minimum system requirements need to run LoRAX include:

  • Nvidia GPU (Ampere generation or above)
  • CUDA 11.8 compatible device drivers and above
  • Linux OS
  • Docker (for this guide)

Launch LoRAX Server

model=mistralai/Mistral-7B-Instruct-v0.1
volume=$PWD/data

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data \
    ghcr.io/predibase/lorax:latest --model-id $model

For a full tutorial including token streaming and the Python client, see Getting Started - Docker.

Prompt via REST API

Prompt base LLM:

curl 127.0.0.1:8080/generate \
    -X POST \
    -d '{
        "inputs": "[INST] Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? [/INST]",
        "parameters": {
            "max_new_tokens": 64
        }
    }' \
    -H 'Content-Type: application/json'

Prompt a LoRA adapter:

curl 127.0.0.1:8080/generate \
    -X POST \
    -d '{
        "inputs": "[INST] Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? [/INST]",
        "parameters": {
            "max_new_tokens": 64,
            "adapter_id": "vineetsharma/qlora-adapter-Mistral-7B-Instruct-v0.1-gsm8k"
        }
    }' \
    -H 'Content-Type: application/json'

See Reference - REST API for full details.

Prompt via Python Client

Install:

pip install lorax-client

Run:

from lorax import Client

client = Client("http://127.0.0.1:8080")

# Prompt the base LLM
prompt = "[INST] Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? [/INST]"
print(client.generate(prompt, max_new_tokens=64).generated_text)

# Prompt a LoRA adapter
adapter_id = "vineetsharma/qlora-adapter-Mistral-7B-Instruct-v0.1-gsm8k"
print(client.generate(prompt, max_new_tokens=64, adapter_id=adapter_id).generated_text)

See Reference - Python Client for full details.

For other ways to run LoRAX, see Getting Started - Kubernetes, Getting Started - SkyPilot, and Getting Started - Local.

Chat via OpenAI API

LoRAX supports multi-turn chat conversations combined with dynamic adapter loading through an OpenAI compatible API. Just specify any adapter as the model parameter.

from openai import OpenAI

client = OpenAI(
    api_key="EMPTY",
    base_url="http://127.0.0.1:8080/v1",
)

resp = client.chat.completions.create(
    model="alignment-handbook/zephyr-7b-dpo-lora",
    messages=[
        {
            "role": "system",
            "content": "You are a friendly chatbot who always responds in the style of a pirate",
        },
        {"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
    ],
    max_tokens=100,
)
print("Response:", resp.choices[0].message.content)

See OpenAI Compatible API for details.

Next steps

Here are some other interesting Mistral-7B fine-tuned models to try out:

You can find more LoRA adapters here, or try fine-tuning your own with PEFT or Ludwig.

πŸ™‡ Acknowledgements

LoRAX is built on top of HuggingFace's text-generation-inference, forked from v0.9.4 (Apache 2.0).

We'd also like to acknowledge Punica for their work on the SGMV kernel, which is used to speed up multi-adapter inference under heavy load.

πŸ—ΊοΈ Roadmap

Our roadmap is tracked here.

lorax's People

Contributors

abidwael avatar arnavgarg1 avatar atry avatar claudiomontanari avatar flozi00 avatar gary149 avatar geoffreyangus avatar girinman avatar gsaivinay avatar huytuong010101 avatar infernaught avatar jeffreyftang avatar jts22 avatar lewtun avatar llama-shepard avatar magdyksaleh avatar michaelfeil avatar narsil avatar njhill avatar noyoshi avatar olivierdehaene avatar regisss avatar rkimball avatar ssmi153 avatar tgaddair avatar thincal avatar thomasw21 avatar xyang16 avatar yard1 avatar yk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lorax's Issues

Surface more informative error when adapter has NaN weights

Feature request

When querying a base model with an adapter that has NaN or Inf weight tensors, LoRAX returns the following error:

The output tensors do not match for key base_model.model.model.layers.0.self_attn.q_proj.lora_A.weight

It would be more helpful if the error message indicates the the reason the tensors don't match during merge is because LoRAX detected NaN/Inf tensors in the adapter weights.

Motivation

This would help provide a rectifiable/actionable path for users who fine-tuned models and are working on testing them out to know that this isn't an issue with LoRAX, but rather, an issue with their trained adapter weights.

Your contribution

Happy to help surface a better error message! Seems like the issue is raised from this line in particule?

if not torch.equal(pt_tensor, sf_tensor):

Extend SGMV kernel to support ranks < 8

Currently, the SGMV kernel will fail if the rank < 8, which is also an issue with tensor parallelism for ranks > 8. We should extend the kernel to support these cases:

  • 2
  • 4

Sliding block window error when running Mixtral 8x7B

System Info

Lorax version: 0.4.1
Lorax_launcher: 0.1.0
Model: mistralai/Mixtral-8x7B-Instruct-v0.1
GPUS: 3090 (24 gb) P100 (16 gb)

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

model= mistralai/Mixtral-8x7B-Instruct-v0.1
volume=$PWD/data

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data
ghcr.io/predibase/lorax:latest --model-id $model --quantize bitsandbytes-nf4 --trust-remote-code

Upon executing this code I receive the following traceback:

Traceback (most recent call last):
File "/opt/conda/bin/lorax-server", line 8, in
sys.exit(app())
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in call
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in call
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
return _main(
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/lorax_server/cli.py", line 84, in serve
server.serve(
File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 271, in serve
asyncio.run(
File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
self._run_once()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
handle._run()
File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)

File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 223, in serve_inner
model = get_model(
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/init.py", line 305, in get_model
return FlashMixtral(
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_mixtral.py", line 346, in init
SLIDING_WINDOW_BLOCKS = math.ceil(config.sliding_window / BLOCK_SIZE)
TypeError: unsupported operand type(s) for /: 'NoneType' and 'int'

2023-12-19T02:06:51.578108Z ERROR shard-manager: lorax_launcher: Shard complete standard error output:

Traceback (most recent call last):

File "/opt/conda/bin/lorax-server", line 8, in
sys.exit(app())

File "/opt/conda/lib/python3.10/site-packages/lorax_server/cli.py", line 84, in serve
server.serve(

File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 271, in serve
asyncio.run(

File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)

File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()

File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 223, in serve_inner
model = get_model(

File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/init.py", line 305, in get_model
return FlashMixtral(

File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_mixtral.py", line 346, in init
SLIDING_WINDOW_BLOCKS = math.ceil(config.sliding_window / BLOCK_SIZE)

TypeError: unsupported operand type(s) for /: 'NoneType' and 'int'

It appears that for some reason config.sliding_window is set to None, which doesn't make sense because there is an if statement that forces it to equal config.max_position_embeddings a few lines above.

I will be out of town, so I do not have a chance to build the server locally, but I can look at it when I get back.

Expected behavior

The mixtral model runs without issue.

Add nf4 support for model quantization

Feature request

There is existing code in LoRAX towards 8bit bitsandbytes quantization:

class Linear8bitLt(nn.Module):
def __init__(
self,
weight,
bias,
has_fp16_weights=True,
memory_efficient_backward=False,
threshold=0.0,
index=None,
):

Supporting 4bit bitsandbytes quantization would enable us to serve models trained in 4bit.

The change should involve following the patterns implemented for 8bit quantization.

Motivation

No response

Your contribution

No response

Second GPU is not found when running --sharded true

System Info

Lorax version: 0.4.1
Lorax_launcher: 0.1.0
Model: mistralai/Mixtral-8x7B-Instruct-v0.1
GPUS: 3090 (24 gb) 3060 (12 gb)

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

model= mistralai/Mixtral-8x7B-Instruct-v0.1
volume=$PWD/data

sudo docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/predibase/lorax:latest --model-id $model --trust-remote-code --quantize bitsandbytes-nf4 --max-batch-prefill-tokens 2048 --sharded true

Error Message:
2023-12-24T07:02:10.759386Z INFO lorax_launcher: Parsing num_shard from CUDA_VISIBLE_DEVICES/NVIDIA_VISIBLE_DEVICES
Error: NotEnoughCUDADevices("sharded is true but only found 1 CUDA devices")

Expected behavior

The expected behavior is for LoRAX to find both GPUs. For reference here is the output of nvidia-smi

'''
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3060 On | 00000000:01:00.0 Off | N/A |
| 0% 49C P8 15W / 170W | 9MiB / 12288MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce RTX 3090 On | 00000000:06:00.0 Off | N/A |
| 0% 51C P8 18W / 350W | 12MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 2249 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 2249 G /usr/lib/xorg/Xorg 4MiB |
+---------------------------------------------------------------------------------------+
'''

I checked the documentation and it said that --sharded true is the default setting of the server; however, when I do not pass --sharded true, I get an out of memory error and need to use a much smaller --max-batch-prefill-tokens (1024 to be exact), when I print nvidia-smi I get the following output

'''
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3060 On | 00000000:01:00.0 Off | N/A |
| 0% 44C P8 15W / 170W | 12MiB / 12288MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce RTX 3090 On | 00000000:06:00.0 Off | N/A |
| 81% 57C P2 114W / 350W | 23873MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 2249 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 2249 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 7439 C /opt/conda/bin/python3.10 23856MiB |
+---------------------------------------------------------------------------------------+
'''

It appears as if the server cannot find the 3060. I swapped the 3060 for one of my other GPUs (a Tesla P100 16gb) yet I still received the same error

How to use --master-addr <MASTER_ADDR>|--master-port <MASTER_PORT>?

System Info

Hi there, thank you for your perfect project,
I see that you have --master-addr <MASTER_ADDR>|--master-port <MASTER_PORT> parameter when run the server
Do you have any guide about use torch distributed in your project? I think it's really helpful if I need to run on multi-machine.
Thank you,

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

model=mistralai/Mistral-7B-Instruct-v0.1
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/predibase/lorax:latest --model-id $model

Expected behavior

More detail about torch distributed

Add Helm charts

Feature request

It should be possible to easily deploy LoRAX on Kubernetes via Helm. We only really need a Deployment and Service resource for now.

Motivation

No response

Your contribution

No response

Does lorax currently support GPT2 finetuned adapters?

System Info

lorax:latest

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

@tgaddair I have few adapters finetuned using GPT2 as base model,

Architecture of GPT2:

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

The adapters are finetuned with "c_attn, c_proj" layer, does lorax currently support it?

Expected behavior

Question about compatibility.

Lorax Hanging in production

System Info

ghcr.io/predibase/lorax:latest
Running within Kubernetes on H100

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

When putting the instance in production, while it receive simultaneous request for different adapters, it will just hang there.
/generate and /health will stop answering
but /info and /docs will continue to be available.

There's no error getting displayed in the logs
CleanShot 2023-12-22 at 16 19 41

Not sure what's the best way to diagnose what the issue could be, but looks to me like it's having some issues fetching multiple adapters in parallel and processing request queued at the same time?

Expected behavior

Should handle live requests for multiple adapters

[Feature Request] : Add support for bloom model apdaters

Model description

I am trying to run bloom-7b1 model using docker locally [both my model and adpters ]. Here is the script for running bloom on , lorax:

#!bin/bash
# PATH to model 
model="bloom-7b1"
# VOLUME to share: pwd/../models -> /data
volume=$PWD/../models:/data # share a volume with the Docker container to avoid downloading weights every run
echo $volume

docker run --gpus 0 --shm-size 1g -p 7070:80 \
    --volume $volume \
    ghcr.io/predibase/lorax:latest \
    --model-id /data/$model \
    --num-shard 1 \
    --quantize bitsandbytes-nf4 \
    --max-concurrent-requests 256 \
    --cuda-memory-fraction 0.5 \

WHen I pass the adapter located at peft-models/bloom-alpaca-ne
Screenshot:
image
image

Open source status

  • The model implementation is available
  • The model weights are available

Provide useful links for the implementation

No response

Dockerfile build failed on 5e6215b4cdfbdce345806e7b504f36948abee126 (main today)

System Info

Hello, this is the end of the Dockerfile build log (I am in EC2 in the instance I was using for inference)

[19/49] /opt/conda/bin/nvcc -I/usr/src/flash-attention-v2/csrc/flash_attn -I/usr/src/flash-attention-v2/csrc/flash_attn/src -I/usr/src/flash-attention-v2/csrc/cutlass/include -I/opt/conda/lib/python3.9/site-packages/torch/include -I/opt/conda/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.9/site-packages/torch/include/TH -I/opt/conda/lib/python3.9/site-packages/torch/include/THC -I/opt/conda/include -I/opt/conda/include/python3.9 -c -c /usr/src/flash-attention-v2/csrc/flash_attn/src/flash_bwd_hdim64_bf16_sm80.cu -o /usr/src/flash-attention-v2/build/temp.linux-x86_64-cpython-39/csrc/flash_attn/src/flash_bwd_hdim64_bf16_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 --threads 4 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0
/opt/conda/lib/python3.9/site-packages/torch/include/c10/util/irange.h(54): warning #186-D: pointless comparison of unsigned integer with zero
detected during:
instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator==(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=size_t, one_sided=false, =0]"
(61): here
instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator!=(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=size_t, one_sided=false, =0]"
/opt/conda/lib/python3.9/site-packages/torch/include/c10/core/TensorImpl.h(77): here

/opt/conda/lib/python3.9/site-packages/torch/include/c10/util/irange.h(54): warning #186-D: pointless comparison of unsigned integer with zero
detected during:
instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator==(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=size_t, one_sided=true, =0]"
(61): here
instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator!=(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=size_t, one_sided=true, =0]"
/opt/conda/lib/python3.9/site-packages/torch/include/c10/core/TensorImpl.h(2327): here
instantiation of "__nv_bool c10::TensorImpl::SetDimsTemplate(c10::ArrayRef) [with T=int64_t, =void]"
/opt/conda/lib/python3.9/site-packages/torch/include/c10/core/TensorImpl.h(2337): here

/opt/conda/lib/python3.9/site-packages/torch/include/c10/util/irange.h(54): warning #186-D: pointless comparison of unsigned integer with zero
detected during:
instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator==(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=size_t, one_sided=false, =0]"
(61): here
instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator!=(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=size_t, one_sided=false, =0]"
/opt/conda/lib/python3.9/site-packages/torch/include/c10/core/TensorImpl.h(77): here

/opt/conda/lib/python3.9/site-packages/torch/include/c10/util/irange.h(54): warning #186-D: pointless comparison of unsigned integer with zero
detected during:
instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator==(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=size_t, one_sided=true, =0]"
(61): here
instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator!=(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=size_t, one_sided=true, =0]"
/opt/conda/lib/python3.9/site-packages/torch/include/c10/core/TensorImpl.h(2327): here
instantiation of "__nv_bool c10::TensorImpl::SetDimsTemplate(c10::ArrayRef) [with T=int64_t, =void]"
/opt/conda/lib/python3.9/site-packages/torch/include/c10/core/TensorImpl.h(2337): here

ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "/opt/conda/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1893, in _run_ninja_build
subprocess.run(
File "/opt/conda/lib/python3.9/subprocess.py", line 528, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/usr/src/flash-attention-v2/setup.py", line 288, in
setup(
File "/opt/conda/lib/python3.9/site-packages/setuptools/init.py", line 87, in setup
return distutils.core.setup(**attrs)
File "/opt/conda/lib/python3.9/site-packages/setuptools/_distutils/core.py", line 185, in setup
return run_commands(dist)
File "/opt/conda/lib/python3.9/site-packages/setuptools/_distutils/core.py", line 201, in run_commands
dist.run_commands()
File "/opt/conda/lib/python3.9/site-packages/setuptools/_distutils/dist.py", line 969, in run_commands
self.run_command(cmd)
File "/opt/conda/lib/python3.9/site-packages/setuptools/dist.py", line 1208, in run_command
super().run_command(command)
File "/opt/conda/lib/python3.9/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
cmd_obj.run()
File "/opt/conda/lib/python3.9/site-packages/setuptools/_distutils/command/build.py", line 132, in run
self.run_command(cmd_name)
File "/opt/conda/lib/python3.9/site-packages/setuptools/_distutils/cmd.py", line 318, in run_command
self.distribution.run_command(command)
File "/opt/conda/lib/python3.9/site-packages/setuptools/dist.py", line 1208, in run_command
super().run_command(command)
File "/opt/conda/lib/python3.9/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
cmd_obj.run()
File "/opt/conda/lib/python3.9/site-packages/setuptools/command/build_ext.py", line 84, in run
_build_ext.run(self)
File "/opt/conda/lib/python3.9/site-packages/setuptools/_distutils/command/build_ext.py", line 346, in run
self.build_extensions()
File "/opt/conda/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 843, in build_extensions
build_ext.build_extensions(self)
File "/opt/conda/lib/python3.9/site-packages/setuptools/_distutils/command/build_ext.py", line 468, in build_extensions
self._build_extensions_serial()
File "/opt/conda/lib/python3.9/site-packages/setuptools/_distutils/command/build_ext.py", line 494, in _build_extensions_serial
self.build_extension(ext)
File "/opt/conda/lib/python3.9/site-packages/setuptools/command/build_ext.py", line 246, in build_extension
_build_ext.build_extension(self, ext)
File "/opt/conda/lib/python3.9/site-packages/setuptools/_distutils/command/build_ext.py", line 549, in build_extension
objects = self.compiler.compile(
File "/opt/conda/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 658, in unix_wrap_ninja_compile
_write_ninja_file_and_compile_objects(
File "/opt/conda/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1574, in _write_ninja_file_and_compile_objects
_run_ninja_build(
File "/opt/conda/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1909, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error compiling objects for extension
make: *** [Makefile:10: build-flash-attention-v2] Error 1
The command '/bin/sh -c make build-flash-attention-v2' returned a non-zero code: 2

I appreciate any time you have for hints.

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

Clone and ./build.sh

Expected behavior

Build completes.

Merging non gptq adapter to gptq model

System Info

master branch

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

start launcher with gptq model, then try to load non gptq adapter

2023-11-22T19:34:06.185127Z ERROR lorax_client: router/client/src/lib.rs:33: Server error: 'QuantLinear' object has no attribute 'weight'

for tests if commented out the ID check


#if adapter_config.base_model_name_or_path != model_id:
    #    raise ValueError(f"Adapter '{adapter_id}' is not compatible with model '{model_id}'. "
    #                        f"Use --model-id '{adapter_config.base_model_name_or_path}' instead.")

I am already thinking about better check, because it depends on model arch and parameters count instead of the id directly. so zephyr lora could be merged successfully to mistral instruct too.

Expected behavior

detect gptq model and use qweight instead of weight when merging

Quantized models fail to generate expected output

Example:

mistralai/Mistral-7B-v0.1 --quantize bitsandbytes-nf4

Request:

curl 127.0.0.1:8080/generate \
    -X POST \
    -d '{"inputs": "<|system|> You are a helpful assistant <|user|> What is deep learning? </s> <|assistant|>", "parameters": {"max_new_tokens": 64, "adapter_id": "qblocks/mistral_7b_norobots"}}' \
    -H 'Content-Type: application/json'

Response:

{"generated_text":""}

Expected:

{"generated_text":"Deep learning is a subset of machine learning that uses artificial neural networks to learn from data. It is a powerful tool for solving complex problems in fields such as natural language processing, computer vision, and speech recognition. Deep learning algorithms can learn from large amounts of data and make predictions or decisions based on that data. They can"}

Support custom tokenizer when loading a local model

Feature request

I have download the model, so I want to run it use local model, eht sample is:
docker run --gpus all --shm-size 1g -p 8080:80 -v /data/model/:/data/
ghcr.io/predibase/lorax:latest --model-id /data/model/Qwen-14B-Chat

Motivation

I want to use the local model. Our computes don't allow to visit huggingface.co.

Your contribution

No.

Not able to load adapter from local dir

System Info

base model: meta-llama/Llama-2-13b-chat-hf

docker cmd:
docker run --gpus '"device=3"' --shm-size 1g -e HUGGING_FACE_HUB_TOKEN=XXXXX -p 8082:82 -v $volume:/data ghcr.io/predibase/lorax:latest --model-id $model --quantize bitsandbytes

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

from lorax import Client
URL="http://127.0.0.0:8082"
client = Client(URL,timeout=20)

adapter_id="./data/lora_models/unsighing/" #this dir contain adapter_config.json or adapter_model.bin
adapter_source="local"
client.generate(prompt,max_new_tokens=128,temperature=0.001,adapter_id=adapter_id,adapter_source=adapter_source).generated_text

getting this error while running the above code :
GenerationError: Request failed during generation: Server error: No local weights found in ./data/lora_models/unsighing/ with extension .safetensors

Also getting a different error while putting the complete path for adapter_id:

adapter_id="home/code/data/lora_models/unsighing/"

GenerationError: Request failed during generation: Server error: Can't find 'adapter_config.json' at 'home/code/data/lora_models/unsighing/'

More info:
base model: meta-llama/Llama-2-13b-chat-hf
docker cmd:
docker run --gpus '"device=3"' --shm-size 1g -e HUGGING_FACE_HUB_TOKEN=XXXXX -p 8082:82 -v $volume:/data ghcr.io/predibase/lorax:latest --model-id $model --quantize bitsandbytes

Note: I am able to do successful prediction out of base model , if i didn't provide adapter_id in the same above setup.

client.generate(prompt,max_new_tokens=128,temperature=0.001).generated_text
This works good for me.

Any help will be appreciated, Thanks

Expected behavior

Should load the adapter from local dir.

getting error during inference "Unsupported head size: 32"

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

I tried lora finetuning a smaller variant of mistral architecture but I am getting this error below,

GenerationError: Request failed during generation: Server error: Unsupported head size: 32

I used rank: 16, alpha: 32

https://huggingface.co/Locutusque/TinyMistral-248M-Instruct

Expected behavior

It should have worked since it's following the mistral architecture. (TinyLlama was working fine)

Extend testing

Feature request

Extend the testing with tiny dummy models

Motivation

Some cpu based tests for example quantization, decoding and model loading for architectures

Your contribution

Can open an PR

adapters produce unk tokens only

System Info

docker from latest

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

example prompt:


User: Who are you ? 

Assistant: You mean me ?

User: Who are you ? 

Assistant: You mean me ?

User: Who are you ? 

Assistant: You mean me ?

User: Who are you ? 

Assistant: You mean me ?

User: Who are you ? 

Assistant: You mean me ?

this is enough to produce unk tokens on any prompt only.
When cutting down to 2 lines only it works as expected.

base model is:
mistralai/Mistral-7B-v0.1

adapters tested:
https://huggingface.co/qblocks/mistral_7b_norobots
https://huggingface.co/flozi00/mistral-zephyr-lora
https://huggingface.co/flozi00/mistral-germanassistantv4

start command

docker run --pull always --gpus all -d --shm-size 1g -p 8080:80 ghcr.io/predibase/lorax:latest --model-id mistralai/Mistral-7B-Instruct-v0.1 --cuda-memory-fraction 0.5 --max-total-tokens 8192 --max-batch-prefill-tokens 7000 --max-input-length 7000

Expected behavior

Working with and without adapters loaded

HQQ just in time quantization

Feature request

https://github.com/mobiusml/hqq

adding hqq as quantization function similiar to bitsandbytes to make it work just in time with only 5 minutes for 70b model

Motivation

2 bit quantization

Your contribution

Can take this to PR as quantization runtime for the first step

Lorax Launcher Fails with Unsupported Models Due to Adapter Loading Issue

System Info

Target: x86_64-unknown-linux-gnu
Cargo version: 1.70.0
Commit sha: N/A
Docker label: N/A
NVIDIA-SMI 545.23.06 Driver Version: 545.23.06 CUDA Version: 12.3
model_id = "bigscience/bloom-560m"

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

run "lorax-launcher" without specifying model-id (it defaults to the "bigscience/bloom-560m") model or run any model not yet supported by lorax.

lorax-launcher

error message (client side):
{"error":"Request failed during generation: Server error: 'BLOOMSharded' object has no attribute 'load_adapter'","error_type":"generation"}

error message (server side):
2023-11-27T08:25:59.281666Z INFO lorax_router::loader: router/src/loader.rs:146: adapter vineetsharma/qlora-adapter-Mistral-7B-Instruct-v0.1-gsm8k downloaded
2023-11-27T08:25:59.281719Z INFO lorax_router::queue: router/src/queue.rs:135: set adapter vineetsharma/qlora-adapter-Mistral-7B-Instruct-v0.1-gsm8k status to Downloaded
2023-11-27T08:25:59.318818Z ERROR lorax_client: router/client/src/lib.rs:33: Server error: 'BLOOMSharded' object has no attribute 'load_adapter'
2023-11-27T08:25:59.318826Z INFO lorax_router::loader: router/src/loader.rs:201: FAILED loading adapter vineetsharma/qlora-adapter-Mistral-7B-Instruct-v0.1-gsm8k
2023-11-27T08:25:59.318833Z INFO lorax_router::queue: router/src/queue.rs:135: set adapter vineetsharma/qlora-adapter-Mistral-7B-Instruct-v0.1-gsm8k status to Errored
2023-11-27T08:25:59.318862Z INFO lorax_router::loader: router/src/loader.rs:271: terminating adapter vineetsharma/qlora-adapter-Mistral-7B-Instruct-v0.1-gsm8k loader

Expected behavior

When an unsupported model is run, Lorax should revert to its original implementation, which does not attempt to load an adapter for models that are not supported. In cases where an unsupported model is run without specifying an adapter_id, the system should still attempt to generate a response. If an adapter_id is specified, Lorax should notify the user that the model is not supported, rather than crashing.

Punica kernel build fails

I am trying to rebuild the Lorax docker image, which is failing in the punica-builder stage. Error logs are attached, could you advise?

My final goal is to make Lorax deployable on Sagemaker by adding back the entrypoint for a sagemaker stage which was originally in the Dockerfile.

Thanks.

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

docker build --target base -t lorax:base
build.log

Expected behavior

Successfull docker build.

Use special tokens specific to the fine-tuned adapter during decoding

During fine-tuning, it's possible that special tokens are added that are specific to the adapter. During decoding, we should be using the special tokens, and ensure the correct stop tokens, padding, etc. are properly honored.

Repro from @runvnc, related: #68

Model ID: https://huggingface.co/qblocks/mistral_7b_norobots/tree/main

QLoRA repo example uses this AutoTokenizer with special tokens:

https://github.com/artidoro/qlora/blob/7f4e95a68dc076bea9b3a413d2b512eca6d004e5/qlora.py#L347

Unexpected CUDA out of memory errors

System Info

ghcr.io/predibase/lorax:latest

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

First, run

docker run --gpus all --shm-size 1g -p 8000:80 -e "HUGGING_FACE_HUB_TOKEN=<token>" -v $volume:/data ghcr.io/predibase/lorax:latest --trust-remote-code --model-id meta-llama/Llama-2-7b-chat-hf

Then, send the following requests:

from concurrent.futures import ThreadPoolExecutor
import time
import requests

url = "http://127.0.0.1:8000/generate"
headers = {"Content-Type": "application/json"}

# Function to send a request
def send_request(payload):
    start_time = time.time()
    response = requests.post(url, headers=headers, json=payload)
    elapsed_time = time.time() - start_time
    return response.json(), elapsed_time

# Number of concurrent requests

adapter_list = [<list of 15 adapters>]
num_requests = len(adapter_list)

# Using ThreadPoolExecutor to send requests concurrently
with ThreadPoolExecutor(max_workers=num_requests) as executor:
    # Submit the requests
    futures = []
    for i in range(num_requests):
        payload = {
            "inputs": "Hello, my name is",
            "parameters": {"max_new_tokens": 100, "adapter_id": adapter_list[i]},
        }
        futures.append(executor.submit(send_request, payload))

    # Wait for all requests to complete
    results = [future.result() for future in futures]

print(results)

Expected behavior

First of all, awesome work with lorax! When sending requests at the same time like this, I'm receiving Cuda Out Of Memory errors on the server. I thought that actually, the server would check beforehand how many tokens can be handled and consequently enqueue requests that cannot be served. Have you encountered this before?

Add RoPE scaling CLI args

Currently the user can configure dynamic RoPE scaling by setting the environment variables ROPE_SCALING and ROPE_FACTOR like so:

export ROPE_SCALING=dynamic 
export ROPE_FACTOR=2

But this is very clunky and not documented. We should add these as CLI args to the lorax-launcher so they can be better documented and less error prone.

Training example?

Feature request

I'm not sure how I managed it, but it seems like I have got a training script that creates a LoRA that loads but has little to no effect. I have been modifying the qLoRA script to try to work with lorax.

By any chance can you point me to a training script that is known to work effectively with this system? Apologies if this is too obvious. Most of what I see when I search for fine-tuning examples right now are for qLoRA or 8 bit.

Motivation

Just trying to verify that I didn't do something wrong.

Your contribution

Happy to test whatever you suggest with our dataset. Which to be honest, our dataset might be part of the problem. But not sure.

Project Roadmap

WIP project roadmap for LoRAX. We'll continue to update this over time.

v0.10

  • Speculative decoding adapters
  • AQLM

v0.11

  • Prefix caching
  • Embedding endpoint
  • BERT support
  • Embedding adapters
  • Classification adapters

Previous Releases

v0.9

  • Adapter memory pool

Backlog

Models

  • Llama
  • Mistral
  • GPT2
  • Qwen
  • Mixtral
  • Phi
  • Bloom
  • BERT
  • Stable-Diffusion

Adapters

Throughput / Latency

  • Paged Attention v2
  • Lookahead Decoding
  • SGMV with variable ranks
  • SGMV with tensor parallelism

Quantization

  • bitsandbytes
  • GPT-Q
  • AWQ

Usability

  • Prebuilt server wheels
  • SkyPilot usage guide
  • Example notebooks

Reduce docker image size

Feature request

Not sure this should be a feature requests.
The current docker image contains two copy of pytorch, which result an extra 6G docker image size increase.
PixPin_2023-12-19_15-45-31

Motivation

Fast startup time in Function as a Service environment.
less docker pull wait time

Your contribution

diff --git a/Dockerfile b/Dockerfile
index 19b2e06..273278a 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -199,7 +199,8 @@ RUN pip install einops --no-cache-dir

 # Install the pip requirements
 COPY server/requirements.txt .
-RUN pip install -r requirements.txt
+# HACK: make torch version same as the one installed by conda
+RUN sed -i 's/+cu118//g' requirements.txt; pip install -r requirements.txt --no-cache-dir

 # Install server
 COPY proto proto
@@ -234,7 +235,8 @@ RUN chmod +x sync.sh

 RUN curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" && \
     unzip awscliv2.zip && \
-    sudo ./aws/install
+    sudo ./aws/install && \
+    rm -rf aws awscliv2.zip

 # ENTRYPOINT ["./entrypoint.sh"]
 ENTRYPOINT ["lorax-launcher"]
diff --git a/server/poetry.lock b/server/poetry.lock
index 8195489..7443101 100644
--- a/server/poetry.lock
+++ b/server/poetry.lock
@@ -2787,4 +2787,4 @@ quantize = ["accelerate", "datasets", "texttable"]
 [metadata]
 lock-version = "2.0"
 python-versions = "^3.9"
-content-hash = "151ae83f306aafec7e9fe044359d9eaada48c55910ad7f25de7461507f6adfe6"
+content-hash = "c9f828f35184814a2017369a1cbe783f42931a9c034d1ac5f5de377cbb69ffdc"
diff --git a/server/pyproject.toml b/server/pyproject.toml
index 206a3f8..09fa4b4 100644
--- a/server/pyproject.toml
+++ b/server/pyproject.toml
@@ -32,7 +32,7 @@ einops = "^0.6.1"
 tiktoken = "^0.5.2"
 texttable = { version = "^1.6.7", optional = true }
 datasets = { version = "^2.14.0", optional = true }
-torch = {version = "2.1.1+cu118", source = "torch"}
+torch = {version = "2.1.1", source = "torch"}
 peft = "0.4.0"
 boto3 = "^1.28.34"
 urllib3 = "<=1.26.18"

I was able to build a smaller docker image with above patch, but it's quite hacky.

REPOSITORY                TAG       IMAGE ID       CREATED        SIZE
test                      latest    bedbe3725de5   17 hours ago   10.5GB
ghcr.io/predibase/lorax   0.4.1     36d7669de298   31 hours ago   17.5GB

Latency increase when run on multi-GPU

System Info

I run your docker image in 2 cases:

  • single gpu (--sharded false)
  • multi-gpu (--sharded false --num_shard 4)
    => When I run single-gpu, the total time around 1.5 second and take ~21GG GPU, but when I run on multi-GPU, it take ~2.4second and 19GB/1GPU :( Seem the lower performance when run multi-gpu.
    Do you meet this problem?
{
  "model_id": "Open-Orca/Mistral-7B-OpenOrca",
  "adapter_id": "",
  "source": "hub",
  "adapter_source": "hub",
  "revision": null,
  "validation_workers": 2,
  "sharded": true,
  "num_shard": 4,
  "quantize": "BitsandbytesNF4",
  "dtype": null,
  "trust_remote_code": false,
  "max_concurrent_requests": 128,
  "max_best_of": 1,
  "max_stop_sequences": 4,
  "max_input_length": 2048,
  "max_total_tokens": 4096,
  "waiting_served_ratio": 1.2,
  "max_batch_prefill_tokens": 4096,
  "max_batch_total_tokens": 100000,
  "max_waiting_tokens": 20,
  "max_active_adapters": 10,
  "adapter_cycle_time_s": 2,
  "hostname": "0.0.0.0",
  "port": 8000,
  "shard_uds_path": "/tmp/lorax-server",
  "master_addr": "localhost",
  "master_port": 29500,
  "huggingface_hub_cache": "/data",
  "weights_cache_override": null,
  "disable_custom_kernels": false,
  "cuda_memory_fraction": 1,
  "json_output": true,
  "otlp_endpoint": null,
  "cors_allow_origin": [],
  "watermark_gamma": null,
  "watermark_delta": null,
  "ngrok": false,
  "ngrok_authtoken": null,
  "ngrok_edge": null,
  "env": false,
  "download_only": false
}

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

Run dokcer with --sharded true --num_shard 4

Expected behavior

Same or better performace when run multi-gpu

Return number of input tokens when `details=True`

We return the number of generated tokens when details=True, but systems like OpenAI API also return the number of input tokens. This i useful, for example, for metering systems that limit users based on number of input + output tokens.

Current:

'details': {'generated_tokens': 20}

Proposal:

'details': {'prompt_tokens': 120, 'generated_tokens': 20}

Fail to load gptq base model in 0.4

System Info

ghcr.io/predibase/lorax:0.4 failed to load gptq image

command: --model-id /mnt/local-model/Qwen-14B-Chat-Int4/ --quantize gptq --trust-remote-code

Using model:

023-12-17T14:25:01.295949Z ERROR lorax_launcher: interceptor.py:41 Method Warmup encountered an error.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/gptq/custom_autotune.py", line 90, in _bench
    return triton.testing.do_bench(
TypeError: do_bench() got an unexpected keyword argument 'percentiles'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/bin/lorax-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/cli.py", line 84, in serve
    server.serve(
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 277, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
    return await self.intercept(
> File "/opt/conda/lib/python3.10/site-packages/lorax_server/interceptor.py", line 38, in intercept
    return await response
  File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
    raise error
  File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 74, in Warmup
    max_supported_total_tokens = self.model.warmup(batch)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 864, in warmup
    _, batch = self.generate_token(batch)
  File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 963, in generate_token
    raise e
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 960, in generate_token
    out = self.forward(batch, adapter_data)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 919, in forward
    return self.model.forward(
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_qwen_modeling.py", line 475, in forward
    hidden_states = self.transformer(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_qwen_modeling.py", line 432, in forward
    hidden_states, residual = layer(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_qwen_modeling.py", line 357, in forward
    attn_output = self.attn(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_qwen_modeling.py", line 226, in forward
    qkv = self.c_attn(hidden_states, adapter_data)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/layers.py", line 481, in forward
    result = self.base_layer(input)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/layers.py", line 285, in forward
    return self.linear.forward(x)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/gptq/quant_linear.py", line 349, in forward
    out = QuantLinearFunction.apply(
  File "/opt/conda/lib/python3.10/site-packages/torch/autograd/function.py", line 539, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/opt/conda/lib/python3.10/site-packages/torch/cuda/amp/autocast_mode.py", line 121, in decorate_fwd
    return fwd(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/gptq/quant_linear.py", line 244, in forward
    output = matmul248(input, qweight, scales, qzeros, g_idx, bits, maxq)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/gptq/quant_linear.py", line 216, in matmul248
    matmul_248_kernel[grid](
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/gptq/custom_autotune.py", line 110, in run
    timings = {
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/gptq/custom_autotune.py", line 111, in <dictcomp>
    config: self._bench(*args, config=config, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/gptq/custom_autotune.py", line 93, in _bench
    except triton.compiler.OutOfResources:
AttributeError: module 'triton.compiler' has no attribute 'OutOfResources'

2023-12-17T14:25:01.296225Z ERROR warmup{max_input_length=1024 max_prefill_tokens=4096}:warmup: lorax_client: router/client/src/lib.rs:33: Server error: module 'triton.compiler' has no attribute 'OutOfResources'
Error: Warmup(Generation("module 'triton.compiler' has no attribute 'OutOfResources'"))

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data \
    ghcr.io/predibase/lorax:0.4 --model-id Qwen/Qwen-14B-Chat-Int4

Expected behavior

It runs ok with ghcr.io/predibase/lorax:0.3

issues launching docker cmd for "mistralai/Mistral-7B-Instruct-v0.2"

System Info

using this official docker run cmd :

model=mistralai/Mistral-7B-Instruct-v0.2
volume=$PWD/data

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data
ghcr.io/predibase/lorax:latest --model-id $model

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

model=mistralai/Mistral-7B-Instruct-v0.2
volume=$PWD/data

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data
ghcr.io/predibase/lorax:latest --model-id $model

error:
2023-12-13T10:44:53.233748Z ERROR shard-manager: lorax_launcher: Shard complete standard error output:

Traceback (most recent call last):

File "/opt/conda/bin/lorax-server", line 8, in
sys.exit(app())

File "/opt/conda/lib/python3.9/site-packages/lorax_server/cli.py", line 81, in serve
server.serve(

File "/opt/conda/lib/python3.9/site-packages/lorax_server/server.py", line 262, in serve
asyncio.run(

File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)

File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
return future.result()

File "/opt/conda/lib/python3.9/site-packages/lorax_server/server.py", line 214, in serve_inner
model = get_model(

File "/opt/conda/lib/python3.9/site-packages/lorax_server/models/init.py", line 274, in get_model
return FlashMistral(

File "/opt/conda/lib/python3.9/site-packages/lorax_server/models/flash_mistral.py", line 347, in init
SLIDING_WINDOW_BLOCKS = math.ceil(config.sliding_window / BLOCK_SIZE)

TypeError: unsupported operand type(s) for /: 'NoneType' and 'int'

Expected behavior

should run the sever with the model

Always consider base model (no adapter) to be active

Feature request

There is no exchange cost for including a base model request in the batch, so we should always consider such requests as part of the "active set" that can be included in a given batch.

Motivation

No response

Your contribution

No response

Sharded adapters not working

System Info

Model info:

{
  "model_id": "mistralai/Mistral-7B-Instruct-v0.1",
  "model_sha": "7ad5799710574ba1c1d953eba3077af582f3a773",
  "model_dtype": "torch.float16",
  "model_device_type": "cuda",
  "model_pipeline_tag": "text-generation",
  "max_concurrent_requests": 128,
  "max_best_of": 2,
  "max_stop_sequences": 4,
  "max_input_length": 1024,
  "max_total_tokens": 2048,
  "waiting_served_ratio": 1.2,
  "max_batch_total_tokens": 1102544,
  "max_waiting_tokens": 20,
  "validation_workers": 2,
  "version": "0.1.0",
  "sha": null,
  "docker_label": null
}

2 A100 gpus, NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2 outside docker.

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

Run mistral example with docker on 2 gpus:

model=mistralai/Mistral-7B-Instruct-v0.1
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/predibase/lorax:latest --model-id $model --num-shard 2

Then try to generate:

❯ curl 127.0.0.1:8080/generate \
    -X POST \
    -d '{"inputs": "[INST] Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? [/INST]", "parameters": {"max_new_tokens": 64, "adapter_id": "vineetsharma/qlora-adapter-Mistral-7B-Instruct-v0.1-gsm8k"}}' \
    -H 'Content-Type: application/json'
{"error":"Request failed during generation: Server error: local variable 'lora_b' referenced before assignment","error_type":"generation"}%

Basically the issue that when trying to multiply first lora_a matrix, we get it sharded with shape [2048, r] while input is not sharded and has shape [49, 4096] .

Expected behavior

Generation completed successfully

Question regarding Punica integeration

The acknowledgements of this project mention the SGMV kernels created by the Punica project. Is there a way we can run multiple adapters simultaneously using LoRAX in a similar way shown in the Punica example? Can this be done via the AsyncClient?

Error running Mixtral: 'TensorParallelHead' object has no attribute 'base_layer'

System Info

Running latest docker image

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

I started a mixtral server on 2 A100 (80GB) GPUs:
lorax-launcher --model-id mistralai/Mixtral-8x7B-Instruct-v0.1 --num-shard 2

Then I sent a generate_stream request to my_adapter, which is an adapter trained with target_modules= ["q_proj", "v_proj"].

I then get the following error:

2023-12-13T22:50:01.147514Z  INFO lorax_launcher: flash_causal_lm.py:742 Loading adapter weights into model: my_adater
2023-12-13T22:50:08.489110Z ERROR lorax_launcher: server.py:170 Error when loading adapter
Traceback (most recent call last):
  File "/opt/conda/bin/lorax-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/cli.py", line 84, in serve
    server.serve(
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 271, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
    return await self.intercept(
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/interceptor.py", line 38, in intercept
    return await response
  File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
    return await behavior(request_or_iterator, context)
> File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 162, in LoadAdapter
    self.model.load_adapter(adapter_id, adapter_source, adapter_index)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 750, in load_adapter
    self.load_batched_adapter_weights(
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 796, in load_batched_adapter_weights
    base_weight = layer.base_layer.linear.weight
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1695, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'TensorParallelHead' object has no attribute 'base_layer'

Given the merged Mixtral pull request, I assumed that the model would be supported. Does this only apply to the base model or are adapters also supported?

Expected behavior

I would expect the model to return a text response

Make lorax hyperparams configurable

Feature request

When starting the router, the following params should be configurable:

  • --adapter-cycle-time-s (default: 2)
  • --max-active-adapters (default: 128)

Motivation

No response

Your contribution

No response

Fuse allgather requests across adapters and q, k, v to reduce small network requests

Feature request

The current approach to tensor parallelism from #5 is not latency optimized. We make an allgather call for every adapter, which will be quite slow for many adapters. Additionally, we don't fuse together the q and v matrices, which would further halve the number of allgathers.

A better approach would be to pre-allocate a large tensor and then slice in and out the individual tensors, as shown here:

https://discuss.pytorch.org/t/concatenate-tensors-without-memory-copying/34609

Motivation

No response

Your contribution

No response

Panic when adapter cannot be loaded

System Info

No response

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

Weights are already downloaded, but failure occurs during load (e.g., local model with adapter but the config hasn't been downloaded). Likely a race condition with cleanup logic.

2023-11-12T05:38:04.192534Z ERROR text_generation_client: router/client/src/lib.rs:33: Server error: Can't find 'adapter_config.json' at '/data/models--arnavgrg--codealpaca_v3/sn
apshots/834b33af35ff5965ea3e4bc18b51ad5d65da7466'                                                                                                                                 
2023-11-12T05:38:04.192612Z  INFO text_generation_router::loader: router/src/loader.rs:184: FAILED loading adapter /data/models--arnavgrg--codealpaca_v3/snapshots/834b33af35ff596
5ea3e4bc18b51ad5d65da7466                                                                                                                                                         
2023-11-12T05:38:04.192682Z ERROR text_generation_router::queue: router/src/queue.rs:240: adapter /data/models--arnavgrg--codealpaca_v3/snapshots/834b33af35ff5965ea3e4bc18b51ad5d
65da7466 not found in queue_map 
Backtrace [{ fn: "text_generation_router::queue::AdapterQueuesState::set_status", file: "./src/queue.rs", line: 241 }, { fn: "text_generation_router::loader::loader_tas[132/1877]
re}}", file: "./src/loader.rs", line: 186 }, { fn: "<core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll", file: "/build/rustc-v6rcRM/rustc-1.66.1+d
fsg0ubuntu1~llvm/library/core/src/future/mod.rs", line: 91 }, { fn: "tokio::runtime::task::core::Core<T,S>::poll::{{closure}}", file: "/root/.cargo/registry/src/github.com-1ecc62
99db9ec823/tokio-1.29.1/src/runtime/task/core.rs", line: 311 }, { fn: "tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut", file: "/root/.cargo/registry/src/github.com-1ecc62
99db9ec823/tokio-1.29.1/src/loom/std/unsafe_cell.rs", line: 14 }, { fn: "tokio::runtime::task::core::Core<T,S>::poll", file: "/root/.cargo/registry/src/github.com-1ecc6299db9ec82
3/tokio-1.29.1/src/runtime/task/core.rs", line: 300 }, { fn: "tokio::runtime::task::harness::poll_future::{{closure}}", file: "/root/.cargo/registry/src/github.com-1ecc6299db9ec8
23/tokio-1.29.1/src/runtime/task/harness.rs", line: 476 }, { fn: "<core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once", file: "/build/ru
stc-v6rcRM/rustc-1.66.1+dfsg0ubuntu1~llvm/library/core/src/panic/unwind_safe.rs", line: 271 }, { fn: "std::panicking::try::do_call", file: "/build/rustc-v6rcRM/rustc-1.66.1+dfsg0
ubuntu1~llvm/library/std/src/panicking.rs", line: 483 }, { fn: "__rust_try" }, { fn: "std::panicking::try", file: "/build/rustc-v6rcRM/rustc-1.66.1+dfsg0ubuntu1~llvm/library/std/
src/panicking.rs", line: 447 }, { fn: "std::panic::catch_unwind", file: "/build/rustc-v6rcRM/rustc-1.66.1+dfsg0ubuntu1~llvm/library/std/src/panic.rs", line: 137 }, { fn: "tokio::
runtime::task::harness::poll_future", file: "/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/task/harness.rs", line: 464 }, { fn: "tokio::runtime::
task::harness::Harness<T,S>::poll_inner", file: "/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/task/harness.rs", line: 198 }, { fn: "tokio::runti
me::task::harness::Harness<T,S>::poll", file: "/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/task/harness.rs", line: 152 }, { fn: "tokio::runtime
::task::raw::poll", file: "/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/task/raw.rs", line: 276 }, { fn: "tokio::runtime::task::raw::RawTask::po
ll", file: "/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/task/raw.rs", line: 200 }, { fn: "tokio::runtime::task::LocalNotified<S>::run", file: "
/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/task/mod.rs", line: 400 }, { fn: "tokio::runtime::scheduler::multi_thread::worker::Context::run_tas
k::{{closure}}", file: "/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/scheduler/multi_thread/worker.rs", line: 639 }, { fn: "tokio::runtime::coop
::with_budget", file: "/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/coop.rs", line: 107 }, { fn: "tokio::runtime::coop::budget", file: "/root/.c
argo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/coop.rs", line: 73 }, { fn: "tokio::runtime::scheduler::multi_thread::worker::Context::run_task", file: "/r
oot/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/scheduler/multi_thread/worker.rs", line: 575 }, { fn: "tokio::runtime::scheduler::multi_thread::worke
r::Context::run", file: "/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/scheduler/multi_thread/worker.rs", line: 526 }, { fn: "tokio::runtime::sch
eduler::multi_thread::worker::run::{{closure}}::{{closure}}", file: "/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/scheduler/multi_thread/worker.
rs", line: 491 }, { fn: "tokio::runtime::context::scoped::Scoped<T>::set", file: "/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/context/scoped.rs
", line: 40 }, { fn: "tokio::runtime::context::set_scheduler::{{closure}}", file: "/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/context.rs", lin
e: 176 }, { fn: "std::thread::local::LocalKey<T>::try_with", file: "/build/rustc-v6rcRM/rustc-1.66.1+dfsg0ubuntu1~llvm/library/std/src/thread/local.rs", line: 446 }, { fn: "std::
thread::local::LocalKey<T>::with", file: "/build/rustc-v6rcRM/rustc-1.66.1+dfsg0ubuntu1~llvm/library/std/src/thread/local.rs", line: 422 }, { fn: "tokio::runtime::context::set_sc
heduler", file: "/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/context.rs", line: 176 }, { fn: "tokio::runtime::scheduler::multi_thread::worker::
run::{{closure}}", file: "/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/scheduler/multi_thread/worker.rs", line: 486 }, { fn: "tokio::runtime::co
ntext::runtime::enter_runtime", file: "/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/context/runtime.rs", line: 65 }, { fn: "tokio::runtime::sche
duler::multi_thread::worker::run", file: "/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/scheduler/multi_thread/worker.rs", line: 478 }, { fn: "to
kio::runtime::scheduler::multi_thread::worker::Launch::launch::{{closure}}", file: "/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/scheduler/multi
_thread/worker.rs", line: 447 }, { fn: "<tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll", file: "/root/.cargo/registry/src/github.com-1ecc
6299db9ec823/tokio-1.29.1/src/runtime/blocking/task.rs", line: 42 }, { fn: "tokio::runtime::task::core::Core<T,S>::poll::{{closure}}", file: "/root/.cargo/registry/src/github.com
-1ecc6299db9ec823/tokio-1.29.1/src/runtime/task/core.rs", line: 311 }, { fn: "tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut", file: "/root/.cargo/registry/src/github.com
-1ecc6299db9ec823/tokio-1.29.1/src/loom/std/unsafe_cell.rs", line: 14 }, { fn: "tokio::runtime::task::core::Core<T,S>::poll", file: "/root/.cargo/registry/src/github.com-1ecc6299
db9ec823/tokio-1.29.1/src/runtime/task/core.rs", line: 300 }, { fn: "tokio::runtime::task::harness::poll_future::{{closure}}", file: "/root/.cargo/registry/src/github.com-1ecc629
9db9ec823/tokio-1.29.1/src/runtime/task/harness.rs", line: 476 }, { fn: "<core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once", file: "/b
uild/rustc-v6rcRM/rustc-1.66.1+dfsg0ubuntu1~llvm/library/core/src/panic/unwind_safe.rs", line: 271 }, { fn: "std::panicking::try::do_call", file: "/build/rustc-v6rcRM/rustc-1.66.
1+dfsg0ubuntu1~llvm/library/std/src/panicking.rs", line: 483 }, { fn: "__rust_try" }, { fn: "std::panicking::try", file: "/build/rustc-v6rcRM/rustc-1.66.1+dfsg0ubuntu1~llvm/libra
ry/std/src/panicking.rs", line: 447 }, { fn: "std::panic::catch_unwind", file: "/build/rustc-v6rcRM/rustc-1.66.1+dfsg0ubuntu1~llvm/library/std/src/panic.rs", line: 137 }, { fn: "
tokio::runtime::task::harness::poll_future", file: "/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/task/harness.rs", line: 464 }, { fn: "tokio::ru
ntime::task::harness::Harness<T,S>::poll_inner", file: "/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/task/harness.rs", line: 198 }, { fn: "tokio
::runtime::task::harness::Harness<T,S>::poll", file: "/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/task/harness.rs", line: 152 }, { fn: "tokio::
runtime::task::raw::poll", file: "/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/task/raw.rs", line: 276 }, { fn: "tokio::runtime::task::raw::Raw$
ask::poll", file: "/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/task/raw.rs", line: 200 }, { fn: "tokio::runtime::task::UnownedTask<S>::run", fi
le: "/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/task/mod.rs", line: 437 }, { fn: "tokio::runtime::blocking::pool::Task::run", file: "/root/.ca
rgo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/blocking/pool.rs", line: 159 }, { fn: "tokio::runtime::blocking::pool::Inner::run", file: "/root/.cargo/regi
stry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/blocking/pool.rs", line: 513 }, { fn: "tokio::runtime::blocking::pool::Spawner::spawn_thread::{{closure}}", file: "/
root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/blocking/pool.rs", line: 471 }, { fn: "std::sys_common::backtrace::__rust_begin_short_backtrace", fi
le: "/build/rustc-v6rcRM/rustc-1.66.1+dfsg0ubuntu1~llvm/library/std/src/sys_common/backtrace.rs", line: 121 }, { fn: "std::thread::Builder::spawn_unchecked_::{{closure}}::{{closu
re}}", file: "/build/rustc-v6rcRM/rustc-1.66.1+dfsg0ubuntu1~llvm/library/std/src/thread/mod.rs", line: 551 }, { fn: "<core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::
function::FnOnce<()>>::call_once", file: "/build/rustc-v6rcRM/rustc-1.66.1+dfsg0ubuntu1~llvm/library/core/src/panic/unwind_safe.rs", line: 271 }, { fn: "std::panicking::try::do_c
all", file: "/build/rustc-v6rcRM/rustc-1.66.1+dfsg0ubuntu1~llvm/library/std/src/panicking.rs", line: 483 }, { fn: "__rust_try" }, { fn: "std::panicking::try", file: "/build/rustc
-v6rcRM/rustc-1.66.1+dfsg0ubuntu1~llvm/library/std/src/panicking.rs", line: 447 }, { fn: "std::panic::catch_unwind", file: "/build/rustc-v6rcRM/rustc-1.66.1+dfsg0ubuntu1~llvm/lib
rary/std/src/panic.rs", line: 137 }, { fn: "std::thread::Builder::spawn_unchecked_::{{closure}}", file: "/build/rustc-v6rcRM/rustc-1.66.1+dfsg0ubuntu1~llvm/library/std/src/thread
/mod.rs", line: 550 }, { fn: "core::ops::function::FnOnce::call_once{{vtable.shim}}", file: "/build/rustc-v6rcRM/rustc-1.66.1+dfsg0ubuntu1~llvm/library/core/src/ops/function.rs",
 line: 251 }, { fn: "<alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once", file: "/build/rustc-v6rcRM/rustc-1.66.1+dfsg0ubuntu1~llvm/library/alloc/src/boxed.
rs", line: 1987 }, { fn: "<alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once", file: "/build/rustc-v6rcRM/rustc-1.66.1+dfsg0ubuntu1~llvm/library/alloc/src/b
oxed.rs", line: 1987 }, { fn: "std::sys::unix::thread::Thread::new::thread_start", file: "/build/rustc-v6rcRM/rustc-1.66.1+dfsg0ubuntu1~llvm/library/std/src/sys/unix/thread.rs", 
line: 108 }, { fn: "start_thread", file: "/build/glibc-BHL3KM/glibc-2.31/nptl/pthread_create.c", line: 477 }, { fn: "clone", file: "/build/glibc-BHL3KM/glibc-2.31/misc/../sysdeps
/unix/sysv/linux/x86_64/clone.S", line: 95 }]
thread 'tokio-runtime-worker' panicked at 'called `Option::unwrap()` on a `None` value', router/src/queue.rs:243:23

Expected behavior

No response

Problem with quantize model

System Info

Can you tell me abit about how to serve model in 4bit quantize mode?
I added the --quantize bitsandbytes-nf4 when run docker container but nothing change, the GPU memory keep the same

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

  1. docker run --gpus all --shm-size 1g-p 8080:80 -v ./ckpts:/data ghcr.io/predibase/lorax:latest --model-id /data/OpenHermes-2-7B-base-2.3 --quantize bitsandbytes-nf4

Expected behavior

Reduce the GPU memory

Some error records and questions

System Info

Docker images: 2023-12-06
GPUs: 2 A40(48g)
OS: centos 7

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

null

Expected behavior

I tested three models qwen-14b, yi-34b-chat (llama2 based), and xuanyuan-70b-chat (llama2 based). Each model prepared 2-3 lora adapter and encountered some problems.
Xuanyuan can run completely normally.
All the following questions are based on Qwen and Yi.

  1. Without adding lora adapter, the output will reach max new length without adding stopwords, but it will not be nonsense. Adding stopwords can output normally. But after adding lora adapter, it will output to max_new_length and make nonsense. Consider that may be a template configuration issue in fine-tuning .

  2. When num_share is set to 2 (two GPUs), pre_fill_length=4096 will cause insufficient memory, and even 1024 will report the error (Qwen). num_share = 1 can run normally.

RuntimeError: Not enough memory to handle 1028 prefill tokens. You need to decrease --max-batch-prefill-tokens
2023-12-08T10:14:01.820643Z ERROR warmup{max_input_length=1024 max_prefill_tokens=1028}:warmup: lorax_client: router/client/src/lib.rs:33: Server error: Not enough memory to handle 1028 prefill tokens. You need to decrease --max-batch-prefill-tokens
Error: Warmup(Generation("Not enough memory to handle 1028 prefill tokens. You need to decrease --max-batch-prefill-tokens"))
2023-12-08T10:14:01.894611Z ERROR lorax_launcher: Webserver Crashed
2023-12-08T10:14:01.894628Z INFO lorax_launcher: Shutting down shards
2023-12-08T10:14:02.427130Z INFO shard-manager: lorax_launcher: Shard terminated rank=1
2023-12-08T10:14:03.432023Z INFO shard-manager: lorax_launcher: Shard terminated rank=0
Error: WebserverFailed

  1. As long as max_new_token is higher than 200, the connection will fail. set under 200 can run normally.
    Settings :

Args { model_id: "/data/yi-34b-chat", adapter_id: "", source: "hub", adapter_source: "hub", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: Some(BitsandbytesNF4), dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, max_active_adapters: 128, adapter_cycle_time_s: 2, hostname: "c92d36636b23", port: 80, shard_uds_path: "/tmp/lorax-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false, download_only: false }

Post:

prompt = "<|im_start|>user\Tell me a story<|im_end|>\n<|im_start|>assistant\n"
adapter_id = "/data/chat-int8-3-epoch-1024-manual_2360-self-5000_1207-1"
print(client.generate(prompt, max_new_tokens=300,temperature=0.8, do_sample=True, stop_sequences=["<|im_end|>"], adapter_id=adapter_id).generated_text)

Error :

Traceback (most recent call last):
File "/home/shaohongen/Temp/WZ_test/lorax/test_lorax_yi.py", line 9, in
print(client.generate(prompt, max_new_tokens=300,temperature=0.8, do_sample=True, stop_sequences=["<|im_end|>"], adapter_id=adapter_id).generated_text)
File "/home/shaohongen/miniconda3/envs/slora/lib/python3.9/site-packages/lorax/client.py", line 148, in generate
resp = requests.post(
File "/home/shaohongen/miniconda3/envs/slora/lib/python3.9/site-packages/requests/api.py", line 115, in post
return request("post", url, data=data, json=json, **kwargs)
File "/home/shaohongen/miniconda3/envs/slora/lib/python3.9/site-packages/requests/api.py", line 59, in request
return session.request(method=method, url=url, **kwargs)
File "/home/shaohongen/miniconda3/envs/slora/lib/python3.9/site-packages/requests/sessions.py", line 589, in request
resp = self.send(prep, **send_kwargs)
File "/home/shaohongen/miniconda3/envs/slora/lib/python3.9/site-packages/requests/sessions.py", line 703, in send
r = adapter.send(request, **kwargs)
File "/home/shaohongen/miniconda3/envs/slora/lib/python3.9/site-packages/requests/adapters.py", line 532, in send
raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPConnectionPool(host='127.0.0.1', port=8081): Read timed out. (read timeout=10)

  1. Question: Three models of different magnitudes (14b, 34b, 70b), under int 4 quantization, actually occupy the same graphics memory, about 44GB. If num_share=2, the memory occupation is 37G*2 in the case of int-4, which may be the reason for the max_new_token limit?
    And why the same graphics memory.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.