predibase / lorax Goto Github PK

Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs

License: Apache License 2.0

Dockerfile 0.45% Makefile 0.17% Rust 16.42% Shell 0.41% Python 69.81% JavaScript 0.11% Cuda 10.67% C++ 1.91% C 0.05% Smarty 0.01%

fine-tuning gpt llama llm llm-inference llm-serving llmops lora model-serving pytorch

lorax's People

Contributors

Stargazers

Watchers

Forkers

tuhinmallick ismainfinitecloud shantanusharma aifylabs techthiyanes m9e jeffwan ai-jie01 dumpmemory yzs-lab tradingindian runvnc bacoco vicpon infrastacks sanyaade-projects josedandrade costly-ai mbrukman iervolino tristanoprofetto thanhpham1987 enricai santoshdawanse aiwasabi idoru hbcbh1999 fluder-paradyne syaikhipin pandada8 morpheusph mluogh baitphish hexpert hertera1 moxmoussa tungllm onlyone-hyphen drasaadmoosa shashipal95 joaopcm1996 jaedukseo noelo dan-sullivan harshnigam6 bettercallcaleb asingh9530 aneeshjoy vempaliakhil96 lizzzcai llama-shepard akelch11 sri-awadh jason-cs18 codeaudit qianliang-lq dhruvabansal00 aakash30jan intellibridgeaidev secondpathstudio carlosouza foo-l prd-tuong-nguyen huytuong010101 thincal brarkaran fadebek anopska stjordanis jts22 ego michaelzhiluo alexsherstinsky stanleee5 jc7k wemersiveadmin quant1-eduardo codybum mitchklusty shyambhagwat12 shaungt1 francyjglisboa sophia0608 chongxiaoc canslove ntrajic biniyam69 hubayirp girinman tanjingme flyflow-devs harsh306 fedml-ai claudiomontanari ishan-marikar dstripelis mhhamdan eldadcohen1 icyxp imneov

lorax's Issues

Merging non gptq adapter to gptq model

System Info

master branch

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

start launcher with gptq model, then try to load non gptq adapter

2023-11-22T19:34:06.185127Z ERROR lorax_client: router/client/src/lib.rs:33: Server error: 'QuantLinear' object has no attribute 'weight'

for tests if commented out the ID check


#if adapter_config.base_model_name_or_path != model_id:
    #    raise ValueError(f"Adapter '{adapter_id}' is not compatible with model '{model_id}'. "
    #                        f"Use --model-id '{adapter_config.base_model_name_or_path}' instead.")

I am already thinking about better check, because it depends on model arch and parameters count instead of the id directly. so zephyr lora could be merged successfully to mistral instruct too.

Expected behavior

detect gptq model and use qweight instead of weight when merging

Use special tokens specific to the fine-tuned adapter during decoding

During fine-tuning, it's possible that special tokens are added that are specific to the adapter. During decoding, we should be using the special tokens, and ensure the correct stop tokens, padding, etc. are properly honored.

Repro from @runvnc, related: #68

Model ID: https://huggingface.co/qblocks/mistral_7b_norobots/tree/main

QLoRA repo example uses this AutoTokenizer with special tokens:

https://github.com/artidoro/qlora/blob/7f4e95a68dc076bea9b3a413d2b512eca6d004e5/qlora.py#L347

Problem with quantize model

System Info

Can you tell me abit about how to serve model in 4bit quantize mode?
I added the --quantize bitsandbytes-nf4 when run docker container but nothing change, the GPU memory keep the same

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

docker run --gpus all --shm-size 1g-p 8080:80 -v ./ckpts:/data ghcr.io/predibase/lorax:latest --model-id /data/OpenHermes-2-7B-base-2.3 --quantize bitsandbytes-nf4

Expected behavior

Reduce the GPU memory

Question regarding Punica integeration

The acknowledgements of this project mention the SGMV kernels created by the Punica project. Is there a way we can run multiple adapters simultaneously using LoRAX in a similar way shown in the Punica example? Can this be done via the AsyncClient?

Some error records and questions

System Info

Docker images: 2023-12-06
GPUs: 2 A40(48g)
OS: centos 7

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

null

Expected behavior

I tested three models qwen-14b, yi-34b-chat (llama2 based), and xuanyuan-70b-chat (llama2 based). Each model prepared 2-3 lora adapter and encountered some problems.
Xuanyuan can run completely normally.
All the following questions are based on Qwen and Yi.

Without adding lora adapter, the output will reach max new length without adding stopwords, but it will not be nonsense. Adding stopwords can output normally. But after adding lora adapter, it will output to max_new_length and make nonsense. Consider that may be a template configuration issue in fine-tuning .
When num_share is set to 2 (two GPUs), pre_fill_length=4096 will cause insufficient memory, and even 1024 will report the error (Qwen). num_share = 1 can run normally.

RuntimeError: Not enough memory to handle 1028 prefill tokens. You need to decrease --max-batch-prefill-tokens
2023-12-08T10:14:01.820643Z ERROR warmup{max_input_length=1024 max_prefill_tokens=1028}:warmup: lorax_client: router/client/src/lib.rs:33: Server error: Not enough memory to handle 1028 prefill tokens. You need to decrease --max-batch-prefill-tokens
Error: Warmup(Generation("Not enough memory to handle 1028 prefill tokens. You need to decrease --max-batch-prefill-tokens"))
2023-12-08T10:14:01.894611Z ERROR lorax_launcher: Webserver Crashed
2023-12-08T10:14:01.894628Z INFO lorax_launcher: Shutting down shards
2023-12-08T10:14:02.427130Z INFO shard-manager: lorax_launcher: Shard terminated rank=1
2023-12-08T10:14:03.432023Z INFO shard-manager: lorax_launcher: Shard terminated rank=0
Error: WebserverFailed

As long as max_new_token is higher than 200, the connection will fail. set under 200 can run normally.
Settings :

Args { model_id: "/data/yi-34b-chat", adapter_id: "", source: "hub", adapter_source: "hub", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: Some(BitsandbytesNF4), dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, max_active_adapters: 128, adapter_cycle_time_s: 2, hostname: "c92d36636b23", port: 80, shard_uds_path: "/tmp/lorax-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false, download_only: false }

Post:

prompt = "<|im_start|>user\Tell me a story<|im_end|>\n<|im_start|>assistant\n"
adapter_id = "/data/chat-int8-3-epoch-1024-manual_2360-self-5000_1207-1"
print(client.generate(prompt, max_new_tokens=300,temperature=0.8, do_sample=True, stop_sequences=["<|im_end|>"], adapter_id=adapter_id).generated_text)

Error :

Traceback (most recent call last):
File "/home/shaohongen/Temp/WZ_test/lorax/test_lorax_yi.py", line 9, in
print(client.generate(prompt, max_new_tokens=300,temperature=0.8, do_sample=True, stop_sequences=["<|im_end|>"], adapter_id=adapter_id).generated_text)
File "/home/shaohongen/miniconda3/envs/slora/lib/python3.9/site-packages/lorax/client.py", line 148, in generate
resp = requests.post(
File "/home/shaohongen/miniconda3/envs/slora/lib/python3.9/site-packages/requests/api.py", line 115, in post
return request("post", url, data=data, json=json, **kwargs)
File "/home/shaohongen/miniconda3/envs/slora/lib/python3.9/site-packages/requests/api.py", line 59, in request
return session.request(method=method, url=url, **kwargs)
File "/home/shaohongen/miniconda3/envs/slora/lib/python3.9/site-packages/requests/sessions.py", line 589, in request
resp = self.send(prep, **send_kwargs)
File "/home/shaohongen/miniconda3/envs/slora/lib/python3.9/site-packages/requests/sessions.py", line 703, in send
r = adapter.send(request, **kwargs)
File "/home/shaohongen/miniconda3/envs/slora/lib/python3.9/site-packages/requests/adapters.py", line 532, in send
raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPConnectionPool(host='127.0.0.1', port=8081): Read timed out. (read timeout=10)

Question: Three models of different magnitudes (14b, 34b, 70b), under int 4 quantization, actually occupy the same graphics memory, about 44GB. If num_share=2, the memory occupation is 37G*2 in the case of int-4, which may be the reason for the max_new_token limit?
And why the same graphics memory.

Always consider base model (no adapter) to be active

Feature request

There is no exchange cost for including a base model request in the batch, so we should always consider such requests as part of the "active set" that can be included in a given batch.

Motivation

No response

Your contribution

No response

issues launching docker cmd for "mistralai/Mistral-7B-Instruct-v0.2"

System Info

using this official docker run cmd :

model=mistralai/Mistral-7B-Instruct-v0.2
volume=$PWD/data

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data
ghcr.io/predibase/lorax:latest --model-id $model

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

model=mistralai/Mistral-7B-Instruct-v0.2
volume=$PWD/data

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data
ghcr.io/predibase/lorax:latest --model-id $model

error:
2023-12-13T10:44:53.233748Z ERROR shard-manager: lorax_launcher: Shard complete standard error output:

Traceback (most recent call last):

File "/opt/conda/bin/lorax-server", line 8, in
sys.exit(app())

File "/opt/conda/lib/python3.9/site-packages/lorax_server/cli.py", line 81, in serve
server.serve(

File "/opt/conda/lib/python3.9/site-packages/lorax_server/server.py", line 262, in serve
asyncio.run(

File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)

File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
return future.result()

File "/opt/conda/lib/python3.9/site-packages/lorax_server/server.py", line 214, in serve_inner
model = get_model(

File "/opt/conda/lib/python3.9/site-packages/lorax_server/models/init.py", line 274, in get_model
return FlashMistral(

File "/opt/conda/lib/python3.9/site-packages/lorax_server/models/flash_mistral.py", line 347, in init
SLIDING_WINDOW_BLOCKS = math.ceil(config.sliding_window / BLOCK_SIZE)

TypeError: unsupported operand type(s) for /: 'NoneType' and 'int'

Expected behavior

should run the sever with the model

Is there any plan to support dynamic lora for qwen/chatglm models?

Feature request

Cool job! I have successfully run mulit-lora with llama2-70b.
I would like to ask if the author has any plans to support other models, such as qwen, which would be very helpful.

Motivation

null

Your contribution

null

Add Helm charts

Feature request

It should be possible to easily deploy LoRAX on Kubernetes via Helm. We only really need a Deployment and Service resource for now.

Motivation

No response

Your contribution

No response

Error running Mixtral: 'TensorParallelHead' object has no attribute 'base_layer'

System Info

Running latest docker image

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

I started a mixtral server on 2 A100 (80GB) GPUs:
lorax-launcher --model-id mistralai/Mixtral-8x7B-Instruct-v0.1 --num-shard 2

Then I sent a generate_stream request to my_adapter, which is an adapter trained with target_modules= ["q_proj", "v_proj"].

I then get the following error:

2023-12-13T22:50:01.147514Z  INFO lorax_launcher: flash_causal_lm.py:742 Loading adapter weights into model: my_adater
2023-12-13T22:50:08.489110Z ERROR lorax_launcher: server.py:170 Error when loading adapter
Traceback (most recent call last):
  File "/opt/conda/bin/lorax-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/cli.py", line 84, in serve
    server.serve(
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 271, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
    return await self.intercept(
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/interceptor.py", line 38, in intercept
    return await response
  File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
    return await behavior(request_or_iterator, context)
> File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 162, in LoadAdapter
    self.model.load_adapter(adapter_id, adapter_source, adapter_index)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 750, in load_adapter
    self.load_batched_adapter_weights(
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 796, in load_batched_adapter_weights
    base_weight = layer.base_layer.linear.weight
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1695, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'TensorParallelHead' object has no attribute 'base_layer'

Given the merged Mixtral pull request, I assumed that the model would be supported. Does this only apply to the base model or are adapters also supported?

Expected behavior

I would expect the model to return a text response

Second GPU is not found when running --sharded true

System Info

Lorax version: 0.4.1
Lorax_launcher: 0.1.0
Model: mistralai/Mixtral-8x7B-Instruct-v0.1
GPUS: 3090 (24 gb) 3060 (12 gb)

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

model= mistralai/Mixtral-8x7B-Instruct-v0.1
volume=$PWD/data

sudo docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/predibase/lorax:latest --model-id $model --trust-remote-code --quantize bitsandbytes-nf4 --max-batch-prefill-tokens 2048 --sharded true

Error Message:
2023-12-24T07:02:10.759386Z INFO lorax_launcher: Parsing num_shard from CUDA_VISIBLE_DEVICES/NVIDIA_VISIBLE_DEVICES
Error: NotEnoughCUDADevices("sharded is true but only found 1 CUDA devices")

Expected behavior

The expected behavior is for LoRAX to find both GPUs. For reference here is the output of nvidia-smi

'''
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3060 On | 00000000:01:00.0 Off | N/A |
| 0% 49C P8 15W / 170W | 9MiB / 12288MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce RTX 3090 On | 00000000:06:00.0 Off | N/A |
| 0% 51C P8 18W / 350W | 12MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 2249 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 2249 G /usr/lib/xorg/Xorg 4MiB |
+---------------------------------------------------------------------------------------+
'''

I checked the documentation and it said that --sharded true is the default setting of the server; however, when I do not pass --sharded true, I get an out of memory error and need to use a much smaller --max-batch-prefill-tokens (1024 to be exact), when I print nvidia-smi I get the following output

'''
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3060 On | 00000000:01:00.0 Off | N/A |
| 0% 44C P8 15W / 170W | 12MiB / 12288MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce RTX 3090 On | 00000000:06:00.0 Off | N/A |
| 81% 57C P2 114W / 350W | 23873MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 2249 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 2249 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 7439 C /opt/conda/bin/python3.10 23856MiB |
+---------------------------------------------------------------------------------------+
'''

It appears as if the server cannot find the 3060. I swapped the 3060 for one of my other GPUs (a Tesla P100 16gb) yet I still received the same error

Support 'gate_proj', 'down_proj', 'up_proj', 'lm_head'

Feature request

Support 'gate_proj', 'down_proj', 'up_proj', 'lm_head' for Llama and Mistral

Motivation

This would allow serving all linear layers in Llama and Mistral.

Your contribution

Contributions welcome!

Not able to load adapter from local dir

System Info

base model: meta-llama/Llama-2-13b-chat-hf

docker cmd:
docker run --gpus '"device=3"' --shm-size 1g -e HUGGING_FACE_HUB_TOKEN=XXXXX -p 8082:82 -v $volume:/data ghcr.io/predibase/lorax:latest --model-id $model --quantize bitsandbytes

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

from lorax import Client
URL="http://127.0.0.0:8082"
client = Client(URL,timeout=20)

adapter_id="./data/lora_models/unsighing/" #this dir contain adapter_config.json or adapter_model.bin
adapter_source="local"
client.generate(prompt,max_new_tokens=128,temperature=0.001,adapter_id=adapter_id,adapter_source=adapter_source).generated_text

getting this error while running the above code :
GenerationError: Request failed during generation: Server error: No local weights found in ./data/lora_models/unsighing/ with extension .safetensors

Also getting a different error while putting the complete path for adapter_id:

adapter_id="home/code/data/lora_models/unsighing/"

GenerationError: Request failed during generation: Server error: Can't find 'adapter_config.json' at 'home/code/data/lora_models/unsighing/'

More info:
base model: meta-llama/Llama-2-13b-chat-hf
docker cmd:
docker run --gpus '"device=3"' --shm-size 1g -e HUGGING_FACE_HUB_TOKEN=XXXXX -p 8082:82 -v $volume:/data ghcr.io/predibase/lorax:latest --model-id $model --quantize bitsandbytes

Note: I am able to do successful prediction out of base model , if i didn't provide adapter_id in the same above setup.

client.generate(prompt,max_new_tokens=128,temperature=0.001).generated_text
This works good for me.

Any help will be appreciated, Thanks

Expected behavior

Should load the adapter from local dir.

Fuse allgather requests across adapters and q, k, v to reduce small network requests

Feature request

The current approach to tensor parallelism from #5 is not latency optimized. We make an allgather call for every adapter, which will be quite slow for many adapters. Additionally, we don't fuse together the q and v matrices, which would further halve the number of allgathers.

A better approach would be to pre-allocate a large tensor and then slice in and out the individual tensors, as shown here:

https://discuss.pytorch.org/t/concatenate-tensors-without-memory-copying/34609

Motivation

No response

Your contribution

No response

Lorax Launcher Fails with Unsupported Models Due to Adapter Loading Issue

System Info

Target: x86_64-unknown-linux-gnu
Cargo version: 1.70.0
Commit sha: N/A
Docker label: N/A
NVIDIA-SMI 545.23.06 Driver Version: 545.23.06 CUDA Version: 12.3
model_id = "bigscience/bloom-560m"

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

run "lorax-launcher" without specifying model-id (it defaults to the "bigscience/bloom-560m") model or run any model not yet supported by lorax.

lorax-launcher

error message (client side):
{"error":"Request failed during generation: Server error: 'BLOOMSharded' object has no attribute 'load_adapter'","error_type":"generation"}

error message (server side):
2023-11-27T08:25:59.281666Z INFO lorax_router::loader: router/src/loader.rs:146: adapter vineetsharma/qlora-adapter-Mistral-7B-Instruct-v0.1-gsm8k downloaded
2023-11-27T08:25:59.281719Z INFO lorax_router::queue: router/src/queue.rs:135: set adapter vineetsharma/qlora-adapter-Mistral-7B-Instruct-v0.1-gsm8k status to Downloaded
2023-11-27T08:25:59.318818Z ERROR lorax_client: router/client/src/lib.rs:33: Server error: 'BLOOMSharded' object has no attribute 'load_adapter'
2023-11-27T08:25:59.318826Z INFO lorax_router::loader: router/src/loader.rs:201: FAILED loading adapter vineetsharma/qlora-adapter-Mistral-7B-Instruct-v0.1-gsm8k
2023-11-27T08:25:59.318833Z INFO lorax_router::queue: router/src/queue.rs:135: set adapter vineetsharma/qlora-adapter-Mistral-7B-Instruct-v0.1-gsm8k status to Errored
2023-11-27T08:25:59.318862Z INFO lorax_router::loader: router/src/loader.rs:271: terminating adapter vineetsharma/qlora-adapter-Mistral-7B-Instruct-v0.1-gsm8k loader

Expected behavior

When an unsupported model is run, Lorax should revert to its original implementation, which does not attempt to load an adapter for models that are not supported. In cases where an unsupported model is run without specifying an adapter_id, the system should still attempt to generate a response. If an adapter_id is specified, Lorax should notify the user that the model is not supported, rather than crashing.

Support custom tokenizer when loading a local model

Feature request

I have download the model, so I want to run it use local model, eht sample is:
docker run --gpus all --shm-size 1g -p 8080:80 -v /data/model/:/data/
ghcr.io/predibase/lorax:latest --model-id /data/model/Qwen-14B-Chat

Motivation

I want to use the local model. Our computes don't allow to visit huggingface.co.

Your contribution

No.

Lorax Hanging in production

System Info

ghcr.io/predibase/lorax:latest
Running within Kubernetes on H100

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

When putting the instance in production, while it receive simultaneous request for different adapters, it will just hang there.
/generate and /health will stop answering
but /info and /docs will continue to be available.

There's no error getting displayed in the logs

Not sure what's the best way to diagnose what the issue could be, but looks to me like it's having some issues fetching multiple adapters in parallel and processing request queued at the same time?

Expected behavior

Should handle live requests for multiple adapters

Panic when adapter cannot be loaded

System Info

No response

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

Weights are already downloaded, but failure occurs during load (e.g., local model with adapter but the config hasn't been downloaded). Likely a race condition with cleanup logic.

2023-11-12T05:38:04.192534Z ERROR text_generation_client: router/client/src/lib.rs:33: Server error: Can't find 'adapter_config.json' at '/data/models--arnavgrg--codealpaca_v3/sn
apshots/834b33af35ff5965ea3e4bc18b51ad5d65da7466'                                                                                                                                 
2023-11-12T05:38:04.192612Z  INFO text_generation_router::loader: router/src/loader.rs:184: FAILED loading adapter /data/models--arnavgrg--codealpaca_v3/snapshots/834b33af35ff596
5ea3e4bc18b51ad5d65da7466                                                                                                                                                         
2023-11-12T05:38:04.192682Z ERROR text_generation_router::queue: router/src/queue.rs:240: adapter /data/models--arnavgrg--codealpaca_v3/snapshots/834b33af35ff5965ea3e4bc18b51ad5d
65da7466 not found in queue_map 
Backtrace [{ fn: "text_generation_router::queue::AdapterQueuesState::set_status", file: "./src/queue.rs", line: 241 }, { fn: "text_generation_router::loader::loader_tas[132/1877]
re}}", file: "./src/loader.rs", line: 186 }, { fn: "<core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll", file: "/build/rustc-v6rcRM/rustc-1.66.1+d
fsg0ubuntu1~llvm/library/core/src/future/mod.rs", line: 91 }, { fn: "tokio::runtime::task::core::Core<T,S>::poll::{{closure}}", file: "/root/.cargo/registry/src/github.com-1ecc62
99db9ec823/tokio-1.29.1/src/runtime/task/core.rs", line: 311 }, { fn: "tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut", file: "/root/.cargo/registry/src/github.com-1ecc62
99db9ec823/tokio-1.29.1/src/loom/std/unsafe_cell.rs", line: 14 }, { fn: "tokio::runtime::task::core::Core<T,S>::poll", file: "/root/.cargo/registry/src/github.com-1ecc6299db9ec82
3/tokio-1.29.1/src/runtime/task/core.rs", line: 300 }, { fn: "tokio::runtime::task::harness::poll_future::{{closure}}", file: "/root/.cargo/registry/src/github.com-1ecc6299db9ec8
23/tokio-1.29.1/src/runtime/task/harness.rs", line: 476 }, { fn: "<core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once", file: "/build/ru
stc-v6rcRM/rustc-1.66.1+dfsg0ubuntu1~llvm/library/core/src/panic/unwind_safe.rs", line: 271 }, { fn: "std::panicking::try::do_call", file: "/build/rustc-v6rcRM/rustc-1.66.1+dfsg0
ubuntu1~llvm/library/std/src/panicking.rs", line: 483 }, { fn: "__rust_try" }, { fn: "std::panicking::try", file: "/build/rustc-v6rcRM/rustc-1.66.1+dfsg0ubuntu1~llvm/library/std/
src/panicking.rs", line: 447 }, { fn: "std::panic::catch_unwind", file: "/build/rustc-v6rcRM/rustc-1.66.1+dfsg0ubuntu1~llvm/library/std/src/panic.rs", line: 137 }, { fn: "tokio::
runtime::task::harness::poll_future", file: "/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/task/harness.rs", line: 464 }, { fn: "tokio::runtime::
task::harness::Harness<T,S>::poll_inner", file: "/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/task/harness.rs", line: 198 }, { fn: "tokio::runti
me::task::harness::Harness<T,S>::poll", file: "/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/task/harness.rs", line: 152 }, { fn: "tokio::runtime
::task::raw::poll", file: "/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/task/raw.rs", line: 276 }, { fn: "tokio::runtime::task::raw::RawTask::po
ll", file: "/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/task/raw.rs", line: 200 }, { fn: "tokio::runtime::task::LocalNotified<S>::run", file: "
/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/task/mod.rs", line: 400 }, { fn: "tokio::runtime::scheduler::multi_thread::worker::Context::run_tas
k::{{closure}}", file: "/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/scheduler/multi_thread/worker.rs", line: 639 }, { fn: "tokio::runtime::coop
::with_budget", file: "/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/coop.rs", line: 107 }, { fn: "tokio::runtime::coop::budget", file: "/root/.c
argo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/coop.rs", line: 73 }, { fn: "tokio::runtime::scheduler::multi_thread::worker::Context::run_task", file: "/r
oot/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/scheduler/multi_thread/worker.rs", line: 575 }, { fn: "tokio::runtime::scheduler::multi_thread::worke
r::Context::run", file: "/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/scheduler/multi_thread/worker.rs", line: 526 }, { fn: "tokio::runtime::sch
eduler::multi_thread::worker::run::{{closure}}::{{closure}}", file: "/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/scheduler/multi_thread/worker.
rs", line: 491 }, { fn: "tokio::runtime::context::scoped::Scoped<T>::set", file: "/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/context/scoped.rs
", line: 40 }, { fn: "tokio::runtime::context::set_scheduler::{{closure}}", file: "/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/context.rs", lin
e: 176 }, { fn: "std::thread::local::LocalKey<T>::try_with", file: "/build/rustc-v6rcRM/rustc-1.66.1+dfsg0ubuntu1~llvm/library/std/src/thread/local.rs", line: 446 }, { fn: "std::
thread::local::LocalKey<T>::with", file: "/build/rustc-v6rcRM/rustc-1.66.1+dfsg0ubuntu1~llvm/library/std/src/thread/local.rs", line: 422 }, { fn: "tokio::runtime::context::set_sc
heduler", file: "/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/context.rs", line: 176 }, { fn: "tokio::runtime::scheduler::multi_thread::worker::
run::{{closure}}", file: "/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/scheduler/multi_thread/worker.rs", line: 486 }, { fn: "tokio::runtime::co
ntext::runtime::enter_runtime", file: "/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/context/runtime.rs", line: 65 }, { fn: "tokio::runtime::sche
duler::multi_thread::worker::run", file: "/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/scheduler/multi_thread/worker.rs", line: 478 }, { fn: "to
kio::runtime::scheduler::multi_thread::worker::Launch::launch::{{closure}}", file: "/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/scheduler/multi
_thread/worker.rs", line: 447 }, { fn: "<tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll", file: "/root/.cargo/registry/src/github.com-1ecc
6299db9ec823/tokio-1.29.1/src/runtime/blocking/task.rs", line: 42 }, { fn: "tokio::runtime::task::core::Core<T,S>::poll::{{closure}}", file: "/root/.cargo/registry/src/github.com
-1ecc6299db9ec823/tokio-1.29.1/src/runtime/task/core.rs", line: 311 }, { fn: "tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut", file: "/root/.cargo/registry/src/github.com
-1ecc6299db9ec823/tokio-1.29.1/src/loom/std/unsafe_cell.rs", line: 14 }, { fn: "tokio::runtime::task::core::Core<T,S>::poll", file: "/root/.cargo/registry/src/github.com-1ecc6299
db9ec823/tokio-1.29.1/src/runtime/task/core.rs", line: 300 }, { fn: "tokio::runtime::task::harness::poll_future::{{closure}}", file: "/root/.cargo/registry/src/github.com-1ecc629
9db9ec823/tokio-1.29.1/src/runtime/task/harness.rs", line: 476 }, { fn: "<core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once", file: "/b
uild/rustc-v6rcRM/rustc-1.66.1+dfsg0ubuntu1~llvm/library/core/src/panic/unwind_safe.rs", line: 271 }, { fn: "std::panicking::try::do_call", file: "/build/rustc-v6rcRM/rustc-1.66.
1+dfsg0ubuntu1~llvm/library/std/src/panicking.rs", line: 483 }, { fn: "__rust_try" }, { fn: "std::panicking::try", file: "/build/rustc-v6rcRM/rustc-1.66.1+dfsg0ubuntu1~llvm/libra
ry/std/src/panicking.rs", line: 447 }, { fn: "std::panic::catch_unwind", file: "/build/rustc-v6rcRM/rustc-1.66.1+dfsg0ubuntu1~llvm/library/std/src/panic.rs", line: 137 }, { fn: "
tokio::runtime::task::harness::poll_future", file: "/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/task/harness.rs", line: 464 }, { fn: "tokio::ru
ntime::task::harness::Harness<T,S>::poll_inner", file: "/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/task/harness.rs", line: 198 }, { fn: "tokio
::runtime::task::harness::Harness<T,S>::poll", file: "/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/task/harness.rs", line: 152 }, { fn: "tokio::
runtime::task::raw::poll", file: "/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/task/raw.rs", line: 276 }, { fn: "tokio::runtime::task::raw::Raw$
ask::poll", file: "/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/task/raw.rs", line: 200 }, { fn: "tokio::runtime::task::UnownedTask<S>::run", fi
le: "/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/task/mod.rs", line: 437 }, { fn: "tokio::runtime::blocking::pool::Task::run", file: "/root/.ca
rgo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/blocking/pool.rs", line: 159 }, { fn: "tokio::runtime::blocking::pool::Inner::run", file: "/root/.cargo/regi
stry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/blocking/pool.rs", line: 513 }, { fn: "tokio::runtime::blocking::pool::Spawner::spawn_thread::{{closure}}", file: "/
root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/blocking/pool.rs", line: 471 }, { fn: "std::sys_common::backtrace::__rust_begin_short_backtrace", fi
le: "/build/rustc-v6rcRM/rustc-1.66.1+dfsg0ubuntu1~llvm/library/std/src/sys_common/backtrace.rs", line: 121 }, { fn: "std::thread::Builder::spawn_unchecked_::{{closure}}::{{closu
re}}", file: "/build/rustc-v6rcRM/rustc-1.66.1+dfsg0ubuntu1~llvm/library/std/src/thread/mod.rs", line: 551 }, { fn: "<core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::
function::FnOnce<()>>::call_once", file: "/build/rustc-v6rcRM/rustc-1.66.1+dfsg0ubuntu1~llvm/library/core/src/panic/unwind_safe.rs", line: 271 }, { fn: "std::panicking::try::do_c
all", file: "/build/rustc-v6rcRM/rustc-1.66.1+dfsg0ubuntu1~llvm/library/std/src/panicking.rs", line: 483 }, { fn: "__rust_try" }, { fn: "std::panicking::try", file: "/build/rustc
-v6rcRM/rustc-1.66.1+dfsg0ubuntu1~llvm/library/std/src/panicking.rs", line: 447 }, { fn: "std::panic::catch_unwind", file: "/build/rustc-v6rcRM/rustc-1.66.1+dfsg0ubuntu1~llvm/lib
rary/std/src/panic.rs", line: 137 }, { fn: "std::thread::Builder::spawn_unchecked_::{{closure}}", file: "/build/rustc-v6rcRM/rustc-1.66.1+dfsg0ubuntu1~llvm/library/std/src/thread
/mod.rs", line: 550 }, { fn: "core::ops::function::FnOnce::call_once{{vtable.shim}}", file: "/build/rustc-v6rcRM/rustc-1.66.1+dfsg0ubuntu1~llvm/library/core/src/ops/function.rs",
 line: 251 }, { fn: "<alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once", file: "/build/rustc-v6rcRM/rustc-1.66.1+dfsg0ubuntu1~llvm/library/alloc/src/boxed.
rs", line: 1987 }, { fn: "<alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once", file: "/build/rustc-v6rcRM/rustc-1.66.1+dfsg0ubuntu1~llvm/library/alloc/src/b
oxed.rs", line: 1987 }, { fn: "std::sys::unix::thread::Thread::new::thread_start", file: "/build/rustc-v6rcRM/rustc-1.66.1+dfsg0ubuntu1~llvm/library/std/src/sys/unix/thread.rs", 
line: 108 }, { fn: "start_thread", file: "/build/glibc-BHL3KM/glibc-2.31/nptl/pthread_create.c", line: 477 }, { fn: "clone", file: "/build/glibc-BHL3KM/glibc-2.31/misc/../sysdeps
/unix/sysv/linux/x86_64/clone.S", line: 95 }]
thread 'tokio-runtime-worker' panicked at 'called `Option::unwrap()` on a `None` value', router/src/queue.rs:243:23

Expected behavior

No response

adapters produce unk tokens only

System Info

docker from latest

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

example prompt:


User: Who are you ? 

Assistant: You mean me ?

User: Who are you ? 

Assistant: You mean me ?

User: Who are you ? 

Assistant: You mean me ?

User: Who are you ? 

Assistant: You mean me ?

User: Who are you ? 

Assistant: You mean me ?

this is enough to produce unk tokens on any prompt only.
When cutting down to 2 lines only it works as expected.

base model is:
mistralai/Mistral-7B-v0.1

adapters tested:
https://huggingface.co/qblocks/mistral_7b_norobots
https://huggingface.co/flozi00/mistral-zephyr-lora
https://huggingface.co/flozi00/mistral-germanassistantv4

start command

docker run --pull always --gpus all -d --shm-size 1g -p 8080:80 ghcr.io/predibase/lorax:latest --model-id mistralai/Mistral-7B-Instruct-v0.1 --cuda-memory-fraction 0.5 --max-total-tokens 8192 --max-batch-prefill-tokens 7000 --max-input-length 7000

Expected behavior

Working with and without adapters loaded

HQQ just in time quantization

Feature request

https://github.com/mobiusml/hqq

adding hqq as quantization function similiar to bitsandbytes to make it work just in time with only 5 minutes for 70b model

Motivation

2 bit quantization

Your contribution

Can take this to PR as quantization runtime for the first step

Support `k_proj` and `o_proj`

Feature request

No response

Motivation

No response

Your contribution

No response

Vectorize the heterogenous batching

Feature request

Example from Punica:

Motivation

No response

Your contribution

No response

add --max-active-adapters argument

Feature request

as discussed here: https://discord.com/channels/1174495433565945916/1176984269558652998/1179487788266172508

Motivation

If there are to many requests for different adapters you are running out of memory, since the actual system only checks the memory for batch sizes and not concurrent adapters.

Your contribution

I looked at the effort it would take to add this feature and realized it was beyond my ability with Rust

Make lorax hyperparams configurable

Feature request

When starting the router, the following params should be configurable:

--adapter-cycle-time-s (default: 2)
--max-active-adapters (default: 128)

Motivation

No response

Your contribution

No response

Add support for additional custom models

Feature request

Motivation

No response

Your contribution

No response

Reduce docker image size

Feature request

Not sure this should be a feature requests.
The current docker image contains two copy of pytorch, which result an extra 6G docker image size increase.

Motivation

Fast startup time in Function as a Service environment.
less docker pull wait time

Your contribution

diff --git a/Dockerfile b/Dockerfile
index 19b2e06..273278a 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -199,7 +199,8 @@ RUN pip install einops --no-cache-dir

 # Install the pip requirements
 COPY server/requirements.txt .
-RUN pip install -r requirements.txt
+# HACK: make torch version same as the one installed by conda
+RUN sed -i 's/+cu118//g' requirements.txt; pip install -r requirements.txt --no-cache-dir

 # Install server
 COPY proto proto
@@ -234,7 +235,8 @@ RUN chmod +x sync.sh

 RUN curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" && \
     unzip awscliv2.zip && \
-    sudo ./aws/install
+    sudo ./aws/install && \
+    rm -rf aws awscliv2.zip

 # ENTRYPOINT ["./entrypoint.sh"]
 ENTRYPOINT ["lorax-launcher"]
diff --git a/server/poetry.lock b/server/poetry.lock
index 8195489..7443101 100644
--- a/server/poetry.lock
+++ b/server/poetry.lock
@@ -2787,4 +2787,4 @@ quantize = ["accelerate", "datasets", "texttable"]
 [metadata]
 lock-version = "2.0"
 python-versions = "^3.9"
-content-hash = "151ae83f306aafec7e9fe044359d9eaada48c55910ad7f25de7461507f6adfe6"
+content-hash = "c9f828f35184814a2017369a1cbe783f42931a9c034d1ac5f5de377cbb69ffdc"
diff --git a/server/pyproject.toml b/server/pyproject.toml
index 206a3f8..09fa4b4 100644
--- a/server/pyproject.toml
+++ b/server/pyproject.toml
@@ -32,7 +32,7 @@ einops = "^0.6.1"
 tiktoken = "^0.5.2"
 texttable = { version = "^1.6.7", optional = true }
 datasets = { version = "^2.14.0", optional = true }
-torch = {version = "2.1.1+cu118", source = "torch"}
+torch = {version = "2.1.1", source = "torch"}
 peft = "0.4.0"
 boto3 = "^1.28.34"
 urllib3 = "<=1.26.18"

I was able to build a smaller docker image with above patch, but it's quite hacky.

REPOSITORY                TAG       IMAGE ID       CREATED        SIZE
test                      latest    bedbe3725de5   17 hours ago   10.5GB
ghcr.io/predibase/lorax   0.4.1     36d7669de298   31 hours ago   17.5GB

Extend SGMV kernel to support ranks < 8

Currently, the SGMV kernel will fail if the rank < 8, which is also an issue with tensor parallelism for ranks > 8. We should extend the kernel to support these cases:

Punica kernel build fails

I am trying to rebuild the Lorax docker image, which is failing in the punica-builder stage. Error logs are attached, could you advise?

My final goal is to make Lorax deployable on Sagemaker by adding back the entrypoint for a sagemaker stage which was originally in the Dockerfile.

Thanks.

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

docker build --target base -t lorax:base
build.log

Expected behavior

Successfull docker build.

Latency increase when run on multi-GPU

System Info

I run your docker image in 2 cases:

single gpu (--sharded false)
multi-gpu (--sharded false --num_shard 4)
=> When I run single-gpu, the total time around 1.5 second and take ~21GG GPU, but when I run on multi-GPU, it take ~2.4second and 19GB/1GPU :( Seem the lower performance when run multi-gpu.
Do you meet this problem?

{
  "model_id": "Open-Orca/Mistral-7B-OpenOrca",
  "adapter_id": "",
  "source": "hub",
  "adapter_source": "hub",
  "revision": null,
  "validation_workers": 2,
  "sharded": true,
  "num_shard": 4,
  "quantize": "BitsandbytesNF4",
  "dtype": null,
  "trust_remote_code": false,
  "max_concurrent_requests": 128,
  "max_best_of": 1,
  "max_stop_sequences": 4,
  "max_input_length": 2048,
  "max_total_tokens": 4096,
  "waiting_served_ratio": 1.2,
  "max_batch_prefill_tokens": 4096,
  "max_batch_total_tokens": 100000,
  "max_waiting_tokens": 20,
  "max_active_adapters": 10,
  "adapter_cycle_time_s": 2,
  "hostname": "0.0.0.0",
  "port": 8000,
  "shard_uds_path": "/tmp/lorax-server",
  "master_addr": "localhost",
  "master_port": 29500,
  "huggingface_hub_cache": "/data",
  "weights_cache_override": null,
  "disable_custom_kernels": false,
  "cuda_memory_fraction": 1,
  "json_output": true,
  "otlp_endpoint": null,
  "cors_allow_origin": [],
  "watermark_gamma": null,
  "watermark_delta": null,
  "ngrok": false,
  "ngrok_authtoken": null,
  "ngrok_edge": null,
  "env": false,
  "download_only": false
}

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

Run dokcer with --sharded true --num_shard 4

Expected behavior

Same or better performace when run multi-gpu

OpenAI compatible API

We should support a version of the REST API that mirrors the OpenAI completion API, similar to vLLM:

https://docs.vllm.ai/en/latest/getting_started/quickstart.html#openai-compatible-server

Fail to load gptq base model in 0.4

System Info

ghcr.io/predibase/lorax:0.4 failed to load gptq image

command: --model-id /mnt/local-model/Qwen-14B-Chat-Int4/ --quantize gptq --trust-remote-code

Using model:

023-12-17T14:25:01.295949Z ERROR lorax_launcher: interceptor.py:41 Method Warmup encountered an error.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/gptq/custom_autotune.py", line 90, in _bench
    return triton.testing.do_bench(
TypeError: do_bench() got an unexpected keyword argument 'percentiles'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/bin/lorax-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/cli.py", line 84, in serve
    server.serve(
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 277, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
    return await self.intercept(
> File "/opt/conda/lib/python3.10/site-packages/lorax_server/interceptor.py", line 38, in intercept
    return await response
  File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
    raise error
  File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 74, in Warmup
    max_supported_total_tokens = self.model.warmup(batch)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 864, in warmup
    _, batch = self.generate_token(batch)
  File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 963, in generate_token
    raise e
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 960, in generate_token
    out = self.forward(batch, adapter_data)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 919, in forward
    return self.model.forward(
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_qwen_modeling.py", line 475, in forward
    hidden_states = self.transformer(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_qwen_modeling.py", line 432, in forward
    hidden_states, residual = layer(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_qwen_modeling.py", line 357, in forward
    attn_output = self.attn(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_qwen_modeling.py", line 226, in forward
    qkv = self.c_attn(hidden_states, adapter_data)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/layers.py", line 481, in forward
    result = self.base_layer(input)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/layers.py", line 285, in forward
    return self.linear.forward(x)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/gptq/quant_linear.py", line 349, in forward
    out = QuantLinearFunction.apply(
  File "/opt/conda/lib/python3.10/site-packages/torch/autograd/function.py", line 539, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/opt/conda/lib/python3.10/site-packages/torch/cuda/amp/autocast_mode.py", line 121, in decorate_fwd
    return fwd(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/gptq/quant_linear.py", line 244, in forward
    output = matmul248(input, qweight, scales, qzeros, g_idx, bits, maxq)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/gptq/quant_linear.py", line 216, in matmul248
    matmul_248_kernel[grid](
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/gptq/custom_autotune.py", line 110, in run
    timings = {
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/gptq/custom_autotune.py", line 111, in <dictcomp>
    config: self._bench(*args, config=config, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/gptq/custom_autotune.py", line 93, in _bench
    except triton.compiler.OutOfResources:
AttributeError: module 'triton.compiler' has no attribute 'OutOfResources'

2023-12-17T14:25:01.296225Z ERROR warmup{max_input_length=1024 max_prefill_tokens=4096}:warmup: lorax_client: router/client/src/lib.rs:33: Server error: module 'triton.compiler' has no attribute 'OutOfResources'
Error: Warmup(Generation("module 'triton.compiler' has no attribute 'OutOfResources'"))

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data \
    ghcr.io/predibase/lorax:0.4 --model-id Qwen/Qwen-14B-Chat-Int4

Expected behavior

It runs ok with ghcr.io/predibase/lorax:0.3

Training example?

Feature request

I'm not sure how I managed it, but it seems like I have got a training script that creates a LoRA that loads but has little to no effect. I have been modifying the qLoRA script to try to work with lorax.

By any chance can you point me to a training script that is known to work effectively with this system? Apologies if this is too obvious. Most of what I see when I search for fine-tuning examples right now are for qLoRA or 8 bit.

Motivation

Just trying to verify that I didn't do something wrong.

Your contribution

Happy to test whatever you suggest with our dataset. Which to be honest, our dataset might be part of the problem. But not sure.

[Feature Request] : Add support for bloom model apdaters

Model description

I am trying to run bloom-7b1 model using docker locally [both my model and adpters ]. Here is the script for running bloom on , lorax:

#!bin/bash
# PATH to model 
model="bloom-7b1"
# VOLUME to share: pwd/../models -> /data
volume=$PWD/../models:/data # share a volume with the Docker container to avoid downloading weights every run
echo $volume

docker run --gpus 0 --shm-size 1g -p 7070:80 \
    --volume $volume \
    ghcr.io/predibase/lorax:latest \
    --model-id /data/$model \
    --num-shard 1 \
    --quantize bitsandbytes-nf4 \
    --max-concurrent-requests 256 \
    --cuda-memory-fraction 0.5 \

WHen I pass the adapter located at peft-models/bloom-alpaca-ne
Screenshot:

Open source status

The model implementation is available
The model weights are available

Provide useful links for the implementation

No response

Return number of input tokens when `details=True`

We return the number of generated tokens when details=True, but systems like OpenAI API also return the number of input tokens. This i useful, for example, for metering systems that limit users based on number of input + output tokens.

Current:

'details': {'generated_tokens': 20}

Proposal:

'details': {'prompt_tokens': 120, 'generated_tokens': 20}

Extend testing

Feature request

Extend the testing with tiny dummy models

Motivation

Some cpu based tests for example quantization, decoding and model loading for architectures

Your contribution

Can open an PR

Mistral model requires flash attention v2

Hi
I got the following error:

NotImplementedError: Mistral model requires flash attention v2

I tried use model=TheBloke/Mistral-7B-Instruct-v0.1-GPTQ

Thanks

Quantized models fail to generate expected output

Example:

mistralai/Mistral-7B-v0.1 --quantize bitsandbytes-nf4

Request:

curl 127.0.0.1:8080/generate \
    -X POST \
    -d '{"inputs": "<|system|> You are a helpful assistant <|user|> What is deep learning? </s> <|assistant|>", "parameters": {"max_new_tokens": 64, "adapter_id": "qblocks/mistral_7b_norobots"}}' \
    -H 'Content-Type: application/json'

Response:

{"generated_text":""}

Expected:

{"generated_text":"Deep learning is a subset of machine learning that uses artificial neural networks to learn from data. It is a powerful tool for solving complex problems in fields such as natural language processing, computer vision, and speech recognition. Deep learning algorithms can learn from large amounts of data and make predictions or decisions based on that data. They can"}

how does this differ from s-Lora?

really cool project! im wondering how its different from s-Lora? https://github.com/S-LoRA/S-LoRA

Surface more informative error when adapter has NaN weights

Feature request

When querying a base model with an adapter that has NaN or Inf weight tensors, LoRAX returns the following error:

The output tensors do not match for key base_model.model.model.layers.0.self_attn.q_proj.lora_A.weight

It would be more helpful if the error message indicates the the reason the tensors don't match during merge is because LoRAX detected NaN/Inf tensors in the adapter weights.

Motivation

This would help provide a rectifiable/actionable path for users who fine-tuned models and are working on testing them out to know that this isn't an issue with LoRAX, but rather, an issue with their trained adapter weights.

Your contribution

Happy to help surface a better error message! Seems like the issue is raised from this line in particule?

lorax/server/lorax_server/utils/convert.py

Line 92 in 360ad4c

if not torch.equal(pt_tensor, sf_tensor):

Does lorax currently support GPT2 finetuned adapters?

System Info

lorax:latest

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

@tgaddair I have few adapters finetuned using GPT2 as base model,

Architecture of GPT2:

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

The adapters are finetuned with "c_attn, c_proj" layer, does lorax currently support it?

Expected behavior

Question about compatibility.

Sharded adapters not working

System Info

Model info:

{
  "model_id": "mistralai/Mistral-7B-Instruct-v0.1",
  "model_sha": "7ad5799710574ba1c1d953eba3077af582f3a773",
  "model_dtype": "torch.float16",
  "model_device_type": "cuda",
  "model_pipeline_tag": "text-generation",
  "max_concurrent_requests": 128,
  "max_best_of": 2,
  "max_stop_sequences": 4,
  "max_input_length": 1024,
  "max_total_tokens": 2048,
  "waiting_served_ratio": 1.2,
  "max_batch_total_tokens": 1102544,
  "max_waiting_tokens": 20,
  "validation_workers": 2,
  "version": "0.1.0",
  "sha": null,
  "docker_label": null
}

2 A100 gpus, NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2 outside docker.

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

Run mistral example with docker on 2 gpus:

model=mistralai/Mistral-7B-Instruct-v0.1
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/predibase/lorax:latest --model-id $model --num-shard 2

Then try to generate:

❯ curl 127.0.0.1:8080/generate \
    -X POST \
    -d '{"inputs": "[INST] Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? [/INST]", "parameters": {"max_new_tokens": 64, "adapter_id": "vineetsharma/qlora-adapter-Mistral-7B-Instruct-v0.1-gsm8k"}}' \
    -H 'Content-Type: application/json'
{"error":"Request failed during generation: Server error: local variable 'lora_b' referenced before assignment","error_type":"generation"}%

Basically the issue that when trying to multiply first lora_a matrix, we get it sharded with shape [2048, r] while input is not sharded and has shape [49, 4096] .

Expected behavior

Generation completed successfully

Add nf4 support for model quantization

Feature request

There is existing code in LoRAX towards 8bit bitsandbytes quantization:

lorax/server/text_generation_server/utils/layers.py

Lines 86 to 95 in c747d27

    
           class Linear8bitLt(nn.Module): 
        
               def __init__( 
        
                   self, 
        
                   weight, 
        
                   bias, 
        
                   has_fp16_weights=True, 
        
                   memory_efficient_backward=False, 
        
                   threshold=0.0, 
        
                   index=None, 
        
               ):

Supporting 4bit bitsandbytes quantization would enable us to serve models trained in 4bit.

The change should involve following the patterns implemented for 8bit quantization.

Motivation

No response

Your contribution

No response

Add RoPE scaling CLI args

Currently the user can configure dynamic RoPE scaling by setting the environment variables ROPE_SCALING and ROPE_FACTOR like so:

export ROPE_SCALING=dynamic 
export ROPE_FACTOR=2

But this is very clunky and not documented. We should add these as CLI args to the lorax-launcher so they can be better documented and less error prone.

Project Roadmap

WIP project roadmap for LoRAX. We'll continue to update this over time.

v0.10

Speculative decoding adapters
AQLM

v0.11

Previous Releases

v0.9

Adapter memory pool

Backlog

Models

Adapters

Throughput / Latency

Paged Attention v2
Lookahead Decoding
SGMV with variable ranks
SGMV with tensor parallelism

Quantization

bitsandbytes
GPT-Q
AWQ

Usability

Prebuilt server wheels
SkyPilot usage guide
Example notebooks

Sliding block window error when running Mixtral 8x7B

System Info

Lorax version: 0.4.1
Lorax_launcher: 0.1.0
Model: mistralai/Mixtral-8x7B-Instruct-v0.1
GPUS: 3090 (24 gb) P100 (16 gb)

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

model= mistralai/Mixtral-8x7B-Instruct-v0.1
volume=$PWD/data

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data
ghcr.io/predibase/lorax:latest --model-id $model --quantize bitsandbytes-nf4 --trust-remote-code

Upon executing this code I receive the following traceback:

Traceback (most recent call last):
File "/opt/conda/bin/lorax-server", line 8, in
sys.exit(app())
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in call
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in call
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
return _main(
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/lorax_server/cli.py", line 84, in serve
server.serve(
File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 271, in serve
asyncio.run(
File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
self._run_once()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
handle._run()
File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)

File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 223, in serve_inner
model = get_model(
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/init.py", line 305, in get_model
return FlashMixtral(
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_mixtral.py", line 346, in init
SLIDING_WINDOW_BLOCKS = math.ceil(config.sliding_window / BLOCK_SIZE)
TypeError: unsupported operand type(s) for /: 'NoneType' and 'int'

2023-12-19T02:06:51.578108Z ERROR shard-manager: lorax_launcher: Shard complete standard error output:

Traceback (most recent call last):

File "/opt/conda/bin/lorax-server", line 8, in
sys.exit(app())

File "/opt/conda/lib/python3.10/site-packages/lorax_server/cli.py", line 84, in serve
server.serve(

File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 271, in serve
asyncio.run(

File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)

File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()

File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 223, in serve_inner
model = get_model(

File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/init.py", line 305, in get_model
return FlashMixtral(

File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_mixtral.py", line 346, in init
SLIDING_WINDOW_BLOCKS = math.ceil(config.sliding_window / BLOCK_SIZE)

TypeError: unsupported operand type(s) for /: 'NoneType' and 'int'

It appears that for some reason config.sliding_window is set to None, which doesn't make sense because there is an if statement that forces it to equal config.max_position_embeddings a few lines above.

I will be out of town, so I do not have a chance to build the server locally, but I can look at it when I get back.

Expected behavior

The mixtral model runs without issue.

Dockerfile build failed on 5e6215b4cdfbdce345806e7b504f36948abee126 (main today)

System Info

Hello, this is the end of the Dockerfile build log (I am in EC2 in the instance I was using for inference)

[19/49] /opt/conda/bin/nvcc -I/usr/src/flash-attention-v2/csrc/flash_attn -I/usr/src/flash-attention-v2/csrc/flash_attn/src -I/usr/src/flash-attention-v2/csrc/cutlass/include -I/opt/conda/lib/python3.9/site-packages/torch/include -I/opt/conda/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.9/site-packages/torch/include/TH -I/opt/conda/lib/python3.9/site-packages/torch/include/THC -I/opt/conda/include -I/opt/conda/include/python3.9 -c -c /usr/src/flash-attention-v2/csrc/flash_attn/src/flash_bwd_hdim64_bf16_sm80.cu -o /usr/src/flash-attention-v2/build/temp.linux-x86_64-cpython-39/csrc/flash_attn/src/flash_bwd_hdim64_bf16_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 --threads 4 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0
/opt/conda/lib/python3.9/site-packages/torch/include/c10/util/irange.h(54): warning #186-D: pointless comparison of unsigned integer with zero
detected during:
instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator==(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=size_t, one_sided=false, =0]"
(61): here
instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator!=(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=size_t, one_sided=false, =0]"
/opt/conda/lib/python3.9/site-packages/torch/include/c10/core/TensorImpl.h(77): here

/opt/conda/lib/python3.9/site-packages/torch/include/c10/util/irange.h(54): warning #186-D: pointless comparison of unsigned integer with zero
detected during:
instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator==(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=size_t, one_sided=true, =0]"
(61): here
instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator!=(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=size_t, one_sided=true, =0]"
/opt/conda/lib/python3.9/site-packages/torch/include/c10/core/TensorImpl.h(2327): here
instantiation of "__nv_bool c10::TensorImpl::SetDimsTemplate(c10::ArrayRef) [with T=int64_t, =void]"
/opt/conda/lib/python3.9/site-packages/torch/include/c10/core/TensorImpl.h(2337): here

/opt/conda/lib/python3.9/site-packages/torch/include/c10/util/irange.h(54): warning #186-D: pointless comparison of unsigned integer with zero
detected during:
instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator==(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=size_t, one_sided=false, =0]"
(61): here
instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator!=(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=size_t, one_sided=false, =0]"
/opt/conda/lib/python3.9/site-packages/torch/include/c10/core/TensorImpl.h(77): here

ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "/opt/conda/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1893, in _run_ninja_build
subprocess.run(
File "/opt/conda/lib/python3.9/subprocess.py", line 528, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/usr/src/flash-attention-v2/setup.py", line 288, in
setup(
File "/opt/conda/lib/python3.9/site-packages/setuptools/init.py", line 87, in setup
return distutils.core.setup(**attrs)
File "/opt/conda/lib/python3.9/site-packages/setuptools/_distutils/core.py", line 185, in setup
return run_commands(dist)
File "/opt/conda/lib/python3.9/site-packages/setuptools/_distutils/core.py", line 201, in run_commands
dist.run_commands()
File "/opt/conda/lib/python3.9/site-packages/setuptools/_distutils/dist.py", line 969, in run_commands
self.run_command(cmd)
File "/opt/conda/lib/python3.9/site-packages/setuptools/dist.py", line 1208, in run_command
super().run_command(command)
File "/opt/conda/lib/python3.9/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
cmd_obj.run()
File "/opt/conda/lib/python3.9/site-packages/setuptools/_distutils/command/build.py", line 132, in run
self.run_command(cmd_name)
File "/opt/conda/lib/python3.9/site-packages/setuptools/_distutils/cmd.py", line 318, in run_command
self.distribution.run_command(command)
File "/opt/conda/lib/python3.9/site-packages/setuptools/dist.py", line 1208, in run_command
super().run_command(command)
File "/opt/conda/lib/python3.9/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
cmd_obj.run()
File "/opt/conda/lib/python3.9/site-packages/setuptools/command/build_ext.py", line 84, in run
_build_ext.run(self)
File "/opt/conda/lib/python3.9/site-packages/setuptools/_distutils/command/build_ext.py", line 346, in run
self.build_extensions()
File "/opt/conda/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 843, in build_extensions
build_ext.build_extensions(self)
File "/opt/conda/lib/python3.9/site-packages/setuptools/_distutils/command/build_ext.py", line 468, in build_extensions
self._build_extensions_serial()
File "/opt/conda/lib/python3.9/site-packages/setuptools/_distutils/command/build_ext.py", line 494, in _build_extensions_serial
self.build_extension(ext)
File "/opt/conda/lib/python3.9/site-packages/setuptools/command/build_ext.py", line 246, in build_extension
_build_ext.build_extension(self, ext)
File "/opt/conda/lib/python3.9/site-packages/setuptools/_distutils/command/build_ext.py", line 549, in build_extension
objects = self.compiler.compile(
File "/opt/conda/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 658, in unix_wrap_ninja_compile
_write_ninja_file_and_compile_objects(
File "/opt/conda/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1574, in _write_ninja_file_and_compile_objects
_run_ninja_build(
File "/opt/conda/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1909, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error compiling objects for extension
make: *** [Makefile:10: build-flash-attention-v2] Error 1
The command '/bin/sh -c make build-flash-attention-v2' returned a non-zero code: 2

I appreciate any time you have for hints.

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

Clone and ./build.sh

Expected behavior

Build completes.

Extend adapters to support MLP head for embeddings, classification

Feature request

This should work similar to Ludwig where we take the hidden state, remove LM head, and swap in an MLP for either embedding generation or classification / regression.

Motivation

No response

Your contribution

No response

Unexpected CUDA out of memory errors

System Info

ghcr.io/predibase/lorax:latest

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

First, run

docker run --gpus all --shm-size 1g -p 8000:80 -e "HUGGING_FACE_HUB_TOKEN=<token>" -v $volume:/data ghcr.io/predibase/lorax:latest --trust-remote-code --model-id meta-llama/Llama-2-7b-chat-hf

Then, send the following requests:

from concurrent.futures import ThreadPoolExecutor
import time
import requests

url = "http://127.0.0.1:8000/generate"
headers = {"Content-Type": "application/json"}

# Function to send a request
def send_request(payload):
    start_time = time.time()
    response = requests.post(url, headers=headers, json=payload)
    elapsed_time = time.time() - start_time
    return response.json(), elapsed_time

# Number of concurrent requests

adapter_list = [<list of 15 adapters>]
num_requests = len(adapter_list)

# Using ThreadPoolExecutor to send requests concurrently
with ThreadPoolExecutor(max_workers=num_requests) as executor:
    # Submit the requests
    futures = []
    for i in range(num_requests):
        payload = {
            "inputs": "Hello, my name is",
            "parameters": {"max_new_tokens": 100, "adapter_id": adapter_list[i]},
        }
        futures.append(executor.submit(send_request, payload))

    # Wait for all requests to complete
    results = [future.result() for future in futures]

print(results)

Expected behavior

First of all, awesome work with lorax! When sending requests at the same time like this, I'm receiving Cuda Out Of Memory errors on the server. I thought that actually, the server would check beforehand how many tokens can be handled and consequently enqueue requests that cannot be served. Have you encountered this before?

getting error during inference "Unsupported head size: 32"

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

I tried lora finetuning a smaller variant of mistral architecture but I am getting this error below,

GenerationError: Request failed during generation: Server error: Unsupported head size: 32

I used rank: 16, alpha: 32

https://huggingface.co/Locutusque/TinyMistral-248M-Instruct

Expected behavior

It should have worked since it's following the mistral architecture. (TinyLlama was working fine)

How to use --master-addr <MASTER_ADDR>|--master-port <MASTER_PORT>?

System Info

Hi there, thank you for your perfect project,
I see that you have --master-addr <MASTER_ADDR>|--master-port <MASTER_PORT> parameter when run the server
Do you have any guide about use torch distributed in your project? I think it's really helpful if I need to run on multi-machine.
Thank you,

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

model=mistralai/Mistral-7B-Instruct-v0.1
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/predibase/lorax:latest --model-id $model

Expected behavior

More detail about torch distributed

	class Linear8bitLt(nn.Module):
	def __init__(
	self,
	weight,
	bias,
	has_fp16_weights=True,
	memory_efficient_backward=False,
	threshold=0.0,
	index=None,
	):