predibase / lorax Goto Github PK
View Code? Open in Web Editor NEWMulti-LoRA inference server that scales to 1000s of fine-tuned LLMs
Home Page: https://loraexchange.ai
License: Apache License 2.0
Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs
Home Page: https://loraexchange.ai
License: Apache License 2.0
master branch
start launcher with gptq model, then try to load non gptq adapter
2023-11-22T19:34:06.185127Z ERROR lorax_client: router/client/src/lib.rs:33: Server error: 'QuantLinear' object has no attribute 'weight'
for tests if commented out the ID check
#if adapter_config.base_model_name_or_path != model_id:
# raise ValueError(f"Adapter '{adapter_id}' is not compatible with model '{model_id}'. "
# f"Use --model-id '{adapter_config.base_model_name_or_path}' instead.")
I am already thinking about better check, because it depends on model arch and parameters count instead of the id directly. so zephyr lora could be merged successfully to mistral instruct too.
detect gptq model and use qweight instead of weight when merging
During fine-tuning, it's possible that special tokens are added that are specific to the adapter. During decoding, we should be using the special tokens, and ensure the correct stop tokens, padding, etc. are properly honored.
Repro from @runvnc, related: #68
Model ID: https://huggingface.co/qblocks/mistral_7b_norobots/tree/main
QLoRA repo example uses this AutoTokenizer with special tokens:
https://github.com/artidoro/qlora/blob/7f4e95a68dc076bea9b3a413d2b512eca6d004e5/qlora.py#L347
Can you tell me abit about how to serve model in 4bit quantize mode?
I added the --quantize bitsandbytes-nf4
when run docker container but nothing change, the GPU memory keep the same
docker run --gpus all --shm-size 1g-p 8080:80 -v ./ckpts:/data ghcr.io/predibase/lorax:latest --model-id /data/OpenHermes-2-7B-base-2.3 --quantize bitsandbytes-nf4
Reduce the GPU memory
The acknowledgements of this project mention the SGMV kernels created by the Punica project. Is there a way we can run multiple adapters simultaneously using LoRAX in a similar way shown in the Punica example? Can this be done via the AsyncClient?
Docker images: 2023-12-06
GPUs: 2 A40(48g)
OS: centos 7
null
I tested three models qwen-14b, yi-34b-chat (llama2 based), and xuanyuan-70b-chat (llama2 based). Each model prepared 2-3 lora adapter and encountered some problems.
Xuanyuan can run completely normally.
All the following questions are based on Qwen and Yi.
Without adding lora adapter, the output will reach max new length without adding stopwords, but it will not be nonsense. Adding stopwords can output normally. But after adding lora adapter, it will output to max_new_length and make nonsense. Consider that may be a template configuration issue in fine-tuning .
When num_share is set to 2 (two GPUs), pre_fill_length=4096 will cause insufficient memory, and even 1024 will report the error (Qwen). num_share = 1 can run normally.
RuntimeError: Not enough memory to handle 1028 prefill tokens. You need to decrease
--max-batch-prefill-tokens
2023-12-08T10:14:01.820643Z ERROR warmup{max_input_length=1024 max_prefill_tokens=1028}:warmup: lorax_client: router/client/src/lib.rs:33: Server error: Not enough memory to handle 1028 prefill tokens. You need to decrease--max-batch-prefill-tokens
Error: Warmup(Generation("Not enough memory to handle 1028 prefill tokens. You need to decrease--max-batch-prefill-tokens
"))
2023-12-08T10:14:01.894611Z ERROR lorax_launcher: Webserver Crashed
2023-12-08T10:14:01.894628Z INFO lorax_launcher: Shutting down shards
2023-12-08T10:14:02.427130Z INFO shard-manager: lorax_launcher: Shard terminated rank=1
2023-12-08T10:14:03.432023Z INFO shard-manager: lorax_launcher: Shard terminated rank=0
Error: WebserverFailed
Args { model_id: "/data/yi-34b-chat", adapter_id: "", source: "hub", adapter_source: "hub", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: Some(BitsandbytesNF4), dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, max_active_adapters: 128, adapter_cycle_time_s: 2, hostname: "c92d36636b23", port: 80, shard_uds_path: "/tmp/lorax-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false, download_only: false }
Post:
prompt = "<|im_start|>user\Tell me a story<|im_end|>\n<|im_start|>assistant\n"
adapter_id = "/data/chat-int8-3-epoch-1024-manual_2360-self-5000_1207-1"
print(client.generate(prompt, max_new_tokens=300,temperature=0.8, do_sample=True, stop_sequences=["<|im_end|>"], adapter_id=adapter_id).generated_text)
Error :
Traceback (most recent call last):
File "/home/shaohongen/Temp/WZ_test/lorax/test_lorax_yi.py", line 9, in
print(client.generate(prompt, max_new_tokens=300,temperature=0.8, do_sample=True, stop_sequences=["<|im_end|>"], adapter_id=adapter_id).generated_text)
File "/home/shaohongen/miniconda3/envs/slora/lib/python3.9/site-packages/lorax/client.py", line 148, in generate
resp = requests.post(
File "/home/shaohongen/miniconda3/envs/slora/lib/python3.9/site-packages/requests/api.py", line 115, in post
return request("post", url, data=data, json=json, **kwargs)
File "/home/shaohongen/miniconda3/envs/slora/lib/python3.9/site-packages/requests/api.py", line 59, in request
return session.request(method=method, url=url, **kwargs)
File "/home/shaohongen/miniconda3/envs/slora/lib/python3.9/site-packages/requests/sessions.py", line 589, in request
resp = self.send(prep, **send_kwargs)
File "/home/shaohongen/miniconda3/envs/slora/lib/python3.9/site-packages/requests/sessions.py", line 703, in send
r = adapter.send(request, **kwargs)
File "/home/shaohongen/miniconda3/envs/slora/lib/python3.9/site-packages/requests/adapters.py", line 532, in send
raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPConnectionPool(host='127.0.0.1', port=8081): Read timed out. (read timeout=10)
There is no exchange cost for including a base model request in the batch, so we should always consider such requests as part of the "active set" that can be included in a given batch.
No response
No response
using this official docker run cmd :
model=mistralai/Mistral-7B-Instruct-v0.2
volume=$PWD/data
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data
ghcr.io/predibase/lorax:latest --model-id $model
model=mistralai/Mistral-7B-Instruct-v0.2
volume=$PWD/data
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data
ghcr.io/predibase/lorax:latest --model-id $model
error:
2023-12-13T10:44:53.233748Z ERROR shard-manager: lorax_launcher: Shard complete standard error output:
Traceback (most recent call last):
File "/opt/conda/bin/lorax-server", line 8, in
sys.exit(app())
File "/opt/conda/lib/python3.9/site-packages/lorax_server/cli.py", line 81, in serve
server.serve(
File "/opt/conda/lib/python3.9/site-packages/lorax_server/server.py", line 262, in serve
asyncio.run(
File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
return future.result()
File "/opt/conda/lib/python3.9/site-packages/lorax_server/server.py", line 214, in serve_inner
model = get_model(
File "/opt/conda/lib/python3.9/site-packages/lorax_server/models/init.py", line 274, in get_model
return FlashMistral(
File "/opt/conda/lib/python3.9/site-packages/lorax_server/models/flash_mistral.py", line 347, in init
SLIDING_WINDOW_BLOCKS = math.ceil(config.sliding_window / BLOCK_SIZE)
TypeError: unsupported operand type(s) for /: 'NoneType' and 'int'
should run the sever with the model
Cool job! I have successfully run mulit-lora with llama2-70b.
I would like to ask if the author has any plans to support other models, such as qwen, which would be very helpful.
null
null
It should be possible to easily deploy LoRAX on Kubernetes via Helm. We only really need a Deployment and Service resource for now.
No response
No response
Running latest docker image
I started a mixtral server on 2 A100 (80GB) GPUs:
lorax-launcher --model-id mistralai/Mixtral-8x7B-Instruct-v0.1 --num-shard 2
Then I sent a generate_stream
request to my_adapter
, which is an adapter trained with target_modules= ["q_proj", "v_proj"]
.
I then get the following error:
2023-12-13T22:50:01.147514Z INFO lorax_launcher: flash_causal_lm.py:742 Loading adapter weights into model: my_adater
2023-12-13T22:50:08.489110Z ERROR lorax_launcher: server.py:170 Error when loading adapter
Traceback (most recent call last):
File "/opt/conda/bin/lorax-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
return _main(
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/lorax_server/cli.py", line 84, in serve
server.serve(
File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 271, in serve
asyncio.run(
File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
self._run_once()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
handle._run()
File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
return await self.intercept(
File "/opt/conda/lib/python3.10/site-packages/lorax_server/interceptor.py", line 38, in intercept
return await response
File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
return await behavior(request_or_iterator, context)
> File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 162, in LoadAdapter
self.model.load_adapter(adapter_id, adapter_source, adapter_index)
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 750, in load_adapter
self.load_batched_adapter_weights(
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 796, in load_batched_adapter_weights
base_weight = layer.base_layer.linear.weight
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1695, in __getattr__
raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'TensorParallelHead' object has no attribute 'base_layer'
Given the merged Mixtral pull request, I assumed that the model would be supported. Does this only apply to the base model or are adapters also supported?
I would expect the model to return a text response
Lorax version: 0.4.1
Lorax_launcher: 0.1.0
Model: mistralai/Mixtral-8x7B-Instruct-v0.1
GPUS: 3090 (24 gb) 3060 (12 gb)
model= mistralai/Mixtral-8x7B-Instruct-v0.1
volume=$PWD/data
sudo docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/predibase/lorax:latest --model-id $model --trust-remote-code --quantize bitsandbytes-nf4 --max-batch-prefill-tokens 2048 --sharded true
Error Message:
2023-12-24T07:02:10.759386Z INFO lorax_launcher: Parsing num_shard from CUDA_VISIBLE_DEVICES/NVIDIA_VISIBLE_DEVICES
Error: NotEnoughCUDADevices("sharded
is true but only found 1 CUDA devices")
The expected behavior is for LoRAX to find both GPUs. For reference here is the output of nvidia-smi
'''
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3060 On | 00000000:01:00.0 Off | N/A |
| 0% 49C P8 15W / 170W | 9MiB / 12288MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce RTX 3090 On | 00000000:06:00.0 Off | N/A |
| 0% 51C P8 18W / 350W | 12MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 2249 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 2249 G /usr/lib/xorg/Xorg 4MiB |
+---------------------------------------------------------------------------------------+
'''
I checked the documentation and it said that --sharded true is the default setting of the server; however, when I do not pass --sharded true, I get an out of memory error and need to use a much smaller --max-batch-prefill-tokens (1024 to be exact), when I print nvidia-smi I get the following output
'''
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3060 On | 00000000:01:00.0 Off | N/A |
| 0% 44C P8 15W / 170W | 12MiB / 12288MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce RTX 3090 On | 00000000:06:00.0 Off | N/A |
| 81% 57C P2 114W / 350W | 23873MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 2249 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 2249 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 7439 C /opt/conda/bin/python3.10 23856MiB |
+---------------------------------------------------------------------------------------+
'''
It appears as if the server cannot find the 3060. I swapped the 3060 for one of my other GPUs (a Tesla P100 16gb) yet I still received the same error
Support 'gate_proj', 'down_proj', 'up_proj', 'lm_head' for Llama and Mistral
This would allow serving all linear layers in Llama and Mistral.
Contributions welcome!
base model: meta-llama/Llama-2-13b-chat-hf
docker cmd:
docker run --gpus '"device=3"' --shm-size 1g -e HUGGING_FACE_HUB_TOKEN=XXXXX -p 8082:82 -v $volume:/data ghcr.io/predibase/lorax:latest --model-id $model --quantize bitsandbytes
from lorax import Client
URL="http://127.0.0.0:8082"
client = Client(URL,timeout=20)
adapter_id="./data/lora_models/unsighing/" #this dir contain adapter_config.json or adapter_model.bin
adapter_source="local"
client.generate(prompt,max_new_tokens=128,temperature=0.001,adapter_id=adapter_id,adapter_source=adapter_source).generated_text
getting this error while running the above code :
GenerationError: Request failed during generation: Server error: No local weights found in ./data/lora_models/unsighing/ with extension .safetensors
Also getting a different error while putting the complete path for adapter_id:
adapter_id="home/code/data/lora_models/unsighing/"
GenerationError: Request failed during generation: Server error: Can't find 'adapter_config.json' at 'home/code/data/lora_models/unsighing/'
More info:
base model: meta-llama/Llama-2-13b-chat-hf
docker cmd:
docker run --gpus '"device=3"' --shm-size 1g -e HUGGING_FACE_HUB_TOKEN=XXXXX -p 8082:82 -v $volume:/data ghcr.io/predibase/lorax:latest --model-id $model --quantize bitsandbytes
Note: I am able to do successful prediction out of base model , if i didn't provide adapter_id in the same above setup.
client.generate(prompt,max_new_tokens=128,temperature=0.001).generated_text
This works good for me.
Any help will be appreciated, Thanks
Should load the adapter from local dir.
The current approach to tensor parallelism from #5 is not latency optimized. We make an allgather call for every adapter, which will be quite slow for many adapters. Additionally, we don't fuse together the q and v matrices, which would further halve the number of allgathers.
A better approach would be to pre-allocate a large tensor and then slice in and out the individual tensors, as shown here:
https://discuss.pytorch.org/t/concatenate-tensors-without-memory-copying/34609
No response
No response
Target: x86_64-unknown-linux-gnu
Cargo version: 1.70.0
Commit sha: N/A
Docker label: N/A
NVIDIA-SMI 545.23.06 Driver Version: 545.23.06 CUDA Version: 12.3
model_id = "bigscience/bloom-560m"
run "lorax-launcher" without specifying model-id (it defaults to the "bigscience/bloom-560m") model or run any model not yet supported by lorax.
lorax-launcher
error message (client side):
{"error":"Request failed during generation: Server error: 'BLOOMSharded' object has no attribute 'load_adapter'","error_type":"generation"}
error message (server side):
2023-11-27T08:25:59.281666Z INFO lorax_router::loader: router/src/loader.rs:146: adapter vineetsharma/qlora-adapter-Mistral-7B-Instruct-v0.1-gsm8k downloaded
2023-11-27T08:25:59.281719Z INFO lorax_router::queue: router/src/queue.rs:135: set adapter vineetsharma/qlora-adapter-Mistral-7B-Instruct-v0.1-gsm8k status to Downloaded
2023-11-27T08:25:59.318818Z ERROR lorax_client: router/client/src/lib.rs:33: Server error: 'BLOOMSharded' object has no attribute 'load_adapter'
2023-11-27T08:25:59.318826Z INFO lorax_router::loader: router/src/loader.rs:201: FAILED loading adapter vineetsharma/qlora-adapter-Mistral-7B-Instruct-v0.1-gsm8k
2023-11-27T08:25:59.318833Z INFO lorax_router::queue: router/src/queue.rs:135: set adapter vineetsharma/qlora-adapter-Mistral-7B-Instruct-v0.1-gsm8k status to Errored
2023-11-27T08:25:59.318862Z INFO lorax_router::loader: router/src/loader.rs:271: terminating adapter vineetsharma/qlora-adapter-Mistral-7B-Instruct-v0.1-gsm8k loader
When an unsupported model is run, Lorax should revert to its original implementation, which does not attempt to load an adapter for models that are not supported. In cases where an unsupported model is run without specifying an adapter_id, the system should still attempt to generate a response. If an adapter_id is specified, Lorax should notify the user that the model is not supported, rather than crashing.
I have download the model, so I want to run it use local model, eht sample is:
docker run --gpus all --shm-size 1g -p 8080:80 -v /data/model/:/data/
ghcr.io/predibase/lorax:latest --model-id /data/model/Qwen-14B-Chat
I want to use the local model. Our computes don't allow to visit huggingface.co.
No.
ghcr.io/predibase/lorax:latest
Running within Kubernetes on H100
When putting the instance in production, while it receive simultaneous request for different adapters, it will just hang there.
/generate and /health will stop answering
but /info and /docs will continue to be available.
There's no error getting displayed in the logs
Not sure what's the best way to diagnose what the issue could be, but looks to me like it's having some issues fetching multiple adapters in parallel and processing request queued at the same time?
Should handle live requests for multiple adapters
No response
Weights are already downloaded, but failure occurs during load (e.g., local model with adapter but the config hasn't been downloaded). Likely a race condition with cleanup logic.
2023-11-12T05:38:04.192534Z ERROR text_generation_client: router/client/src/lib.rs:33: Server error: Can't find 'adapter_config.json' at '/data/models--arnavgrg--codealpaca_v3/sn
apshots/834b33af35ff5965ea3e4bc18b51ad5d65da7466'
2023-11-12T05:38:04.192612Z INFO text_generation_router::loader: router/src/loader.rs:184: FAILED loading adapter /data/models--arnavgrg--codealpaca_v3/snapshots/834b33af35ff596
5ea3e4bc18b51ad5d65da7466
2023-11-12T05:38:04.192682Z ERROR text_generation_router::queue: router/src/queue.rs:240: adapter /data/models--arnavgrg--codealpaca_v3/snapshots/834b33af35ff5965ea3e4bc18b51ad5d
65da7466 not found in queue_map
Backtrace [{ fn: "text_generation_router::queue::AdapterQueuesState::set_status", file: "./src/queue.rs", line: 241 }, { fn: "text_generation_router::loader::loader_tas[132/1877]
re}}", file: "./src/loader.rs", line: 186 }, { fn: "<core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll", file: "/build/rustc-v6rcRM/rustc-1.66.1+d
fsg0ubuntu1~llvm/library/core/src/future/mod.rs", line: 91 }, { fn: "tokio::runtime::task::core::Core<T,S>::poll::{{closure}}", file: "/root/.cargo/registry/src/github.com-1ecc62
99db9ec823/tokio-1.29.1/src/runtime/task/core.rs", line: 311 }, { fn: "tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut", file: "/root/.cargo/registry/src/github.com-1ecc62
99db9ec823/tokio-1.29.1/src/loom/std/unsafe_cell.rs", line: 14 }, { fn: "tokio::runtime::task::core::Core<T,S>::poll", file: "/root/.cargo/registry/src/github.com-1ecc6299db9ec82
3/tokio-1.29.1/src/runtime/task/core.rs", line: 300 }, { fn: "tokio::runtime::task::harness::poll_future::{{closure}}", file: "/root/.cargo/registry/src/github.com-1ecc6299db9ec8
23/tokio-1.29.1/src/runtime/task/harness.rs", line: 476 }, { fn: "<core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once", file: "/build/ru
stc-v6rcRM/rustc-1.66.1+dfsg0ubuntu1~llvm/library/core/src/panic/unwind_safe.rs", line: 271 }, { fn: "std::panicking::try::do_call", file: "/build/rustc-v6rcRM/rustc-1.66.1+dfsg0
ubuntu1~llvm/library/std/src/panicking.rs", line: 483 }, { fn: "__rust_try" }, { fn: "std::panicking::try", file: "/build/rustc-v6rcRM/rustc-1.66.1+dfsg0ubuntu1~llvm/library/std/
src/panicking.rs", line: 447 }, { fn: "std::panic::catch_unwind", file: "/build/rustc-v6rcRM/rustc-1.66.1+dfsg0ubuntu1~llvm/library/std/src/panic.rs", line: 137 }, { fn: "tokio::
runtime::task::harness::poll_future", file: "/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/task/harness.rs", line: 464 }, { fn: "tokio::runtime::
task::harness::Harness<T,S>::poll_inner", file: "/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/task/harness.rs", line: 198 }, { fn: "tokio::runti
me::task::harness::Harness<T,S>::poll", file: "/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/task/harness.rs", line: 152 }, { fn: "tokio::runtime
::task::raw::poll", file: "/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/task/raw.rs", line: 276 }, { fn: "tokio::runtime::task::raw::RawTask::po
ll", file: "/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/task/raw.rs", line: 200 }, { fn: "tokio::runtime::task::LocalNotified<S>::run", file: "
/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/task/mod.rs", line: 400 }, { fn: "tokio::runtime::scheduler::multi_thread::worker::Context::run_tas
k::{{closure}}", file: "/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/scheduler/multi_thread/worker.rs", line: 639 }, { fn: "tokio::runtime::coop
::with_budget", file: "/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/coop.rs", line: 107 }, { fn: "tokio::runtime::coop::budget", file: "/root/.c
argo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/coop.rs", line: 73 }, { fn: "tokio::runtime::scheduler::multi_thread::worker::Context::run_task", file: "/r
oot/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/scheduler/multi_thread/worker.rs", line: 575 }, { fn: "tokio::runtime::scheduler::multi_thread::worke
r::Context::run", file: "/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/scheduler/multi_thread/worker.rs", line: 526 }, { fn: "tokio::runtime::sch
eduler::multi_thread::worker::run::{{closure}}::{{closure}}", file: "/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/scheduler/multi_thread/worker.
rs", line: 491 }, { fn: "tokio::runtime::context::scoped::Scoped<T>::set", file: "/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/context/scoped.rs
", line: 40 }, { fn: "tokio::runtime::context::set_scheduler::{{closure}}", file: "/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/context.rs", lin
e: 176 }, { fn: "std::thread::local::LocalKey<T>::try_with", file: "/build/rustc-v6rcRM/rustc-1.66.1+dfsg0ubuntu1~llvm/library/std/src/thread/local.rs", line: 446 }, { fn: "std::
thread::local::LocalKey<T>::with", file: "/build/rustc-v6rcRM/rustc-1.66.1+dfsg0ubuntu1~llvm/library/std/src/thread/local.rs", line: 422 }, { fn: "tokio::runtime::context::set_sc
heduler", file: "/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/context.rs", line: 176 }, { fn: "tokio::runtime::scheduler::multi_thread::worker::
run::{{closure}}", file: "/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/scheduler/multi_thread/worker.rs", line: 486 }, { fn: "tokio::runtime::co
ntext::runtime::enter_runtime", file: "/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/context/runtime.rs", line: 65 }, { fn: "tokio::runtime::sche
duler::multi_thread::worker::run", file: "/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/scheduler/multi_thread/worker.rs", line: 478 }, { fn: "to
kio::runtime::scheduler::multi_thread::worker::Launch::launch::{{closure}}", file: "/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/scheduler/multi
_thread/worker.rs", line: 447 }, { fn: "<tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll", file: "/root/.cargo/registry/src/github.com-1ecc
6299db9ec823/tokio-1.29.1/src/runtime/blocking/task.rs", line: 42 }, { fn: "tokio::runtime::task::core::Core<T,S>::poll::{{closure}}", file: "/root/.cargo/registry/src/github.com
-1ecc6299db9ec823/tokio-1.29.1/src/runtime/task/core.rs", line: 311 }, { fn: "tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut", file: "/root/.cargo/registry/src/github.com
-1ecc6299db9ec823/tokio-1.29.1/src/loom/std/unsafe_cell.rs", line: 14 }, { fn: "tokio::runtime::task::core::Core<T,S>::poll", file: "/root/.cargo/registry/src/github.com-1ecc6299
db9ec823/tokio-1.29.1/src/runtime/task/core.rs", line: 300 }, { fn: "tokio::runtime::task::harness::poll_future::{{closure}}", file: "/root/.cargo/registry/src/github.com-1ecc629
9db9ec823/tokio-1.29.1/src/runtime/task/harness.rs", line: 476 }, { fn: "<core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once", file: "/b
uild/rustc-v6rcRM/rustc-1.66.1+dfsg0ubuntu1~llvm/library/core/src/panic/unwind_safe.rs", line: 271 }, { fn: "std::panicking::try::do_call", file: "/build/rustc-v6rcRM/rustc-1.66.
1+dfsg0ubuntu1~llvm/library/std/src/panicking.rs", line: 483 }, { fn: "__rust_try" }, { fn: "std::panicking::try", file: "/build/rustc-v6rcRM/rustc-1.66.1+dfsg0ubuntu1~llvm/libra
ry/std/src/panicking.rs", line: 447 }, { fn: "std::panic::catch_unwind", file: "/build/rustc-v6rcRM/rustc-1.66.1+dfsg0ubuntu1~llvm/library/std/src/panic.rs", line: 137 }, { fn: "
tokio::runtime::task::harness::poll_future", file: "/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/task/harness.rs", line: 464 }, { fn: "tokio::ru
ntime::task::harness::Harness<T,S>::poll_inner", file: "/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/task/harness.rs", line: 198 }, { fn: "tokio
::runtime::task::harness::Harness<T,S>::poll", file: "/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/task/harness.rs", line: 152 }, { fn: "tokio::
runtime::task::raw::poll", file: "/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/task/raw.rs", line: 276 }, { fn: "tokio::runtime::task::raw::Raw$
ask::poll", file: "/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/task/raw.rs", line: 200 }, { fn: "tokio::runtime::task::UnownedTask<S>::run", fi
le: "/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/task/mod.rs", line: 437 }, { fn: "tokio::runtime::blocking::pool::Task::run", file: "/root/.ca
rgo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/blocking/pool.rs", line: 159 }, { fn: "tokio::runtime::blocking::pool::Inner::run", file: "/root/.cargo/regi
stry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/blocking/pool.rs", line: 513 }, { fn: "tokio::runtime::blocking::pool::Spawner::spawn_thread::{{closure}}", file: "/
root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.29.1/src/runtime/blocking/pool.rs", line: 471 }, { fn: "std::sys_common::backtrace::__rust_begin_short_backtrace", fi
le: "/build/rustc-v6rcRM/rustc-1.66.1+dfsg0ubuntu1~llvm/library/std/src/sys_common/backtrace.rs", line: 121 }, { fn: "std::thread::Builder::spawn_unchecked_::{{closure}}::{{closu
re}}", file: "/build/rustc-v6rcRM/rustc-1.66.1+dfsg0ubuntu1~llvm/library/std/src/thread/mod.rs", line: 551 }, { fn: "<core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::
function::FnOnce<()>>::call_once", file: "/build/rustc-v6rcRM/rustc-1.66.1+dfsg0ubuntu1~llvm/library/core/src/panic/unwind_safe.rs", line: 271 }, { fn: "std::panicking::try::do_c
all", file: "/build/rustc-v6rcRM/rustc-1.66.1+dfsg0ubuntu1~llvm/library/std/src/panicking.rs", line: 483 }, { fn: "__rust_try" }, { fn: "std::panicking::try", file: "/build/rustc
-v6rcRM/rustc-1.66.1+dfsg0ubuntu1~llvm/library/std/src/panicking.rs", line: 447 }, { fn: "std::panic::catch_unwind", file: "/build/rustc-v6rcRM/rustc-1.66.1+dfsg0ubuntu1~llvm/lib
rary/std/src/panic.rs", line: 137 }, { fn: "std::thread::Builder::spawn_unchecked_::{{closure}}", file: "/build/rustc-v6rcRM/rustc-1.66.1+dfsg0ubuntu1~llvm/library/std/src/thread
/mod.rs", line: 550 }, { fn: "core::ops::function::FnOnce::call_once{{vtable.shim}}", file: "/build/rustc-v6rcRM/rustc-1.66.1+dfsg0ubuntu1~llvm/library/core/src/ops/function.rs",
line: 251 }, { fn: "<alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once", file: "/build/rustc-v6rcRM/rustc-1.66.1+dfsg0ubuntu1~llvm/library/alloc/src/boxed.
rs", line: 1987 }, { fn: "<alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once", file: "/build/rustc-v6rcRM/rustc-1.66.1+dfsg0ubuntu1~llvm/library/alloc/src/b
oxed.rs", line: 1987 }, { fn: "std::sys::unix::thread::Thread::new::thread_start", file: "/build/rustc-v6rcRM/rustc-1.66.1+dfsg0ubuntu1~llvm/library/std/src/sys/unix/thread.rs",
line: 108 }, { fn: "start_thread", file: "/build/glibc-BHL3KM/glibc-2.31/nptl/pthread_create.c", line: 477 }, { fn: "clone", file: "/build/glibc-BHL3KM/glibc-2.31/misc/../sysdeps
/unix/sysv/linux/x86_64/clone.S", line: 95 }]
thread 'tokio-runtime-worker' panicked at 'called `Option::unwrap()` on a `None` value', router/src/queue.rs:243:23
No response
docker from latest
example prompt:
User: Who are you ?
Assistant: You mean me ?
User: Who are you ?
Assistant: You mean me ?
User: Who are you ?
Assistant: You mean me ?
User: Who are you ?
Assistant: You mean me ?
User: Who are you ?
Assistant: You mean me ?
this is enough to produce unk tokens on any prompt only.
When cutting down to 2 lines only it works as expected.
base model is:
mistralai/Mistral-7B-v0.1
adapters tested:
https://huggingface.co/qblocks/mistral_7b_norobots
https://huggingface.co/flozi00/mistral-zephyr-lora
https://huggingface.co/flozi00/mistral-germanassistantv4
start command
docker run --pull always --gpus all -d --shm-size 1g -p 8080:80 ghcr.io/predibase/lorax:latest --model-id mistralai/Mistral-7B-Instruct-v0.1 --cuda-memory-fraction 0.5 --max-total-tokens 8192 --max-batch-prefill-tokens 7000 --max-input-length 7000
Working with and without adapters loaded
https://github.com/mobiusml/hqq
adding hqq as quantization function similiar to bitsandbytes to make it work just in time with only 5 minutes for 70b model
2 bit quantization
Can take this to PR as quantization runtime for the first step
No response
No response
No response
Example from Punica:
No response
No response
as discussed here: https://discord.com/channels/1174495433565945916/1176984269558652998/1179487788266172508
If there are to many requests for different adapters you are running out of memory, since the actual system only checks the memory for batch sizes and not concurrent adapters.
I looked at the effort it would take to add this feature and realized it was beyond my ability with Rust
When starting the router, the following params should be configurable:
--adapter-cycle-time-s
(default: 2)--max-active-adapters
(default: 128)No response
No response
No response
No response
Not sure this should be a feature requests.
The current docker image contains two copy of pytorch, which result an extra 6G docker image size increase.
Fast startup time in Function as a Service environment.
less docker pull wait time
diff --git a/Dockerfile b/Dockerfile
index 19b2e06..273278a 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -199,7 +199,8 @@ RUN pip install einops --no-cache-dir
# Install the pip requirements
COPY server/requirements.txt .
-RUN pip install -r requirements.txt
+# HACK: make torch version same as the one installed by conda
+RUN sed -i 's/+cu118//g' requirements.txt; pip install -r requirements.txt --no-cache-dir
# Install server
COPY proto proto
@@ -234,7 +235,8 @@ RUN chmod +x sync.sh
RUN curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" && \
unzip awscliv2.zip && \
- sudo ./aws/install
+ sudo ./aws/install && \
+ rm -rf aws awscliv2.zip
# ENTRYPOINT ["./entrypoint.sh"]
ENTRYPOINT ["lorax-launcher"]
diff --git a/server/poetry.lock b/server/poetry.lock
index 8195489..7443101 100644
--- a/server/poetry.lock
+++ b/server/poetry.lock
@@ -2787,4 +2787,4 @@ quantize = ["accelerate", "datasets", "texttable"]
[metadata]
lock-version = "2.0"
python-versions = "^3.9"
-content-hash = "151ae83f306aafec7e9fe044359d9eaada48c55910ad7f25de7461507f6adfe6"
+content-hash = "c9f828f35184814a2017369a1cbe783f42931a9c034d1ac5f5de377cbb69ffdc"
diff --git a/server/pyproject.toml b/server/pyproject.toml
index 206a3f8..09fa4b4 100644
--- a/server/pyproject.toml
+++ b/server/pyproject.toml
@@ -32,7 +32,7 @@ einops = "^0.6.1"
tiktoken = "^0.5.2"
texttable = { version = "^1.6.7", optional = true }
datasets = { version = "^2.14.0", optional = true }
-torch = {version = "2.1.1+cu118", source = "torch"}
+torch = {version = "2.1.1", source = "torch"}
peft = "0.4.0"
boto3 = "^1.28.34"
urllib3 = "<=1.26.18"
I was able to build a smaller docker image with above patch, but it's quite hacky.
REPOSITORY TAG IMAGE ID CREATED SIZE
test latest bedbe3725de5 17 hours ago 10.5GB
ghcr.io/predibase/lorax 0.4.1 36d7669de298 31 hours ago 17.5GB
Currently, the SGMV kernel will fail if the rank < 8, which is also an issue with tensor parallelism for ranks > 8. We should extend the kernel to support these cases:
I am trying to rebuild the Lorax docker image, which is failing in the punica-builder stage. Error logs are attached, could you advise?
My final goal is to make Lorax deployable on Sagemaker by adding back the entrypoint for a sagemaker stage which was originally in the Dockerfile.
Thanks.
docker build --target base -t lorax:base
build.log
Successfull docker build.
I run your docker image in 2 cases:
--sharded false
)--sharded false --num_shard 4
){
"model_id": "Open-Orca/Mistral-7B-OpenOrca",
"adapter_id": "",
"source": "hub",
"adapter_source": "hub",
"revision": null,
"validation_workers": 2,
"sharded": true,
"num_shard": 4,
"quantize": "BitsandbytesNF4",
"dtype": null,
"trust_remote_code": false,
"max_concurrent_requests": 128,
"max_best_of": 1,
"max_stop_sequences": 4,
"max_input_length": 2048,
"max_total_tokens": 4096,
"waiting_served_ratio": 1.2,
"max_batch_prefill_tokens": 4096,
"max_batch_total_tokens": 100000,
"max_waiting_tokens": 20,
"max_active_adapters": 10,
"adapter_cycle_time_s": 2,
"hostname": "0.0.0.0",
"port": 8000,
"shard_uds_path": "/tmp/lorax-server",
"master_addr": "localhost",
"master_port": 29500,
"huggingface_hub_cache": "/data",
"weights_cache_override": null,
"disable_custom_kernels": false,
"cuda_memory_fraction": 1,
"json_output": true,
"otlp_endpoint": null,
"cors_allow_origin": [],
"watermark_gamma": null,
"watermark_delta": null,
"ngrok": false,
"ngrok_authtoken": null,
"ngrok_edge": null,
"env": false,
"download_only": false
}
Run dokcer with --sharded true --num_shard 4
Same or better performace when run multi-gpu
We should support a version of the REST API that mirrors the OpenAI completion API, similar to vLLM:
https://docs.vllm.ai/en/latest/getting_started/quickstart.html#openai-compatible-server
ghcr.io/predibase/lorax:0.4
failed to load gptq image
command: --model-id /mnt/local-model/Qwen-14B-Chat-Int4/ --quantize gptq --trust-remote-code
Using model:
023-12-17T14:25:01.295949Z ERROR lorax_launcher: interceptor.py:41 Method Warmup encountered an error.
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/gptq/custom_autotune.py", line 90, in _bench
return triton.testing.do_bench(
TypeError: do_bench() got an unexpected keyword argument 'percentiles'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/conda/bin/lorax-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
return _main(
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/lorax_server/cli.py", line 84, in serve
server.serve(
File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 277, in serve
asyncio.run(
File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
self._run_once()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
handle._run()
File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
return await self.intercept(
> File "/opt/conda/lib/python3.10/site-packages/lorax_server/interceptor.py", line 38, in intercept
return await response
File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
raise error
File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
return await behavior(request_or_iterator, context)
File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 74, in Warmup
max_supported_total_tokens = self.model.warmup(batch)
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 864, in warmup
_, batch = self.generate_token(batch)
File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 963, in generate_token
raise e
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 960, in generate_token
out = self.forward(batch, adapter_data)
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 919, in forward
return self.model.forward(
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_qwen_modeling.py", line 475, in forward
hidden_states = self.transformer(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_qwen_modeling.py", line 432, in forward
hidden_states, residual = layer(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_qwen_modeling.py", line 357, in forward
attn_output = self.attn(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_qwen_modeling.py", line 226, in forward
qkv = self.c_attn(hidden_states, adapter_data)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/layers.py", line 481, in forward
result = self.base_layer(input)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/layers.py", line 285, in forward
return self.linear.forward(x)
File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/gptq/quant_linear.py", line 349, in forward
out = QuantLinearFunction.apply(
File "/opt/conda/lib/python3.10/site-packages/torch/autograd/function.py", line 539, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/opt/conda/lib/python3.10/site-packages/torch/cuda/amp/autocast_mode.py", line 121, in decorate_fwd
return fwd(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/gptq/quant_linear.py", line 244, in forward
output = matmul248(input, qweight, scales, qzeros, g_idx, bits, maxq)
File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/gptq/quant_linear.py", line 216, in matmul248
matmul_248_kernel[grid](
File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/gptq/custom_autotune.py", line 110, in run
timings = {
File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/gptq/custom_autotune.py", line 111, in <dictcomp>
config: self._bench(*args, config=config, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/gptq/custom_autotune.py", line 93, in _bench
except triton.compiler.OutOfResources:
AttributeError: module 'triton.compiler' has no attribute 'OutOfResources'
2023-12-17T14:25:01.296225Z ERROR warmup{max_input_length=1024 max_prefill_tokens=4096}:warmup: lorax_client: router/client/src/lib.rs:33: Server error: module 'triton.compiler' has no attribute 'OutOfResources'
Error: Warmup(Generation("module 'triton.compiler' has no attribute 'OutOfResources'"))
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data \
ghcr.io/predibase/lorax:0.4 --model-id Qwen/Qwen-14B-Chat-Int4
It runs ok with ghcr.io/predibase/lorax:0.3
I'm not sure how I managed it, but it seems like I have got a training script that creates a LoRA that loads but has little to no effect. I have been modifying the qLoRA script to try to work with lorax.
By any chance can you point me to a training script that is known to work effectively with this system? Apologies if this is too obvious. Most of what I see when I search for fine-tuning examples right now are for qLoRA or 8 bit.
Just trying to verify that I didn't do something wrong.
Happy to test whatever you suggest with our dataset. Which to be honest, our dataset might be part of the problem. But not sure.
I am trying to run bloom-7b1 model using docker locally [both my model and adpters ]. Here is the script for running bloom on , lorax:
#!bin/bash
# PATH to model
model="bloom-7b1"
# VOLUME to share: pwd/../models -> /data
volume=$PWD/../models:/data # share a volume with the Docker container to avoid downloading weights every run
echo $volume
docker run --gpus 0 --shm-size 1g -p 7070:80 \
--volume $volume \
ghcr.io/predibase/lorax:latest \
--model-id /data/$model \
--num-shard 1 \
--quantize bitsandbytes-nf4 \
--max-concurrent-requests 256 \
--cuda-memory-fraction 0.5 \
WHen I pass the adapter located at peft-models/bloom-alpaca-ne
Screenshot
:
No response
We return the number of generated tokens when details=True
, but systems like OpenAI API also return the number of input tokens. This i useful, for example, for metering systems that limit users based on number of input + output tokens.
Current:
'details': {'generated_tokens': 20}
Proposal:
'details': {'prompt_tokens': 120, 'generated_tokens': 20}
Extend the testing with tiny dummy models
Some cpu based tests for example quantization, decoding and model loading for architectures
Can open an PR
Hi
I got the following error:
NotImplementedError: Mistral model requires flash attention v2
I tried use model=TheBloke/Mistral-7B-Instruct-v0.1-GPTQ
Thanks
Example:
mistralai/Mistral-7B-v0.1 --quantize bitsandbytes-nf4
Request:
curl 127.0.0.1:8080/generate \
-X POST \
-d '{"inputs": "<|system|> You are a helpful assistant <|user|> What is deep learning? </s> <|assistant|>", "parameters": {"max_new_tokens": 64, "adapter_id": "qblocks/mistral_7b_norobots"}}' \
-H 'Content-Type: application/json'
Response:
{"generated_text":""}
Expected:
{"generated_text":"Deep learning is a subset of machine learning that uses artificial neural networks to learn from data. It is a powerful tool for solving complex problems in fields such as natural language processing, computer vision, and speech recognition. Deep learning algorithms can learn from large amounts of data and make predictions or decisions based on that data. They can"}
really cool project! im wondering how its different from s-Lora? https://github.com/S-LoRA/S-LoRA
When querying a base model with an adapter that has NaN or Inf weight tensors, LoRAX returns the following error:
The output tensors do not match for key base_model.model.model.layers.0.self_attn.q_proj.lora_A.weight
It would be more helpful if the error message indicates the the reason the tensors don't match during merge is because LoRAX detected NaN/Inf tensors in the adapter weights.
This would help provide a rectifiable/actionable path for users who fine-tuned models and are working on testing them out to know that this isn't an issue with LoRAX, but rather, an issue with their trained adapter weights.
Happy to help surface a better error message! Seems like the issue is raised from this line in particule?
lorax:latest
@tgaddair I have few adapters finetuned using GPT2 as base model,
Architecture of GPT2:
GPT2LMHeadModel(
(transformer): GPT2Model(
(wte): Embedding(50257, 768)
(wpe): Embedding(1024, 768)
(drop): Dropout(p=0.1, inplace=False)
(h): ModuleList(
(0-11): 12 x GPT2Block(
(ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(attn): GPT2Attention(
(c_attn): Conv1D()
(c_proj): Conv1D()
(attn_dropout): Dropout(p=0.1, inplace=False)
(resid_dropout): Dropout(p=0.1, inplace=False)
)
(ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(mlp): GPT2MLP(
(c_fc): Conv1D()
(c_proj): Conv1D()
(act): NewGELUActivation()
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
(ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
(lm_head): Linear(in_features=768, out_features=50257, bias=False)
)
The adapters are finetuned with "c_attn, c_proj" layer, does lorax currently support it?
Question about compatibility.
Model info:
{
"model_id": "mistralai/Mistral-7B-Instruct-v0.1",
"model_sha": "7ad5799710574ba1c1d953eba3077af582f3a773",
"model_dtype": "torch.float16",
"model_device_type": "cuda",
"model_pipeline_tag": "text-generation",
"max_concurrent_requests": 128,
"max_best_of": 2,
"max_stop_sequences": 4,
"max_input_length": 1024,
"max_total_tokens": 2048,
"waiting_served_ratio": 1.2,
"max_batch_total_tokens": 1102544,
"max_waiting_tokens": 20,
"validation_workers": 2,
"version": "0.1.0",
"sha": null,
"docker_label": null
}
2 A100 gpus, NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2
outside docker.
Run mistral example with docker on 2 gpus:
model=mistralai/Mistral-7B-Instruct-v0.1
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/predibase/lorax:latest --model-id $model --num-shard 2
Then try to generate:
❯ curl 127.0.0.1:8080/generate \
-X POST \
-d '{"inputs": "[INST] Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? [/INST]", "parameters": {"max_new_tokens": 64, "adapter_id": "vineetsharma/qlora-adapter-Mistral-7B-Instruct-v0.1-gsm8k"}}' \
-H 'Content-Type: application/json'
{"error":"Request failed during generation: Server error: local variable 'lora_b' referenced before assignment","error_type":"generation"}%
Basically the issue that when trying to multiply first lora_a matrix, we get it sharded with shape [2048, r]
while input is not sharded and has shape [49, 4096]
.
Generation completed successfully
There is existing code in LoRAX towards 8bit bitsandbytes quantization:
lorax/server/text_generation_server/utils/layers.py
Lines 86 to 95 in c747d27
Supporting 4bit bitsandbytes quantization would enable us to serve models trained in 4bit.
The change should involve following the patterns implemented for 8bit quantization.
No response
No response
Currently the user can configure dynamic RoPE scaling by setting the environment variables ROPE_SCALING
and ROPE_FACTOR
like so:
export ROPE_SCALING=dynamic
export ROPE_FACTOR=2
But this is very clunky and not documented. We should add these as CLI args to the lorax-launcher
so they can be better documented and less error prone.
WIP project roadmap for LoRAX. We'll continue to update this over time.
Lorax version: 0.4.1
Lorax_launcher: 0.1.0
Model: mistralai/Mixtral-8x7B-Instruct-v0.1
GPUS: 3090 (24 gb) P100 (16 gb)
model= mistralai/Mixtral-8x7B-Instruct-v0.1
volume=$PWD/data
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data
ghcr.io/predibase/lorax:latest --model-id $model --quantize bitsandbytes-nf4 --trust-remote-code
Upon executing this code I receive the following traceback:
Traceback (most recent call last):
File "/opt/conda/bin/lorax-server", line 8, in
sys.exit(app())
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in call
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in call
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
return _main(
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/lorax_server/cli.py", line 84, in serve
server.serve(
File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 271, in serve
asyncio.run(
File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
self._run_once()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
handle._run()
File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 223, in serve_inner
model = get_model(
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/init.py", line 305, in get_model
return FlashMixtral(
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_mixtral.py", line 346, in init
SLIDING_WINDOW_BLOCKS = math.ceil(config.sliding_window / BLOCK_SIZE)
TypeError: unsupported operand type(s) for /: 'NoneType' and 'int'
2023-12-19T02:06:51.578108Z ERROR shard-manager: lorax_launcher: Shard complete standard error output:
Traceback (most recent call last):
File "/opt/conda/bin/lorax-server", line 8, in
sys.exit(app())
File "/opt/conda/lib/python3.10/site-packages/lorax_server/cli.py", line 84, in serve
server.serve(
File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 271, in serve
asyncio.run(
File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()
File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 223, in serve_inner
model = get_model(
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/init.py", line 305, in get_model
return FlashMixtral(
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_mixtral.py", line 346, in init
SLIDING_WINDOW_BLOCKS = math.ceil(config.sliding_window / BLOCK_SIZE)
TypeError: unsupported operand type(s) for /: 'NoneType' and 'int'
It appears that for some reason config.sliding_window is set to None, which doesn't make sense because there is an if statement that forces it to equal config.max_position_embeddings a few lines above.
I will be out of town, so I do not have a chance to build the server locally, but I can look at it when I get back.
The mixtral model runs without issue.
Hello, this is the end of the Dockerfile build log (I am in EC2 in the instance I was using for inference)
[19/49] /opt/conda/bin/nvcc -I/usr/src/flash-attention-v2/csrc/flash_attn -I/usr/src/flash-attention-v2/csrc/flash_attn/src -I/usr/src/flash-attention-v2/csrc/cutlass/include -I/opt/conda/lib/python3.9/site-packages/torch/include -I/opt/conda/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.9/site-packages/torch/include/TH -I/opt/conda/lib/python3.9/site-packages/torch/include/THC -I/opt/conda/include -I/opt/conda/include/python3.9 -c -c /usr/src/flash-attention-v2/csrc/flash_attn/src/flash_bwd_hdim64_bf16_sm80.cu -o /usr/src/flash-attention-v2/build/temp.linux-x86_64-cpython-39/csrc/flash_attn/src/flash_bwd_hdim64_bf16_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 --threads 4 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0
/opt/conda/lib/python3.9/site-packages/torch/include/c10/util/irange.h(54): warning #186-D: pointless comparison of unsigned integer with zero
detected during:
instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator==(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=size_t, one_sided=false, =0]"
(61): here
instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator!=(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=size_t, one_sided=false, =0]"
/opt/conda/lib/python3.9/site-packages/torch/include/c10/core/TensorImpl.h(77): here
/opt/conda/lib/python3.9/site-packages/torch/include/c10/util/irange.h(54): warning #186-D: pointless comparison of unsigned integer with zero
detected during:
instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator==(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=size_t, one_sided=true, =0]"
(61): here
instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator!=(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=size_t, one_sided=true, =0]"
/opt/conda/lib/python3.9/site-packages/torch/include/c10/core/TensorImpl.h(2327): here
instantiation of "__nv_bool c10::TensorImpl::SetDimsTemplate(c10::ArrayRef) [with T=int64_t, =void]"
/opt/conda/lib/python3.9/site-packages/torch/include/c10/core/TensorImpl.h(2337): here
/opt/conda/lib/python3.9/site-packages/torch/include/c10/util/irange.h(54): warning #186-D: pointless comparison of unsigned integer with zero
detected during:
instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator==(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=size_t, one_sided=false, =0]"
(61): here
instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator!=(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=size_t, one_sided=false, =0]"
/opt/conda/lib/python3.9/site-packages/torch/include/c10/core/TensorImpl.h(77): here
/opt/conda/lib/python3.9/site-packages/torch/include/c10/util/irange.h(54): warning #186-D: pointless comparison of unsigned integer with zero
detected during:
instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator==(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=size_t, one_sided=true, =0]"
(61): here
instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator!=(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=size_t, one_sided=true, =0]"
/opt/conda/lib/python3.9/site-packages/torch/include/c10/core/TensorImpl.h(2327): here
instantiation of "__nv_bool c10::TensorImpl::SetDimsTemplate(c10::ArrayRef) [with T=int64_t, =void]"
/opt/conda/lib/python3.9/site-packages/torch/include/c10/core/TensorImpl.h(2337): here
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "/opt/conda/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1893, in _run_ninja_build
subprocess.run(
File "/opt/conda/lib/python3.9/subprocess.py", line 528, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/src/flash-attention-v2/setup.py", line 288, in
setup(
File "/opt/conda/lib/python3.9/site-packages/setuptools/init.py", line 87, in setup
return distutils.core.setup(**attrs)
File "/opt/conda/lib/python3.9/site-packages/setuptools/_distutils/core.py", line 185, in setup
return run_commands(dist)
File "/opt/conda/lib/python3.9/site-packages/setuptools/_distutils/core.py", line 201, in run_commands
dist.run_commands()
File "/opt/conda/lib/python3.9/site-packages/setuptools/_distutils/dist.py", line 969, in run_commands
self.run_command(cmd)
File "/opt/conda/lib/python3.9/site-packages/setuptools/dist.py", line 1208, in run_command
super().run_command(command)
File "/opt/conda/lib/python3.9/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
cmd_obj.run()
File "/opt/conda/lib/python3.9/site-packages/setuptools/_distutils/command/build.py", line 132, in run
self.run_command(cmd_name)
File "/opt/conda/lib/python3.9/site-packages/setuptools/_distutils/cmd.py", line 318, in run_command
self.distribution.run_command(command)
File "/opt/conda/lib/python3.9/site-packages/setuptools/dist.py", line 1208, in run_command
super().run_command(command)
File "/opt/conda/lib/python3.9/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
cmd_obj.run()
File "/opt/conda/lib/python3.9/site-packages/setuptools/command/build_ext.py", line 84, in run
_build_ext.run(self)
File "/opt/conda/lib/python3.9/site-packages/setuptools/_distutils/command/build_ext.py", line 346, in run
self.build_extensions()
File "/opt/conda/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 843, in build_extensions
build_ext.build_extensions(self)
File "/opt/conda/lib/python3.9/site-packages/setuptools/_distutils/command/build_ext.py", line 468, in build_extensions
self._build_extensions_serial()
File "/opt/conda/lib/python3.9/site-packages/setuptools/_distutils/command/build_ext.py", line 494, in _build_extensions_serial
self.build_extension(ext)
File "/opt/conda/lib/python3.9/site-packages/setuptools/command/build_ext.py", line 246, in build_extension
_build_ext.build_extension(self, ext)
File "/opt/conda/lib/python3.9/site-packages/setuptools/_distutils/command/build_ext.py", line 549, in build_extension
objects = self.compiler.compile(
File "/opt/conda/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 658, in unix_wrap_ninja_compile
_write_ninja_file_and_compile_objects(
File "/opt/conda/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1574, in _write_ninja_file_and_compile_objects
_run_ninja_build(
File "/opt/conda/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1909, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error compiling objects for extension
make: *** [Makefile:10: build-flash-attention-v2] Error 1
The command '/bin/sh -c make build-flash-attention-v2' returned a non-zero code: 2
I appreciate any time you have for hints.
Clone and ./build.sh
Build completes.
This should work similar to Ludwig where we take the hidden state, remove LM head, and swap in an MLP for either embedding generation or classification / regression.
No response
No response
ghcr.io/predibase/lorax:latest
First, run
docker run --gpus all --shm-size 1g -p 8000:80 -e "HUGGING_FACE_HUB_TOKEN=<token>" -v $volume:/data ghcr.io/predibase/lorax:latest --trust-remote-code --model-id meta-llama/Llama-2-7b-chat-hf
Then, send the following requests:
from concurrent.futures import ThreadPoolExecutor
import time
import requests
url = "http://127.0.0.1:8000/generate"
headers = {"Content-Type": "application/json"}
# Function to send a request
def send_request(payload):
start_time = time.time()
response = requests.post(url, headers=headers, json=payload)
elapsed_time = time.time() - start_time
return response.json(), elapsed_time
# Number of concurrent requests
adapter_list = [<list of 15 adapters>]
num_requests = len(adapter_list)
# Using ThreadPoolExecutor to send requests concurrently
with ThreadPoolExecutor(max_workers=num_requests) as executor:
# Submit the requests
futures = []
for i in range(num_requests):
payload = {
"inputs": "Hello, my name is",
"parameters": {"max_new_tokens": 100, "adapter_id": adapter_list[i]},
}
futures.append(executor.submit(send_request, payload))
# Wait for all requests to complete
results = [future.result() for future in futures]
print(results)
First of all, awesome work with lorax! When sending requests at the same time like this, I'm receiving Cuda Out Of Memory errors on the server. I thought that actually, the server would check beforehand how many tokens can be handled and consequently enqueue requests that cannot be served. Have you encountered this before?
I tried lora finetuning a smaller variant of mistral architecture but I am getting this error below,
GenerationError: Request failed during generation: Server error: Unsupported head size: 32
I used rank: 16, alpha: 32
https://huggingface.co/Locutusque/TinyMistral-248M-Instruct
It should have worked since it's following the mistral architecture. (TinyLlama was working fine)
Hi there, thank you for your perfect project,
I see that you have --master-addr <MASTER_ADDR>|--master-port <MASTER_PORT>
parameter when run the server
Do you have any guide about use torch distributed in your project? I think it's really helpful if I need to run on multi-machine.
Thank you,
model=mistralai/Mistral-7B-Instruct-v0.1
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/predibase/lorax:latest --model-id $model
More detail about torch distributed
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.