Comments (7)
how can I serve my local vicuna-13B model? Is that supported?
Yes, it's supported. you can specify another params --adapter-source local
which indicating the --adapter-id
is a local filepath pointed to your local model.
from lorax.
Hi @sleepwalker2017, I think the issue with loading the adapter is the same as the issue in #311, which should be fixed by #317. There should be no issue with serving your local vicuna-13b model following this change, but let me know if you run into more issues.
Regarding your second question: when you say you want to specify multiple local LoRAs, do you mean you wish to merge them together? If so, we support this, just not during initialization. We have some docs on LoRA merging here. But if you mean you want to be able to call different adapters for each request, you can do so by specifying the
adapter_id
(can be a local file) in the request instead of during initialization.
When I use my code to benchmark lora-x on A30*2, I got the poor benchmark.
There must be something wrong. Please take a look, thank you.
from lorax.
When I try this
model=codellama/CodeLlama-13b-hf
docker run --gpus all --shm-size 1g -p 8080:80 -v /data/:/data \
ghcr.io/predibase/lorax:latest --model-id $model --sharded true --num-shard 2 \
--adapter-id shibing624/llama-13b-belle-zh-lora
The error becomes this:
OSError: decapoda-research/llama-13b-hf is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo either by logging in with `huggingface-cli login` or by passing `token=<your_token>`
rank=1
I wonder, where does this decapoda-research/llama-13b-hf
come from??
I didn't specify that.
At last, I want to know, how can I serve my local vicuna-13B model? Is that supported?
from lorax.
Another question: I want to specify multiple local loras when luanching server, How can I achieve this?
from lorax.
specify multiple local loras when luanching server
it seems this feature is already in roadmap (LoRA blending): #57
from lorax.
Hi @sleepwalker2017, I think the issue with loading the adapter is the same as the issue in #311, which should be fixed by #317. There should be no issue with serving your local vicuna-13b model following this change, but let me know if you run into more issues.
Regarding your second question: when you say you want to specify multiple local LoRAs, do you mean you wish to merge them together? If so, we support this, just not during initialization. We have some docs on LoRA merging here. But if you mean you want to be able to call different adapters for each request, you can do so by specifying the adapter_id
(can be a local file) in the request instead of during initialization.
from lorax.
Hi @sleepwalker2017, I think the issue with loading the adapter is the same as the issue in #311, which should be fixed by #317. There should be no issue with serving your local vicuna-13b model following this change, but let me know if you run into more issues.
Regarding your second question: when you say you want to specify multiple local LoRAs, do you mean you wish to merge them together? If so, we support this, just not during initialization. We have some docs on LoRA merging here. But if you mean you want to be able to call different adapters for each request, you can do so by specifying the
adapter_id
(can be a local file) in the request instead of during initialization.
Thank you, I checked this, the 1st issue has been solved.
About the 2nd question, when I launch server with this:
lorax-launcher --model-id /data/vicuna-13b/vicuna-13b-v1.5/ --sharded true --num-shard 2
And then I send request with different lora_ids like this:
adapters = ['mattreid/alpaca-lora-13b', "merror/llama_13b_lora_beauty", 'shibing624/llama-13b-belle-zh-lora', 'shibing624/ziya-llama-13b-medical-lora']
def build_request(output_len):
global req_cnt
idx = req_cnt % len(test_data)
lora_id = idx % 4
input_dict = {
"inputs": test_data[idx],
"parameters": {
"adapter_id": adapters[lora_id],
"max_new_tokens": 256,
"top_p": 0.7
}
}
req_cnt += 1
return input_dict
I want to know whether the adapter computation of these requests are merged.
from lorax.
Related Issues (20)
- Llama3-8b-Instruct won't stop generating HOT 1
- Speculative tokens fails during warmup in some scenarios HOT 1
- Batch inference endpoint (OpenAI compatible)
- Add HF authentication instructions to lorax-launcher docs HOT 6
- Improve async load for adapters to avoid main thread lockups in server
- Retrieve all lora models from Huggingface hub by base model setting. HOT 2
- Add all launcher args as optional in the Helm charts
- AutoTokenzier.from_pretrains needs setting with `trust_remote_code` inside `load_module_map` HOT 2
- Ensure api_token is not included in the response on error HOT 3
- [QUESTION] How to change HuggingFace model download Path in Lorax When deployed to Kubernetes through HelmChart HOT 1
- Bug Report: lorax-launcher failed with --source "s3" for model_id "mistralai/Mistral-7B-Instruct-v0.2" HOT 1
- Improve warmup checking for max new tokens when using speculative decoding
- Support inference on INF2 instance
- Reject unknown fields from API requests
- When caching adapters, cache the adapter ID + the API token pair HOT 4
- Add HTTP status codes to docs HOT 1
- Quantized KV Cache
- `make install` insufficient for running llama3-8B-Instruct HOT 4
- Fail to run Phi-3 HOT 4
- Quickstart example not working HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from lorax.