Comments (9)
Here's the trace:
from lorax.
We’re running H100 on NebiusAI Kubernetes. I’ll have to get back to you on Tuesday with info on drivers.
from lorax.
Hey @karlbernard2, thanks for reporting. It sounds like there's a deadlock that's occurring here that may be triggered under very specific conditions (requests coming it at just the wrong time). Can you share any additional details about your setup (args to lorax-launcher
for example) that can help with reproducing the error?
One thing that stands out from the logs you provided is that the adapter NextDayAI/xtraspicy1.0_13b_r32_800
was loaded, successfully processed a request, then offloaded, then loaded back, but never successfully processed any other requests. It's curious that it was offloaded at all, as it looks like only two adapters were loaded, while by default we will allow up to 128 to be loaded before doing any offloading. So whatever is causing the deadlock may be related to that behavior.
I'll try and take a closer look, but if there's anything you can provide to help me repro that would be helpful.
from lorax.
The fact that the /health
endpoint is unresponsive but the /info
endpoint works would suggest that there's an issue with the Python server, rather than the router. It's possible that the Python server is stuck on some operation.
Something you could try:
- Make sure your container is running in privileged mode by adding
SYS_PTRACE
to the security context of the container as shown here. - SSH into the pod with
kubectl exec -it <pod_name> -- /bin/bash
- Install py-spy so you can get a backtrace from the Python server:
pip install py-spy
- Find the Python server process:
ps aux | grep python
- Run py-spy on the Python server to obtain the backtrace:
sudo py-spy dump -p <pid>
If you're able to run that on one of the hung pods, that would be very helpful for debugging the error.
from lorax.
Thanks for the detailed instructions,, I'll try to do that.
Here's how I launched teh container
containers:
- name: lorax-container
image: ghcr.io/predibase/lorax:latest
ports:
- containerPort: 8001
env:
- name: HUGGING_FACE_HUB_TOKEN
value: hf_secret
- name: PORT
value: "8001"
- name: ROPE_SCALING
value: "dynamic"
- name: ROPE_FACTOR
value: "2.0"
args:
- "--max-input-length=7900"
- "--max-total-tokens=8192"
- "--max-batch-prefill-tokens=8192"
- "--model-id=NextDayAI/extraspicy"
from lorax.
@tgaddair My first attempt to replicate didn;t have the same issue (althouh earlier today I got it all the time, so will try more.)
However, since you talked about offloading that shouln't happen, you might find these logs strange:
2023-12-23T03:04:06.827779Z INFO HTTP request{otel.name=POST /generate http.client_ip= http.flavor=1.1 http.host=inference.spicychat.ai:7001 http.method=POST http.route=/generate http.scheme=HTTP http.target=/generate http.user_agent=axios/1.4.0 otel.kind=server trace_id=1b65c393d478941d6b16797446fc1519}:generate{parameters=GenerateParameters { adapter_id: Some("NextDayAI/xtraspicy1.0_13b_r32_720_adapter"), adapter_source: None, api_token: None, best_of: None, temperature: Some(0.7), repetition_penalty: Some(1.1), top_k: Some(90), top_p: Some(0.7), typical_p: None, do_sample: true, max_new_tokens: 180, return_full_text: None, truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None } total_time="3.414291187s" validation_time="3.11ms" queue_time="46.871µs" inference_time="3.41113458s" time_per_token="18.950747ms" seed="Some(18257989878521111275)"}: lorax_router::server: router/src/server.rs:298: Success 2023-12-23T03:04:07.058713Z INFO lorax_router::loader: router/src/loader.rs:241: adapter __base_model__ offloaded 2023-12-23T03:04:07.058731Z INFO lorax_router::queue: router/src/queue.rs:135: set adapter __base_model__ status to Downloaded 2023-12-23T03:04:07.095727Z INFO lorax_router::loader: router/src/loader.rs:197: adapter NextDayAI/xtraspicy1.0_13b_r32_760_adapter loaded 2023-12-23T03:04:07.095745Z INFO lorax_router::queue: router/src/queue.rs:135: set adapter NextDayAI/xtraspicy1.0_13b_r32_760_adapter status to Ready 2023-12-23T03:04:07.588268Z INFO HTTP request{otel.name=POST /generate http.client_ip= http.flavor=1.1 http.host=inference.spicychat.ai:7001 http.method=POST http.route=/generate http.scheme=HTTP http.target=/generate http.user_agent=axios/1.4.0 otel.kind=server trace_id=f94f4f52e417b624f40329c484ee954f}:generate{parameters=GenerateParameters { adapter_id: Some("NextDayAI/xtraspicy1.0_13b_r32_760_adapter"), adapter_source: None, api_token: None, best_of: None, temperature: Some(0.7), repetition_penalty: Some(1.1), top_k: Some(90), top_p: Some(0.7), typical_p: None, do_sample: true, max_new_tokens: 180, return_full_text: None, truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None } total_time="541.9253ms" validation_time="2.590788ms" queue_time="64.482779ms" inference_time="474.851989ms" time_per_token="23.742599ms" seed="Some(14941048975292732004)"}: lorax_router::server: router/src/server.rs:298: Success 2023-12-23T03:04:07.625223Z INFO HTTP request{otel.name=POST /generate http.client_ip= http.flavor=1.1 http.host=inference.spicychat.ai:7001 http.method=POST http.route=/generate http.scheme=HTTP http.target=/generate http.user_agent=axios/1.4.0 otel.kind=server trace_id=676504f7982ff96d058121f59d961dd4}:generate{parameters=GenerateParameters { adapter_id: Some("NextDayAI/xtraspicy1.0_13b_r32_800_adapter"), adapter_source: None, api_token: None, best_of: None, temperature: Some(0.7), repetition_penalty: Some(1.1), top_k: Some(90), top_p: Some(0.7), typical_p: None, do_sample: true, max_new_tokens: 300, return_full_text: None, stop: ["\nEva:", "\nShizuka:", "\n###"], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None } total_time="1.211921872s" validation_time="778.706µs" queue_time="13.944244ms" inference_time="1.197199151s" time_per_token="21.003493ms" seed="Some(5853084379632762489)"}: lorax_router::server: router/src/server.rs:298: Success 2023-12-23T03:04:09.218011Z INFO lorax_router::loader: router/src/loader.rs:241: adapter NextDayAI/xtraspicy1.0_13b_r32_800_adapter offloaded 2023-12-23T03:04:09.218033Z INFO lorax_router::queue: router/src/queue.rs:135: set adapter NextDayAI/xtraspicy1.0_13b_r32_800_adapter status to Downloaded 2023-12-23T03:04:09.218716Z INFO lorax_router::loader: router/src/loader.rs:241: adapter NextDayAI/xtraspicy1.0_13b_r32_720_adapter offloaded 2023-12-23T03:04:09.218739Z INFO lorax_router::queue: router/src/queue.rs:135: set adapter NextDayAI/xtraspicy1.0_13b_r32_720_adapter status to Downloaded 2023-12-23T03:04:09.239236Z INFO lorax_router::loader: router/src/loader.rs:197: adapter NextDayAI/xtraspicy1.0_13b_r32_400_adapter loaded 2023-12-23T03:04:09.239269Z INFO lorax_router::queue: router/src/queue.rs:135: set adapter NextDayAI/xtraspicy1.0_13b_r32_400_adapter status to Ready 2023-12-23T03:04:09.258709Z INFO lorax_router::loader: router/src/loader.rs:197: adapter NextDayAI/xtraspicy1.0_13b_r32_720_adapter loaded 2023-12-23T03:04:09.258724Z INFO lorax_router::queue: router/src/queue.rs:135: set adapter NextDayAI/xtraspicy1.0_13b_r32_720_adapter status to Ready 2023-12-23T03:04:09.701608Z INFO HTTP request{otel.name=POST /generate http.client_ip= http.flavor=1.1 http.host=inference.spicychat.ai:7001 http.method=POST http.route=/generate http.scheme=HTTP http.target=/generate http.user_agent=axios/1.4.0 otel.kind=server trace_id=a9611795aa4b0336970442689421c739}:generate{parameters=GenerateParameters { adapter_id: Some("NextDayAI/xtraspicy1.0_13b_r32_400_adapter"), adapter_source: None, api_token: None, best_of: None, temperature: Some(0.7), repetition_penalty: Some(1.1), top_k: Some(90), top_p: Some(0.7), typical_p: None, do_sample: true, max_new_tokens: 180, return_full_text: None, stop: ["\nYour roommate Amber :", "\nWinston:", "\n###"], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None } total_time="489.291069ms" validation_time="4.884384ms" queue_time="22.199583ms" inference_time="462.207337ms" time_per_token="24.326701ms" seed="Some(16751434335624526013)"}: lorax_router::server: router/src/server.rs:298: Success
Screenshot might be easier to read:
We are only dealing with 4 adapters
from lorax.
Thanks for the back trace @karlbernard2, this is very helpful!
Definitely looks like the hanging is occurring the SGMV kernel.
In the short term, you can try disabling SGMV with an environment variable: DISABLE_SGMV=1
. That's not a great longterm solution since SGMV is very fast when you have lots of adapters, but it should at least unblock you while I try and repro the issue, and the performance hit shouldn't be very noticeable with fewer than 10 adapters.
I'll see if I can repro this behavior with the adapters you're using here.
from lorax.
Hey @karlbernard2, update on this: I tried running some stress tests today with a variety of request patterns to try and replicate your setup, but was unable to trigger the hanging behavior.
Can you share a few more details about your environment:
- What GPU are you running on?
- What Nvidia device driver version are you using (from
nvidia-smi
)? - Is this running on prem or in the cloud? If cloud, which one?
Thanks.
from lorax.
Hey @karlbernard2, I managed to track down the root cause of the deadlock, and has been fixed in #156.
from lorax.
Related Issues (20)
- Combining multiple LoRA adapters HOT 1
- 10s latency of lora inference caused by None base_model_name_or_path in adapter_config
- [Question] Usage about the `adapter-memory-fraction` HOT 1
- Improve the latency of `load_batched_adapter_weights` HOT 1
- Fix PyTorch CUDA version in Docker
- Idefics2 and LLaVA
- Fallback to Flash Attention v1 for pre-Ampere GPUs HOT 1
- Private LORA Adapter Error - Server error: No valid adapter config file found: tried None and None HOT 1
- Llama3-8b-Instruct won't stop generating
- Speculative tokens fails during warmup in some scenarios HOT 1
- Batch inference endpoint (OpenAI compatible)
- Add HF authentication instructions to lorax-launcher docs HOT 6
- Improve async load for adapters to avoid main thread lockups in server
- Retrieve all lora models from Huggingface hub by base model setting.
- Add all launcher args as optional in the Helm charts
- AutoTokenzier.from_pretrains needs setting with `trust_remote_code` inside `load_module_map` HOT 2
- Ensure api_token is not included in the response on error HOT 3
- [QUESTION] How to change HuggingFace model download Path in Lorax When deployed to Kubernetes through HelmChart HOT 1
- Bug Report: lorax-launcher failed with --source "s3" for model_id "mistralai/Mistral-7B-Instruct-v0.2"
- Improve warmup checking for max new tokens when using speculative decoding
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from lorax.