Comments (10)
Hey @hayleyhu, thanks for reporting this. This is a surprising error. Could you try running the same command, but including the environment variable RUST_BACKTRACE=1
and sharing the full log output?
Example:
docker run -e RUST_BACKTRACE=1 ...
from lorax.
Hello @tgaddair ,
I encountered the same problem when testing the image "ghcr.io/predibase/lorax:latest". Here are the logs:
docker run --gpus '"device=7"' -e RUST_BACKTRACE=1 --shm-size 1g -p 8081:80 -v /model_dir:/data ghcr.io/predibase/lorax:latest --model-id /data/Qwen-14B-Chat --trust-remote-code
2024-03-11T02:31:06.117503Z INFO lorax_launcher: Args { model_id: "/data/Qwen-14B-Chat", adapter_id: None, source: "hub", adapter_source: "hub", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, compile: false, dtype: None, trust_remote_code: true, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, max_active_adapters: 128, adapter_cycle_time_s: 2, adapter_memory_fraction: 0.0, hostname: "3ef400c8e367", port: 80, shard_uds_path: "/tmp/lorax-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, json_output: false, otlp_endpoint: None, cors_allow_origin: [], cors_allow_header: [], cors_expose_header: [], cors_allow_method: [], cors_allow_credentials: None, watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false, download_only: false }
2024-03-11T02:31:06.117556Z WARN lorax_launcher: `trust_remote_code` is set. Trusting that model `/data/Qwen-14B-Chat` do not contain malicious code.
2024-03-11T02:31:06.117744Z INFO download: lorax_launcher: Starting download process.
2024-03-11T02:31:09.676052Z INFO lorax_launcher: cli.py:109 Files are already present on the host. Skipping download.
2024-03-11T02:31:10.721726Z INFO download: lorax_launcher: Successfully downloaded weights.
2024-03-11T02:31:10.722129Z INFO shard-manager: lorax_launcher: Starting shard rank=0
2024-03-11T02:31:20.730706Z INFO shard-manager: lorax_launcher: Waiting for shard to be ready... rank=0
2024-03-11T02:31:25.287915Z INFO lorax_launcher: server.py:291 Server started at unix:///tmp/lorax-server-0
2024-03-11T02:31:25.334414Z INFO shard-manager: lorax_launcher: Shard ready in 14.611113274s rank=0
2024-03-11T02:31:25.432031Z INFO lorax_launcher: Starting Webserver
2024-03-11T02:31:25.464515Z INFO lorax_router: router/src/main.rs:202: Loading tokenizer /data/Qwen-14B-Chat
2024-03-11T02:31:25.464578Z INFO lorax_router: router/src/main.rs:206: Using local tokenizer: /data/Qwen-14B-Chat
2024-03-11T02:31:25.464601Z WARN lorax_router: router/src/main.rs:251: Could not find a fast tokenizer implementation for /data/Qwen-14B-Chat
2024-03-11T02:31:25.464605Z WARN lorax_router: router/src/main.rs:252: Rust input length validation and truncation is disabled
2024-03-11T02:31:25.464609Z WARN lorax_router: router/src/main.rs:277: no pipeline tag found for model /data/Qwen-14B-Chat
2024-03-11T02:31:25.485387Z INFO lorax_router: router/src/main.rs:296: Warming up model
2024-03-11T02:31:57.331056Z INFO lorax_launcher: flash_causal_lm.py:781 Memory remaining for kv cache: 3082.375 MB
2024-03-11T02:31:57.572087Z INFO lorax_router: router/src/main.rs:335: Setting max batch total tokens to 12128
2024-03-11T02:31:57.572120Z INFO lorax_router: router/src/main.rs:336: Connected
2024-03-11T02:31:57.572134Z WARN lorax_router: router/src/main.rs:341: Invalid hostname, defaulting to 0.0.0.0
2024-03-11T02:31:57.573058Z INFO lorax_router::server: router/src/server.rs:974: CORS: origin: Const("*"), methods: Const(Some("GET,POST")), headers: Const(Some("content-type")), expose-headers: Const(None) credentials: No
2024-03-11T02:31:57.573079Z INFO lorax_router::server: router/src/server.rs:986: CORS: CorsLayer { allow_credentials: No, allow_headers: Const(Some("content-type")), allow_methods: Const(Some("GET,POST")), allow_origin: Const("*"), allow_private_network: No, expose_headers: Const(None), max_age: Exact(None), vary: Vary(["origin", "access-control-request-method", "access-control-request-headers"]) }
thread 'tokio-runtime-worker' panicked at /usr/src/router/src/server.rs:794:26:
called `Option::unwrap()` on a `None` value
stack backtrace:
0: rust_begin_unwind
at /rustc/79e9716c980570bfd1f666e3b16ac583f0168962/library/std/src/panicking.rs:597:5
1: core::panicking::panic_fmt
at /rustc/79e9716c980570bfd1f666e3b16ac583f0168962/library/core/src/panicking.rs:72:14
2: core::panicking::panic
at /rustc/79e9716c980570bfd1f666e3b16ac583f0168962/library/core/src/panicking.rs:127:5
3: core::option::Option<T>::unwrap
at /rustc/79e9716c980570bfd1f666e3b16ac583f0168962/library/core/src/option.rs:935:21
4: lorax_router::server::request_logger::{{closure}}
at ./router/src/server.rs:794:22
5: tokio::runtime::task::core::Core<T,S>::poll::{{closure}}
at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/task/core.rs:328:17
6: tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/loom/std/unsafe_cell.rs:16:9
7: tokio::runtime::task::core::Core<T,S>::poll
at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/task/core.rs:317:30
8: std::panicking::try::do_call
at /rustc/79e9716c980570bfd1f666e3b16ac583f0168962/library/std/src/panicking.rs:504:40
9: std::panicking::try
at /rustc/79e9716c980570bfd1f666e3b16ac583f0168962/library/std/src/panicking.rs:468:19
10: std::panic::catch_unwind
at /rustc/79e9716c980570bfd1f666e3b16ac583f0168962/library/std/src/panic.rs:142:14
11: tokio::runtime::task::harness::poll_future
at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/task/harness.rs:473:18
12: tokio::runtime::task::harness::Harness<T,S>::poll_inner
at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/task/harness.rs:208:27
13: tokio::runtime::task::harness::Harness<T,S>::poll
at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/task/harness.rs:153:15
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
2024-03-11T02:31:57.855398Z ERROR lorax_launcher: Webserver Crashed
2024-03-11T02:31:57.855429Z INFO lorax_launcher: Shutting down shards
2024-03-11T02:31:57.931701Z INFO shard-manager: lorax_launcher: Shard terminated rank=0
Error: WebserverFailed
from lorax.
Hey @Nipi64310, thanks for providing this additional context. Unfortunately, it looks like the offending call to Option::unwrap()
is still being hidden somehow. Can you try running docker pull ghcr.io/predibase/lorax:latest
to ensure you're running the latest image and set RUST_BACKTRACE=full
to get the full stack trace? Thanks.
from lorax.
Hey @Nipi64310, thanks for providing this additional context. Unfortunately, it looks like the offending call to
Option::unwrap()
is still being hidden somehow. Can you try runningdocker pull ghcr.io/predibase/lorax:latest
to ensure you're running the latest image and setRUST_BACKTRACE=full
to get the full stack trace? Thanks.
Hi @tgaddair , thanks for getting back to me. I've now updated to the latest Docker image and I can start it now.
from lorax.
Hey @Nipi64310, thanks for providing this additional context. Unfortunately, it looks like the offending call to
Option::unwrap()
is still being hidden somehow. Can you try runningdocker pull ghcr.io/predibase/lorax:latest
to ensure you're running the latest image and setRUST_BACKTRACE=full
to get the full stack trace? Thanks.Hi @tgaddair , thanks for getting back to me. I've now updated to the latest Docker image and I can start it now.
Hello @tgaddair ,
Loaded Qwen-72B-Chat-Int4, encountered RuntimeError: CUDA error: an illegal memory access was encountered
. Loading Qwen-14B-Chat-Int4 gives the correct result. Here is the error log:
docker run --gpus '"device=2,3,4,5"' -e RUST_BACKTRACE=full --shm-size 1g -p 8081:80 -v /Qwen/:/data ghcr.nju.edu.cn/predibase/lorax:latest --model-id /data/Qwen-72B-Chat-Int4 --adapter-source local --trust-remote-code --quantize gptq
2024-03-11T09:24:56.420409Z INFO lorax_launcher: Starting Webserver
2024-03-11T09:24:56.457190Z INFO lorax_router: router/src/main.rs:202: Loading tokenizer /data/Qwen-72B-Chat-Int4
2024-03-11T09:24:56.459163Z INFO lorax_router: router/src/main.rs:206: Using local tokenizer: /data/Qwen-72B-Chat-Int4
2024-03-11T09:24:56.459186Z WARN lorax_router: router/src/main.rs:251: Could not find a fast tokenizer implementation for /data/Qwen-72B-Chat-Int4
2024-03-11T09:24:56.459265Z WARN lorax_router: router/src/main.rs:252: Rust input length validation and truncation is disabled
2024-03-11T09:24:56.459270Z WARN lorax_router: router/src/main.rs:277: no pipeline tag found for model /data/Qwen-72B-Chat-Int4
2024-03-11T09:24:56.503452Z INFO lorax_router: router/src/main.rs:296: Warming up model
2024-03-11T09:24:59.348856Z ERROR lorax_launcher: interceptor.py:41 Method Warmup encountered an error.
Traceback (most recent call last):
File "/opt/conda/bin/lorax-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
return _main(
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/lorax_server/cli.py", line 89, in serve
server.serve(
File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 330, in serve
asyncio.run(
File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
self._run_once()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
handle._run()
File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
return await self.intercept(
> File "/opt/conda/lib/python3.10/site-packages/lorax_server/interceptor.py", line 38, in intercept
return await response
File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
raise error
File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
return await behavior(request_or_iterator, context)
File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 80, in Warmup
max_supported_total_tokens = self.model.warmup(batch, request.max_new_tokens)
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 746, in warmup
_, batch = self.generate_token(batch, is_warmup=True)
File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 878, in generate_token
raise e
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 875, in generate_token
out = self.forward(batch, adapter_data)
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_causal_lm.py", line 833, in forward
return model.forward(
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_qwen_modeling.py", line 476, in forward
hidden_states = self.transformer(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_qwen_modeling.py", line 433, in forward
hidden_states, residual = layer(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_qwen_modeling.py", line 358, in forward
attn_output = self.attn(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_qwen_modeling.py", line 227, in forward
qkv = self.c_attn(hidden_states, adapter_data)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/layers.py", line 601, in forward
result = self.base_layer(input)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/layers.py", line 399, in forward
return self.linear.forward(x)
File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/gptq/quant_linear.py", line 349, in forward
out = QuantLinearFunction.apply(
File "/opt/conda/lib/python3.10/site-packages/torch/autograd/function.py", line 553, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/opt/conda/lib/python3.10/site-packages/torch/cuda/amp/autocast_mode.py", line 123, in decorate_fwd
return fwd(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/gptq/quant_linear.py", line 244, in forward
output = matmul248(input, qweight, scales, qzeros, g_idx, bits, maxq)
File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/gptq/quant_linear.py", line 216, in matmul248
matmul_248_kernel[grid](
File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/gptq/custom_autotune.py", line 110, in run
timings = {
File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/gptq/custom_autotune.py", line 111, in <dictcomp>
config: self._bench(*args, config=config, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/lorax_server/utils/gptq/custom_autotune.py", line 90, in _bench
return triton.testing.do_bench(
File "/opt/conda/lib/python3.10/site-packages/triton/testing.py", line 103, in do_bench
torch.cuda.synchronize()
File "/opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py", line 801, in synchronize
return torch._C._cuda_synchronize()
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
2024-03-11T09:25:05.191066Z ERROR warmup{max_input_length=1024 max_prefill_tokens=4096 max_total_tokens=2048}:warmup: lorax_client: router/client/src/lib.rs:34: Server error: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Error: Warmup(Generation("CUDA error: an illegal memory access was encountered\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1.\nCompile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.\n"))
2024-03-11T09:25:05.227846Z ERROR lorax_launcher: Webserver Crashed
2024-03-11T09:25:05.227884Z INFO lorax_launcher: Shutting down shards
2024-03-11T09:25:05.576928Z INFO shard-manager: lorax_launcher: Shard terminated rank=0
2024-03-11T09:25:05.599339Z INFO shard-manager: lorax_launcher: Shard terminated rank=2
2024-03-11T09:25:05.599523Z INFO shard-manager: lorax_launcher: Shard terminated rank=3
2024-03-11T09:25:05.643815Z INFO shard-manager: lorax_launcher: Shard terminated rank=1
Error: WebserverFailed
from lorax.
Hey @Nipi64310, can you share the output of nvidia-smi
? It looks like the warmup process is running out of memory. You may need to try reducing these values:
max_input_length=1024 max_prefill_tokens=4096 max_total_tokens=2048
@hayleyhu can you try pulling the latest image and see if that resolves the unwrap() panic?
from lorax.
Okay, I think I see what's happening here. The unwrap error is occurring because of PR #309, which was accidentally pushing latest
images during development.
cc @magdyksaleh
Let's make sure we only push dev images with a specific tag for the branch. I'll see if there's something we can do to prevent this automatically. In the meantime, I'll see if we can retag the current latest with the last commit to main.
from lorax.
@magdyksaleh confirmed the latest image has been fixed to be tagged from main
.
from lorax.
Hey @Nipi64310, can you share the output of
nvidia-smi
? It looks like the warmup process is running out of memory. You may need to try reducing these values:max_input_length=1024 max_prefill_tokens=4096 max_total_tokens=2048
@hayleyhu can you try pulling the latest image and see if that resolves the unwrap() panic?
Hi @tgaddair , I specified ---max-input-length 128 --max-batch-prefill-tokens 512 --max-batch-total-tokens 512 --max-total-tokens 512
, but I'm still getting the same error log.
docker run --gpus '"device=2,3,4,5"' -e RUST_BACKTRACE=full --shm-size 1g -p 8081:80 -v /Qwen:/data ghcr.nju.edu.cn/predibase/lorax:latest --model-id /data/Qwen-72B-Chat-Int4 --adapter-source local --quantize gptq --max-input-length 128 --max-batch-prefill-tokens 512 --max-batch-total-tokens 512 --max-total-tokens 512 --trust-remote-code
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
2024-03-12T02:40:12.730824Z ERROR warmup{max_input_length=128 max_prefill_tokens=512 max_total_tokens=512}:warmup: lorax_client: router/client/src/lib.rs:34: Server error: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Error: Warmup(Generation("CUDA error: an illegal memory access was encountered\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1.\nCompile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.\n"))
2024-03-12T02:40:12.751591Z ERROR lorax_launcher: Webserver Crashed
2024-03-12T02:40:12.751620Z INFO lorax_launcher: Shutting down shards
2024-03-12T02:40:13.041195Z INFO shard-manager: lorax_launcher: Shard terminated rank=2
2024-03-12T02:40:13.064553Z INFO shard-manager: lorax_launcher: Shard terminated rank=1
2024-03-12T02:40:13.091416Z INFO shard-manager: lorax_launcher: Shard terminated rank=3
2024-03-12T02:40:13.138504Z INFO shard-manager: lorax_launcher: Shard terminated rank=0
from lorax.
Thanks my original question was resolved!
from lorax.
Related Issues (20)
- Llama3-8b-Instruct won't stop generating HOT 1
- Speculative tokens fails during warmup in some scenarios HOT 1
- Batch inference endpoint (OpenAI compatible)
- Add HF authentication instructions to lorax-launcher docs HOT 6
- Improve async load for adapters to avoid main thread lockups in server
- Retrieve all lora models from Huggingface hub by base model setting. HOT 2
- Add all launcher args as optional in the Helm charts
- AutoTokenzier.from_pretrains needs setting with `trust_remote_code` inside `load_module_map` HOT 2
- Ensure api_token is not included in the response on error HOT 3
- [QUESTION] How to change HuggingFace model download Path in Lorax When deployed to Kubernetes through HelmChart HOT 1
- Bug Report: lorax-launcher failed with --source "s3" for model_id "mistralai/Mistral-7B-Instruct-v0.2" HOT 1
- Improve warmup checking for max new tokens when using speculative decoding
- Support inference on INF2 instance
- Reject unknown fields from API requests
- When caching adapters, cache the adapter ID + the API token pair HOT 4
- Add HTTP status codes to docs HOT 1
- Quantized KV Cache
- `make install` insufficient for running llama3-8B-Instruct HOT 4
- Fail to run Phi-3 HOT 4
- Quickstart example not working HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from lorax.