Comments (10)
This cuda 11.3, which I didn't test on. Can you try cuda 11.8?
Let me add a section to the readme about known CUDA support.
from openllm.
Thanks for your answer (and the great lib by the way!)
Starting from another fresh install and running:
# uninstall previous coda install
sudo /usr/bin/nvidia-uninstall
# install cuda 11.8
wget https://developer.download.nvidia.com/compute/cuda/11.8.0/local_installers/cuda_11.8.0_520.61.05_linux.run
sudo sh cuda_11.8.0_520.61.05_linux.run --silent
# install openllm
conda create -n py10 python=3.10 -y
conda activate py10
pip install "openllm[llama, fine-tune, vllm]"
openllm start llama --model-id huggyllama/llama-13b
The missing SciPy issue still shows up. After installing it, the logs go straight to the checkpoint shards loading (without displaying anything about downloading the model weights). Then, nothing much happens (OpenLLM slowly uses more and more RAM but barely any CPU and no GPU). Any chance loading via CPU may be the bottleneck here ? (despite the GPU being found as evidenced by Deepspeed setting the right accelerator).
from openllm.
I just fixed a bug for loading on single gpu.
Can u try with 0.2.6?
I guess since you are using a100, it should be good to load the whole model into memory
from openllm.
The logs one hour and a half after running openllm start llama --model-id huggyllama/llama-13b
:
bin /opt/conda/envs/py10/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda113.so
[2023-07-24 07:39:35,243] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Downloading (…)fetensors.index.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 33.4k/33.4k [00:00<00:00, 13.5MB/s]
Downloading (…)of-00003.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.95G/9.95G [02:43<00:00, 60.7MB/s]
Downloading (…)of-00003.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.90G/9.90G [02:40<00:00, 61.7MB/s]
Downloading (…)of-00003.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6.18G/6.18G [01:41<00:00, 61.0MB/s]
Downloading shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [07:06<00:00, 142.29s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:03<00:00, 1.21s/it]
Downloading (…)neration_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 137/137 [00:00<00:00, 1.01MB/s]
Downloading (…)okenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 700/700 [00:00<00:00, 5.03MB/s]
Downloading tokenizer.model: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500k/500k [00:00<00:00, 5.05MB/s]
Downloading (…)/main/tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.84M/1.84M [00:00<00:00, 12.5MB/s]
Downloading (…)cial_tokens_map.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 411/411 [00:00<00:00, 3.17MB/s]
^C^C^C^C^C^C2023-07-24T08:00:54+0000 [DEBUG] [cli] Importing service "_service.py:svc" from working dir: "/opt/conda/envs/py10/lib/python3.10/site-packages/openllm"
bin /opt/conda/envs/py10/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda113.so
2023-07-24T08:01:14+0000 [INFO] [cli] Created a temporary directory at /tmp/tmpqthsnq8d
2023-07-24T08:01:14+0000 [INFO] [cli] Writing /tmp/tmpqthsnq8d/_remote_module_non_scriptable.py
[2023-07-24 08:01:14,881] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
2023-07-24T08:01:16+0000 [DEBUG] [cli] Popen(['git', 'version'], cwd=/opt/conda/envs/py10/lib/python3.10/site-packages/openllm, universal_newlines=False, shell=None, istream=None)
2023-07-24T08:01:17+0000 [DEBUG] [cli] Popen(['git', 'version'], cwd=/opt/conda/envs/py10/lib/python3.10/site-packages/openllm, universal_newlines=False, shell=None, istream=None)
2023-07-24T08:01:17+0000 [DEBUG] [cli] Trying paths: ['/home/user/.docker/config.json', '/home/user/.dockercfg']
2023-07-24T08:01:17+0000 [DEBUG] [cli] Found file at path: /home/user/.docker/config.json
2023-07-24T08:01:17+0000 [DEBUG] [cli] Found 'credHelpers' section
2023-07-24T08:01:17+0000 [DEBUG] [cli] [Tracing] Create new propagation context: {'trace_id': 'daf4767d6aa948b4b96d0cdc18949e70', 'span_id': '8ddcc746bd7df314', 'parent_span_id': None, 'dynamic_sampling_context': None}
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [12:19<00:00, 246.41s/it]
Using pad_token, but it is not set yet.
Still nothing loaded on the GPU by that time unfortunately.
from openllm.
What happens with openllm start llama --model-id huggyllama/llama-13b --debug
?
from openllm.
Pretty much the same thing at first (using 0.2.9):
[2023-07-25 14:03:55,952] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
DEBUG:tensorflow:Falling back to TensorFlow client; we recommended you install the Cloud TPU client directly with pip install cloud-tpu-client.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [04:04<00:00, 81.50s/it]
But things got moving when I tried to shutdown the command:
^C^C^C^C^C^CStarting server with arguments: ['/opt/conda/envs/py10/bin/python3.10', '-m', 'bentoml', 'serve-http', '_service.py:svc', '--host', '0.0.0.0', '--port', '3000', '--backlog', '2048', '--api-workers', '12', '--working-dir', '/opt/conda/envs/py10/lib/python3.10/site-packages/openllm', '--ssl-version', '17', '--ssl-ciphers', 'TLSv1']
2023-07-25T14:25:28+0000 [DEBUG] [cli] Importing service "_service.py:svc" from working dir: "/opt/conda/envs/py10/lib/python3.10/site-packages/openllm"
2023-07-25T14:25:31+0000 [DEBUG] [cli] Initializing MLIR with module: _site_initialize_0
2023-07-25T14:25:31+0000 [DEBUG] [cli] Registering dialects from initializer <module 'jaxlib.mlir._mlir_libs._site_initialize_0' from '/opt/conda/envs/py10/lib/python3.10/site-packages/jaxlib/mlir/_mlir_libs/_site_initialize_0.so'>
2023-07-25T14:25:32+0000 [DEBUG] [cli] No jax_plugins namespace packages available
2023-07-25T14:25:33+0000 [DEBUG] [cli] etils.epath found. Using etils.epath for file I/O.
2023-07-25T14:25:51+0000 [INFO] [cli] Created a temporary directory at /tmp/tmpgwt7mutk
2023-07-25T14:25:51+0000 [INFO] [cli] Writing /tmp/tmpgwt7mutk/_remote_module_non_scriptable.py
[2023-07-25 14:25:52,312] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
2023-07-25T14:26:01+0000 [DEBUG] [cli] Falling back to TensorFlow client; we recommended you install the Cloud TPU client directly with pip install cloud-tpu-client.
2023-07-25T14:26:02+0000 [DEBUG] [cli] Creating converter from 7 to 5
2023-07-25T14:26:02+0000 [DEBUG] [cli] Creating converter from 5 to 7
2023-07-25T14:26:02+0000 [DEBUG] [cli] Creating converter from 7 to 5
2023-07-25T14:26:02+0000 [DEBUG] [cli] Creating converter from 5 to 7
2023-07-25T14:26:11+0000 [DEBUG] [cli] Popen(['git', 'version'], cwd=/opt/conda/envs/py10/lib/python3.10/site-packages/openllm, universal_newlines=False, shell=None, istream=None)
2023-07-25T14:26:11+0000 [DEBUG] [cli] Popen(['git', 'version'], cwd=/opt/conda/envs/py10/lib/python3.10/site-packages/openllm, universal_newlines=False, shell=None, istream=None)
2023-07-25T14:26:11+0000 [DEBUG] [cli] Trying paths: ['/home/user/.docker/config.json', '/home/qlutz/.dockercfg']
2023-07-25T14:26:11+0000 [DEBUG] [cli] Found file at path: /home/user/.docker/config.json
2023-07-25T14:26:11+0000 [DEBUG] [cli] Found 'credHelpers' section
2023-07-25T14:26:11+0000 [DEBUG] [cli] [Tracing] Create new propagation context: {'trace_id': '663640676af84209a41185161a0d1eac', 'span_id': 'b2ab05f9966f5d45', 'parent_span_id': None, 'dynamic_sampling_context': None}
Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s]
Either way, nothing is loaded on the GPU.
from openllm.
how many GPUs do you have? nvidia-smi
?
from openllm.
Still the same setup as in the original post: 1xA100 80GB. I tested on Cuda 11.6 and 11.8
from openllm.
Fixed in the last version (0.2.25
) for the described setup and model. Thanks !
from openllm.
@aarnphm still has the same problem when use openllm start baichuan
to load baichuan llm. No gpu usage and cannot accept requests.
from openllm.
Related Issues (20)
- feat: Expose vllm max-model-len parameter to avoid OOM issues with AWQ quantized models using vllm HOT 1
- Repetitive non-fatal ConflictError: "arbiter is already running %s command" HOT 3
- Output from OpenLLM is different with HuggingFace Transformers HOT 8
- bug: error with openllm start HOT 1
- bug: When running by example getting error: TypeError: 'dict' object is not callable HOT 6
- bug: AttributeError: can't set attribute 'eos_token' HOT 1
- bug: Failed to run on specified gpus HOT 5
- bug: TypeError: getattr(): attribute name must be string HOT 2
- bug: Chat template is not applied HOT 1
- bug: HOT 3
- bug: openllm cannot start HOT 6
- infra: Tests plan
- feat: embedding HOT 3
- RuntimeError: Found no NVIDIA driver on your system. HOT 5
- feat: Docker image for ARM64 / aarch64 HOT 4
- Is
- bug: Error while serializing: IoError(Os { code: 28, kind: StorageFull, message: "No space left on device" }) HOT 2
- bug: Model is not found in BentoML store, you may need to run bentoml models pull first HOT 2
- /v1/chat/completions endpoint not responding - ValueError: The number of required GPUs exceeds the total number of available GPUs in the cluster. HOT 7
- feat: Avoid downloading the same model twice when only backend is different
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from openllm.