Comments (12)
Yep, unfortunately the FasterTransformer code is very tied to NVIDIA cards (perhaps unsurprising since FasterTransformer is made by... NVIDIA).
However, there have been some really exciting improvements in low-latency inference recently via DeepSpeed and INT8 quantization that might allow us to replace the FT backend with something that works on a wider variety of hardware (with less memory usage too!) without sacrificing performance:
https://huggingface.co/blog/bloom-inference-pytorch-scripts
from fauxpilot.
For what it’s worth PyTorch has support for an mps backend that one can query for and set that will drastically improve performance on apple silicon. For most things it’s as simple as setting device(“mps”)
https://pytorch.org/docs/stable/notes/mps.html
from fauxpilot.
Looks like nvidia-docker isn't still fully supported on Apple M1 (see this: NVIDIA/nvidia-docker#101 and nathanwbrei/phasm#8 (comment)). I don't have a Mac with me currently, so can't try this out myself unfortunately.
from fauxpilot.
Any progress on this? Or other alternatives?
from fauxpilot.
Does the new Apple hardware include NVIDIA graphics cards? If not, this repo will not work for you. For more information, see issue #4
from fauxpilot.
Does the new Apple hardware include NVIDIA graphics cards? If not, this repo will not work for you. For more information, see issue #4
Did you succeed to run it?
I tried never successfully
from fauxpilot.
Does the new Apple hardware include NVIDIA graphics cards? If not, this repo will not work for you. For more information, see issue #4
Unfortunately, the new macbooks with apple silicon do not contain NVIDIA cards. Maybe we can consider using the neural engine on the M1 chips.
from fauxpilot.
Now that we have a Python backend, you be able to get this working?
from fauxpilot.
On a MacBook Pro Ventura 13.1 with the Python backend:
Choose your backend:
[1] FasterTransformer backend (faster, but limited models)
[2] Python backend (slower, but more models, and allows loading with int8)
Enter your choice [1]: 2
Models available:
[1] codegen-350M-mono (1GB total VRAM required; Python-only)
[2] codegen-350M-multi (1GB total VRAM required; multi-language)
[3] codegen-2B-mono (4GB total VRAM required; Python-only)
[4] codegen-2B-multi (4GB total VRAM required; multi-language)
Enter your choice [4]: 1
it still seems to fail about libnvidia-ml.so.1
:
Attaching to fauxpilot-copilot_proxy-1, fauxpilot-triton-1
Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown
Full trace
./setup.sh Node 16.13.1
Checking for curl ...
/usr/bin/curl
Checking for zstd ...
/usr/local/bin/zstd
Checking for docker ...
/usr/local/bin/docker
Enter number of GPUs [1]:
External port for the API [5000]:
Address for Triton [triton]:
Port of Triton host [8001]:
Where do you want to save your models [/Users/romain.rigaux/projects/fauxpilot/models]?
Choose your backend:
[1] FasterTransformer backend (faster, but limited models)
[2] Python backend (slower, but more models, and allows loading with int8)
Enter your choice [1]: 2
Models available:
[1] codegen-350M-mono (1GB total VRAM required; Python-only)
[2] codegen-350M-multi (1GB total VRAM required; multi-language)
[3] codegen-2B-mono (4GB total VRAM required; Python-only)
[4] codegen-2B-multi (4GB total VRAM required; multi-language)
Enter your choice [4]: 1
Do you want to share your huggingface cache between host and docker container? y/n [n]:
Do you want to use int8? y/n [y]:
Config written to /Users/romain.rigaux/projects/fauxpilot/models/py-Salesforce-codegen-350M-mono/py-model/config.pbtxt
[+] Building 0.0s (0/0)
[+] Building 0.1s (2/3)
=> [internal] load build definition from Dockerfile 0.0s
[+] Building 0.3s (2/3)
=> [internal] load build definition from Dockerfile 0.0s
=> => transferring dockerfile: 32B 0.0s
[+] Building 2.1s (10/10) FINISHED
=> [internal] load build definition from Dockerfile 0.0s
=> => transferring dockerfile: 32B 0.0s
=> [internal] load .dockerignore 0.0s
=> => transferring context: 35B 0.0s
=> [internal] load metadata for docker.io/library/python:3.10-slim-buster 2.0s
=> [internal] load build context 0.0s
=> => transferring context: 1.15kB 0.0s
=> [1/5] FROM docker.io/library/python:3.10-slim-buster@sha256:8c2ff857fff9df7905b299647176e16c2a606ff65fa479ba9cad61acbee3123c 0.0s
=> CACHED [2/5] WORKDIR /python-docker 0.0s
=> CACHED [3/5] COPY copilot_proxy/requirements.txt requirements.txt 0.0s
[+] Building 2.3s (7/7) FINISHED
=> [internal] load build definition from Dockerfile 0.0s
=> => transferring dockerfile: 32B 0.0s
=> [internal] load .dockerignore 0.0s
=> => transferring context: 35B 0.0s
=> [internal] load metadata for docker.io/moyix/triton_with_ft:22.09 2.1s
=> [1/3] FROM docker.io/moyix/triton_with_ft:22.09@sha256:5a15c1f29c6b018967b49c588eb0ea67acbf897abb7f26e509ec21844574c9b1 0.0s
=> CACHED [2/3] RUN python3 -m pip install --disable-pip-version-check -U torch --extra-index-url https://download.pytorch.org/whl/cu116 0.0s
=> CACHED [3/3] RUN python3 -m pip install --disable-pip-version-check -U transformers bitsandbytes accelerate 0.0s
=> exporting to image 0.0s
=> => exporting layers 0.0s
=> => writing image sha256:1d22eab54aab4755ffffeb7627dcb8041ebc2be321cb3865d574ec9fb346321b 0.0s
=> => naming to docker.io/library/fauxpilot-triton 0.0s
Config complete, do you want to run FauxPilot? [y/n]
[+] Running 2/2
⠿ Container fauxpilot-copilot_proxy-1 Recreated 0.4s
⠿ Container fauxpilot-triton-1 Recreated 0.1s
Attaching to fauxpilot-copilot_proxy-1, fauxpilot-triton-1
Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown
[+] Running 1/0
⠿ Container fauxpilot-copilot_proxy-1 Running 0.0s
Attaching to fauxpilot-copilot_proxy-1, fauxpilot-triton-1
Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown
from fauxpilot.
About ems
NotImplementedError: The operator 'aten::cumsum.out' is not currently implemented for the MPS device.
If you want this op to be added in priority during the prototype phase of this feature, please comment on
https://github.com/pytorch/pytorch/issues/77764. As a temporary fix, you can set the environment variable
`PYTORCH_ENABLE_MPS_FALLBACK=1` to use the CPU as a fallback for this op. WARNING: this will be
slower than running natively on MPS.
from fauxpilot.
Any updates as of now?
from fauxpilot.
Any updates as of now?
from fauxpilot.
Related Issues (20)
- Maybe add windows/etc installer all-in-one in this project's 'releases'.
- 400 Bad Request when file has around 100 lines of code HOT 3
- C# support! HOT 2
- Hello all. The comments above have been very helpful in setting up the Copilot extension. I managed to get it to work with my instance and figured I would combine the steps I used (this is for Windows. Linux installation is similar, just different locations):
- It was working fine before... HOT 1
- Support for AMD GPUs HOT 1
- Triton doesnt exist anymore I think? HOT 3
- K8s deployment (via helm chart) HOT 2
- Caught signal 11 (Segmentation fault: address not mapped to object at address (nil)) HOT 1
- why my response are all !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! HOT 3
- Can I merge images of triton and client into one?eg fastertransformer_backend get content_fetch <fastertransformer&client>in CMakeLists ? HOT 1
- help me HOT 1
- What is the comparison of these model in huggingface? HOT 2
- Python Backend: "Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0" HOT 2
- [promptlib] proxy {"cause":{}} HOT 1
- ollama HOT 2
- Company Proxy HOT 1
- is documentation outdated?
- Jetbrains Support
- RTX 4060 Unsupported Message
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fauxpilot.