Coder Social home page Coder Social logo

fauxpilot's Introduction

FauxPilot

This is an attempt to build a locally hosted version of GitHub Copilot. It uses the SalesForce CodeGen models inside of NVIDIA's Triton Inference Server with the FasterTransformer backend.

Prerequisites

You'll need:

  • Docker
  • docker compose >= 1.28
  • An NVIDIA GPU with Compute Capability >= 6.0 and enough VRAM to run the model you want.
  • nvidia-docker
  • curl and zstd for downloading and unpacking the models.

Note that the VRAM requirements listed by setup.sh are total -- if you have multiple GPUs, you can split the model across them. So, if you have two NVIDIA RTX 3080 GPUs, you should be able to run the 6B model by putting half on each GPU.

Support and Warranty

lmao

Okay, fine, we now have some minimal information on the wiki and a discussion forum where you can ask questions. Still no formal support or warranty though!

Setup

This section describes how to install a Fauxpilot server and clients.

Setting up a FauxPilot Server

Run the setup script to choose a model to use. This will download the model from Huggingface/Moyix in GPT-J format and then convert it for use with FasterTransformer.

Please refer to How to set-up a FauxPilot server.

Client configuration for FauxPilot

We offer some ways to connect to FauxPilot Server. For example, you can create a client by how to open the Openai API, Copilot Plugin, REST API.

Please refer to How to set-up a client.

Terminology

  • API: Application Programming Interface
  • CC: Compute Capability
  • CUDA: Compute Unified Device Architecture
  • FT: Faster Transformer
  • JSON: JavaScript Object Notation
  • gRPC: Remote Procedure call by Google
  • GPT-J: A transformer model trained using Ben Wang's Mesh Transformer JAX
  • REST: REpresentational State Transfer

fauxpilot's People

Contributors

akay avatar arturhoo avatar claudiosv avatar fdegier avatar frederisk avatar jsoref avatar leemgs avatar luanshaotong avatar moyix avatar petronny avatar thakkarparth007 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fauxpilot's Issues

any vscode plugin recommend?

have tried fauxpilot client but seems I've got lots of bugs while running, couldn't help ask is there any other plugins available?

CUDA runtime error: invalid device function sampling_topp_kernels.cu

Hi guys,

I'm using:

  • Model: codegen-2B-multi
  • GPU: GTX 1070 w/ 8G VRAM
  • Sys: Fedora36 5.18.16-200.fc36.x86_64
  • NV Drive 515.57 w/ CUDA 11.7

Using podman as container runtime with NV container toolkit

Client:       Podman Engine
Version:      4.1.1
API Version:  4.1.1
Go Version:   go1.18.4
Built:        Fri Jul 22 15:05:59 2022
OS/Arch:      linux/amd64
cat  /usr/share/containers/oci/hooks.d/oci-nvidia-hook.json
{
    "version": "1.0.0",
    "hook": {
        "path": "/usr/bin/nvidia-container-toolkit",
        "args": ["nvidia-container-toolkit", "prestart"],
        "env": [
            "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
        ]
    },
    "when": {
        "always": true,
        "commands": [".*"]
    },
    "stages": ["prestart"]
}

Command nvdia-smi works fine in container.

image

Problem

The triton server started fine but it crashes when I request it using OpenAI API demo written in the readme.
image
image

Is this a GPU compatibility issue? if yes, which GPU model is supported?

Any help will be appreciated!

No cuda capable device is detected

Hi all,

My host is windows 10 with nivida 3090 +24gb vram, I cannot start triton container with error message not CUDA capable device is detected. Do You Know why? i can detect cuda with pytorch in the host.

=============================
== Triton Inference Server ==

NVIDIA Release 22.06 (build 39726160)
Triton Server Version 2.23.0

Copyright (c) 2018-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

W0903 01:19:39.717609 88 pinned_memory_manager.cc:236] Unable to allocate pinned system memory, pinned memory pool will not be available: no CUDA-capable device is detected
I0903 01:19:39.717744 88 cuda_memory_manager.cc:115] CUDA memory pool disabled
E0903 01:19:39.753366 88 model_repository_manager.cc:2063] Poll failed for model directory 'codegen-16B-mono-1gpu': failed to open text file for read /model/codegen-16B-mono-1gpu/config.pbtxt: No such file or directory
E0903 01:19:39.773178 88 model_repository_manager.cc:2063] Poll failed for model directory 'codegen-16B-mono-2gpu': failed to open text file for read /model/codegen-16B-mono-2gpu/config.pbtxt: No such file or directory
E0903 01:19:39.795737 88 model_repository_manager.cc:2063] Poll failed for model directory 'codegen-16B-multi-1gpu': failed to open text file for read /model/codegen-16B-multi-1gpu/config.pbtxt: No such file or directory
E0903 01:19:39.815933 88 model_repository_manager.cc:2063] Poll failed for model directory 'codegen-16B-multi-2gpu': failed to open text file for read /model/codegen-16B-multi-2gpu/config.pbtxt: No such file or directory
E0903 01:19:39.836123 88 model_repository_manager.cc:2063] Poll failed for model directory 'codegen-16B-nl-1gpu': failed to open text file for read /model/codegen-16B-nl-1gpu/config.pbtxt: No such file or directory
E0903 01:19:39.857360 88 model_repository_manager.cc:2063] Poll failed for model directory 'codegen-16B-nl-2gpu': failed to open text file for read /model/codegen-16B-nl-2gpu/config.pbtxt: No such file or directory
E0903 01:19:39.881019 88 model_repository_manager.cc:2063] Poll failed for model directory 'codegen-2B-mono-1gpu': failed to open text file for read /model/codegen-2B-mono-1gpu/config.pbtxt: No such file or directory

[Q] run on Android ?

Is there any chance that this project is able to run on android TV (Nvidia Shield Pro, which has a Tegra X1) ?

API rate limit for fauxpilot?

Curious to know have you attempted to see how much load fauxpilot could handle? I know its lot to do with the H/W that is provisioned.

Still would be curious to know what a typical GPU like RTX3090 or RTX A6000 could handle in terms of API request/minute?

Error response from daemon: could not select device driver "nvidia" with capabilities: [[gpu]]

Hey,

installation went fine, no problems whatsoever. However, when trying to run launch.sh I get following error:

➜ ./launch.sh       
[+] Running 1/0
 ⠿ Container fauxpilot-copilot_proxy-1  Running
Attaching to fauxpilot-copilot_proxy-1, fauxpilot-triton-1
Error response from daemon: could not select device driver "nvidia" with capabilities: [[gpu]]

I have all dependencies installed:

  • docker (v. 20.10.17, build 100c70180f)
  • docker-compose (v. 2.6.1)
  • nvidia driver (v. 515.57, CUDA v. 11.7)
  • nvidia-docker (v. 2.11.0-1)

Running on Arch Linux, kernel: 5.15.55-1-lts

`env_file` is incorrectly used by compose for variable interpolation

According to the response from the Docker developers:

env_file is not used for variable interpolation in the container specification used by compose to create containers, but passed to the docker API as runtime definition for the process runing inside the container.
...

So docker-compose.yaml in 8895b74 uses an undefined behavior to construct itself, which may have unintended effects on some systems. (In fact, as mentioned in the issue above, this doesn't work properly on Windows. Such as fauxpilot-windows)

I noticed this was already the second bug in #49 and @dslandry said he didn't finish the check, should we completely re-evaluate and check this PR?

How to optimize CodeGen for my code before launching FauxPilot

Using a well-crafted FAUXPILOT, we can execute inference tasks based on the Codegen model. I read recently that I can work on Fine-tune using the Codegen model on the following website.

$ deepspeed --num_gpus 1 --num_nodes 1 run_clm.py --model_name_or_path=Salesforce/codegen-6B-multi --per_device_train_batch_size=1 --learning_rate 2e-5 --num_train_epochs 1 --output_dir=./codegen-6B-finetuned --dataset_name your_dataset --tokenizer_name Salesforce/codegen-6B-multi --block_size 2048 --gradient_accumulation_steps 32 --do_train --fp16 --overwrite_output_dir --deepspeed ds_config.json

I'm curious if there is a GitHub storage address that describes how to perform Fine-Tune work with additional source code (e.g., my own source code) using Deepspeed. We are looking for a more detailed GitHub repository for the "--dataset_name your_dataset" option. Where is the applicable GitHub repository located? Are there any web pages that deal with how to run Fine-Tune with Deepspeed? Welcome to any comments on this issue.

Recommended hardware / tutorial / full setup instructions?

It is so inspiring to find this project.

I am a disabled software developer who really struggles to code now. I've been staying away from copilot, worried of becoming reliant on cloud infrastructure, but I wonder if these tools could really help me.

Are there clear "for dummies" instructions to get set up anywhere? Including what hardware is recommended, and how to configure popular editors? (i use vim, and i was thinking of setting up a dasharo asus d16 motherboard ...)

Can we get the execution result of OpenAI API with curl command?

Can the curl command return the OpenAI API execution result?

When I executed the OpenAI API as follows using the curl command, I did not receive a successful execution result. What part is wrong? Any hints or clues are welcome. :)

How to reproduce this issue

$ [email protected]:~$ cat ./openai-apitest.py
#!/usr/bin/env python3
import openai
openai.api_key = 'dummy'
openai.api_base = 'http://localhost:5000/v1'
result = openai.Completion.create(engine='codegen', prompt='def bye', max_tokens=16, temperature=0.1, stop=["\n\n"])
print ("################ result #################")
print (result)
$ python3 ./openai-apitest.py
################ result #################
{
  "choices": [
    {
      "finish_reason": "length",
      "index": 0,
      "logprobs": null,
      "text": "() {\n        System.exit(0);\n    }\n}\n"
    }
  ],
  "created": 1660611975,
  "id": "cmpl-AYJ3pTvp3USyMhFmL0BoewUWZIA5h",
  "model": "codegen",
  "object": "text_completion",
  "usage": {
    "completion_tokens": 16,
    "prompt_tokens": 2,
    "total_tokens": 18
  }
}
$ curl --location --globoff --request POST 'http://localhost:5000/v1/engines/:codegen/completions?prompt="def hello"&max_tokens=16&temperature=0.1&stop=["\n\n"]'
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
        "http://www.w3.org/TR/html4/strict.dtd">
<html>
    <head>
        <meta http-equiv="Content-Type" content="text/html;charset=utf-8">
        <title>Error response</title>
    </head>
    <body>
        <h1>Error response</h1>
        <p>Error code: 400</p>
        <p>Message: Bad request syntax ('POST /v1/engines/:codegen/completions?prompt="def hello"&amp;max_tokens=16&amp;temperature=0.1&amp;stop=["\\n\\n"] HTTP/1.1').</p>
        <p>Error code explanation: HTTPStatus.BAD_REQUEST - Bad request syntax or unsupported method.</p>
    </body>
</html>

[FT][ERROR] CUDA runtime error: invalid device function

I installed and run Fauxpilot on Ubuntu18.04/Nvidia RTX 2080 (192.168.0.201) and Ubuntu18.04/Nvidia Titan Xp (192.168.0.179).
Then, in the Ubuntu environment of my laptop, I performed the OpenAI API with the curl command, as shown below.
Unfortunately, sending the curl command to Ubuntu18.04/Nvidia Titan Xp (192.168.0.179) throws an error.
In the summary. FauxPilot on Ubuntu18.04/Nvidia Titan Xp generates the "CUDA runtime error: invalid device function" error message.
Maybe Nvidia Titan Xp is not supported to run FauxPilot?

The configuration file is as follows.

cat ./config.env
MODEL=codegen-2B-multi
NUM_GPUS=1
MODEL_DIR=/work/fauxpilot/models

Case1: When I send a OpenAI API to Ubuntu18.04/Nvidia RTX 2080 (192.168.0.201), it is okay.

fauxpilot$ curl -s -H "Accept: application/json" -H "Content-type: application/json" -X POST -d '{"prompt":"def hello","max_tokens":16,"temperature":0.1,"stop":["\n\n"]}' http://192.168.0.201:5000/v1/engines/codegen/completions


{"id": "cmpl-eww3WHuWSjUMdfLb5tBfxVxRoJUIs", "model": "codegen", "object": "text_completion", "created": 1660749662, "choices": [{"text": "(self):\n        return \"Hello World!\"", "index": 0, "finish_reason": "stop", "logprobs": null}], "usage": 
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01   Driver Version: 470.103.01   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| 30%   29C    P8     3W / 225W |   6035MiB /  7982MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1198      G   /usr/lib/xorg/Xorg                  9MiB |
|    0   N/A  N/A      1403      G   /usr/bin/gnome-shell                3MiB |
|    0   N/A  N/A    768980      C   ...onserver/bin/tritonserver     6017MiB |
+-----------------------------------------------------------------------------+

Case2: When I send a OpenAI API to Ubuntu18.04/Nvidia Titan Xp (192.168.0.179), it is failed.

fauxpilot$ curl -s -H "Accept: application/json" -H "Content-type: application/json" -X POST -d '{"prompt":"def hello","max_tokens":16,"temperature":0.1,"stop":["\n\n"]}' http://192.168.0.179:5000/v1/engines/codegen/completions


<!doctype html>
<html lang=en>
<title>500 Internal Server Error</title>
<h1>Internal Server Error</h1>
<p>The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.</p>
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01    Driver Version: 465.19.01    CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA TITAN Xp     On   | 00000000:01:00.0 Off |                  N/A |
| 23%   39C    P2    61W / 250W |   5919MiB / 12194MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1592      G   /usr/lib/xorg/Xorg                 41MiB |
|    0   N/A  N/A     24177      C   ...onserver/bin/tritonserver     5873MiB |
+-----------------------------------------------------------------------------+

Issue report:

Below is the error log output when running ./launch.sh on Ubuntu18.04/Nvidia Titan Xp (192.168.0.179).

$ ./launch.sh
...... Omission ......
triton_1         |
triton_1         | I0817 15:19:21.682222 96 grpc_server.cc:4587] Started GRPCInferenceService at 0.0.0.0:8001
triton_1         | I0817 15:19:21.682527 96 http_server.cc:3303] Started HTTPService at 0.0.0.0:8000
triton_1         | I0817 15:19:21.724786 96 http_server.cc:178] Started Metrics Service at 0.0.0.0:8002


triton_1         | W0817 15:21:08.354892 96 libfastertransformer.cc:1397] model fastertransformer, instance fastertransformer_0, executing 1 requests
triton_1         | W0817 15:21:08.354910 96 libfastertransformer.cc:638] TRITONBACKEND_ModelExecute: Running fastertransformer_0 with 1 requests
triton_1         | W0817 15:21:08.354916 96 libfastertransformer.cc:693] get total batch_size = 1
triton_1         | W0817 15:21:08.354922 96 libfastertransformer.cc:1051] get input count = 16
triton_1         | W0817 15:21:08.354930 96 libfastertransformer.cc:1117] collect name: start_id size: 4 bytes
triton_1         | W0817 15:21:08.354935 96 libfastertransformer.cc:1117] collect name: input_ids size: 8 bytes
triton_1         | W0817 15:21:08.354939 96 libfastertransformer.cc:1117] collect name: bad_words_list size: 8 bytes
triton_1         | W0817 15:21:08.354944 96 libfastertransformer.cc:1117] collect name: random_seed size: 4 bytes
triton_1         | W0817 15:21:08.354948 96 libfastertransformer.cc:1117] collect name: end_id size: 4 bytes
triton_1         | W0817 15:21:08.354952 96 libfastertransformer.cc:1117] collect name: input_lengths size: 4 bytes
triton_1         | W0817 15:21:08.354956 96 libfastertransformer.cc:1117] collect name: request_output_len size: 4 bytes
triton_1         | W0817 15:21:08.354960 96 libfastertransformer.cc:1117] collect name: runtime_top_k size: 4 bytes
triton_1         | W0817 15:21:08.354964 96 libfastertransformer.cc:1117] collect name: runtime_top_p size: 4 bytes
triton_1         | W0817 15:21:08.354968 96 libfastertransformer.cc:1117] collect name: is_return_log_probs size: 1 bytes
triton_1         | W0817 15:21:08.354972 96 libfastertransformer.cc:1117] collect name: stop_words_list size: 24 bytes
triton_1         | W0817 15:21:08.354976 96 libfastertransformer.cc:1117] collect name: temperature size: 4 bytes
triton_1         | W0817 15:21:08.354979 96 libfastertransformer.cc:1117] collect name: len_penalty size: 4 bytes
triton_1         | W0817 15:21:08.354988 96 libfastertransformer.cc:1117] collect name: beam_width size: 4 bytes
triton_1         | W0817 15:21:08.354998 96 libfastertransformer.cc:1117] collect name: beam_search_diversity_rate size: 4 bytes
triton_1         | W0817 15:21:08.355005 96 libfastertransformer.cc:1117] collect name: repetition_penalty size: 4 bytes
triton_1         | W0817 15:21:08.355010 96 libfastertransformer.cc:1130] the data is in CPU
triton_1         | W0817 15:21:08.355015 96 libfastertransformer.cc:1137] the data is in CPU
triton_1         | W0817 15:21:08.355025 96 libfastertransformer.cc:999] before ThreadForward 0
triton_1         | W0817 15:21:08.355069 96 libfastertransformer.cc:1006] after ThreadForward 0
triton_1         | I0817 15:21:08.355097 96 libfastertransformer.cc:834] Start to forward
triton_1         | terminate called after throwing an instance of 'std::runtime_error'
triton_1         |   what():  [FT][ERROR] CUDA runtime error: invalid device function /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/kernels/sampling_topp_kernels.cu:1057
triton_1         |
triton_1         | Signal (6) received.
triton_1         |  0# 0x000055ACE072C699 in /opt/tritonserver/bin/tritonserver
triton_1         |  1# 0x00007F0F78E2D090 in /usr/lib/x86_64-linux-gnu/libc.so.6
triton_1         |  2# gsignal in /usr/lib/x86_64-linux-gnu/libc.so.6
triton_1         |  3# abort in /usr/lib/x86_64-linux-gnu/libc.so.6
triton_1         |  4# 0x00007F0F791E6911 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
triton_1         |  5# 0x00007F0F791F238C in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
triton_1         |  6# 0x00007F0F791F23F7 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
triton_1         |  7# 0x00007F0F791F26A9 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
triton_1         |  8# void fastertransformer::check<cudaError>(cudaError, char const*, char const*, int) in /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so
triton_1         |  9# void fastertransformer::invokeTopPSampling<float>(void*, unsigned long&, unsigned long&, int*, int*, bool*, float*, float*, float const*, int const*, int*, int*, curandStateXORWOW*, int, unsigned long, int const*, float, CUstream_st*, cudaDeviceProp*) in /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so
triton_1         | 10# fastertransformer::TopPSamplingLayer<float>::allocateBuffer(unsigned long, unsigned long, float) in /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so
triton_1         | 11# fastertransformer::TopPSamplingLayer<float>::runSampling(std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, fastertransformer::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, fastertransformer::Tensor> > >*, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, fastertransformer::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, fastertransformer::Tensor> > > const*) in /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so
triton_1         | 12# fastertransformer::BaseSamplingLayer<float>::forward(std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, fastertransformer::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, fastertransformer::Tensor> > >*, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, fastertransformer::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, fastertransformer::Tensor> > > const*) in /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so
triton_1         | 13# fastertransformer::DynamicDecodeLayer<float>::forward(std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, fastertransformer::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, fastertransformer::Tensor> > >*, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, fastertransformer::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, fastertransformer::Tensor> > > const*) in /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so
triton_1         | 14# fastertransformer::GptJ<__half>::forward(std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, fastertransformer::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, fastertransformer::Tensor> > >*, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, fastertransformer::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, fastertransformer::Tensor> > > const*, fastertransformer::GptJWeight<__half> const*) in /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so
triton_1         | 15# GptJTritonModelInstance<__half>::forward(std::shared_ptr<std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, triton::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, triton::Tensor> > > >) in /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so
triton_1         | 16# 0x00007F0F700ED40A in /opt/tritonserver/backends/fastertransformer/libtriton_fastertransformer.so
triton_1         | 17# 0x00007F0F7921EDE4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
triton_1         | 18# 0x00007F0F7A42D609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0
triton_1         | 19# clone in /usr/lib/x86_64-linux-gnu/libc.so.6
triton_1         |
copilot_proxy_1  | [2022-08-17 15:21:08,929] ERROR in app: Exception on /v1/engines/codegen/completions [POST]
copilot_proxy_1  | Traceback (most recent call last):
copilot_proxy_1  |   File "/usr/local/lib/python3.8/site-packages/flask/app.py", line 2463, in wsgi_app
copilot_proxy_1  |     response = self.full_dispatch_request()
copilot_proxy_1  |   File "/usr/local/lib/python3.8/site-packages/flask/app.py", line 1760, in full_dispatch_request
copilot_proxy_1  |     rv = self.handle_user_exception(e)
copilot_proxy_1  |   File "/usr/local/lib/python3.8/site-packages/flask/app.py", line 1758, in full_dispatch_request
copilot_proxy_1  |     rv = self.dispatch_request()
copilot_proxy_1  |   File "/usr/local/lib/python3.8/site-packages/flask/app.py", line 1734, in dispatch_request
copilot_proxy_1  |     return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
copilot_proxy_1  |   File "/python-docker/app.py", line 258, in completions
copilot_proxy_1  |     response=codegen(data),
copilot_proxy_1  |   File "/python-docker/app.py", line 234, in __call__
copilot_proxy_1  |     completion, choices = self.generate(data)
copilot_proxy_1  |   File "/python-docker/app.py", line 146, in generate
copilot_proxy_1  |     result = self.client.infer(model_name, inputs)
copilot_proxy_1  |   File "/usr/local/lib/python3.8/site-packages/tritonclient/grpc/__init__.py", line 1322, in infer
copilot_proxy_1  |     raise_error_grpc(rpc_error)
copilot_proxy_1  |   File "/usr/local/lib/python3.8/site-packages/tritonclient/grpc/__init__.py", line 62, in raise_error_grpc
copilot_proxy_1  |     raise get_error_grpc(rpc_error) from None
copilot_proxy_1  | tritonclient.utils.InferenceServerException: [StatusCode.UNAVAILABLE] Socket closed
copilot_proxy_1  | 192.168.0.179 - - [17/Aug/2022 15:21:08] "POST /v1/engines/codegen/completions HTTP/1.1" 500 -
triton_1         | --------------------------------------------------------------------------
triton_1         | Primary job  terminated normally, but 1 process returned
triton_1         | a non-zero exit code. Per user-direction, the job has been aborted.
triton_1         | --------------------------------------------------------------------------
triton_1         | --------------------------------------------------------------------------
triton_1         | mpirun noticed that process rank 0 with PID 0 on node 1f7b69d48c22 exited on signal 6 (Aborted).
triton_1         | --------------------------------------------------------------------------
fauxpilot_triton_1 exited with code 134

What could be causing this issue?
Any hints or clues are welcome. Thank you.

Different executable name for newer Docker Compose

With most recent versions of Docker Compose (2.6), installed as Debian package "docker-compose-plugin", the executable name changed from docker-compose to an argument docker compose.
launch.sh has the docker-compose command hard-coded. Make this choose the correct command depending on the version installed (or just try it and fall back to the other).

Small issues

Thanks for the great repo! Just a couple of issues:

  • the openAI root url should not include the 'v1'
  • docker-compose could really be renamed to docker compose in the README as docker-compose is now deprecated
  • Issue with config files for setups with >2 GPUs.

Code Block suggestion

In fauxpilot, can we suggest full code block like github copilot? right now even I wrote a quick sort method, I need to tab 100 times.

Running FauxPilot with A6000?

Couple of queries:

  • Could I run codegen-16B-multi model with a single A6000?
  • What GPU you tested against while development? I am asking this so that I chose the same GPU and reduce hiccup.

Bad performance for running this project

Hi, I am trying to run the tritonserver and flask proxy in the same container and found that the performance is bad.

10.110.2.179 - - [10/Sep/2022 08:46:43] "POST /v1/engines/codegen/completions HTTP/1.1" 200 -
Returned completion in 6501.662492752075 ms
10.110.2.179 - - [10/Sep/2022 08:46:45] "POST /v1/engines/codegen/completions HTTP/1.1" 200 -
Returned completion in 6626.265287399292 ms
10.110.2.179 - - [10/Sep/2022 08:46:45] "POST /v1/engines/codegen/completions HTTP/1.1" 200 -
Returned completion in 6687.429904937744 ms
10.110.2.179 - - [10/Sep/2022 08:46:46] "POST /v1/engines/codegen/completions HTTP/1.1" 200 -
Returned completion in 6587.698698043823 ms
10.110.2.179 - - [10/Sep/2022 08:46:46] "POST /v1/engines/codegen/completions HTTP/1.1" 200 -
Returned completion in 5684.9658489227295 ms
10.110.2.179 - - [10/Sep/2022 08:46:46] "POST /v1/engines/codegen/completions HTTP/1.1" 200 -
Returned completion in 5248.137474060059 ms
10.110.2.179 - - [10/Sep/2022 08:46:46] "POST /v1/engines/codegen/completions HTTP/1.1" 200 -
Returned completion in 4866.7027950286865 ms
10.110.2.179 - - [10/Sep/2022 08:46:46] "POST /v1/engines/codegen/completions HTTP/1.1" 200 -
Returned completion in 4606.655836105347 ms
10.110.2.179 - - [10/Sep/2022 08:46:46] "POST /v1/engines/codegen/completions HTTP/1.1" 200 -

Every request need 4 or more seconds. This is not acceptable...But I am usng a V100 gpu which I think is good enough. Hope someone can help me figure out the reason.

This is the config.env:

MODEL=codegen-350M-multi
NUM_GPUS=1
MODEL_DIR=...

This is the nvidia-smi info:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.86       Driver Version: 470.86       CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000000:06:00.0 Off |                    0 |
| N/A   32C    P0    36W / 250W |   1649MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

And here is the log when start the tritonserver:

I0910 08:43:31.338765 258862 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f5498000000' with size 268435456
I0910 08:43:31.339558 258862 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I0910 08:43:31.346725 258862 model_repository_manager.cc:1191] loading: fastertransformer:1
I0910 08:43:31.567535 258862 libfastertransformer.cc:1226] TRITONBACKEND_Initialize: fastertransformer
I0910 08:43:31.567573 258862 libfastertransformer.cc:1236] Triton TRITONBACKEND API version: 1.10
I0910 08:43:31.567581 258862 libfastertransformer.cc:1242] 'fastertransformer' TRITONBACKEND API version: 1.10
I0910 08:43:31.567638 258862 libfastertransformer.cc:1274] TRITONBACKEND_ModelInitialize: fastertransformer (version 1)
W0910 08:43:31.569452 258862 libfastertransformer.cc:149] model configuration:
{
    "name": "fastertransformer",
    "platform": "",
    "backend": "fastertransformer",
    "version_policy": {
        "latest": {
            "num_versions": 1
        }
    },
    "max_batch_size": 1024,
    "input": [
        {
            "name": "input_ids",
            "data_type": "TYPE_UINT32",
            "format": "FORMAT_NONE",
            "dims": [
                -1
            ],
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": false
        },
        {
            "name": "start_id",
            "data_type": "TYPE_UINT32",
            "format": "FORMAT_NONE",
            "dims": [
                1
            ],
            "reshape": {
                "shape": []
            },
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": true
        },
        {
            "name": "end_id",
            "data_type": "TYPE_UINT32",
            "format": "FORMAT_NONE",
            "dims": [
                1
            ],
            "reshape": {
                "shape": []
            },
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": true
        },
        {
            "name": "input_lengths",
            "data_type": "TYPE_UINT32",
            "format": "FORMAT_NONE",
            "dims": [
                1
            ],
            "reshape": {
                "shape": []
            },
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": false
        },
        {
            "name": "request_output_len",
            "data_type": "TYPE_UINT32",
            "format": "FORMAT_NONE",
            "dims": [
                -1
            ],
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": false
        },
        {
            "name": "runtime_top_k",
            "data_type": "TYPE_UINT32",
            "format": "FORMAT_NONE",
            "dims": [
                1
            ],
            "reshape": {
                "shape": []
            },
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": true
        },
        {
            "name": "runtime_top_p",
            "data_type": "TYPE_FP32",
            "format": "FORMAT_NONE",
            "dims": [
                1
            ],
            "reshape": {
                "shape": []
            },
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": true
        },
        {
            "name": "beam_search_diversity_rate",
            "data_type": "TYPE_FP32",
            "format": "FORMAT_NONE",
            "dims": [
                1
            ],
            "reshape": {
                "shape": []
            },
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": true
        },
        {
            "name": "temperature",
            "data_type": "TYPE_FP32",
            "format": "FORMAT_NONE",
            "dims": [
                1
            ],
            "reshape": {
                "shape": []
            },
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": true
        },
        {
            "name": "len_penalty",
            "data_type": "TYPE_FP32",
            "format": "FORMAT_NONE",
            "dims": [
                1
            ],
            "reshape": {
                "shape": []
            },
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": true
        },
        {
            "name": "repetition_penalty",
            "data_type": "TYPE_FP32",
            "format": "FORMAT_NONE",
            "dims": [
                1
            ],
            "reshape": {
                "shape": []
            },
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": true
        },
        {
            "name": "random_seed",
            "data_type": "TYPE_INT32",
            "format": "FORMAT_NONE",
            "dims": [
                1
            ],
            "reshape": {
                "shape": []
            },
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": true
        },
        {
            "name": "is_return_log_probs",
            "data_type": "TYPE_BOOL",
            "format": "FORMAT_NONE",
            "dims": [
                1
            ],
            "reshape": {
                "shape": []
            },
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": true
        },
        {
            "name": "beam_width",
            "data_type": "TYPE_UINT32",
            "format": "FORMAT_NONE",
            "dims": [
                1
            ],
            "reshape": {
                "shape": []
            },
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": true
        },
        {
            "name": "bad_words_list",
            "data_type": "TYPE_INT32",
            "format": "FORMAT_NONE",
            "dims": [
                2,
                -1
            ],
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": true
        },
        {
            "name": "stop_words_list",
            "data_type": "TYPE_INT32",
            "format": "FORMAT_NONE",
            "dims": [
                2,
                -1
            ],
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": true
        }
    ],
    "output": [
        {
            "name": "output_ids",
            "data_type": "TYPE_UINT32",
            "dims": [
                -1,
                -1
            ],
            "label_filename": "",
            "is_shape_tensor": false
        },
        {
            "name": "sequence_length",
            "data_type": "TYPE_UINT32",
            "dims": [
                -1
            ],
            "label_filename": "",
            "is_shape_tensor": false
        },
        {
            "name": "cum_log_probs",
            "data_type": "TYPE_FP32",
            "dims": [
                -1
            ],
            "label_filename": "",
            "is_shape_tensor": false
        },
        {
            "name": "output_log_probs",
            "data_type": "TYPE_FP32",
            "dims": [
                -1,
                -1
            ],
            "label_filename": "",
            "is_shape_tensor": false
        }
    ],
    "batch_input": [],
    "batch_output": [],
    "optimization": {
        "priority": "PRIORITY_DEFAULT",
        "input_pinned_memory": {
            "enable": true
        },
        "output_pinned_memory": {
            "enable": true
        },
        "gather_kernel_buffer_threshold": 0,
        "eager_batching": false
    },
    "instance_group": [
        {
            "name": "fastertransformer_0",
            "kind": "KIND_CPU",
            "count": 1,
            "gpus": [],
            "secondary_devices": [],
            "profile": [],
            "passive": false,
            "host_policy": ""
        }
    ],
    "default_model_filename": "codegen-350M-multi",
    "cc_model_filenames": {},
    "metric_tags": {},
    "parameters": {
        "start_id": {
            "string_value": "50256"
        },
        "model_name": {
            "string_value": "codegen-350M-multi"
        },
        "is_half": {
            "string_value": "1"
        },
        "enable_custom_all_reduce": {
            "string_value": "0"
        },
        "vocab_size": {
            "string_value": "51200"
        },
        "tensor_para_size": {
            "string_value": "1"
        },
        "decoder_layers": {
            "string_value": "20"
        },
        "size_per_head": {
            "string_value": "64"
        },
        "max_seq_len": {
            "string_value": "2048"
        },
        "end_id": {
            "string_value": "50256"
        },
        "inter_size": {
            "string_value": "4096"
        },
        "head_num": {
            "string_value": "16"
        },
        "model_type": {
            "string_value": "GPT-J"
        },
        "model_checkpoint_path": {
            "string_value": "/model/fastertransformer/1/1-gpu"
        },
        "rotary_embedding": {
            "string_value": "32"
        },
        "pipeline_para_size": {
            "string_value": "1"
        }
    },
    "model_warmup": []
}
I0910 08:43:31.569890 258862 libfastertransformer.cc:1320] TRITONBACKEND_ModelInstanceInitialize: fastertransformer_0 (device 0)
W0910 08:43:31.569915 258862 libfastertransformer.cc:453] Faster transformer model instance is created at GPU '0'
W0910 08:43:31.569922 258862 libfastertransformer.cc:459] Model name codegen-350M-multi
W0910 08:43:31.569940 258862 libfastertransformer.cc:578] Get input name: input_ids, type: TYPE_UINT32, shape: [-1]
W0910 08:43:31.569948 258862 libfastertransformer.cc:578] Get input name: start_id, type: TYPE_UINT32, shape: [1]
W0910 08:43:31.569954 258862 libfastertransformer.cc:578] Get input name: end_id, type: TYPE_UINT32, shape: [1]
W0910 08:43:31.569960 258862 libfastertransformer.cc:578] Get input name: input_lengths, type: TYPE_UINT32, shape: [1]
W0910 08:43:31.569966 258862 libfastertransformer.cc:578] Get input name: request_output_len, type: TYPE_UINT32, shape: [-1]
W0910 08:43:31.569972 258862 libfastertransformer.cc:578] Get input name: runtime_top_k, type: TYPE_UINT32, shape: [1]
W0910 08:43:31.569978 258862 libfastertransformer.cc:578] Get input name: runtime_top_p, type: TYPE_FP32, shape: [1]
W0910 08:43:31.569984 258862 libfastertransformer.cc:578] Get input name: beam_search_diversity_rate, type: TYPE_FP32, shape: [1]
W0910 08:43:31.569990 258862 libfastertransformer.cc:578] Get input name: temperature, type: TYPE_FP32, shape: [1]
W0910 08:43:31.569995 258862 libfastertransformer.cc:578] Get input name: len_penalty, type: TYPE_FP32, shape: [1]
W0910 08:43:31.570001 258862 libfastertransformer.cc:578] Get input name: repetition_penalty, type: TYPE_FP32, shape: [1]
W0910 08:43:31.570006 258862 libfastertransformer.cc:578] Get input name: random_seed, type: TYPE_INT32, shape: [1]
W0910 08:43:31.570012 258862 libfastertransformer.cc:578] Get input name: is_return_log_probs, type: TYPE_BOOL, shape: [1]
W0910 08:43:31.570018 258862 libfastertransformer.cc:578] Get input name: beam_width, type: TYPE_UINT32, shape: [1]
W0910 08:43:31.570024 258862 libfastertransformer.cc:578] Get input name: bad_words_list, type: TYPE_INT32, shape: [2, -1]
W0910 08:43:31.570034 258862 libfastertransformer.cc:578] Get input name: stop_words_list, type: TYPE_INT32, shape: [2, -1]
W0910 08:43:31.570046 258862 libfastertransformer.cc:620] Get output name: output_ids, type: TYPE_UINT32, shape: [-1, -1]
W0910 08:43:31.570053 258862 libfastertransformer.cc:620] Get output name: sequence_length, type: TYPE_UINT32, shape: [-1]
W0910 08:43:31.570059 258862 libfastertransformer.cc:620] Get output name: cum_log_probs, type: TYPE_FP32, shape: [-1]
W0910 08:43:31.570065 258862 libfastertransformer.cc:620] Get output name: output_log_probs, type: TYPE_FP32, shape: [-1, -1]
[FT][WARNING] Custom All Reduce only supports 8 Ranks currently. Using NCCL as Comm.
I0910 08:43:31.853587 258862 libfastertransformer.cc:307] Before Loading Model:
after allocation, free 31.19 GB total 31.75 GB
[WARNING] gemm_config.in is not found; using default GEMM algo
I0910 08:43:34.250280 258862 libfastertransformer.cc:321] After Loading Model:
after allocation, free 30.33 GB total 31.75 GB
I0910 08:43:34.251457 258862 libfastertransformer.cc:537] Model instance is created on GPU Tesla V100-PCIE-32GB
I0910 08:43:34.252189 258862 model_repository_manager.cc:1345] successfully loaded 'fastertransformer' version 1
I0910 08:43:34.252418 258862 server.cc:556]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0910 08:43:34.252533 258862 server.cc:583]
+-------------------+--------------------------------------------+--------------------------------------------+
| Backend           | Path                                       | Config                                     |
+-------------------+--------------------------------------------+--------------------------------------------+
| fastertransformer | /opt/tritonserver/backends/fastertransform | {"cmdline":{"auto-complete-config":"false" |
|                   | er/libtriton_fastertransformer.so          | ,"min-compute-capability":"6.000000","back |
|                   |                                            | end-directory":"/opt/tritonserver/backends |
|                   |                                            | ","default-max-batch-size":"4"}}           |
|                   |                                            |                                            |
+-------------------+--------------------------------------------+--------------------------------------------+

I0910 08:43:34.252618 258862 server.cc:626]
+-------------------+---------+--------+
| Model             | Version | Status |
+-------------------+---------+--------+
| fastertransformer | 1       | READY  |
+-------------------+---------+--------+

I0910 08:43:34.298144 258862 metrics.cc:650] Collecting metrics for GPU 0: Tesla V100-PCIE-32GB
I0910 08:43:34.298580 258862 tritonserver.cc:2159]
+----------------------------------+----------------------------------------------------------------------------+
| Option                           | Value                                                                      |
+----------------------------------+----------------------------------------------------------------------------+
| server_id                        | triton                                                                     |
| server_version                   | 2.23.0                                                                     |
| server_extensions                | classification sequence model_repository model_repository(unload_dependent |
|                                  | s) schedule_policy model_configuration system_shared_memory cuda_shared_me |
|                                  | mory binary_tensor_data statistics trace                                   |
| model_repository_path[0]         | /openbayes/home/fauxpilot/models/codegen-350M-multi-1gpu                   |
| model_control_mode               | MODE_NONE                                                                  |
| strict_model_config              | 1                                                                          |
| rate_limit                       | OFF                                                                        |
| pinned_memory_pool_byte_size     | 268435456                                                                  |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                   |
| response_cache_byte_size         | 0                                                                          |
| min_supported_compute_capability | 6.0                                                                        |
| strict_readiness                 | 1                                                                          |
| exit_timeout                     | 30                                                                         |
+----------------------------------+----------------------------------------------------------------------------+

I0910 08:43:34.329038 258862 grpc_server.cc:4587] Started GRPCInferenceService at 0.0.0.0:8001
I0910 08:43:34.329462 258862 http_server.cc:3303] Started HTTPService at 0.0.0.0:8000
I0910 08:43:34.370706 258862 http_server.cc:178] Started Metrics Service at 0.0.0.0:8002

Create a FAQ document and known working configurations

Now that FauxPilot has been used by quite a few people, it would be great to collect the questions that come up repeatedly in a Frequently Asked Questions (FAQ) page. I have started tagging issues with to collect such questions.

Another helpful thing would be a list of GPUs and model sizes that are known to work, so that people can easily see if their configuration should work.

Proposal: Participating in HactoberFest (10/1 - 10/31)

This year, HactoberFest will be held in October for one month.
Many GitHub open source projects are participating.

To encourage code contribution, I would like to suggest that this community participate here. 😺
Methods for participation can be found in the ISSUE and Pull Request (PR) menu.
@moyix , You may include the two labels such as HACKTOBERFEST, HACKTOBERFEST-ACCEPTED.

Hacktoberfest is a month-long celebration of open source projects, their maintainers, 
and the entire community of contributors. Each October, open source maintainers 
give new contributors extra attention as they guide developers through their first 
pull requests on GitHub.

Dockerfile to build the moyix/triton_with_ft:22.06 image

The repository contains Dockerfiles to recreate the moyix/model_converter:latest and moyix/copilot_proxy:latest image but not the moyix/triton_with_ft:22.06 image. Would it be possible to add the Dockerfile + config to build this image?

CPU instead of GPU

Is there a way to do such a thing with CPU instead of GPU? I know this would be slower, but it would be a cheaper solution and would not depend on NVIDIA.

Do we know the server name used by CoPilot?

Hi,

I'm using the JetBrains CoPilot plugin which is not configurable. I've tried setting api.openai.com in my hosts file, but the server isn't being hit.

Is it a different hostname, such as copilot.github.com or something?

Thanks!

Build fails: https://huggingface.co/codegen-16B-mono-hf is private

It seems that repository https://huggingface.co/codegen-16B-mono-hf/resolve/main/config.json required authentication. This lets the build fail for me.

$ ~/fauxpilot (main) $ ./setup.sh 
Models available:
[1] codegen-350M-mono (2GB total VRAM required; Python-only)
[2] codegen-350M-multi (2GB total VRAM required; multi-language)
[3] codegen-2B-mono (7GB total VRAM required; Python-only)
[4] codegen-2B-multi (7GB total VRAM required; multi-language)
[5] codegen-6B-mono (13GB total VRAM required; Python-only)
[6] codegen-6B-multi (13GB total VRAM required; multi-language)
[7] codegen-16B-mono (32GB total VRAM required; Python-only)
[8] codegen-16B-multi (32GB total VRAM required; multi-language)
Enter your choice [6]: 7
Enter number of GPUs [1]: 1
Where do you want to save the model [/home/user/fauxpilot/models]? 
Downloading and converting the model, this will take a while...
Unable to find image 'moyix/model_conveter:latest' locally
latest: Pulling from moyix/model_conveter
[many "Pull complete"s]
Digest: sha256:744858f56b5eef785fde79b0f3bc76887fe34f14d0f8c01b06bf92ccd551b3ac
Status: Downloaded newer image for moyix/model_conveter:latest
Converting model codegen-16B-mono with 1 GPUs
Downloading config.json:   0%|          | 0.00/994 [00:00<?, ?B/s]Loading CodeGen model
Downloading config.json: 100%|██████████| 994/994 [00:00<00:00, 1.59MB/s]
Downloading pytorch_model.bin: 100%|██████████| 30.0G/30.0G [06:19<00:00, 84.9MB/s]
download_and_convert_model.sh: line 9:     8 Killed                  python3 codegen_gptj_convert.py --code_model Salesforce/${MODEL} ${MODEL}-hf

=============== Argument ===============
saved_dir: /models/codegen-16B-mono-1gpu/fastertransformer/1
in_file: codegen-16B-mono-hf
trained_gpu_num: 1
infer_gpu_num: 1
processes: 4
weight_data_type: fp32
========================================
Traceback (most recent call last):
  File "/transformers/src/transformers/configuration_utils.py", line 619, in _get_config_dict
    resolved_config_file = cached_path(
  File "/transformers/src/transformers/utils/hub.py", line 285, in cached_path
    output_path = get_from_cache(
  File "/transformers/src/transformers/utils/hub.py", line 503, in get_from_cache
    _raise_for_status(r)
  File "/transformers/src/transformers/utils/hub.py", line 418, in _raise_for_status
    raise RepositoryNotFoundError(
transformers.utils.hub.RepositoryNotFoundError: 401 Client Error: Repository not found for url: https://huggingface.co/codegen-16B-mono-hf/resolve/main/config.json. If the repo is private, make sure you are authenticated.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "huggingface_gptj_convert.py", line 188, in <module>
    split_and_convert(args)
  File "huggingface_gptj_convert.py", line 86, in split_and_convert
    model = GPTJForCausalLM.from_pretrained(args.in_file)
  File "/transformers/src/transformers/modeling_utils.py", line 1844, in from_pretrained
    config, model_kwargs = cls.config_class.from_pretrained(
  File "/transformers/src/transformers/configuration_utils.py", line 530, in from_pretrained
    config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/transformers/src/transformers/configuration_utils.py", line 557, in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/transformers/src/transformers/configuration_utils.py", line 631, in _get_config_dict
    raise EnvironmentError(
OSError: codegen-16B-mono-hf is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo with `use_auth_token` or log in with `huggingface-cli login` and pass `use_auth_token=True`.
Done! Now run ./launch.sh to start the FauxPilot server.

Apple silicon support?

Hi, I discovered this project from a blog and I find it interesting. Only issue is that I work on a macbookpro with an M2 chip. And I dont know how to adapt this project to work an apple silicon and leverage the neural engine on it.

I'm a web developer and don't know much about Machine learning

Triton "no CUDA-capable device is detected" and "fastertransformer is not found"

Looking for a help with triton inference server setup!

After run ./launch.sh the following log generated:
"
...
fauxpilot-triton-1 | W0907 17:13:12.624284 88 pinned_memory_manager.cc:236] Unable to allocate pinned system memory, pinned memory pool will not be available: no CUDA-capable device is detected
fauxpilot-triton-1 | I0907 17:13:12.624501 88 cuda_memory_manager.cc:115] CUDA memory pool disabled
fauxpilot-triton-1 | I0907 17:13:12.624647 88 server.cc:556]
...
W0907 17:13:12.704691 88 metrics.cc:634] Cannot get CUDA device count, GPU metrics will not be available
...
"
But container is running. When I try to request inference I got:
"tritonclient.utils.InferenceServerException: [StatusCode.UNAVAILABLE] Request for unknown model: 'fastertransformer' is not found"

Dependencies and system specs:
1xV100 gpu
Driver Version: 515.65.01
Docker Compose version v2.6.0
Nvidia docker: nvidia/cuda:11.0.3-base-ubuntu20.04

ERROR: Version in "./docker-compose.yaml" is unsupported.

bash launch.sh

ERROR: Version in "./docker-compose.yaml" is unsupported.
You might be seeing this error because you're using the wrong Compose file version.
Either specify a version of "2" (or "2.0") and place your service definitions under the services key,
or omit the version key and place your service definitions at the root of the file to use version 1.
For more on the Compose file format versions, see https://docs.docker.com/compose/compose-file/

head docker-compose.yaml

version: '3.3'
services:
triton:
image: moyix/triton_with_ft:22.09

/etc/issue

Ubuntu 16.04 LTS

docker --version

Server: Docker Engine - Community
Engine:
Version: 20.10.7
API version: 1.41 (minimum version 1.12)
Go version: go1.13.15
Git commit: b0f5bc3
Built: Wed Jun 2 11:54:58 2021
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.4.6
GitCommit: d71fcd7d8303cbf684402823e425e9dd2e99285d
runc:
Version: 1.0.0-rc95
GitCommit: b9ee9c6314599f1b4a7f497e1f1f856fe433d3b7
docker-init:
Version: 0.19.0
GitCommit: de40ad0

docker-compose --version

docker-compose version 1.8.0, build unknown
docker-py version: 1.9.0
CPython version: 2.7.12
OpenSSL version: OpenSSL 1.0.2g 1 Mar 2016

[FT][ERROR] CUDA runtime error: operation not supported

I am using a T4 gpu, host machine's cuda is 11.0 and driver is 450.102.04. When running launch.sh, got such error.
Detail log:

fauxpilot-triton-1         | W0812 03:06:40.864778 92 libfastertransformer.cc:620] Get output name: cum_log_probs, type: TYPE_FP32, shape: [-1]
fauxpilot-triton-1         | W0812 03:06:40.864782 92 libfastertransformer.cc:620] Get output name: output_log_probs, type: TYPE_FP32, shape: [-1, -1]
fauxpilot-triton-1         | [FT][WARNING] Custom All Reduce only supports 8 Ranks currently. Using NCCL as Comm.
fauxpilot-triton-1         | I0812 03:06:41.156692 92 libfastertransformer.cc:307] Before Loading Model:
fauxpilot-triton-1         | after allocation, free 6.56 GB total 8.00 GB
fauxpilot-triton-1         | [WARNING] gemm_config.in is not found; using default GEMM algo
fauxpilot-triton-1         | terminate called after throwing an instance of 'std::runtime_error'
fauxpilot-triton-1         |   what():  [FT][ERROR] CUDA runtime error: operation not supported /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/utils/allocator.h:181 
fauxpilot-triton-1         | 
fauxpilot-triton-1         | [5f61fab36b85:00092] *** Process received signal ***
fauxpilot-triton-1         | [5f61fab36b85:00092] Signal: Aborted (6)
fauxpilot-triton-1         | [5f61fab36b85:00092] Signal code:  (-6)
fauxpilot-triton-1         | [5f61fab36b85:00092] [ 0] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7f3a7ef7e420]

VSCode plugin

Hi,I tried using the VSCode plugin using the supplied configuration, but the plugin throws the following error:

[INFO] [auth] [2022-08-03T06:47:46.784Z] Invalid copilot token: missing token: 403
[ERROR] [default] [2022-08-03T06:47:46.787Z] GitHub Copilot could not connect to server. Extension activation failed: "User not authorized"

Do we need a GitHub Copilot subscription to get a working token?

Log-probabilities should not be turned on by default

An (unintended?) consequence of #49 seems to be that the completion API now returns log probabilities for each token now even if they are not requested (i.e. if logprobs=NULL in the request). I poked around at it a little bit but couldn't immediately track down why it's happening. I thought it might be due to these lines:

https://github.com/moyix/fauxpilot/blob/main/copilot_proxy/utils/codegen.py#L84-L87

But changing that doesn't seem to have stopped it from returning LPs. @fdegier, do you know offhand what might have introduced this?

Not high priority, just something to fix up when I get a chance.

Will the codex openai library work with fauxpilot?

Thanks for this fantastic work. I will do a quick comparison for code completion between fauxpiot and copilot.

I will be making calls with the following API invocation:

        response = openai.Completion.create(
            model="code-davinci-002",
            prompt=input_prompt,
            temperature=temperature,
            max_tokens=max_tokens,
            top_p=1,
            n=number_of_suggestions,
            frequency_penalty=frequency_penalty,
            presence_penalty=0,
            stop="###",
        )
        # suggestions = response['choices']
        result = ""
        if 'choices' in response:
            x = response['choices']
            if len(x) > 0:
                for i in range(0, len(x)):
                    result = x[i]['text']
            else:
                result = ''

        # are these metrics present?
        response_completion_tokens = response["usage"]["completion_tokens"]
        response_prompt_tokens = response["usage"]["prompt_tokens"]
        response_total_tokens = response["usage"]["total_tokens"]

Can you please let me know whether this API invocation would work for fauxpilot?

[ERROR] CUDA runtime error: the provided PTX was compiled with an unsupported toolchain.

I get the following error when I run launch.sh, how do I fix it? Thanks

fauxpilot-triton-1         | [FT][WARNING] Custom All Reduce only supports 8 Ranks currently. Using NCCL as Comm.
fauxpilot-triton-1         | I0811 02:59:27.879335 94 libfastertransformer.cc:307] Before Loading Model:
fauxpilot-triton-1         | terminate called after throwing an instance of 'std::runtime_error'
fauxpilot-triton-1         |   what():  [FT][ERROR] CUDA runtime error: the provided PTX was compiled with an unsupported toolchain. /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/utils/cuda_utils.h:393

Could you show fauxpilot using GIF?

In my practice, fauxpilot always offers some not so good candidate, just like this, it uses codegen-6B-mono model.

image

By the way, Who can provide a GIF to show fauxpilot? thx~

[Discussion] PowerShell-based support for Windows

I rewrote the setup, launch scripts and so on from this repository to PowerShell to start fauxpilot directly in Windows with Docker, and it works fine on my device (As shown below). Although I used Windows in the name, I've recently enhanced the generality to work well in Linux as well (if anyone likes using pwsh in Linux as much as I do😸).

Using fauxpilot-windows in VSCode

I wonder if you are interested in such a project. Should I file a merge request to add the same functionality to this repo? Or continue to be independent?

[FT][ERROR] CUDA runtime error: the provided PTX was compiled with an unsupported toolchain

While am spitting up the compose via launch script am getting following error

triton_1         | [FT][WARNING] Custom All Reduce only supports 8 Ranks currently. Using NCCL as Comm.
triton_1         | terminate called after throwing an instance of 'std::runtime_error'
triton_1         |   what():  [FT][ERROR] CUDA runtime error: the provided PTX was compiled with an unsupported toolchain. /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/utils/cuda_utils.h:393 
triton_1         | 
triton_1         | [0674cc13c0f5:00095] *** Process received signal ***
triton_1         | [0674cc13c0f5:00095] Signal: Aborted (6)
triton_1         | [0674cc13c0f5:00095] Signal code:  (-6)

Cuda Information

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Feb_14_21:12:58_PST_2021
Cuda compilation tools, release 11.2, V11.2.152
Build cuda_11.2.r11.2/compiler.29618528_0
vidia-smi
Tue Sep 20 07:53:30 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0  On |                  N/A |
|  0%   49C    P8    30W / 320W |    334MiB / 10240MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      4833      G                                      35MiB |
|    0   N/A  N/A      5856      G                                     179MiB |
|    0   N/A  N/A      5982      G                                      51MiB |
|    0   N/A  N/A     17941      G                                      13MiB |
|    0   N/A  N/A    179354      G                                      11MiB |
|    0   N/A  N/A    189500      G                                      26MiB |
+-----------------------------------------------------------------------------+

Any way of porting this to Colab?

Most of us don't have GPUs powerful enough to even run models with 6 billion parameters. Can we port this to Colab in any way so it would be more accessible?

16B 2gpu models on HF Hub are corrupt

Not sure how this happened, but currently the 16B 2gpu models fail in unzstd with Decoding error (36) : Corrupted block detected. I will re-convert and re-upload them. Steps to fix:

  • Double-check to see if any other models are affected.
  • Re-run the conversion script and create that .tar.zst files
  • To prevent this in the future, add a SHASUMS file for each model and check it in ./setup.sh to prevent corrupted downloads.

./launch.sh compile error: /mnt/c/Program no such file or directory

I installed the requirements and installed choice 4 with 1 GPU in ./setup.sh. Then I called ./launch.sh with the following prints. I think the server was not launched. Simply creating a Program folder in my c drive didn't solve the problem. I wonder how to fix this?

(base) [email protected]:~/fauxpilot$ ls
LICENSE  README.md  config.env  converter  copilot_proxy  docker-compose.yaml  example.env  launch.sh  models  setup.sh

(base) [email protected]:~/fauxpilot$ ./launch.sh
./launch.sh: line 19: /mnt/c/Program: No such file or directory

(base) [email protected]:~/fauxpilot$ ./launch.sh
./launch.sh: line 19: /mnt/c/Program: Is a directory

(base) [email protected]:~/fauxpilot$

service "triton" refers to undefined volume

Hi there,

Amazing job on fauxpilot! Thank you. I just wanted to make you aware of this error I got when I tried to run the docker containers:

~/dev/fauxpilot$ docker compose up
service "triton" refers to undefined volume y/codegen-350M-multi-1gpu: invalid compose project

I believe the issue is the missing dot in front of the local path as shown below. It seems docker understands y/codegen-350M-multi-1gpu as a volume identifier instead of a local path. I am using wsl2 on Windows with docker 20.10.12.

Not sure if the change should be in docker-compose.yaml or launch.sh.

diff --git a/docker-compose.yaml b/docker-compose.yaml
index 7a0745b..0e37b03 100644
--- a/docker-compose.yaml
+++ b/docker-compose.yaml
@@ -5,7 +5,7 @@ services:
     command: bash -c "CUDA_VISIBLE_DEVICES=${GPUS} mpirun -n 1 --allow-run-as-root /opt/tritonserver/bin/tritonserver --model-repository=/model"
     shm_size: '2gb'
     volumes:
-      - ${MODEL_DIR}:/model
+      - ./${MODEL_DIR}:/model
     ports:
       - "8000:8000"
       - "8001:8001"

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.