Coder Social home page Coder Social logo

smallcloudai / refact Goto Github PK

View Code? Open in Web Editor NEW
1.4K 20.0 91.0 5.19 MB

WebUI for Fine-Tuning and Self-hosting of Open-Source Large Language Models for Coding

Home Page: https://refact.ai

License: BSD 3-Clause "New" or "Revised" License

Dockerfile 0.08% Python 25.19% HTML 2.66% JavaScript 69.41% CSS 2.60% Shell 0.06%
ai autocompletion chat refactoring fine-tuning self-hosted llama2 llms starchat starcoder

refact's Introduction

Black Refact Logo

This repo consists Refact WebUI for fine-tuning and self-hosting of code models, that you can later use inside Refact plugins for code completion and chat.


Discord Twitter Follow License Visual Studio JetBrains

  • Fine-tuning of open-source code models
  • Self-hosting of open-source code models
  • Download and upload Lloras
  • Use models for code completion and chat inside Refact plugins
  • Model sharding
  • Host several small models on one GPU
  • Use OpenAI and Anthropic keys to connect GPT-models for chat

self-hosting-refact

Running Refact Self-Hosted in a Docker Container

The easiest way to run the self-hosted server is a pre-build Docker image.

Install Docker with NVidia GPU support. On Windows you need to install WSL 2 first, one guide to do this.

Run docker container with following command:

docker run -d --rm --gpus all --shm-size=256m -p 8008:8008 -v refact-perm-storage:/perm_storage smallcloud/refact_self_hosting:latest

perm-storage is a volume that is mounted inside the container. All the configuration files, downloaded weights and logs are stored here.

To upgrade the docker, delete it using docker kill XXX (the volume perm-storage will retain your data), run docker pull smallcloud/refact_self_hosting and run it again.

Now you can visit http://127.0.0.1:8008 to see the server Web GUI.

Docker commands super short refresher Add your yourself to docker group to run docker without sudo (works for Linux):
sudo usermod -aG docker {your user}

List all containers:

docker ps -a

Start and stop existing containers (stop doesn't remove them):

docker start XXX
docker stop XXX

Shows messages from a container:

docker logs -f XXX

Remove a container and all its data (except data inside a volume):

docker rm XXX

Check out or delete a docker volume:

docker volume inspect VVV
docker volume rm VVV

See CONTRIBUTING.md for installation without a docker container.

Setting Up Plugins

Download Refact for VS Code or JetBrains.

Go to plugin settings and set up a custom inference URL http://127.0.0.1:8008

JetBrains Settings > Tools > Refact.ai > Advanced > Inference URL
VSCode Extensions > Refact.ai Assistant > Settings > Infurl

Supported models

Model Completion Chat Fine-tuning
Refact/1.6B + +
starcoder/1b/base + +
starcoder/3b/base + +
starcoder/7b/base + +
starcoder/15b/base +
starcoder/15b/plus +
starcoder2/3b/base + +
starcoder2/7b/base + +
starcoder2/15b/base + +
wizardcoder/15b +
codellama/7b + +
starchat/15b/beta +
wizardlm/7b +
wizardlm/13b +
wizardlm/30b +
llama2/7b +
llama2/13b +
deepseek-coder/1.3b/base + +
deepseek-coder/5.7b/mqa-base + +
magicoder/6.7b +
mistral/7b/instruct-v0.1 +
mixtral/8x7b/instruct-v0.1 +
deepseek-coder/6.7b/instruct +
deepseek-coder/33b/instruct +
stable/3b/code +

Usage

Refact is free to use for individuals and small teams under BSD-3-Clause license. If you wish to use Refact for Enterprise, please contact us.

Custom installation

You can also install refact repo without docker:

pip install .

If you have a GPU with CUDA capability >= 8.0, you can also install it with flash-attention v2 support:

FLASH_ATTENTION_FORCE_BUILD=TRUE MAX_JOBS=4 INSTALL_OPTIONAL=TRUE pip install .

FAQ

Q: Can I run a model on CPU?

A: it doesn't run on CPU yet, but it's certainly possible to implement this.

Community & Support

refact's People

Contributors

adam-weinberger avatar anwirs avatar fuzzyreason avatar jegernoutt avatar klink avatar marcmcintosh avatar miranamer avatar mitya52 avatar olegklimov avatar oxyplay avatar reymondzzzz avatar valaises avatar worldemar avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

refact's Issues

[bounty] CPU inference support, Mac M1/M2 inference support

There are several projects aiming to make inference on CPU efficient.

The first part is research:

  • Which project works better,
  • And compatible with Refact license,
  • And doesn't bloat the docker too much,
  • And allows to use scratchpads similar to how inference_hf.py does it (needs a callback that streams output and allows to stop),
  • Does it include Mac M1/M2 support, or does it make sense to address Mac separately.

Please finish the first part, get a "go-ahead" for the second part.

The second part is implementation:

  • Script similar to inference_hf.py,
  • Little code,
  • Not much dependencies,
  • Demonstrate that it works with Refact-1.6b model, as well as StarCoder (at least the smaller sizes),
  • Integration with UI and watchdog is a plus, but efficient inference is obviously the priority.

Amd gpu supports

hi, is there any plan to support amd gpus or even any alternatives?
amd gpus have excellent vram capacity and power with reasonable price

Self-Hosted server can't use and WEBUI's GPU show loading model

WEBUI log
-- 302580 -- 20230816 00:49:00 WEBUI 10.244.43.64:1940 - "POST /v1/contrast HTTP/1.1" 200
-- 302580 -- 20230816 00:49:00 WEBUI *** CANCEL *** cancelling comp-sCPvuzAjDUXL 488.0ms
-- 302580 -- 20230816 00:49:01 WEBUI comp-s7zqIh7A8aiY model resolve "CONTRASTcode" func "infill" -> "CONTRASTcode/3b/multi" from XXX
-- 302580 -- 20230816 00:49:01 WEBUI 10.244.43.64:27439 - "POST /v1/contrast HTTP/1.1" 200
-- 302580 -- 20230816 00:49:31 WEBUI TIMEOUT comp-s7zqIh7A8aiY

Crash on Docker startup

When I try to start the docker command using the command given in the README, it immediately crashes with the following message (I've created the volume perm-storage manually and its the same result):

% docker run --rm -p 8008:8008 -v perm-storage:/perm_storage --gpus all smallcloud/refact_self_hosting 

==========
== CUDA ==
==========

CUDA Version 11.6.2

Container image Copyright (c) 2016-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.8/dist-packages/refact_self_hosting/watchdog.py", line 77, in <module>
    logdir.mkdir(exist_ok=True, parents=False)
  File "/usr/lib/python3.8/pathlib.py", line 1288, in mkdir
    self._accessor.mkdir(self, mode)
FileNotFoundError: [Errno 2] No such file or directory: '/workdir/logs'

Document SERVER_API_TOKEN

The docker image doesn't work unless you set the SERVER_API_TOKEN environment variable. Please include this in the documentation.

OOM issues cause auto-completion and chat response to stop

Preconditions:
Run with a large context or little amount of memory

Steps:
Start to chat or trigger an auto-completion request in IDE
Wait for the system to respond

Expected result:
The system should respond correctly

Actual result:
The system may stop responding and not indicating an out-of-memory issue in IDE.
Auto-completion requests or chat responses may be incomplete.

image

Consider integrating with atom

https://github.com/AppThreat/atom

We created an open-sourced atom for the precise identification of usages and dataflows across large code bases. This approach is better for summarizing and identifying context than the primitive tokenization commonly found everywhere.

We have a dedicated community support channel should you require any help with using atom or code analysis in general.

Allow renaming of LoRA's Completed Runs

By allowing users to rename their completed runs, they can quickly identify and distinguish between different runs.

  • Give names to loras when training, "all files", "only python" - what I would like to see for me last two runs
  • Rename existing loras
  • When exporting, .zip needs to retain the same name

Support parallel access?

When i to completion code using plugin, blocked if the other one to generate code, need to config that at server side to support parallel processing generate code?

Plugin in PyCharm and local model in Windows.

Is it possible to connect the plugin via the oobabooga-webui or koboldcpp API to a locally running model (Refact-1.6b, starcoder, etc.)?
If possible, how? Or is it possible to work with local models only as described here?

[UI] Interface blocked when starting GPU filtering without waiting for file load

Steps:
Add a big file to the list without waiting for the addition to complete
Immediately scan sources and run GPU filter
Expected result:
Interface should not allow GPU filtering to start until all files have finished loading
Actual result:
GPU filtering starts anyway
An error message "No train files have been provided for filtering" is displayed
Upload files/repos buttons are blocked until refresh page
image
image

Notes:
Issue occurs regardless of file size, but it's easier to reproduce with big file (~500Mb+)
Issue does not occur when waiting for addition to complete before running GPU filter

[UI] Issues in uploading large files and inconsistent UI behavior

Steps:
Add a big file to the list.
Wait for the upload progress to reach 100%.

When upload a large file and upload progress reaches 100%, the window may remain displayed for a few more minutes (3+ minutes for 2 Gb file). I would like to better understand what is happening and how much longer to wait for the end:
image

When upload a large file interface is not blocked. It's possible to hidden "Upload file" window from previous screen and then how to figure out what is going on:
image

If then open the "Upload file" window again while in list displayed .tmp file, there is no progress bar or working status:
image

If wait a few minutes, then eventually the file will be uploaded:
image

Also on all screenshots can see the error "No train files..." left from the previous time. The interface has not been updated.

[bug] minor typos and punctuation

I suggest marking the following typo/bug as a "good first-time issue", so we could attract some newcomers with low-hanging fruits.

refactor_typo_20230731_220955

  • "Context-aware chat on entire codebase" checkbox must be selected
  • Period is missing at the end of "Download Refact for VS Code or JetBrains"

Failures on insufficient GPU memory

This is my local setup:

NVIDIA-SMI 535.103 Driver Version: 537.13 CUDA Version: 12.2
GPU Name Persistence-M Bus-Id Disp.A Volatile Uncorr. ECC Fan Temp Perf Pwr:Usage/Cap Memory-Usage GPU-Uti Compute M MIG M.
0 NVIDIA RTX A2000 Laptop GPU On 00000000:01:00.0 On N/A N/A 49C P0 11W / 40W 3886MiB / 4096MiB 2% Default N/A

The container seem to start and load the model:

PS C:\Users\z0034zpz> docker run -d --rm -p 8008:8008 --env SERVER_API_TOKEN=LocalTokenInDockerContainer -v perm-storage:/perm_storage --gpus all smallcloud/refact_self_hosting
a455ae67d1b7b829460361bec6d3530629f7d2c1577b5e5c495318d5f378223a
PS C:\Users\z0034zpz> docker logs -f a455

==========
== CUDA ==
==========

CUDA Version 11.8.0

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

20230906 12:56:00 adding job model-contrastcode-3b-multi-0.cfg
20230906 12:56:00 adding job enum_gpus.cfg
20230906 12:56:00 adding job filetune.cfg
20230906 12:56:00 adding job filetune_filter_only.cfg
20230906 12:56:00 adding job process_uploaded.cfg
20230906 12:56:00 adding job webgui.cfg
20230906 12:56:00 CVD=0 starting python -m self_hosting_machinery.inference.inference_worker --model CONTRASTcode/3b/multi --compile
 -> pid 30
20230906 12:56:00 CVD= starting python -m self_hosting_machinery.scripts.enum_gpus
 -> pid 31
20230906 12:56:00 CVD= starting python -m self_hosting_machinery.webgui.webgui
 -> pid 32
-- 32 -- 20230906 12:56:00 WEBUI Started server process [32]
-- 32 -- 20230906 12:56:00 WEBUI Waiting for application startup.
-- 32 -- 20230906 12:56:00 WEBUI Application startup complete.
-- 32 -- 20230906 12:56:00 WEBUI Uvicorn running on http://0.0.0.0:8008 (Press CTRL+C to quit)
-- 30 -- 20230906 12:56:06 MODEL STATUS loading model
-- 32 -- 20230906 12:56:23 WEBUI Invalid HTTP request received.
-- 30 -- 20230906 12:56:44 MODEL STATUS test batch
20230906 12:57:22 30 finished python -m self_hosting_machinery.inference.inference_worker --model CONTRASTcode/3b/multi @:gpu00, retcode 0
/finished compiling as recognized by watchdog
20230906 12:57:23 CVD=0 starting python -m self_hosting_machinery.inference.inference_worker --model CONTRASTcode/3b/multi
 -> pid 105
-- 105 -- 20230906 12:57:25 MODEL STATUS loading model
-- 105 -- 20230906 12:57:51 MODEL STATUS test batch
-- 32 -- 20230906 12:58:15 WEBUI Invalid HTTP request received.

I tried to run the vscode extension with and without api key:
image
I tried to use the extension, but it is inprogress forever:
image
The logs inside the container signal an issue, but do not specify what:

-- 32 -- 20230906 13:25:05 WEBUI 127.0.0.1:45898 - "POST /infengine-v1/completions-wait-batch HTTP/1.1" 200
-- 32 -- 20230906 13:25:08 WEBUI Invalid HTTP request received.

Only if I launch the webui, then I can see:
image
Required memory exceeds the GPU's memory.

Could you please improve the logs, that a more detailed messasge is visible and provide clear warning, that there is not enough memory on graphic card?

401 unauthorised... any additional configs to be put in

Hi , trying to host this without docker.... Please suggest if I am missing something.

info: Uvicorn running on https://0.0.0.0:8008 (Press CTRL+C to quit)INFO: 16.142.51.155:38015 - "GET / HTTP/1.1" 404 Not FoundINFO: 16.142.51.155:38015 - "GET /favicon.ico HTTP/1.1" 404 Not FoundINFO: 16.142.51.155:38015 - "GET / HTTP/1.1" 404 Not FoundINFO: 16.142.51.155:5217 - "GET /v1/login HTTP/1.1" 200 OKINFO:root:running /v1/contrast function=infillINFO: 16.142.51.155:4871 - "POST /v1/contrast HTTP/1.1" 401 UnauthorizedINFO:root:running /v1/contrast function=infillINFO: 16.142.51.155:6519 - "POST /v1/contrast HTTP/1.1" 401 UnauthorizedINFO:root:running /v1/contrast function=infill

Finetune "Start now" button is available to press even if there are no training files

Steps:
Set up a small file loss threshold so that all files are filtered
Filter files

Expected result:
"Start Now" button to be unavailable if there are no training files

Actual result:
Funetune "Start Now" button is available to press even though there are no training files.
Finetune starts and fails after several minutes.
In log window: RuntimeError: No train files have been provided

[UI] With all files filtered, error message is bad

  File "/home/user/code/refact/refact_data_pipeline/finetune/finetune_filter.py", line 353, in pre_filtering
    test_files = random.choices(train_files, k=fcfg["limit_test_files"])
  File "/usr/lib/python3.9/random.py", line 487, in choices
    return [population[floor(random() * n)] for i in _repeat(None, k)]
  File "/usr/lib/python3.9/random.py", line 487, in <listcomp>
    return [population[floor(random() * n)] for i in _repeat(None, k)]
IndexError: list index out of range

should be something more frieldly.

[Feature] LORA export / more LORA training capabilities

I think Refact's strongest selling points are:

  1. It's fast, clean web UI.
  2. It's ability to be be run 100% offline locally.
  3. It's ability to performing training/fine tuning.

For the third I think it would be great to have some more configurability and options, for example being able to name / add notes to a LORA and export it for backups / testing / other instances etc....

It would also be great if it was possible to perform fine tuning / training on other models.

Keep up the great work folks!

[plugin settings in IntelliJ] global permanent rules are not consistent

I am not sure if it's a bug or a feature.

When "global defaults" are switched between levels 0-2, the "global permanent rules to override by default" remain untouched.

refactai_settings_20230731_215235

However, when a new project is opened, it's using the "global defaults" as a default value.
Personally, I would like the "global default" switch to affect the "global permanent rules to override by default" as well

refactai_settings_20230731_215740

Running on Fedora 38 - Docs Update

When wanting self-hosted we are told to visit https://refact.ai/docs/self-hosting/ and run docker run -d --rm -p 8008:8008 -v perm-storage:/perm_storage --gpus all smallcloud/refact_self_hosting after ensuring we have docker with nvidia gpu support. Unfortunately these instructions do not work for me while I was able to run the previous release of refact before the significant changes just made. Here is what I was getting when following those instructions:

-- 26 -- WARNING:root:output was:
-- 26 -- - no output -
-- 26 -- WARNING:root:nvidia-smi does not work, that's especially bad for initial setup.
-- 26 -- WARNING:root:Traceback (most recent call last):
-- 26 --   File "/usr/local/lib/python3.8/dist-packages/self_hosting_machinery/scripts/enum_gpus.py", line 17, in query_nvidia_smi
-- 26 --     nvidia_smi_output = subprocess.check_output([
-- 26 --   File "/usr/lib/python3.8/subprocess.py", line 415, in check_output
-- 26 --     return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
-- 26 --   File "/usr/lib/python3.8/subprocess.py", line 516, in run
-- 26 --     raise CalledProcessError(retcode, process.args,
-- 26 -- subprocess.CalledProcessError: Command '['nvidia-smi', '--query-gpu=pci.bus_id,name,memory.used,memory.total,temperature.gpu', '--format=csv']' returned non-zero exit status 4.
-- 26 -- 

I can confirm however that running the enum_gpus.py by importing into python (tested 3.8 - which is in the Dockerfile, and 3.11) the function query_nvidia_smi succeeds. Additionally running the nvidia-smi command and flags from enum_gpus succeed:

(refact) [mrhillsman@workstation refact]$ python --version
Python 3.8.17
(refact) [mrhillsman@workstation refact]$ python
Python 3.8.17 (default, Jun  8 2023, 00:00:00) 
[GCC 13.1.1 20230511 (Red Hat 13.1.1-2)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import subprocess
>>> subprocess.check_output(["nvidia-smi", "--query-gpu=pci.bus_id,name,memory.used,memory.total,temperature.gpu", "--format=csv"])
b'pci.bus_id, name, memory.used [MiB], memory.total [MiB], temperature.gpu\n00000000:01:00.0, NVIDIA GeForce RTX 3080, 11 MiB, 10240 MiB, 29\n'
>>> import self_hosting_machinery.scripts.enum_gpus as gpuenum
>>> gpuenum.query_nvidia_smi()
{'gpus': [{'id': '00000000:01:00.0', 'name': 'NVIDIA GeForce RTX 3080', 'mem_used_mb': 11, 'mem_total_mb': 10240, 'temp_celsius': 29}]}
>>> exit()
(refact) [mrhillsman@workstation refact]$ nvidia-smi --query-gpu=pci.bus_id,name,memory.used,memory.total,temperature.gpu --format=csv
pci.bus_id, name, memory.used [MiB], memory.total [MiB], temperature.gpu
00000000:01:00.0, NVIDIA GeForce RTX 3080, 11 MiB, 10240 MiB, 29
(refact) [mrhillsman@workstation refact]$ nvidia-smi 
Sat Jul 22 15:46:59 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3080        Off | 00000000:01:00.0 Off |                  N/A |
|  0%   29C    P8              13W / 370W |     11MiB / 10240MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2727      G   /usr/bin/gnome-shell                          3MiB |
+---------------------------------------------------------------------------------------+
(refact) [mrhillsman@workstation refact]$ uname -a
Linux workstation 6.3.12-200.fc38.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Jul  6 04:05:18 UTC 2023 x86_64 GNU/Linux
(refact) [mrhillsman@workstation refact]$ cat /etc/os-release 
NAME="Fedora Linux"
VERSION="38 (Workstation Edition)"
ID=fedora
VERSION_ID=38
VERSION_CODENAME=""
PLATFORM_ID="platform:f38"
PRETTY_NAME="Fedora Linux 38 (Workstation Edition)"
ANSI_COLOR="0;38;2;60;110;180"
LOGO=fedora-logo-icon
CPE_NAME="cpe:/o:fedoraproject:fedora:38"
DEFAULT_HOSTNAME="fedora"
HOME_URL="https://fedoraproject.org/"
DOCUMENTATION_URL="https://docs.fedoraproject.org/en-US/fedora/f38/system-administrators-guide/"
SUPPORT_URL="https://ask.fedoraproject.org/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="Fedora"
REDHAT_BUGZILLA_PRODUCT_VERSION=38
REDHAT_SUPPORT_PRODUCT="Fedora"
REDHAT_SUPPORT_PRODUCT_VERSION=38
SUPPORT_END=2024-05-14
VARIANT="Workstation Edition"
VARIANT_ID=workstation
[mrhillsman@workstation refact-ai]$ sestatus
SELinux status:                 enabled
SELinuxfs mount:                /sys/fs/selinux
SELinux root directory:         /etc/selinux
Loaded policy name:             targeted
Current mode:                   enforcing
Mode from config file:          enforcing
Policy MLS status:              enabled
Policy deny_unknown status:     allowed
Memory protection checking:     actual (secure)
Max kernel policy version:      33

I would have created a PR for the documentation change but I do not see a repo for the site documentation. Here is what I was able to run and have work which I am recommending be added to the documentation somehow either under Fedora38 specifically or RPM based OSs in general:

podman run -d -it --gpus 0 --security-opt=label=disable -p 8008:8008 -v perm_storage:/perm_storage smallcloud/refact_self_hosting

Problem with --env as last part of command on powershell

Hi,

docker run -p 8008:8008 --gpus 0 --name refact_self_hosting smallcloud/refact_self_hosting --env SERVER_API_TOKEN=$token

Doesn't work on windows 11 powershel , but

docker run --env SERVER_API_TOKEN=$token -p 8008:8008 --gpus 0 --name refact_self_hosting smallcloud/refact_self_hosting

Works just fine. Here is reference why it's happening https://stackoverflow.com/a/64159568
May be good to update the docs to include that.

Model memory usage / quantization

According to this Refact blog post:

Check out the docs on self-hosting to get your AI code assistant up and running.
To run StarCoder using 4-bit quantization, you’ll need a 12GB GPU, and for 8-bit you’ll need 24GB.
It’s currently available for VS Code, and JetBrains IDEs.

I am currently using a 12GB GPU (RTX 4070), so that sounds great.

However, the interface does not offer any options to select quantization:

image

If I attempt to select codellama/7b or starcoder/15b/base, it claims that I don't have enough memory... which is strange, considering I've been running (quantized) 13B parameter llama2 models on my GPU just fine using other software.

image


The memory usage of even the smallest models is rather weird.

According to this Refact blog post:

With the smaller size, running the model is much faster and affordable than ever: the model can be served on most of all modern GPUs requiring just 3Gb[sic, 0] RAM and works great for real-time code completion tasks.

I've noticed that the Refact/1.6B model starts at about 5.3GB of VRAM usage, but jumps up to the full 12GB of VRAM as soon as I start doing any completions, which seems confusingly inefficient for a 1.6B parameter model, and a stark contrast to the stated goal of running on GPUs with only 3GB of VRAM. When I kill the container, my VRAM usage drops back to around zero, so it's not some other program using all this VRAM.

[sic, 0]: 3Gb (Gigabits) == ~0.375GB (Gigabytes), I'm assuming this should be 3GB.

image


I'm just running the Docker container under Docker on Windows using WSL2, and everything works fine, it's just memory usage that is confusing and concerning compared to other LLM software I have used (also under WSL2).

I'm not sure if there are plans to offer quantized model options through the GUI, if there are ways of selecting these quantized models without the GUI, or whatever other options.

No progress bar in web GUI when downloading layers

When selecting a new model, the web GUI shows message "loading model" but there's no progress bar. Only in the container's logs I can see that the layers are being downloaded.

Downloading layers.021.ln_m.weight: 100%|██████████| 5.51k/5.51k [00:00<00:00, 2.49MB/s]
Downloading layers.021.ln_m.bias: 100%|██████████| 5.51k/5.51k [00:00<00:00, 2.08MB/s]
-- 33 -- 20230905 10:34:50 WEBUI 172.17.0.1:37680 - "GET /tab-host-have-gpus HTTP/1.1" 200
-- 31 -- 
Downloading layers.021.pw.W1:   0%|          | 0.00/52.4M [00:00<?, ?B/s]
-- 33 -- 20230905 10:34:52 WEBUI 172.17.0.1:37680 - "GET /tab-host-have-gpus HTTP/1.1" 200
Downloading layers.021.pw.W1:  20%|█▉        | 10.5M/52.4M [00:01<00:07, 5.34MB/s]
-- 33 -- 20230905 10:34:54 WEBUI 172.17.0.1:37680 - "GET /tab-host-have-gpus HTTP/1.1" 200
Downloading layers.021.pw.W1:  40%|███▉      | 21.0M/52.4M [00:03<00:05, 6.05MB/s]
Downloading layers.021.pw.W1:  60%|█████▉    | 31.5M/52.4M [00:05<00:03, 6.35MB/s]
-- 33 -- 20230905 10:34:56 WEBUI 172.17.0.1:37680 - "GET /tab-host-have-gpus HTTP/1.1" 200
-- 33 -- 20230905 10:34:58 WEBUI 172.17.0.1:37680 - "GET /tab-host-have-gpus HTTP/1.1" 200
Downloading layers.021.pw.W1:  80%|███████▉  | 41.9M/52.4M [00:06<00:01, 6.20MB/s]
Downloading layers.021.pw.W1: 100%|██████████| 52.4M/52.4M [00:08<00:00, 6.32MB/s]
-- 33 -- 20230905 10:35:00 WEBUI 172.17.0.1:37680 - "GET /tab-host-have-gpus HTTP/1.1" 200

Etc.

Also the model in command is different than selected one which creates confusion as to which model is actually being loaded.

image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.