smallcloudai / refact Goto Github PK

WebUI for Fine-Tuning and Self-hosting of Open-Source Large Language Models for Coding

License: BSD 3-Clause "New" or "Revised" License

Dockerfile 0.08% Python 25.19% HTML 2.66% JavaScript 69.41% CSS 2.60% Shell 0.06%

ai autocompletion chat refactoring fine-tuning self-hosted llama2 llms starchat starcoder

refact's Introduction

This repo consists Refact WebUI for fine-tuning and self-hosting of code models, that you can later use inside Refact plugins for code completion and chat.

Fine-tuning of open-source code models
Self-hosting of open-source code models
Download and upload Lloras
Use models for code completion and chat inside Refact plugins
Model sharding
Host several small models on one GPU
Use OpenAI and Anthropic keys to connect GPT-models for chat

Running Refact Self-Hosted in a Docker Container

The easiest way to run the self-hosted server is a pre-build Docker image.

Install Docker with NVidia GPU support. On Windows you need to install WSL 2 first, one guide to do this.

Run docker container with following command:

docker run -d --rm --gpus all --shm-size=256m -p 8008:8008 -v refact-perm-storage:/perm_storage smallcloud/refact_self_hosting:latest

perm-storage is a volume that is mounted inside the container. All the configuration files, downloaded weights and logs are stored here.

To upgrade the docker, delete it using docker kill XXX (the volume perm-storage will retain your data), run docker pull smallcloud/refact_self_hosting and run it again.

Now you can visit http://127.0.0.1:8008 to see the server Web GUI.

Docker commands super short refresher

Add your yourself to docker group to run docker without sudo (works for Linux):

sudo usermod -aG docker {your user}

List all containers:

docker ps -a

Start and stop existing containers (stop doesn't remove them):

docker start XXX
docker stop XXX

Shows messages from a container:

docker logs -f XXX

Remove a container and all its data (except data inside a volume):

docker rm XXX

Check out or delete a docker volume:

docker volume inspect VVV
docker volume rm VVV

See CONTRIBUTING.md for installation without a docker container.

Setting Up Plugins

Download Refact for VS Code or JetBrains.

Go to plugin settings and set up a custom inference URL http://127.0.0.1:8008

JetBrains

Settings > Tools > Refact.ai > Advanced > Inference URL

VSCode

Extensions > Refact.ai Assistant > Settings > Infurl

Supported models

Model	Completion	Chat	Fine-tuning
Refact/1.6B	+		+
starcoder/1b/base	+		+
starcoder/3b/base	+		+
starcoder/7b/base	+		+
starcoder/15b/base	+
starcoder/15b/plus	+
starcoder2/3b/base	+		+
starcoder2/7b/base	+		+
starcoder2/15b/base	+		+
wizardcoder/15b	+
codellama/7b	+		+
starchat/15b/beta		+
wizardlm/7b		+
wizardlm/13b		+
wizardlm/30b		+
llama2/7b		+
llama2/13b		+
deepseek-coder/1.3b/base	+		+
deepseek-coder/5.7b/mqa-base	+		+
magicoder/6.7b		+
mistral/7b/instruct-v0.1		+
mixtral/8x7b/instruct-v0.1		+
deepseek-coder/6.7b/instruct		+
deepseek-coder/33b/instruct		+
stable/3b/code	+

Usage

Refact is free to use for individuals and small teams under BSD-3-Clause license. If you wish to use Refact for Enterprise, please contact us.

Custom installation

You can also install refact repo without docker:

pip install .

If you have a GPU with CUDA capability >= 8.0, you can also install it with flash-attention v2 support:

FLASH_ATTENTION_FORCE_BUILD=TRUE MAX_JOBS=4 INSTALL_OPTIONAL=TRUE pip install .

FAQ

Q: Can I run a model on CPU?

A: it doesn't run on CPU yet, but it's certainly possible to implement this.

Community & Support

Contributing CONTRIBUTING.md
GitHub issues for bugs and errors
Community forum for community support and discussions
Discord for chatting with community members
Twitter for product news and updates

refact's People

Contributors

Stargazers

Watchers

Forkers

ruakij nopeanuts xoraingroup cyd3nt criticaloptimisation coandco epinnock joeccane sjdthree kosmoflyko aronneburg sycomix khayoon touristshaun wangwendong1024 nikitakoselev pixelkaiser kurtseifried kjfff kokofixcomputers xcc313 coderbotics-ai-bot ishaan-jaff shobhit9957 mwksandman cypherpunk-labs kundanmunda5 benxh1995 jakiro2017 gardner miranamer acumenix luehang sorokinvld trofkm tonywhite11 kumar045 sephdm martincooperbiz therockstardba asghar765 psyrtsov cori arrow-moon-ai ego riffus k3kaw8pnf7mkmdsmphz27 williamtran29 houhcr ds5t5 anwirs adam-weinberger projectoperations cyd3nt rkt-dev jiaerdangjia tradingindian a01138704 jshuadvd silasdao 88ocelot amanullah1002 yuanzhongqiao uppercaveman uberizual rivman mentordotgit kerrickchan gitrjaa jaytoday jcntrl nakedlitttlezombie zhengjun16688 lokeshjonnakuti yakovau dimesy090 st01cs ojoggerst vital121 yuiseki chengzehsiao mahadih534 xuanhoa88 yuanxiaoming8899 aniervs wantong-lab 596192804 oblq vaimalaviya1233

refact's Issues

Select file types -- add select all, select none

Also, the UI becomes gray for a second each checkbox is triggered.

Cannot reach the server after generate output for somewhile

[UI] Improve: Prompt user to switch to finetune tab after adding sources

Why is user not prompted to switch to the Finetune tab after adding sources? Or maybe the add source section should be on the Finetune page?

[UI] In finetuning, ETA is not completely thought through

When a job is working it looks fine, but completed look like this:

"1" is meaningless, maybe it should show how much time it took, or hide it, or do something.

[Feature Request] Publish plugin to open-vsx.org

For people using codium instead of VSCode, could you publish the plugin on open-vsx.org ?

Cannot reach the server : Unexpected end of JSON input

deploy on a independent linux server with NVIDIA GPUs

[bounty] CPU inference support, Mac M1/M2 inference support

There are several projects aiming to make inference on CPU efficient.

The first part is research:

Which project works better,
And compatible with Refact license,
And doesn't bloat the docker too much,
And allows to use scratchpads similar to how inference_hf.py does it (needs a callback that streams output and allows to stop),
Does it include Mac M1/M2 support, or does it make sense to address Mac separately.

Please finish the first part, get a "go-ahead" for the second part.

The second part is implementation:

Script similar to inference_hf.py,
Little code,
Not much dependencies,
Demonstrate that it works with Refact-1.6b model, as well as StarCoder (at least the smaller sizes),
Integration with UI and watchdog is a plus, but efficient inference is obviously the priority.

Amd gpu supports

hi, is there any plan to support amd gpus or even any alternatives?
amd gpus have excellent vram capacity and power with reasonable price

Self-Hosted server can't use and WEBUI's GPU show loading model

WEBUI log
-- 302580 -- 20230816 00:49:00 WEBUI 10.244.43.64:1940 - "POST /v1/contrast HTTP/1.1" 200
-- 302580 -- 20230816 00:49:00 WEBUI *** CANCEL *** cancelling comp-sCPvuzAjDUXL 488.0ms
-- 302580 -- 20230816 00:49:01 WEBUI comp-s7zqIh7A8aiY model resolve "CONTRASTcode" func "infill" -> "CONTRASTcode/3b/multi" from XXX
-- 302580 -- 20230816 00:49:01 WEBUI 10.244.43.64:27439 - "POST /v1/contrast HTTP/1.1" 200
-- 302580 -- 20230816 00:49:31 WEBUI TIMEOUT comp-s7zqIh7A8aiY

[UI] Improve default SVG for fine tuning

Crash on Docker startup

When I try to start the docker command using the command given in the README, it immediately crashes with the following message (I've created the volume perm-storage manually and its the same result):

% docker run --rm -p 8008:8008 -v perm-storage:/perm_storage --gpus all smallcloud/refact_self_hosting 

==========
== CUDA ==
==========

CUDA Version 11.6.2

Container image Copyright (c) 2016-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.8/dist-packages/refact_self_hosting/watchdog.py", line 77, in <module>
    logdir.mkdir(exist_ok=True, parents=False)
  File "/usr/lib/python3.8/pathlib.py", line 1288, in mkdir
    self._accessor.mkdir(self, mode)
FileNotFoundError: [Errno 2] No such file or directory: '/workdir/logs'

Document SERVER_API_TOKEN

The docker image doesn't work unless you set the SERVER_API_TOKEN environment variable. Please include this in the documentation.

[UI] Move GPU filtering for finetune to the finetune tab

Looks like this now:

and

Want to move to finetuning tab, and join with "Start new finetune" box like this:

Force include/exclude from finetune -- path to match starts with the source

lib/* will not match any files, but source/lib/* will. Let's fix this on the backend side.

OOM issues cause auto-completion and chat response to stop

Preconditions:
Run with a large context or little amount of memory

Steps:
Start to chat or trigger an auto-completion request in IDE
Wait for the system to respond

Expected result:
The system should respond correctly

Actual result:
The system may stop responding and not indicating an out-of-memory issue in IDE.
Auto-completion requests or chat responses may be incomplete.

[UI] After deleting the last file in "Code and text sources", need to refresh page to display list correctly

Preconditions:
Have some files in "Code and text sources" list.
Steps:
Delete files from list
Expected result:
When file is deleted, it no longer appears in list
Actual result:
When not last file is deleted, it no longer appears in list. When last file is deleted, it is displayed in the list until page is refreshed

[UI] Warning when models > GPUs

[UI] When adding large repo, files count is hidden

...but the counter is actually goes up as the file is processed, so it's useful for the user to look at.

Generate SSH key -- fix permissions

SSH only accepts 600

Consider integrating with atom

https://github.com/AppThreat/atom

We created an open-sourced atom for the precise identification of usages and dataflows across large code bases. This approach is better for summarizing and identifying context than the primitive tokenization commonly found everywhere.

We have a dedicated community support channel should you require any help with using atom or code analysis in general.

[question] Can I self-host on a M1 Pro ?

Hey great project! I’m wondering if self hosting works on the Mac M1 chips ?

[UI] Make empty interface more attactive

For the first run for example

how can deploy for testing localy and not following docker's

Allow renaming of LoRA's Completed Runs

By allowing users to rename their completed runs, they can quickly identify and distinguish between different runs.

Give names to loras when training, "all files", "only python" - what I would like to see for me last two runs
Rename existing loras
When exporting, .zip needs to retain the same name

Support parallel access?

When i to completion code using plugin, blocked if the other one to generate code, need to config that at server side to support parallel processing generate code?

plugins support

e.g. Web Retriever, how to support that ,thx

JSON Decoding Error in create_tracked_jobs_from_configs: Expecting Value

Two time occures random crash:

Reproducing scenarios currently unknown. The assumption that while one process is writing to a file, another is reading the same file.

Plugin in PyCharm and local model in Windows.

Is it possible to connect the plugin via the oobabooga-webui or koboldcpp API to a locally running model (Refact-1.6b, starcoder, etc.)?
If possible, how? Or is it possible to work with local models only as described here?

[UI] SSH key created => ok button is missing => reload page

[UI] Interface blocked when starting GPU filtering without waiting for file load

Steps:
Add a big file to the list without waiting for the addition to complete
Immediately scan sources and run GPU filter
Expected result:
Interface should not allow GPU filtering to start until all files have finished loading
Actual result:
GPU filtering starts anyway
An error message "No train files have been provided for filtering" is displayed
Upload files/repos buttons are blocked until refresh page

Notes:
Issue occurs regardless of file size, but it's easier to reproduce with big file (~500Mb+)
Issue does not occur when waiting for addition to complete before running GPU filter

Self hosting with 1.3B model?

Greetings,

There is any docker image for laucnging the 1.3B (https://refact.ai/blog/2023/introducing-refact-code-llm/) model as self hosted?

[UI] Issues in uploading large files and inconsistent UI behavior

Steps:
Add a big file to the list.
Wait for the upload progress to reach 100%.

When upload a large file and upload progress reaches 100%, the window may remain displayed for a few more minutes (3+ minutes for 2 Gb file). I would like to better understand what is happening and how much longer to wait for the end:

When upload a large file interface is not blocked. It's possible to hidden "Upload file" window from previous screen and then how to figure out what is going on:

If then open the "Upload file" window again while in list displayed .tmp file, there is no progress bar or working status:

If wait a few minutes, then eventually the file will be uploaded:

Also on all screenshots can see the error "No train files..." left from the previous time. The interface has not been updated.

[bug] minor typos and punctuation

I suggest marking the following typo/bug as a "good first-time issue", so we could attract some newcomers with low-hanging fruits.

"Context-aware chat on entire codebase" checkbox must be selected
Period is missing at the end of "Download Refact for VS Code or JetBrains"

[ui] Adding git repo, warn about trailing '/'

Now the error text is "Error: incorrect url"

Failures on insufficient GPU memory

This is my local setup:

NVIDIA-SMI 535.103	Driver Version: 537.13	CUDA Version: 12.2

GPU	Name	Persistence-M	Bus-Id	Disp.A	Volatile Uncorr. ECC	Fan	Temp	Perf	Pwr:Usage/Cap	Memory-Usage	GPU-Uti	Compute M	MIG M.
0	NVIDIA RTX A2000 Laptop GPU	On	00000000:01:00.0	On	N/A	N/A	49C	P0	11W / 40W	3886MiB / 4096MiB	2%	Default	N/A

The container seem to start and load the model:

PS C:\Users\z0034zpz> docker run -d --rm -p 8008:8008 --env SERVER_API_TOKEN=LocalTokenInDockerContainer -v perm-storage:/perm_storage --gpus all smallcloud/refact_self_hosting
a455ae67d1b7b829460361bec6d3530629f7d2c1577b5e5c495318d5f378223a
PS C:\Users\z0034zpz> docker logs -f a455

==========
== CUDA ==
==========

CUDA Version 11.8.0

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

20230906 12:56:00 adding job model-contrastcode-3b-multi-0.cfg
20230906 12:56:00 adding job enum_gpus.cfg
20230906 12:56:00 adding job filetune.cfg
20230906 12:56:00 adding job filetune_filter_only.cfg
20230906 12:56:00 adding job process_uploaded.cfg
20230906 12:56:00 adding job webgui.cfg
20230906 12:56:00 CVD=0 starting python -m self_hosting_machinery.inference.inference_worker --model CONTRASTcode/3b/multi --compile
 -> pid 30
20230906 12:56:00 CVD= starting python -m self_hosting_machinery.scripts.enum_gpus
 -> pid 31
20230906 12:56:00 CVD= starting python -m self_hosting_machinery.webgui.webgui
 -> pid 32
-- 32 -- 20230906 12:56:00 WEBUI Started server process [32]
-- 32 -- 20230906 12:56:00 WEBUI Waiting for application startup.
-- 32 -- 20230906 12:56:00 WEBUI Application startup complete.
-- 32 -- 20230906 12:56:00 WEBUI Uvicorn running on http://0.0.0.0:8008 (Press CTRL+C to quit)
-- 30 -- 20230906 12:56:06 MODEL STATUS loading model
-- 32 -- 20230906 12:56:23 WEBUI Invalid HTTP request received.
-- 30 -- 20230906 12:56:44 MODEL STATUS test batch
20230906 12:57:22 30 finished python -m self_hosting_machinery.inference.inference_worker --model CONTRASTcode/3b/multi @:gpu00, retcode 0
/finished compiling as recognized by watchdog
20230906 12:57:23 CVD=0 starting python -m self_hosting_machinery.inference.inference_worker --model CONTRASTcode/3b/multi
 -> pid 105
-- 105 -- 20230906 12:57:25 MODEL STATUS loading model
-- 105 -- 20230906 12:57:51 MODEL STATUS test batch
-- 32 -- 20230906 12:58:15 WEBUI Invalid HTTP request received.

I tried to run the vscode extension with and without api key:

I tried to use the extension, but it is inprogress forever:

The logs inside the container signal an issue, but do not specify what:

-- 32 -- 20230906 13:25:05 WEBUI 127.0.0.1:45898 - "POST /infengine-v1/completions-wait-batch HTTP/1.1" 200
-- 32 -- 20230906 13:25:08 WEBUI Invalid HTTP request received.

Only if I launch the webui, then I can see:

Required memory exceeds the GPU's memory.

Could you please improve the logs, that a more detailed messasge is visible and provide clear warning, that there is not enough memory on graphic card?

401 unauthorised... any additional configs to be put in

Hi , trying to host this without docker.... Please suggest if I am missing something.

info: Uvicorn running on https://0.0.0.0:8008 (Press CTRL+C to quit)INFO: 16.142.51.155:38015 - "GET / HTTP/1.1" 404 Not FoundINFO: 16.142.51.155:38015 - "GET /favicon.ico HTTP/1.1" 404 Not FoundINFO: 16.142.51.155:38015 - "GET / HTTP/1.1" 404 Not FoundINFO: 16.142.51.155:5217 - "GET /v1/login HTTP/1.1" 200 OKINFO:root:running /v1/contrast function=infillINFO: 16.142.51.155:4871 - "POST /v1/contrast HTTP/1.1" 401 UnauthorizedINFO:root:running /v1/contrast function=infillINFO: 16.142.51.155:6519 - "POST /v1/contrast HTTP/1.1" 401 UnauthorizedINFO:root:running /v1/contrast function=infill

[UI] Refresh, finetune, a currently working run auto opens

Refresh, finetune, a currently working run auto opens, checkpoint list is empty. You need to click in the list again for checkpoints to appear.

Finetune "Start now" button is available to press even if there are no training files

Steps:
Set up a small file loss threshold so that all files are filtered
Filter files

Expected result:
"Start Now" button to be unavailable if there are no training files

Actual result:
Funetune "Start Now" button is available to press even though there are no training files.
Finetune starts and fails after several minutes.
In log window: RuntimeError: No train files have been provided

[finetune] Dropout doesn't seem to affect training

dropout = 0.01, 0.1 gif animation

Re-login automatically after the list of hosted models changes

...to load new server capabilities

[UI] Clicking lora checkpoint does not produce the effect obvious enough

The "Use Model" switch changes, but it's away from user's attention. We need a more obvious local effect, or a popup in the right-top corner.

Deselect SVG by default

I can't verify that these are image files, but they are selected by default, and probably shouldn't be.

Source: https://github.com/JabRef/jabref.git

[UI] With all files filtered, error message is bad

  File "/home/user/code/refact/refact_data_pipeline/finetune/finetune_filter.py", line 353, in pre_filtering
    test_files = random.choices(train_files, k=fcfg["limit_test_files"])
  File "/usr/lib/python3.9/random.py", line 487, in choices
    return [population[floor(random() * n)] for i in _repeat(None, k)]
  File "/usr/lib/python3.9/random.py", line 487, in <listcomp>
    return [population[floor(random() * n)] for i in _repeat(None, k)]
IndexError: list index out of range

should be something more frieldly.

[Feature] LORA export / more LORA training capabilities

I think Refact's strongest selling points are:

It's fast, clean web UI.
It's ability to be be run 100% offline locally.
It's ability to performing training/fine tuning.

For the third I think it would be great to have some more configurability and options, for example being able to name / add notes to a LORA and export it for backups / testing / other instances etc....

It would also be great if it was possible to perform fine tuning / training on other models.

Keep up the great work folks!

[ui] Warnings when there's a model that is not supported anymore

-- nothing to do with memory or "more models"

[plugin settings in IntelliJ] global permanent rules are not consistent

I am not sure if it's a bug or a feature.

When "global defaults" are switched between levels 0-2, the "global permanent rules to override by default" remain untouched.

However, when a new project is opened, it's using the "global defaults" as a default value.
Personally, I would like the "global default" switch to affect the "global permanent rules to override by default" as well

Running on Fedora 38 - Docs Update

When wanting self-hosted we are told to visit https://refact.ai/docs/self-hosting/ and run docker run -d --rm -p 8008:8008 -v perm-storage:/perm_storage --gpus all smallcloud/refact_self_hosting after ensuring we have docker with nvidia gpu support. Unfortunately these instructions do not work for me while I was able to run the previous release of refact before the significant changes just made. Here is what I was getting when following those instructions:

-- 26 -- WARNING:root:output was:
-- 26 -- - no output -
-- 26 -- WARNING:root:nvidia-smi does not work, that's especially bad for initial setup.
-- 26 -- WARNING:root:Traceback (most recent call last):
-- 26 --   File "/usr/local/lib/python3.8/dist-packages/self_hosting_machinery/scripts/enum_gpus.py", line 17, in query_nvidia_smi
-- 26 --     nvidia_smi_output = subprocess.check_output([
-- 26 --   File "/usr/lib/python3.8/subprocess.py", line 415, in check_output
-- 26 --     return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
-- 26 --   File "/usr/lib/python3.8/subprocess.py", line 516, in run
-- 26 --     raise CalledProcessError(retcode, process.args,
-- 26 -- subprocess.CalledProcessError: Command '['nvidia-smi', '--query-gpu=pci.bus_id,name,memory.used,memory.total,temperature.gpu', '--format=csv']' returned non-zero exit status 4.
-- 26 --

I can confirm however that running the enum_gpus.py by importing into python (tested 3.8 - which is in the Dockerfile, and 3.11) the function query_nvidia_smi succeeds. Additionally running the nvidia-smi command and flags from enum_gpus succeed:

(refact) [mrhillsman@workstation refact]$ python --version
Python 3.8.17
(refact) [mrhillsman@workstation refact]$ python
Python 3.8.17 (default, Jun  8 2023, 00:00:00) 
[GCC 13.1.1 20230511 (Red Hat 13.1.1-2)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import subprocess
>>> subprocess.check_output(["nvidia-smi", "--query-gpu=pci.bus_id,name,memory.used,memory.total,temperature.gpu", "--format=csv"])
b'pci.bus_id, name, memory.used [MiB], memory.total [MiB], temperature.gpu\n00000000:01:00.0, NVIDIA GeForce RTX 3080, 11 MiB, 10240 MiB, 29\n'
>>> import self_hosting_machinery.scripts.enum_gpus as gpuenum
>>> gpuenum.query_nvidia_smi()
{'gpus': [{'id': '00000000:01:00.0', 'name': 'NVIDIA GeForce RTX 3080', 'mem_used_mb': 11, 'mem_total_mb': 10240, 'temp_celsius': 29}]}
>>> exit()

(refact) [mrhillsman@workstation refact]$ nvidia-smi --query-gpu=pci.bus_id,name,memory.used,memory.total,temperature.gpu --format=csv
pci.bus_id, name, memory.used [MiB], memory.total [MiB], temperature.gpu
00000000:01:00.0, NVIDIA GeForce RTX 3080, 11 MiB, 10240 MiB, 29
(refact) [mrhillsman@workstation refact]$ nvidia-smi 
Sat Jul 22 15:46:59 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3080        Off | 00000000:01:00.0 Off |                  N/A |
|  0%   29C    P8              13W / 370W |     11MiB / 10240MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2727      G   /usr/bin/gnome-shell                          3MiB |
+---------------------------------------------------------------------------------------+
(refact) [mrhillsman@workstation refact]$ uname -a
Linux workstation 6.3.12-200.fc38.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Jul  6 04:05:18 UTC 2023 x86_64 GNU/Linux
(refact) [mrhillsman@workstation refact]$ cat /etc/os-release 
NAME="Fedora Linux"
VERSION="38 (Workstation Edition)"
ID=fedora
VERSION_ID=38
VERSION_CODENAME=""
PLATFORM_ID="platform:f38"
PRETTY_NAME="Fedora Linux 38 (Workstation Edition)"
ANSI_COLOR="0;38;2;60;110;180"
LOGO=fedora-logo-icon
CPE_NAME="cpe:/o:fedoraproject:fedora:38"
DEFAULT_HOSTNAME="fedora"
HOME_URL="https://fedoraproject.org/"
DOCUMENTATION_URL="https://docs.fedoraproject.org/en-US/fedora/f38/system-administrators-guide/"
SUPPORT_URL="https://ask.fedoraproject.org/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="Fedora"
REDHAT_BUGZILLA_PRODUCT_VERSION=38
REDHAT_SUPPORT_PRODUCT="Fedora"
REDHAT_SUPPORT_PRODUCT_VERSION=38
SUPPORT_END=2024-05-14
VARIANT="Workstation Edition"
VARIANT_ID=workstation
[mrhillsman@workstation refact-ai]$ sestatus
SELinux status:                 enabled
SELinuxfs mount:                /sys/fs/selinux
SELinux root directory:         /etc/selinux
Loaded policy name:             targeted
Current mode:                   enforcing
Mode from config file:          enforcing
Policy MLS status:              enabled
Policy deny_unknown status:     allowed
Memory protection checking:     actual (secure)
Max kernel policy version:      33

I would have created a PR for the documentation change but I do not see a repo for the site documentation. Here is what I was able to run and have work which I am recommending be added to the documentation somehow either under Fedora38 specifically or RPM based OSs in general:

podman run -d -it --gpus 0 --security-opt=label=disable -p 8008:8008 -v perm_storage:/perm_storage smallcloud/refact_self_hosting

Problem with --env as last part of command on powershell

Hi,

docker run -p 8008:8008 --gpus 0 --name refact_self_hosting smallcloud/refact_self_hosting --env SERVER_API_TOKEN=$token

Doesn't work on windows 11 powershel , but

docker run --env SERVER_API_TOKEN=$token -p 8008:8008 --gpus 0 --name refact_self_hosting smallcloud/refact_self_hosting

Works just fine. Here is reference why it's happening https://stackoverflow.com/a/64159568
May be good to update the docs to include that.

Model memory usage / quantization

According to this Refact blog post:

Check out the docs on self-hosting to get your AI code assistant up and running.
To run StarCoder using 4-bit quantization, you’ll need a 12GB GPU, and for 8-bit you’ll need 24GB.
It’s currently available for VS Code, and JetBrains IDEs.

I am currently using a 12GB GPU (RTX 4070), so that sounds great.

However, the interface does not offer any options to select quantization:

If I attempt to select codellama/7b or starcoder/15b/base, it claims that I don't have enough memory... which is strange, considering I've been running (quantized) 13B parameter llama2 models on my GPU just fine using other software.

The memory usage of even the smallest models is rather weird.

According to this Refact blog post:

With the smaller size, running the model is much faster and affordable than ever: the model can be served on most of all modern GPUs requiring just 3Gb[sic, 0] RAM and works great for real-time code completion tasks.

I've noticed that the Refact/1.6B model starts at about 5.3GB of VRAM usage, but jumps up to the full 12GB of VRAM as soon as I start doing any completions, which seems confusingly inefficient for a 1.6B parameter model, and a stark contrast to the stated goal of running on GPUs with only 3GB of VRAM. When I kill the container, my VRAM usage drops back to around zero, so it's not some other program using all this VRAM.

[sic, 0]: 3Gb (Gigabits) == ~0.375GB (Gigabytes), I'm assuming this should be 3GB.

I'm just running the Docker container under Docker on Windows using WSL2, and everything works fine, it's just memory usage that is confusing and concerning compared to other LLM software I have used (also under WSL2).

I'm not sure if there are plans to offer quantized model options through the GUI, if there are ways of selecting these quantized models without the GUI, or whatever other options.

No progress bar in web GUI when downloading layers

When selecting a new model, the web GUI shows message "loading model" but there's no progress bar. Only in the container's logs I can see that the layers are being downloaded.

Downloading layers.021.ln_m.weight: 100%|██████████| 5.51k/5.51k [00:00<00:00, 2.49MB/s]
Downloading layers.021.ln_m.bias: 100%|██████████| 5.51k/5.51k [00:00<00:00, 2.08MB/s]
-- 33 -- 20230905 10:34:50 WEBUI 172.17.0.1:37680 - "GET /tab-host-have-gpus HTTP/1.1" 200
-- 31 -- 
Downloading layers.021.pw.W1:   0%|          | 0.00/52.4M [00:00<?, ?B/s]
-- 33 -- 20230905 10:34:52 WEBUI 172.17.0.1:37680 - "GET /tab-host-have-gpus HTTP/1.1" 200
Downloading layers.021.pw.W1:  20%|█▉        | 10.5M/52.4M [00:01<00:07, 5.34MB/s]
-- 33 -- 20230905 10:34:54 WEBUI 172.17.0.1:37680 - "GET /tab-host-have-gpus HTTP/1.1" 200
Downloading layers.021.pw.W1:  40%|███▉      | 21.0M/52.4M [00:03<00:05, 6.05MB/s]
Downloading layers.021.pw.W1:  60%|█████▉    | 31.5M/52.4M [00:05<00:03, 6.35MB/s]
-- 33 -- 20230905 10:34:56 WEBUI 172.17.0.1:37680 - "GET /tab-host-have-gpus HTTP/1.1" 200
-- 33 -- 20230905 10:34:58 WEBUI 172.17.0.1:37680 - "GET /tab-host-have-gpus HTTP/1.1" 200
Downloading layers.021.pw.W1:  80%|███████▉  | 41.9M/52.4M [00:06<00:01, 6.20MB/s]
Downloading layers.021.pw.W1: 100%|██████████| 52.4M/52.4M [00:08<00:00, 6.32MB/s]
-- 33 -- 20230905 10:35:00 WEBUI 172.17.0.1:37680 - "GET /tab-host-have-gpus HTTP/1.1" 200

Etc.

Also the model in command is different than selected one which creates confusion as to which model is actually being loaded.