codedotal / gpt-code-clippy Goto Github PK
View Code? Open in Web Editor NEWFull description can be found here: https://discuss.huggingface.co/t/pretrain-gpt-neo-for-open-source-github-copilot-model/7678?u=ncoop57
License: Apache License 2.0
Full description can be found here: https://discuss.huggingface.co/t/pretrain-gpt-neo-for-open-source-github-copilot-model/7678?u=ncoop57
License: Apache License 2.0
Hi, I increased the number of layers for the file with the error, and found that the error is still reported. I want to ask two questions.
Thanks!
Originally posted by @BitcoinNLPer in #74 (comment)
When I run this train script, I encounter some errors. The error log is as follows:
Do you know how to solve it?
Furthermore. there are too many files in code_clippy_data file. Is there a script to download this dataset conveniently?
Hi y'all I'd like to make sure we do plenty of brainstorming on where things can go wrong in terms of ethical concerns. I don't want our field to have the same issues that have happened in the other AI fields such as biases and lacking discussion of limitations. So, please use this issue to also (we also have an internal discord channel where we discuss this in a less formal setting, which I will be periodically synthesizing to here) discuss any things that could go wrong!
Here are already a few things that have been discussed:
This is a test of the issue body
Was trying this spaces example https://huggingface.co/spaces/flax-community/code-clippy-problem-solver and it seems to get stuck for A function that prints prime numbers from 1 to 100
I would like to fine tune the model.
Can anyone please advise @harish-garg @neubig @neubig @mrm8488 @ncoop57?
Hi,
You guys are doing a great job with it.
I have tried your flax-community/gpt-neo-1.3B-apps-all model,
and the generated code is kinda hit or miss.
This is generated using
flax-community/gpt-neo-1.3B-apps-all
and this is generated using
EleutherAI/gpt-neo-1.3B
as far I know
EleutherAI/gpt-neo-1.3B is trained on more generalized texts, which are not necessarily code.
then why flax-community/gpt-neo-1.3B-apps-all performing much worse than EleutherAI/gpt-neo-1.3B?
the dataset cannot download: https://the-eye.eu/public/AI/training_data/code_clippy_data/
Awesome work,
I try to begin developing vim plugin version of the project ( if my knowledge allows it).
BTW, awesome job.
Hi there, your repository has been selected to be included in an effort
to train an open source version of GitHub and OpenAI's Copilot tool.
You can find more information on our project here.
If you are the owner/admin of this repository and would like to opt-out of this,
please reply to this issue before July 9th with "yes" and we will remove your
repository from our list.
Hi,
Thanks for the great work!
Firstly I wanted to ask about the performance of the code-clippy models. It seems that the 125M parameter models are quite weak and perform quite poorly on human-eval dataset (even lower than GPT-Neo-1.3B?). Any idea why this is happening.
Also is there some update on the evaluation of the GPT-Neo-1.3 B code-clippy model?
Finally, I would love to contribute to upcoming iterations of code-clippy. Should I join the discord channel?
Are there any plans for a JetBrains extension?
Hi,
when I try to train a model from scratch I am facing following error.
The data_dir contains less data, so I think CPU should be sufficient in my case. So what exactly could cause this?
@ncoop57 can you please check and help.
./run_clm_streaming_flax.py \
--output_dir $HOME/fhgw-gpt-neo-125M-code-clippy \
--dataset_name /home/fedora/explore/clippy/gpt-code-clippy/data_processing/code_clippy.py \
--data_dir /mnt/vol/FHGW/scm_fhgw/workspace_FHGW_21.000/FHGW-NW-CM \
--text_column_name="text" \
--do_train --do_eval \
--block_size="2048" \
--per_device_train_batch_size="8" \
--per_device_eval_batch_size="16" \
--preprocessing_num_workers="8" \
--learning_rate="1e-4" \
--max_steps 100000 \
--warmup_steps 2500 \
--decay_steps 25000 \
--adam_beta1="0.9" \
--adam_beta2="0.95" \
--weight_decay="0.1" \
--overwrite_output_dir \
--logging_steps="100" \
--eval_steps="500" \
--push_to_hub="False" \
--report_to="all" \
--dtype="bfloat16" \
--skip_memory_metrics="True" \
--save_steps="500" \
--save_total_limit 10 \
--gradient_accumulation_steps 16 \
--report_to="wandb" \
--run_name="125m_1e-4lr_1024bs" \
--max_eval_samples 2000 \
--save_optimizer true
2022-01-06 08:27:11.271076: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
INFO:absl:Unable to initialize backend 'tpu_driver': NOT_FOUND: Unable to find driver in registry given worker:
INFO:absl:Unable to initialize backend 'gpu': NOT_FOUND: Could not find registered platform with name: "cuda". Available platform names are: Interpreter Host
INFO:absl:Unable to initialize backend 'tpu': INVALID_ARGUMENT: TpuPlatform is not available.
WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
INFO:__main__:Training/evaluation parameters TrainingArguments(
_n_gpu=0,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.95,
adam_epsilon=1e-08,
bf16=False,
bf16_full_eval=False,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_steps=500,
evaluation_strategy=IntervalStrategy.NO,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
gradient_accumulation_steps=16,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_model_id=None,
hub_strategy=HubStrategy.EVERY_SAVE,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=0.0001,
length_column_name=length,
load_best_model_at_end=False,
local_rank=-1,
log_level=-1,
log_level_replica=-1,
log_on_each_node=True,
logging_dir=/home/fedora/fhgw-gpt-neo-125M-code-clippy/runs/Jan06_08-27-13_fedora.novalocal,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=100,
logging_strategy=IntervalStrategy.STEPS,
lr_scheduler_type=SchedulerType.LINEAR,
max_grad_norm=1.0,
max_steps=100000,
metric_for_best_model=None,
mp_parameters=,
no_cuda=False,
num_train_epochs=3.0,
output_dir=/home/fedora/fhgw-gpt-neo-125M-code-clippy,
overwrite_output_dir=True,
past_index=-1,
per_device_eval_batch_size=16,
per_device_train_batch_size=8,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
remove_unused_columns=True,
report_to=['wandb'],
resume_from_checkpoint=None,
run_name=125m_1e-4lr_1024bs,
save_on_each_node=False,
save_steps=500,
save_strategy=IntervalStrategy.STEPS,
save_total_limit=10,
seed=42,
sharded_ddp=[],
skip_memory_metrics=True,
tf32=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_legacy_prediction_loop=False,
warmup_ratio=0.0,
warmup_steps=2500,
weight_decay=0.1,
xpu_backend=None,
)
WARNING:datasets.builder:Using custom data configuration default-01c596fb6133304a
Traceback (most recent call last):
File "/usr/lib64/python3.7/pathlib.py", line 713, in __str__
return self._str
AttributeError: _str
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "./run_clm_streaming_flax.py", line 774, in <module>
main()
File "./run_clm_streaming_flax.py", line 392, in main
split="train"
File "/usr/local/lib/python3.7/site-packages/datasets/load.py", line 1686, in load_dataset
use_auth_token=use_auth_token,
File "/usr/local/lib/python3.7/site-packages/datasets/builder.py", line 897, in as_streaming_dataset
splits_generators = {sg.name: sg for sg in self._split_generators(dl_manager)}
File "/home/fedora/.cache/huggingface/modules/datasets_modules/datasets/code_clippy/86b09b4a623c1c39753a8ad165e05757d9a97daf132ac71d3b6eb791e7da16dd/code_clippy.py", line 111, in _split_generators
gen_kwargs={"filepaths": sorted([str(fp) for fp in Path(f"{data_dir}/train").glob("*.jsonl.zst")])}
File "/home/fedora/.cache/huggingface/modules/datasets_modules/datasets/code_clippy/86b09b4a623c1c39753a8ad165e05757d9a97daf132ac71d3b6eb791e7da16dd/code_clippy.py", line 111, in <listcomp>
gen_kwargs={"filepaths": sorted([str(fp) for fp in Path(f"{data_dir}/train").glob("*.jsonl.zst")])}
File "/usr/local/lib/python3.7/site-packages/datasets/utils/streaming_download_manager.py", line 384, in xpathglob
yield from Path(main_hop).glob(pattern)
File "/usr/local/lib/python3.7/site-packages/datasets/utils/streaming_download_manager.py", line 384, in xpathglob
yield from Path(main_hop).glob(pattern)
File "/usr/local/lib/python3.7/site-packages/datasets/utils/streaming_download_manager.py", line 384, in xpathglob
yield from Path(main_hop).glob(pattern)
[Previous line repeated 984 more times]
File "/usr/local/lib/python3.7/site-packages/datasets/utils/streaming_download_manager.py", line 381, in xpathglob
posix_path = _as_posix(path)
File "/usr/local/lib/python3.7/site-packages/datasets/utils/streaming_download_manager.py", line 172, in _as_posix
path_as_posix = path.as_posix()
File "/usr/lib64/python3.7/pathlib.py", line 726, in as_posix
return str(self).replace(f.sep, '/')
File "/usr/lib64/python3.7/pathlib.py", line 716, in __str__
self._parts) or '.'
File "/usr/lib64/python3.7/pathlib.py", line 695, in _format_parsed_parts
return drv + root + cls._flavour.join(parts[1:])
RecursionError: maximum recursion depth exceeded while calling a Python object
In order to do distributed training across multiple TPUs and for hosting the model once we lose access to the TPUs, we need to figure out how to setup a GCS bucket to store the model in. Any help on this task would be greatly appreciated!
Hi there, your repository has been selected to be included in an effort
to train an open source version of GitHub and OpenAI's Copilot tool.
You can find more information on our project here.
If you are the owner/admin of this repository and would like to opt-out of this,
please reply to this issue before July 9th with "yes" and we will remove your repository
from our list.
Hi,
The filenames in code-clippy dedup dataset are wrong. In the repo with multiple files - though various files are present, they share a single random filename not adhering to the correct file extension as well. While for gpt-code-clippy training efforts this might not be an issue since only content of files might matter, it would be really great if this issue can be fixed or mentioned clearly otherwise.
sample code to reproduce the issue (prints filenames in first 100 rows of jsonl)
import os
import json
import uuid
import zstandard
import subprocess
def loadJsonL(fname):
import json
data = []
with open(fname) as fp:
for line in fp.readlines():
data.append(json.loads(line))
return data
def processZSTLink(url):
zstfile = url.split('/')[-1]
print(url)
out = subprocess.run(f"wget {url}", shell=True, stdout=subprocess.DEVNULL)
jsonlfile = zstfile[:-4]
with open(zstfile, 'rb') as compressed:
decomp = zstandard.ZstdDecompressor()
with open(jsonlfile, 'wb') as destination:
decomp.copy_stream(compressed, destination)
data = loadJsonL(jsonlfile)
newData = []
for row in data[:100]:
file_name = row['meta']['file_name']
repo_name = row['meta']['repo_name']
print(f"{repo_name}/{file_name}")
processZSTLink('https://the-eye.eu/public/AI/training_data/code_clippy_data//code_clippy_dedup_data/test/data_2814_time1626332048_default.jsonl.zst')
Is there an easier way to get started?
I tried to setup a machine and install all requirements. Would try tomorrow to go further but maybe I am doing something wrong:
The error I am at currently is:
"""
2021-11-05 22:23:59.523515: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
Traceback (most recent call last):
File "run_clm_apps.py", line 800, in
main()
File "run_clm_apps.py", line 342, in main
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
File "/home/pankaj/.local/lib/python3.8/site-packages/transformers/hf_argparser.py", line 191, in parse_args_into_dataclasses
obj = dtype(**inputs)
File "", line 14, in init
File "run_clm_apps.py", line 174, in post_init
raise ValueError("Need either a dataset name or a training/validation file.")
ValueError: Need either a dataset name or a training/validation file.
"""
Also, getting the requirements to work was quite difficult on my machine. Wondering if I am doing something wrong.
Hi everyone. After I launch the extension in debug mode, when I try writing I get this error:
[ncoop57.code-clippy]: editor/inlineCompletions/actions is a proposed menu identifier. It requires 'package.json#enabledApiProposals: ["inlineCompletionsAdditions"]' and is only available when running out of dev or with the following command line switch: --enable-proposed-api ncoop57.code-clippy
--enable-proposed-api ncoop57.code-clippy gives me a "Missing expression after unary operator '--'" error.
And code --enable-proposed-api ncoop57.code-clippy gets me out of Debug mode.
Does anyone has an idea how I can fix this?
I have only found Java. Wonder if someone can spare me the details without having to process the whole dataset :)
Thank you for open sourcing it! Awesome stuff!
Hi,
Are there any recommended steps or resources available for fine-tuning a large language model such as GPT-J in an unsupervised manner using GPT-Code-Clippy, with the goal of teaching the model about a new domain?
Thanks
The following file doesn't compile due to an incomplete merge.
https://github.com/CodedotAl/gpt-code-clippy/blob/camera-ready/training/run_clm_streaming_flax.py
Hello I'm attempting to run the starter code for the flax-community/gpt-neo-125M-code-clippy.
For some reason, I cannot get anything other than blank characters and escape characters.
Would someone be able to assist?
Hi!
Many thanks for working towards an open-source version of GitHub Copilot. ๐
I'm particularly interested in the VS Code extension -- could you please also publish it to OpenVSX, the open-source, vendor-neutral IDE extension repository?
This would allow users of non-Microsoft products to install this extension as well (for example users of VSCodium, Gitpod, Theia, etc.)
The process should be pretty easy, especially since you already have a VS Code extension (OpenVSX uses the same publishing tools). Please:
Generate a token
Run npx ovsx publish
(That's it.)
this only works for generating python code?
Hello, I found this amazing repository today. I tried to run the example found in the huggingface on google colab but it didn't output anything except "Setting pad_token_id
to eos_token_id
:50256 for open-end generation.". I want to know if there is anything I did wrong. Thanks! (sorry for my poor english)
this the code I ran on colab( I change the variable device to "cpu"):
How is this implementation comparable with the github copilot and codex model?
Trying to fine-tune gpt-j to create a better version of code-clippy
Fine-tuning script has already been created by me. However, it would require a beefy tpu(v3-256 takes about 6 weeks I believe.) And thus, I cannot train it
It would be great if this repository would be helpful in the long run of creating an open source version of github-copilot
Hi! I was wondering if I a GPT Code Clippy model could generate embeddings instead of output generation?
The purpose is to embed code in a semantical space, such that it can be used as a feature for another neural network. I have done the same with BERT (more as a baseline, since this model is not trained on code), and with the OpenAI Codex model (with a paying API), and therefore would love to use one of your models as well.
Thank you!
Hi there, your repository has been selected to be included in an effort
to train an open source version of GitHub and OpenAI's Copilot tool.
You can find more information on our project here.
If you are the owner/admin of this repository and would like to opt-out of this, please reply to this issue
before July 9th with "yes" and we will remove your repository from our list.
Need to update the HumanEval results due to bug that was originally in our evaluation code and was fixed in this PR #62
Not quite sure where to leave this but I wanted to try out this project without training the model myself and found this model(It's the only working link from the recommended models) and copied the "How to use" example but it only produces a result with linebreaks in it. Is this expected?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.