codedotal / gpt-code-clippy Goto Github PK

Full description can be found here: https://discuss.huggingface.co/t/pretrain-gpt-neo-for-open-source-github-copilot-model/7678?u=ncoop57

License: Apache License 2.0

Python 94.68% Shell 0.36% Jupyter Notebook 4.95% Dockerfile 0.01%

gpt-code-clippy's People

Contributors

Stargazers

Watchers

Forkers

zebrajack salah856 btbujiangjun ashora cyberflamego be-secure leedaga qdaria kgeneral techthiyanes sumodgeorge hiroakimikami saran33 spigelli praveenmunagapati rainwangphy technologyarts dpfried derpspears mackeee-orange josepbalague bobycv06fpm reshinthadithyan hitman56 gaurav2567 ajayarunachalam ssahgal mrinal18 romainr uakbr mrcodechef laplacekorea 3ephn3zxdr5c cxz oyshoboy ggardiakos fairhopeweb daasin dejianyang cdj0311 pvkadvisory apinkgo swipswaps liulhdarks rs-store rs-store-personal sauravn technocannon1337 xu753x jk4011 nashid pked01 atomdog degerli zhang-ge-hao ethanjones21 hurshd0 houtan-rocky rammishr papacthulu sonyeric dsoselia siyahulhaq kof21411 lwd-temp fujohnwang shouxia fjteam aril-man joncv hsouporto gaoconggit pagka cmcc-ict 0oljyo0 snailzrg effat gengyuchao cephdon lavenstudio-research zeroegg fenixbinario huaxiuyiwei lehaotian smartmanoj cube3power sshyran fakeorf3ke arifx dirtybits aricjean forky-mcforkface zfcraft threadshare rhscz charygao amrmuhammad awa-a valeman codeaudit

gpt-code-clippy's Issues

Hi, I increased the number of layers for the file with the error, and found that the error is still reported. I want to ask two questions.

May I just clone this project and install the corresponding library, and then I can run run_clm_streaming_flax.py? Do I need to download more than 200g of training corpus into my local server? Or does it download automatically? Can you provide a more detailed pre-training guide?
The server I use has a GPU, so why do I run run_clm_streaming_flax.py directly and use the CPU by default?

Thanks!

Originally posted by @BitcoinNLPer in #74 (comment)

when I run train script, an errors occurs

When I run this train script, I encounter some errors. The error log is as follows:

Do you know how to solve it?

Furthermore. there are too many files in code_clippy_data file. Is there a script to download this dataset conveniently?

Things That Could Go Wrong

Hi y'all I'd like to make sure we do plenty of brainstorming on where things can go wrong in terms of ethical concerns. I don't want our field to have the same issues that have happened in the other AI fields such as biases and lacking discussion of limitations. So, please use this issue to also (we also have an internal discord channel where we discuss this in a less formal setting, which I will be periodically synthesizing to here) discuss any things that could go wrong!
Here are already a few things that have been discussed:

vulnerabilities being inserted into completions
Licensing Issues
Automating developers out of a job

Testing Adding an Issue Body

This is a test of the issue body

https://huggingface.co/spaces/flax-community/code-clippy-problem-solver gets stuck on generating solution

Was trying this spaces example https://huggingface.co/spaces/flax-community/code-clippy-problem-solver and it seems to get stuck for A function that prints prime numbers from 1 to 100

Can I fine tune gpt-code-clippy?

I would like to fine tune the model.

Can anyone please advise @harish-garg @neubig @neubig @mrm8488 @ncoop57?

how can a normal coder use this ???/ no option give??? how

Training script

- add bf16 support
- check if training with bf16 weights works fine
- add resuming from ckpt
- add wandb tracking
- complete adafactor option
- figure out how to best utilize profiler for training loop optimization
- add gradient accumulation
- support iterable datasets and max_steps argument
- prefetch generator for dataloader

EleutherAI/gpt-neo-1.3B Model works better than this.

Hi,
You guys are doing a great job with it.

I have tried your flax-community/gpt-neo-1.3B-apps-all model,
and the generated code is kinda hit or miss.

This is generated using
flax-community/gpt-neo-1.3B-apps-all

and this is generated using
EleutherAI/gpt-neo-1.3B

as far I know
EleutherAI/gpt-neo-1.3B is trained on more generalized texts, which are not necessarily code.

then why flax-community/gpt-neo-1.3B-apps-all performing much worse than EleutherAI/gpt-neo-1.3B?

data cannot download?

the dataset cannot download: https://the-eye.eu/public/AI/training_data/code_clippy_data/

Vim Plugin

Awesome work,
I try to begin developing vim plugin version of the project ( if my knowledge allows it).
BTW, awesome job.

Participation in an Open Source Language Modeling Dataset

Hi there, your repository has been selected to be included in an effort
to train an open source version of GitHub and OpenAI's Copilot tool.
You can find more information on our project here.

If you are the owner/admin of this repository and would like to opt-out of this,
please reply to this issue before July 9th with "yes" and we will remove your
repository from our list.

Why GPT-CC is lower than CodeX significantly.

This is Codex results

The following result is GPT-CC

Is it caused by the quality of the pre-trained corpus data?

Thanks

Low Pass@k

Hi,
Thanks for the great work!
Firstly I wanted to ask about the performance of the code-clippy models. It seems that the 125M parameter models are quite weak and perform quite poorly on human-eval dataset (even lower than GPT-Neo-1.3B?). Any idea why this is happening.

Also is there some update on the evaluation of the GPT-Neo-1.3 B code-clippy model?

Finally, I would love to contribute to upcoming iterations of code-clippy. Should I join the discord channel?

Code Model

What model will be used?
How will the model be trained?
What existing training script can be repurposed?
Modified/newly created training script that can feed into the rest of the pipeline

JetBrains extension

Are there any plans for a JetBrains extension?

Testing Opening an Issue!

Code Tokenization

What sort of tokenization will be done?
Scripts/tutorials that can do the tokenization?
Modified/newly created tokenization script to feed into the rest of the pipeline

Unable to train with custom data

Hi,
when I try to train a model from scratch I am facing following error.
The data_dir contains less data, so I think CPU should be sufficient in my case. So what exactly could cause this?
@ncoop57 can you please check and help.

./run_clm_streaming_flax.py \
    --output_dir $HOME/fhgw-gpt-neo-125M-code-clippy \
    --dataset_name /home/fedora/explore/clippy/gpt-code-clippy/data_processing/code_clippy.py \
    --data_dir /mnt/vol/FHGW/scm_fhgw/workspace_FHGW_21.000/FHGW-NW-CM \
    --text_column_name="text" \
    --do_train --do_eval \
    --block_size="2048" \
    --per_device_train_batch_size="8" \
    --per_device_eval_batch_size="16" \
    --preprocessing_num_workers="8" \
    --learning_rate="1e-4" \
    --max_steps 100000 \
    --warmup_steps 2500 \
    --decay_steps 25000 \
    --adam_beta1="0.9" \
    --adam_beta2="0.95" \
    --weight_decay="0.1" \
    --overwrite_output_dir \
    --logging_steps="100" \
    --eval_steps="500" \
    --push_to_hub="False" \
    --report_to="all" \
    --dtype="bfloat16" \
    --skip_memory_metrics="True" \
    --save_steps="500" \
    --save_total_limit 10 \
    --gradient_accumulation_steps 16 \
    --report_to="wandb" \
    --run_name="125m_1e-4lr_1024bs" \
    --max_eval_samples 2000 \
    --save_optimizer true

2022-01-06 08:27:11.271076: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
INFO:absl:Unable to initialize backend 'tpu_driver': NOT_FOUND: Unable to find driver in registry given worker: 
INFO:absl:Unable to initialize backend 'gpu': NOT_FOUND: Could not find registered platform with name: "cuda". Available platform names are: Interpreter Host
INFO:absl:Unable to initialize backend 'tpu': INVALID_ARGUMENT: TpuPlatform is not available.
WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
INFO:__main__:Training/evaluation parameters TrainingArguments(
_n_gpu=0,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.95,
adam_epsilon=1e-08,
bf16=False,
bf16_full_eval=False,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_steps=500,
evaluation_strategy=IntervalStrategy.NO,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
gradient_accumulation_steps=16,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_model_id=None,
hub_strategy=HubStrategy.EVERY_SAVE,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=0.0001,
length_column_name=length,
load_best_model_at_end=False,
local_rank=-1,
log_level=-1,
log_level_replica=-1,
log_on_each_node=True,
logging_dir=/home/fedora/fhgw-gpt-neo-125M-code-clippy/runs/Jan06_08-27-13_fedora.novalocal,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=100,
logging_strategy=IntervalStrategy.STEPS,
lr_scheduler_type=SchedulerType.LINEAR,
max_grad_norm=1.0,
max_steps=100000,
metric_for_best_model=None,
mp_parameters=,
no_cuda=False,
num_train_epochs=3.0,
output_dir=/home/fedora/fhgw-gpt-neo-125M-code-clippy,
overwrite_output_dir=True,
past_index=-1,
per_device_eval_batch_size=16,
per_device_train_batch_size=8,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
remove_unused_columns=True,
report_to=['wandb'],
resume_from_checkpoint=None,
run_name=125m_1e-4lr_1024bs,
save_on_each_node=False,
save_steps=500,
save_strategy=IntervalStrategy.STEPS,
save_total_limit=10,
seed=42,
sharded_ddp=[],
skip_memory_metrics=True,
tf32=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_legacy_prediction_loop=False,
warmup_ratio=0.0,
warmup_steps=2500,
weight_decay=0.1,
xpu_backend=None,
)
WARNING:datasets.builder:Using custom data configuration default-01c596fb6133304a
Traceback (most recent call last):
  File "/usr/lib64/python3.7/pathlib.py", line 713, in __str__
    return self._str
AttributeError: _str

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./run_clm_streaming_flax.py", line 774, in <module>
    main()
  File "./run_clm_streaming_flax.py", line 392, in main
    split="train"
  File "/usr/local/lib/python3.7/site-packages/datasets/load.py", line 1686, in load_dataset
    use_auth_token=use_auth_token,
  File "/usr/local/lib/python3.7/site-packages/datasets/builder.py", line 897, in as_streaming_dataset
    splits_generators = {sg.name: sg for sg in self._split_generators(dl_manager)}
  File "/home/fedora/.cache/huggingface/modules/datasets_modules/datasets/code_clippy/86b09b4a623c1c39753a8ad165e05757d9a97daf132ac71d3b6eb791e7da16dd/code_clippy.py", line 111, in _split_generators
    gen_kwargs={"filepaths": sorted([str(fp) for fp in Path(f"{data_dir}/train").glob("*.jsonl.zst")])}
  File "/home/fedora/.cache/huggingface/modules/datasets_modules/datasets/code_clippy/86b09b4a623c1c39753a8ad165e05757d9a97daf132ac71d3b6eb791e7da16dd/code_clippy.py", line 111, in <listcomp>
    gen_kwargs={"filepaths": sorted([str(fp) for fp in Path(f"{data_dir}/train").glob("*.jsonl.zst")])}
  File "/usr/local/lib/python3.7/site-packages/datasets/utils/streaming_download_manager.py", line 384, in xpathglob
    yield from Path(main_hop).glob(pattern)
  File "/usr/local/lib/python3.7/site-packages/datasets/utils/streaming_download_manager.py", line 384, in xpathglob
    yield from Path(main_hop).glob(pattern)
  File "/usr/local/lib/python3.7/site-packages/datasets/utils/streaming_download_manager.py", line 384, in xpathglob
    yield from Path(main_hop).glob(pattern)
  [Previous line repeated 984 more times]
  File "/usr/local/lib/python3.7/site-packages/datasets/utils/streaming_download_manager.py", line 381, in xpathglob
    posix_path = _as_posix(path)
  File "/usr/local/lib/python3.7/site-packages/datasets/utils/streaming_download_manager.py", line 172, in _as_posix
    path_as_posix = path.as_posix()
  File "/usr/lib64/python3.7/pathlib.py", line 726, in as_posix
    return str(self).replace(f.sep, '/')
  File "/usr/lib64/python3.7/pathlib.py", line 716, in __str__
    self._parts) or '.'
  File "/usr/lib64/python3.7/pathlib.py", line 695, in _format_parsed_parts
    return drv + root + cls._flavour.join(parts[1:])
RecursionError: maximum recursion depth exceeded while calling a Python object

How to store model weights in GCS

In order to do distributed training across multiple TPUs and for hosting the model once we lose access to the TPUs, we need to figure out how to setup a GCS bucket to store the model in. Any help on this task would be greatly appreciated!

https://cloud.google.com/storage/

Testing Opening Another Issue!

Participation in an Open Source Language Modeling Dataset

Hi there, your repository has been selected to be included in an effort
to train an open source version of GitHub and OpenAI's Copilot tool.
You can find more information on our project here.

If you are the owner/admin of this repository and would like to opt-out of this,
please reply to this issue before July 9th with "yes" and we will remove your repository
from our list.

Wrong filenames in dataset

Hi,
The filenames in code-clippy dedup dataset are wrong. In the repo with multiple files - though various files are present, they share a single random filename not adhering to the correct file extension as well. While for gpt-code-clippy training efforts this might not be an issue since only content of files might matter, it would be really great if this issue can be fixed or mentioned clearly otherwise.

sample code to reproduce the issue (prints filenames in first 100 rows of jsonl)

import os
import json
import uuid
import zstandard
import subprocess

def loadJsonL(fname):
    import json

    data = []
    with open(fname) as fp:
        for line in fp.readlines():
            data.append(json.loads(line))
    return data


def processZSTLink(url):
    zstfile = url.split('/')[-1]
    print(url)
    out = subprocess.run(f"wget {url}", shell=True, stdout=subprocess.DEVNULL)    
    jsonlfile = zstfile[:-4]    
    with open(zstfile, 'rb') as compressed:
        decomp = zstandard.ZstdDecompressor()
        with open(jsonlfile, 'wb') as destination:
            decomp.copy_stream(compressed, destination)

    data = loadJsonL(jsonlfile)
    newData = []
    for row in data[:100]:
        file_name = row['meta']['file_name']
        repo_name = row['meta']['repo_name']        
        print(f"{repo_name}/{file_name}")


processZSTLink('https://the-eye.eu/public/AI/training_data/code_clippy_data//code_clippy_dedup_data/test/data_2814_time1626332048_default.jsonl.zst')

How to get started?

Is there an easier way to get started?

I tried to setup a machine and install all requirements. Would try tomorrow to go further but maybe I am doing something wrong:

The error I am at currently is:
"""
2021-11-05 22:23:59.523515: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
Traceback (most recent call last):
File "run_clm_apps.py", line 800, in
main()
File "run_clm_apps.py", line 342, in main
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
File "/home/pankaj/.local/lib/python3.8/site-packages/transformers/hf_argparser.py", line 191, in parse_args_into_dataclasses
obj = dtype(**inputs)
File "", line 14, in init
File "run_clm_apps.py", line 174, in post_init
raise ValueError("Need either a dataset name or a training/validation file.")
ValueError: Need either a dataset name or a training/validation file.
"""
Also, getting the requirements to work was quite difficult on my machine. Wondering if I am doing something wrong.

--enable-proposed-api ncoop57.code-clippy

Hi everyone. After I launch the extension in debug mode, when I try writing I get this error:
[ncoop57.code-clippy]: editor/inlineCompletions/actions is a proposed menu identifier. It requires 'package.json#enabledApiProposals: ["inlineCompletionsAdditions"]' and is only available when running out of dev or with the following command line switch: --enable-proposed-api ncoop57.code-clippy

--enable-proposed-api ncoop57.code-clippy gives me a "Missing expression after unary operator '--'" error.
And code --enable-proposed-api ncoop57.code-clippy gets me out of Debug mode.

Does anyone has an idea how I can fix this?

Code Model Demo

How will we demo the model?

What are the different 'repo_language' contained in the dataset?

I have only found Java. Wonder if someone can spare me the details without having to process the whole dataset :)
Thank you for open sourcing it! Awesome stuff!

Fine-tuning on GPT-J

Hi,

Are there any recommended steps or resources available for fine-tuning a large language model such as GPT-J in an unsupervised manner using GPT-Code-Clippy, with the goal of teaching the model about a new domain?

Thanks

Documentation

What needs to be documented?
Model card
Datasheets

Incomplete merge

The following file doesn't compile due to an incomplete merge.
https://github.com/CodedotAl/gpt-code-clippy/blob/camera-ready/training/run_clm_streaming_flax.py

Cannot seem to get good results

Hello I'm attempting to run the starter code for the flax-community/gpt-neo-125M-code-clippy.

For some reason, I cannot get anything other than blank characters and escape characters.

Would someone be able to assist?

Please publish the VS Code extension to OpenVSX as well

Hi!

Many thanks for working towards an open-source version of GitHub Copilot. 🙏

I'm particularly interested in the VS Code extension -- could you please also publish it to OpenVSX, the open-source, vendor-neutral IDE extension repository?

This would allow users of non-Microsoft products to install this extension as well (for example users of VSCodium, Gitpod, Theia, etc.)

The process should be pretty easy, especially since you already have a VS Code extension (OpenVSX uses the same publishing tools). Please:

Generate a token
Run npx ovsx publish

(That's it.)

doubts: language code generation

this only works for generating python code?

Huggingface example

Hello, I found this amazing repository today. I tried to run the example found in the huggingface on google colab but it didn't output anything except "Setting pad_token_id to eos_token_id:50256 for open-end generation.". I want to know if there is anything I did wrong. Thanks! (sorry for my poor english)

this is the example:

this the code I ran on colab( I change the variable device to "cpu"):

Code Datasets

Datasets to use?
How to collect the datasets?
How to store and organize the datasets?
What filtering/preprocessing/processing needs to be done to the datasets?
Merge data onto one TPU
Figure out deduplicating dataset
Setup dataloading of dataset using HF datasets
Talk with owner of the eye archive community for hosting our dataset similar to the pile

how does it compare with GitHub copilot?

How is this implementation comparable with the github copilot and codex model?

Training and fine-tuning on GPT-J

Trying to fine-tune gpt-j to create a better version of code-clippy

Fine-tuning script has already been created by me. However, it would require a beefy tpu(v3-256 takes about 6 weeks I believe.) And thus, I cannot train it

It would be great if this repository would be helpful in the long run of creating an open source version of github-copilot

Creating embeddings instead of output prediction

Hi! I was wondering if I a GPT Code Clippy model could generate embeddings instead of output generation?
The purpose is to embed code in a semantical space, such that it can be used as a feature for another neural network. I have done the same with BERT (more as a baseline, since this model is not trained on code), and with the OpenAI Codex model (with a paying API), and therefore would love to use one of your models as well.

Thank you!

Participation in an Open Source Language Modeling Dataset

Hi there, your repository has been selected to be included in an effort
to train an open source version of GitHub and OpenAI's Copilot tool.
You can find more information on our project here.

If you are the owner/admin of this repository and would like to opt-out of this, please reply to this issue
before July 9th with "yes" and we will remove your repository from our list.

Update Eval Results for HumanEval

Need to update the HumanEval results due to bug that was originally in our evaluation code and was fixed in this PR #62

Code Model Evaluation

How will we evaluate the model?
What metrics will we use?
What existing scripts could we repurpose?
Modified/newly created eval script created to feed into the rest of the pipeline

gpt-neo-125M-code-clippy example on huggingface produces empty result

Not quite sure where to leave this but I wanted to try out this project without training the model myself and found this model(It's the only working link from the recommended models) and copied the "How to use" example but it only produces a result with linebreaks in it. Is this expected?

codedotal / gpt-code-clippy Goto Github PK

gpt-code-clippy's People

Contributors

Stargazers

Watchers

Forkers

gpt-code-clippy's Issues

Recommend Projects

Recommend Topics

Recommend Org