Coder Social home page Coder Social logo

codedotal / gpt-code-clippy Goto Github PK

View Code? Open in Web Editor NEW
3.3K 3.3K 217.0 17.75 MB

Full description can be found here: https://discuss.huggingface.co/t/pretrain-gpt-neo-for-open-source-github-copilot-model/7678?u=ncoop57

License: Apache License 2.0

Python 94.68% Shell 0.36% Jupyter Notebook 4.95% Dockerfile 0.01%

gpt-code-clippy's People

Contributors

arampacha avatar arunraja-hub avatar bentrevett avatar hiroakimikami avatar mrinal18 avatar ncoop57 avatar ndrpnt avatar neubig avatar reshinthadithyan avatar shpotes avatar taisazero avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gpt-code-clippy's Issues

Hi, I increased the number of layers for the file with the error, and found that the error is still reported. I want to ask two questions.

Hi, I increased the number of layers for the file with the error, and found that the error is still reported. I want to ask two questions.
image

  • May I just clone this project and install the corresponding library, and then I can run run_clm_streaming_flax.py? Do I need to download more than 200g of training corpus into my local server? Or does it download automatically? Can you provide a more detailed pre-training guide?
  • The server I use has a GPU, so why do I run run_clm_streaming_flax.py directly and use the CPU by default?
    image

Thanks!

Originally posted by @BitcoinNLPer in #74 (comment)

Things That Could Go Wrong

Hi y'all I'd like to make sure we do plenty of brainstorming on where things can go wrong in terms of ethical concerns. I don't want our field to have the same issues that have happened in the other AI fields such as biases and lacking discussion of limitations. So, please use this issue to also (we also have an internal discord channel where we discuss this in a less formal setting, which I will be periodically synthesizing to here) discuss any things that could go wrong!
Here are already a few things that have been discussed:

  1. vulnerabilities being inserted into completions
  2. Licensing Issues
  3. Automating developers out of a job

**Training script**

  • - add bf16 support
  • - check if training with bf16 weights works fine
  • - add resuming from ckpt
  • - add wandb tracking
  • - complete adafactor option
  • - figure out how to best utilize profiler for training loop optimization
  • - add gradient accumulation
  • - support iterable datasets and max_steps argument
  • - prefetch generator for dataloader

EleutherAI/gpt-neo-1.3B Model works better than this.

Hi,
You guys are doing a great job with it.

I have tried your flax-community/gpt-neo-1.3B-apps-all model,
and the generated code is kinda hit or miss.

This is generated using
flax-community/gpt-neo-1.3B-apps-all
image

and this is generated using
EleutherAI/gpt-neo-1.3B
image

as far I know
EleutherAI/gpt-neo-1.3B is trained on more generalized texts, which are not necessarily code.

then why flax-community/gpt-neo-1.3B-apps-all performing much worse than EleutherAI/gpt-neo-1.3B?

Vim Plugin

Awesome work,
I try to begin developing vim plugin version of the project ( if my knowledge allows it).
BTW, awesome job.

Participation in an Open Source Language Modeling Dataset

Hi there, your repository has been selected to be included in an effort
to train an open source version of GitHub and OpenAI's Copilot tool.
You can find more information on our project here.

If you are the owner/admin of this repository and would like to opt-out of this,
please reply to this issue before July 9th with "yes" and we will remove your
repository from our list.

Low Pass@k

Hi,
Thanks for the great work!
Firstly I wanted to ask about the performance of the code-clippy models. It seems that the 125M parameter models are quite weak and perform quite poorly on human-eval dataset (even lower than GPT-Neo-1.3B?). Any idea why this is happening.

Also is there some update on the evaluation of the GPT-Neo-1.3 B code-clippy model?

Finally, I would love to contribute to upcoming iterations of code-clippy. Should I join the discord channel?

**Code Model**

  • What model will be used?
  • How will the model be trained?
  • What existing training script can be repurposed?
  • Modified/newly created training script that can feed into the rest of the pipeline

**Code Tokenization**

  • What sort of tokenization will be done?
  • Scripts/tutorials that can do the tokenization?
  • Modified/newly created tokenization script to feed into the rest of the pipeline

Unable to train with custom data

Hi,
when I try to train a model from scratch I am facing following error.
The data_dir contains less data, so I think CPU should be sufficient in my case. So what exactly could cause this?
@ncoop57 can you please check and help.

./run_clm_streaming_flax.py \
    --output_dir $HOME/fhgw-gpt-neo-125M-code-clippy \
    --dataset_name /home/fedora/explore/clippy/gpt-code-clippy/data_processing/code_clippy.py \
    --data_dir /mnt/vol/FHGW/scm_fhgw/workspace_FHGW_21.000/FHGW-NW-CM \
    --text_column_name="text" \
    --do_train --do_eval \
    --block_size="2048" \
    --per_device_train_batch_size="8" \
    --per_device_eval_batch_size="16" \
    --preprocessing_num_workers="8" \
    --learning_rate="1e-4" \
    --max_steps 100000 \
    --warmup_steps 2500 \
    --decay_steps 25000 \
    --adam_beta1="0.9" \
    --adam_beta2="0.95" \
    --weight_decay="0.1" \
    --overwrite_output_dir \
    --logging_steps="100" \
    --eval_steps="500" \
    --push_to_hub="False" \
    --report_to="all" \
    --dtype="bfloat16" \
    --skip_memory_metrics="True" \
    --save_steps="500" \
    --save_total_limit 10 \
    --gradient_accumulation_steps 16 \
    --report_to="wandb" \
    --run_name="125m_1e-4lr_1024bs" \
    --max_eval_samples 2000 \
    --save_optimizer true

2022-01-06 08:27:11.271076: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
INFO:absl:Unable to initialize backend 'tpu_driver': NOT_FOUND: Unable to find driver in registry given worker: 
INFO:absl:Unable to initialize backend 'gpu': NOT_FOUND: Could not find registered platform with name: "cuda". Available platform names are: Interpreter Host
INFO:absl:Unable to initialize backend 'tpu': INVALID_ARGUMENT: TpuPlatform is not available.
WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
INFO:__main__:Training/evaluation parameters TrainingArguments(
_n_gpu=0,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.95,
adam_epsilon=1e-08,
bf16=False,
bf16_full_eval=False,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_steps=500,
evaluation_strategy=IntervalStrategy.NO,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
gradient_accumulation_steps=16,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_model_id=None,
hub_strategy=HubStrategy.EVERY_SAVE,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=0.0001,
length_column_name=length,
load_best_model_at_end=False,
local_rank=-1,
log_level=-1,
log_level_replica=-1,
log_on_each_node=True,
logging_dir=/home/fedora/fhgw-gpt-neo-125M-code-clippy/runs/Jan06_08-27-13_fedora.novalocal,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=100,
logging_strategy=IntervalStrategy.STEPS,
lr_scheduler_type=SchedulerType.LINEAR,
max_grad_norm=1.0,
max_steps=100000,
metric_for_best_model=None,
mp_parameters=,
no_cuda=False,
num_train_epochs=3.0,
output_dir=/home/fedora/fhgw-gpt-neo-125M-code-clippy,
overwrite_output_dir=True,
past_index=-1,
per_device_eval_batch_size=16,
per_device_train_batch_size=8,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
remove_unused_columns=True,
report_to=['wandb'],
resume_from_checkpoint=None,
run_name=125m_1e-4lr_1024bs,
save_on_each_node=False,
save_steps=500,
save_strategy=IntervalStrategy.STEPS,
save_total_limit=10,
seed=42,
sharded_ddp=[],
skip_memory_metrics=True,
tf32=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_legacy_prediction_loop=False,
warmup_ratio=0.0,
warmup_steps=2500,
weight_decay=0.1,
xpu_backend=None,
)
WARNING:datasets.builder:Using custom data configuration default-01c596fb6133304a
Traceback (most recent call last):
  File "/usr/lib64/python3.7/pathlib.py", line 713, in __str__
    return self._str
AttributeError: _str

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./run_clm_streaming_flax.py", line 774, in <module>
    main()
  File "./run_clm_streaming_flax.py", line 392, in main
    split="train"
  File "/usr/local/lib/python3.7/site-packages/datasets/load.py", line 1686, in load_dataset
    use_auth_token=use_auth_token,
  File "/usr/local/lib/python3.7/site-packages/datasets/builder.py", line 897, in as_streaming_dataset
    splits_generators = {sg.name: sg for sg in self._split_generators(dl_manager)}
  File "/home/fedora/.cache/huggingface/modules/datasets_modules/datasets/code_clippy/86b09b4a623c1c39753a8ad165e05757d9a97daf132ac71d3b6eb791e7da16dd/code_clippy.py", line 111, in _split_generators
    gen_kwargs={"filepaths": sorted([str(fp) for fp in Path(f"{data_dir}/train").glob("*.jsonl.zst")])}
  File "/home/fedora/.cache/huggingface/modules/datasets_modules/datasets/code_clippy/86b09b4a623c1c39753a8ad165e05757d9a97daf132ac71d3b6eb791e7da16dd/code_clippy.py", line 111, in <listcomp>
    gen_kwargs={"filepaths": sorted([str(fp) for fp in Path(f"{data_dir}/train").glob("*.jsonl.zst")])}
  File "/usr/local/lib/python3.7/site-packages/datasets/utils/streaming_download_manager.py", line 384, in xpathglob
    yield from Path(main_hop).glob(pattern)
  File "/usr/local/lib/python3.7/site-packages/datasets/utils/streaming_download_manager.py", line 384, in xpathglob
    yield from Path(main_hop).glob(pattern)
  File "/usr/local/lib/python3.7/site-packages/datasets/utils/streaming_download_manager.py", line 384, in xpathglob
    yield from Path(main_hop).glob(pattern)
  [Previous line repeated 984 more times]
  File "/usr/local/lib/python3.7/site-packages/datasets/utils/streaming_download_manager.py", line 381, in xpathglob
    posix_path = _as_posix(path)
  File "/usr/local/lib/python3.7/site-packages/datasets/utils/streaming_download_manager.py", line 172, in _as_posix
    path_as_posix = path.as_posix()
  File "/usr/lib64/python3.7/pathlib.py", line 726, in as_posix
    return str(self).replace(f.sep, '/')
  File "/usr/lib64/python3.7/pathlib.py", line 716, in __str__
    self._parts) or '.'
  File "/usr/lib64/python3.7/pathlib.py", line 695, in _format_parsed_parts
    return drv + root + cls._flavour.join(parts[1:])
RecursionError: maximum recursion depth exceeded while calling a Python object

Participation in an Open Source Language Modeling Dataset

Hi there, your repository has been selected to be included in an effort
to train an open source version of GitHub and OpenAI's Copilot tool.
You can find more information on our project here.

If you are the owner/admin of this repository and would like to opt-out of this,
please reply to this issue before July 9th with "yes" and we will remove your repository
from our list.

Wrong filenames in dataset

Hi,
The filenames in code-clippy dedup dataset are wrong. In the repo with multiple files - though various files are present, they share a single random filename not adhering to the correct file extension as well. While for gpt-code-clippy training efforts this might not be an issue since only content of files might matter, it would be really great if this issue can be fixed or mentioned clearly otherwise.

sample code to reproduce the issue (prints filenames in first 100 rows of jsonl)

import os
import json
import uuid
import zstandard
import subprocess

def loadJsonL(fname):
    import json

    data = []
    with open(fname) as fp:
        for line in fp.readlines():
            data.append(json.loads(line))
    return data


def processZSTLink(url):
    zstfile = url.split('/')[-1]
    print(url)
    out = subprocess.run(f"wget {url}", shell=True, stdout=subprocess.DEVNULL)    
    jsonlfile = zstfile[:-4]    
    with open(zstfile, 'rb') as compressed:
        decomp = zstandard.ZstdDecompressor()
        with open(jsonlfile, 'wb') as destination:
            decomp.copy_stream(compressed, destination)

    data = loadJsonL(jsonlfile)
    newData = []
    for row in data[:100]:
        file_name = row['meta']['file_name']
        repo_name = row['meta']['repo_name']        
        print(f"{repo_name}/{file_name}")


processZSTLink('https://the-eye.eu/public/AI/training_data/code_clippy_data//code_clippy_dedup_data/test/data_2814_time1626332048_default.jsonl.zst')

How to get started?

Is there an easier way to get started?

I tried to setup a machine and install all requirements. Would try tomorrow to go further but maybe I am doing something wrong:

The error I am at currently is:
"""
2021-11-05 22:23:59.523515: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
Traceback (most recent call last):
File "run_clm_apps.py", line 800, in
main()
File "run_clm_apps.py", line 342, in main
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
File "/home/pankaj/.local/lib/python3.8/site-packages/transformers/hf_argparser.py", line 191, in parse_args_into_dataclasses
obj = dtype(**inputs)
File "", line 14, in init
File "run_clm_apps.py", line 174, in post_init
raise ValueError("Need either a dataset name or a training/validation file.")
ValueError: Need either a dataset name or a training/validation file.
"""
Also, getting the requirements to work was quite difficult on my machine. Wondering if I am doing something wrong.

--enable-proposed-api ncoop57.code-clippy

Hi everyone. After I launch the extension in debug mode, when I try writing I get this error:
[ncoop57.code-clippy]: editor/inlineCompletions/actions is a proposed menu identifier. It requires 'package.json#enabledApiProposals: ["inlineCompletionsAdditions"]' and is only available when running out of dev or with the following command line switch: --enable-proposed-api ncoop57.code-clippy

--enable-proposed-api ncoop57.code-clippy gives me a "Missing expression after unary operator '--'" error.
And code --enable-proposed-api ncoop57.code-clippy gets me out of Debug mode.

Does anyone has an idea how I can fix this?

Fine-tuning on GPT-J

Hi,

Are there any recommended steps or resources available for fine-tuning a large language model such as GPT-J in an unsupervised manner using GPT-Code-Clippy, with the goal of teaching the model about a new domain?

Thanks

Cannot seem to get good results

Hello I'm attempting to run the starter code for the flax-community/gpt-neo-125M-code-clippy.

For some reason, I cannot get anything other than blank characters and escape characters.

Would someone be able to assist?

Please publish the VS Code extension to OpenVSX as well

Hi!

Many thanks for working towards an open-source version of GitHub Copilot. ๐Ÿ™

I'm particularly interested in the VS Code extension -- could you please also publish it to OpenVSX, the open-source, vendor-neutral IDE extension repository?

This would allow users of non-Microsoft products to install this extension as well (for example users of VSCodium, Gitpod, Theia, etc.)

The process should be pretty easy, especially since you already have a VS Code extension (OpenVSX uses the same publishing tools). Please:

  1. Generate a token

  2. Run npx ovsx publish

(That's it.)

Huggingface example

Hello, I found this amazing repository today. I tried to run the example found in the huggingface on google colab but it didn't output anything except "Setting pad_token_id to eos_token_id:50256 for open-end generation.". I want to know if there is anything I did wrong. Thanks! (sorry for my poor english)

this is the example:
image

this the code I ran on colab( I change the variable device to "cpu"):
image

**Code Datasets**

  • Datasets to use?
  • How to collect the datasets?
  • How to store and organize the datasets?
  • What filtering/preprocessing/processing needs to be done to the datasets?
  • Merge data onto one TPU
  • Figure out deduplicating dataset
  • Setup dataloading of dataset using HF datasets
  • Talk with owner of the eye archive community for hosting our dataset similar to the pile

Training and fine-tuning on GPT-J

Trying to fine-tune gpt-j to create a better version of code-clippy

Fine-tuning script has already been created by me. However, it would require a beefy tpu(v3-256 takes about 6 weeks I believe.) And thus, I cannot train it

It would be great if this repository would be helpful in the long run of creating an open source version of github-copilot

Creating embeddings instead of output prediction

Hi! I was wondering if I a GPT Code Clippy model could generate embeddings instead of output generation?
The purpose is to embed code in a semantical space, such that it can be used as a feature for another neural network. I have done the same with BERT (more as a baseline, since this model is not trained on code), and with the OpenAI Codex model (with a paying API), and therefore would love to use one of your models as well.

Thank you!

Participation in an Open Source Language Modeling Dataset

Hi there, your repository has been selected to be included in an effort
to train an open source version of GitHub and OpenAI's Copilot tool.
You can find more information on our project here.

If you are the owner/admin of this repository and would like to opt-out of this, please reply to this issue
before July 9th with "yes" and we will remove your repository from our list.

**Code Model Evaluation**

  • How will we evaluate the model?
  • What metrics will we use?
  • What existing scripts could we repurpose?
  • Modified/newly created eval script created to feed into the rest of the pipeline

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.