Coder Social home page Coder Social logo

reacc's Introduction

ReACC

Source codes for ACL 2022 paper "ReACC: A Retrieval-Augmented Code Completion Framework" ReACC combines a source code retiever and an auto-regresstive language model for programming languages.

Dependency

  • pytorch >= 1.7.0
  • transformers >= 4.10.0
  • tree_sitter
  • faiss-gpu
  • beir (for BM25)
    • elastic search

Instructions

Here are the instructions to apply ReACC framework on the code completion task with PY150 dataset.

1. Pretrain a retriever

Leverage microsoft/reacc-py-retriever as a code-to-code retriever for python source codes.

2. Build an index for search

First, you have to prepare a codebase for retrieving. It is recommended to split each file/function into small chunks. (refer to utils/split_codes.py). Then run the command to get representations of all the codes in search corpus.

python -m torch.distributed.launch --nproc_per_node=${PER_NODE_GPU} infer.py \
        --data_path=data/train_split.txt \
        --save_name=save_vec \
        --lang=python \
        --pretrained_dir=microsoft/reacc-py-retriever \
        --num_vec=8 \
        --block_size=512 \
        --gpu_per_node ${PER_NODE_GPU} \
        --logging_steps=100 

You can modify the InferDataset in infer.py to fit your own dataset. Our dataset is formated as a jsonl file, where each line is like

{
        "code": "def function()",
        "id": 0
}

or a plain text file, in which each line is a code snippet.

3. Retrieve step

ReACC is a two-stage framework. The first stage is to retrieve the similar codes given a query. As the test set is fixed, we retrieve all the similar codes of the queries in test set in advance. It would be better to merge step 3 into step 4.

First, get the representations of test queries like in step 2. Then run the script utils/search_dense.py to sort the similarity and get the most similar codes.

If you would like to use BM25 algorithm to retrieve similar codes, run the script utils/search_bm25.py.

At last, run utils/get_res.py to get the most similar code based on bm25 results, or dense retrieval results, or both.

4. Generation step

Please download PY150 dataset first and use preprocess scripts in CodeXGLUE. And follow CodeXGLUE to fine-tune a model on it, like CodeGPT.

The second stage in ReACC is to complete codes based on the context and the retrieved codes. We simply put the retrieved code before the context and concat them as inputs.

Navigate to the gen folder. We adapt the code completion scripts in CodeXGLUE. We modify the script dataset.py to include similar codes as input. Run run_lm.py to evaluate your fine-tuned model.

export CUDA_VISIBLE_DEVICES=0
LANG=python
DATADIR=dataset/py150
LITFILE=${DATADIR}/literals.json
OUTPUTDIR=save/py150
PRETRAINDIR=py150-ckpt

LOADFILE=${DATADIR}/train_split
RESFILE=search_res.pkl
SAVEFILE=prediction.txt

python -u run_lm.py \
        --data_dir=$DATADIR \
        --lit_file=$LITFILE \
        --langs=$LANG \
        --output_dir=$OUTPUTDIR \
        --pretrain_dir=$PRETRAINDIR \
        --load_file_name=$LOADFILE \
        --search_res=$RESFILE \
        --save_name=$SAVEFILE \
        --model_type=gpt2 \
        --block_size=1024 \
        --eval_line \
        --logging_steps=100 \
        --seed=42 

Zero-shot code clone detection

In order to evaluate the effectiveness of the code-to-code retrieval module in ReACC, we perform code clone detection task which aims to retrieve semantic equivalent programs.

We extract the evaluation dataset from CodeNet, the same as in UniXcoder paper. The dataset can be downloaded from here

Run the codenet_test.py to reproduce this experiment.

DATADIR=CodeNet
PRETRAINDIR=microsoft/reacc-py-retriever
 
python -u codenet_test.py \
        --data_dir=$DATADIR \
        --pretrained_dir=$PRETRAINDIR \
        --lang=python \
        --num_vec=8 \
        --cut \
        --block_size=512 \
        --per_gpu_eval_batch_size=64 \
        --logging_steps=100 \
        --seed=614 

Code of Conduct

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

License

Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT license.

Reference

If you use this code or ReACC, please consider citing us.

@article{lu2022reacc,
  title={ReACC: A Retrieval-Augmented Code Completion Framework},
  author={Lu, Shuai and Duan, Nan and Han, Hojae and Guo, Daya and Hwang, Seung-won and Svyatkovskiy, Alexey},
  journal={arXiv preprint arXiv:2203.07722},
  year={2022}
}

reacc's People

Contributors

celbree avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

reacc's Issues

retriever训练

您好,请问:
1.如何获取针对于java的incomplete code-to-code retriever?,文中只提供了针对于python的retriever:[microsoft/reacc-py-retriever],需要我们自己训练java的retriever吗?
2.如果我们想自己训练ReAcc framework中的retriever,应该如何获取数据集?是否需要我们自己找数据集,进行Transformation(Identifier renaming and Dead code insertion),然后进行训练,run the codenet_test.py来评估我们的retriever?
谢谢!

train_split

if use py150 dataset , do we need train_split.txt?

搜索数据集

您好,请问read_me中的第二步 2.建立搜索索引 中建立检索代码库的数据集应该如何获取?十分感谢!!

How many retrieved completions are prepended to the input?

The second stage in ReACC is to complete codes based on the context and the retrieved codes. ReACC simply puts the retrieved code before the context and concat them as inputs. How many retrieve code are prepended?

Does the choice depend on any predefined length? Let's say maximum allowed length is 1024 and the input is 256, then maximum length of the retrieved document could be 768. Because the completion sizes are less than 300, how the decision is made to consider the number of retrieved doc?

Request for Implementation Code for Data Augmentation

I would like to use the data augmentation method used by ReACC, i.e., the Identifier renaming and Dead code insertion method, as the baseline data augmentation of my research.
By the way, process_python.py seems to be the code for that. But, I don't know how to use this code. Can I know how to use it or an example of using it?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.