Coder Social home page Coder Social logo

trendingtechnology / plbart Goto Github PK

View Code? Open in Web Editor NEW

This project forked from wasiahmad/plbart

0.0 2.0 0.0 1.52 MB

Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].

License: MIT License

Python 81.98% Shell 18.02%

plbart's Introduction

PLBART

Official code release of our NAACL 2021 work, Unified Pre-training for Program Understanding and Generation. We present PLBART that is pre-trained on a large collection Java and Python functions and natural language descriptions collected from Github and StackOverflow, respectively.

We present the file structure of this repository here.

What's New:

Setup (optional)

We can setup a conda environment in order to run PLBART experiments, the first step is to download the dependencies. We assume anaconda and Python 3.6 is installed. The additional requirements (as noted in requirements.txt can be installed by running the following script:

bash install_tools.sh

Pre-training

Install apex for fp16 training. Then, follow the following steps.

Step1. Download Github data

Go to data/github directory and follow instructions.

Step2. Download StackOverflow data

Go to data/stackoverflow directory and follow instructions.

Step3. Binarize the data and pre-train

cd pretrain
bash binarize.sh
bash pretrain.sh GPU_IDS

[Note] We pre-trained PLBART on 8 GeForce RTX 2080 (11gb) GPUs (took ~11.5 days). If you want to pre-train PLBART using more GPUs or GPUs with more memory, adjust MAX_SENTENCES, MAX_TOKENS, UPDATE_FREQ accordingly to maintain an effective batch size of 2048. According to fairseq, effective batch size is equal to:

PER_GPU_TRAIN_BATCH_SIZE * NUM_GPU * UPDATE_FREQ

Note that, MAX_TOKENS refers to the size of each mini-batch, in terms of the number of tokens. During our experiments, we noticed that in an 11gb GPU, maximum 2048 tokens can be accommodated which is equivalent to 4-5 examples. Therefore, we set UPDATE_FREQ to 60, so that we can achieve an effective batch size of ~2048.

Fine-tuning on Downstream Tasks

We fine-tune and evaluate PLBART on three types of tasks.

Type Task Language(s) Data Scripts Checkpoints
Code to Text Code summarization Python, Java, Ruby,
PHP, Javascript, Go
[LINK] [LINK] [LINK]
Text to Code Code generation Java [LINK] [LINK] [LINK]
Code to Code Code translation Java, C# [LINK] [LINK] [LINK]
Code refinement Java [LINK] [LINK]
Clone detection Java [LINK] [LINK]
Defect detection C/C++ [LINK] [LINK]

Step1. Download PLBART checkpoint

cd pretrain
bash download.sh
cd ..

Step2. Download the data

cd data/codeXglue
bash download.sh
cd ../..

Step3. Build parser for CodeBLEU evaluation

cd evaluation/CodeBLEU/parser
bash build.sh
cd ../../..

Step4. Prepare the data, train and evaluate PLBART

For example, we want to fine-tune PLBART on Text-to-Code task. Then,

cd scripts/text_to_code
bash prepare.sh
bash run.sh GPU_IDS
cd ../..

Note. We fine-tuned PLBART on 1 GeForce RTX 2080 (11gb) GPU.

Notes

Mismatch in performance reported in the paper and achieved using the released checkpoints.

There is a difference between PLBART's performances mentioned in the paper and the performance achieved with the released checkpoints. We noted them here. Note that, there is no change in the hyper-parameter setting. We provided the exact same value we used in the bash scripts. The performance difference we observed is perhaps due to running experiments at different point of time. Although we didn't but we recommend to fine-tune PLBART with multiple different seeds and report the average scores.

mbart_base task is not present in fairseq==0.9.0 official release.

Although we used fairseq==0.9.0 but we used a different commit which consists of mbart_base task. You may do the following which should work.

git clone https://github.com/pytorch/fairseq
cd fairseq
git checkout 698e3b91ffa832c286c48035bdff78238b0de8ae
pip install .

Otherwise, you may consider installing fairseq==0.10.0. Please refer to this issue to make other adjustments.

What can be the maximum input and output lengths for PLBART?

The maximum length is 512.

Acknowledgement

PLBART uses Fairseq, codeXglue, and TransCoder and thanks the authors of these works for their contribution.

Citation

@inproceedings{ahmad-etal-2021-unified,
    title = "Unified Pre-training for Program Understanding and Generation",
    author = "Ahmad, Wasi  and
      Chakraborty, Saikat  and
      Ray, Baishakhi  and
      Chang, Kai-Wei",
    booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
    month = jun,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2021.naacl-main.211",
    pages = "2655--2668"
}

plbart's People

Contributors

saikat107 avatar wasiahmad avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.