Coder Social home page Coder Social logo

parasj / contracode Goto Github PK

View Code? Open in Web Editor NEW
165.0 8.0 27.0 10.38 MB

Contrastive Code Representation Learning: functionality-based JavaScript embeddings through self-supervised learning

Home Page: https://parasj.github.io/contracode/

License: Apache License 2.0

Python 13.89% JavaScript 1.34% Makefile 0.01% Jupyter Notebook 82.54% Shell 2.22%
deep-learning contrastive-learning momentum-contrast compiler programming-language machine-learning pytorch

contracode's Introduction

Contrastive Code Representation Learning

By Paras Jain, Ajay Jain, Tianjun Zhang, Pieter Abbeel, Joseph E. Gonzalez and Ion Stoica (website)

Learning functionality-based representations of programs

Machine-aided programming tools such as type predictors and code summarizers are increasingly learning-based. However, most code representation learning approaches rely on supervised learning with task-specific annotated datasets. We propose Contrastive Code Representation Learning (ContraCode), a self-supervised algorithm for learning task-agnostic semantic representations of programs via contrastive learning.

Our approach uses no human-provided labels, relying only on the raw text of programs. In particular, we design an unsupervised pretext task by generating textually divergent copies of source functions via automated source-to-source compiler transforms that preserve semantics. We train a neural model to identify variants of an anchor program within a large batch of negatives. To solve this task, the network must extract program features representing the functionality, not form, of the program. This is the first application of instance discrimination to code representation learning to our knowledge. We pre-train ContraCode over 1.8m unannotated JavaScript methods mined from GitHub. ContraCode pre-training improves code summarization accuracy by 7.9% over supervised approaches and 4.8% over BERT pre-training. Moreover, our approach is agnostic to model architecture; for a type prediction task, contrastive pre-training consistently improves the accuracy of existing baselines.

This repository contains code to augment JavaScript programs with code transformations, pre-train LSTM and Transformer models with ContraCode, and to finetune the models on downstream tasks.

Installation

Dependencies: Python 3.7, NodeJS, NPM

$ npm install
$ pip install -e "."
$ python scripts/download_data.py

Data and checkpoints

Download the data subfolder from this Google Drive link and place at the root of the repository. This folder contains training and evaluation data, vocabularies and model checkpoints.

Pretraining models with ContraCode

Pretrain Bidirectional LSTM with ContraCode (10001 should be an available port, change if the port is in use):

python representjs/pretrain_distributed.py pretrain_lstm2l_hidden \
  --num_epochs=200 --batch_size=512 --lr=1e-4 --num_workers=4 \
  --subword_regularization_alpha 0.1 --program_mode contrastive --label_mode contrastive --save_every 5000 \
  --train_filepath=data/codesearchnet_javascript/javascript_augmented.pickle.gz \
  --spm_filepath=data/codesearchnet_javascript/csnjs_8k_9995p_unigram_url.model \
  --min_alternatives 2 --dist_url tcp://localhost:10001 --rank 0 \
  --encoder_type lstm --lstm_project_mode hidden --n_encoder_layers 2

Pretrain Transformer with ContraCode:

python representjs/pretrain_distributed.py pretrain_transformer \
  --num_epochs=200 --batch_size=96 --lr=1e-4 --num_workers=6 \
  --subword_regularization_alpha 0.1 --program_mode contrastive --label_mode contrastive --save_every 5000 \
  --train_filepath=/dev/shm/codesearchnet_javascript/javascript_augmented.pickle.gz \
  --spm_filepath=/dev/shm/codesearchnet_javascript/csnjs_8k_9995p_unigram_url.model \
  --min_alternatives 1 --dist_url tcp://localhost:10001 --rank 0

Pretrain Transformer with hybrid MLM + ContraCode objective:

python representjs/pretrain_distributed.py pretrain_transformer_hybrid \
  --num_epochs=200 --batch_size=96 --lr=4e-4 --num_workers=8 \
  --subword_regularization_alpha 0. --program_mode contrastive --loss_mode hybrid --save_every 5000 \
  --train_filepath=data/codesearchnet_javascript/javascript_augmented.pickle.gz \
  --spm_filepath=data/codesearchnet_javascript/csnjs_8k_9995p_unigram_url.model \
  --min_alternatives 1 --dist_url "tcp://localhost:10001" --rank 0

Finetuning and evaluating on downstream type prediction task

Commands to reproduce key type prediction results are provided below. In you are using pretraining checkpoints from the released checkpoints in the Google Drive, these commands should work without modification. However, if you pretrained the model from scratch, you will need to update the --resume_path argument.

Checkpoint paths if you pre-trained a model from scratch:
  • data/ft/ckpt_lstm_ft_types.pth becomes data/runs/types_contracode/ckpt_best.pth
  • data/pretrain/ckpt_transformer_ft_types.pth becomes data/runs/types_contracode_transformer/ckpt_best.pth
  • data/ft/ckpt_transformer_hybrid_ft_types.pth becomes data/runs/types_hybrid_transformer/ckpt_best.pth
  • data/ft/ckpt_transformer_ft_names.pth becomes data/runs/names_ft/ckpt_best.pth

Type prediction with an LSTM (pretrained with ContraCode)

Evaluate our finetuned Bidirectional LSTM (Table 2, DeepTyper with ContraCode pre-training):

python representjs/type_prediction.py eval --eval_filepath data/types/test_projects_gold_filtered.json --type_vocab_filepath data/types/target_wl --spm_filepath data/codesearchnet_javascript/csnjs_8k_9995p_unigram_url.model --num_workers 4 --batch_size 1 --max_seq_len -1 --no_output_attention True --encoder_type lstm --n_encoder_layers 2 --resume_path data/ft/ckpt_lstm_ft_types.pth

Finetune Bidirectional LSTM pretrained with ContraCode:

python representjs/type_prediction.py train --run_name types_contracode --train_filepath data/types/train_nounk.txt --eval_filepath data/types/valid_nounk.txt --type_vocab_filepath data/types/target_wl --spm_filepath data/codesearchnet_javascript/csnjs_8k_9995p_unigram_url.model --num_workers 4 --batch_size 16 --max_seq_len 2048 --max_eval_seq_len 2048 --lr 1e-3 --no_output_attention True --encoder_type lstm --n_encoder_layers 2 --warmup_steps 10000 --pretrain_resume_path data/pretrain/ckpt_lstm_pretrain_20k.pth --pretrain_resume_encoder_name encoder_q

Type prediction with a Transformer (pretrained with ContraCode)

Evaluate our finetuned Transformer (Table 2, Transformer with ContraCode pre-training):

python representjs/type_prediction.py eval --eval_filepath data/types/test_projects_gold_filtered.json --type_vocab_filepath data/types/target_wl --spm_filepath data/codesearchnet_javascript/csnjs_8k_9995p_unigram_url.model --num_workers 4 --batch_size 1 --max_seq_len -1 --resume_path data/pretrain/ckpt_transformer_ft_types.pth

Finetune Transformer pretrained with ContraCode:

python representjs/type_prediction.py train --run_name types_contracode_transformer --train_filepath data/types/train_nounk.txt --eval_filepath data/types/valid_nounk.txt --type_vocab_filepath data/types/target_wl --spm_filepath data/codesearchnet_javascript/csnjs_8k_9995p_unigram_url.model	--num_workers 4 --batch_size 16 --max_seq_len 2048 --max_eval_seq_len 2048 --pretrain_resume_path data/pretrain/ckpt_transformer_pretrain_240k.pth --pretrain_resume_encoder_name encoder_q --lr 1e-4

Type prediction with a hybrid Transformer (pretraining with both MLM and ContraCode)

Evaluate our finetuned hybrid Transformer (Table 2, Transformer (RoBERTa MLM pre-training) with ContraCode pre-training):

python representjs/type_prediction.py eval --eval_filepath data/types/test_projects_gold_filtered.json --type_vocab_filepath data/types/target_wl --spm_filepath data/codesearchnet_javascript/csnjs_8k_9995p_unigram_url.model --num_workers 4 --batch_size 1 --max_seq_len -1 --resume_path data/ft/ckpt_transformer_hybrid_ft_types.pth

Finetune Transformer after hybrid pretraining:

python representjs/type_prediction.py train --run_name types_hybrid_transformer --train_filepath data/types/train_nounk.txt --eval_filepath data/types/valid_nounk.txt --type_vocab_filepath data/types/target_wl --spm_filepath data/codesearchnet_javascript/csnjs_8k_9995p_unigram_url.model	--num_workers 4 --batch_size 16 --max_seq_len 2048 --max_eval_seq_len 2048 --pretrain_resume_path data/pretrain/ckpt_transformer_hybrid_pretrain_240k.pth --pretrain_resume_encoder_name encoder_q --lr 1e-4

Finetuning and evaluating on downstream method naming task

Evaluate (Table 3, Transformer + ContraCode + augmentation):

python representjs/main.py test --batch_size 64 --num_workers 8 --n_decoder_layers 4 \
  --checkpoint_file data/ft/ckpt_transformer_ft_names.pth \
  --test_filepath data/codesearchnet_javascript/javascript_test_0.jsonl.gz \
  --spm_filepath data/codesearchnet_javascript/csnjs_8k_9995p_unigram_url.model

Finetune:

python representjs/main.py train --run_name names_ft \
  --program_mode identity --label_mode identifier --n_decoder_layers=4 --subword_regularization_alpha 0 \
  --num_epochs 100 --save_every 5 --batch_size 32 --num_workers 4 --lr 1e-4 \
  --train_filepath data/codesearchnet_javascript/javascript_train_supervised.jsonl.gz \
  --eval_filepath data/codesearchnet_javascript/javascript_valid_0.jsonl.gz \
  --resume_path data/pretrain/ckpt_transformer_pretrain_20k.pth

Citation

If you find this code or our paper relevant to your work, please cite our arXiv paper:

@article{jain2020contrastive,
  title={Contrastive Code Representation Learning},
  author={Paras Jain and Ajay Jain and Tianjun Zhang
  and Pieter Abbeel and Joseph E. Gonzalez and Ion Stoica},
  year={2020},
  journal={arXiv preprint}
}

contracode's People

Contributors

ajayjain avatar parasj avatar tianjunz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

contracode's Issues

data.zip

Failed to decompress the data.zip file in the cloud disk. Is there any solution

Proper Pytorch version

First of all, thanks for sharing the impressive work!

We were trying to get the finetuning for downstream task working, but got the following issue

torch.nn.modules.module.ModuleAttributeError: 'DataParallel' object has no attribute 'encoder'

It's quite likely that we were using the incorrect version of Pytorch (or maybe other dependencies).

Would you kindly share the proper version of the dependency required to run the project?

Cheers

Cannot obtain the checkpoint

I follow the README to

Download the data subfolder from [this Google Drive link](https://drive.google.com/drive/folders/153pZfKPcr1-l8VaDPys29b1ElGLuoq3M?usp=sharing) and place at the root of the repository. This folder contains training and evaluation data, vocabularies and model checkpoints.

However, I do not fine the checkpoints and only the data.zip.

image

How can I obtain the checkpoint?

Thanks

Memory requirements for ContraCode

Hi Parasj, thanks for publishing your code. I want to ask what are the parameters of the equipment used in your experiment? I found that 16G memory is not enough if using the javascript_augmented.pickle.gz file.

Originally posted by @QZH-eng in #6 (comment)

How to generate augmented js file

Hi Parasj, thanks so much for your great work and released code, I notice that in the pre-trained step, the model is trained on the javascript_augmented.pickle.gz file. I want to ask that could we generate the augmented our js file by ourselves? If so, how to generate them? Thanks for your response and gudiance, best regards.

Memory explosion when pretrain Bidirectional LSTM

Hi,

Thanks for the wonderful work. May I ask a question, when I pretrain LSTM model with the default settings, the memory is overflow. My server has 180G RAM, so may I ask how much RAM needed for pretraining?

Thanks and best regards.

Python functions extension

@parasj Is it your code applicable to Python language function? More precisely can the automated source to source compiler transformation be used for Python beyond js language?

ask help for the codeclone dataset

great work! I need some help of your codeclone dataset. If you do not mind spend a little time and help me figure out it , I will be very appreciated to you. I download it by the "scripts/download_data.py" in your repo (codeclone/full_data.json.gz) , but I do not know wether it is the dataset used in "4.1 Evaluating Functionality and Robustness: Zero-shot Code Clone Detection" in your paper. I see the "split" function in "representjs/clone_detection.py", so I'm confused... And for the 2065 pairs you mention in your paper, ( also in 4.1) , is it from the same dataset? and how to get it? If you do not mind spend a little time and help me figure out it , I will be very appreciated to you.

What kind of gpu environment did you use to train the model?

I tried to learn both pretrain and fine-tuning with one RTX-2080ti, but it takes a lot of time. What kind of learning environment did you use?
I would appreciate it if you could tell me the specs and number of gpu's you used for model pretraining and fine-tuning.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.