Coder Social home page Coder Social logo

probings4code's Introduction

Probing Pretrained Models of Source Code

This project contains source code to replicate the experiments from the Probing Pretrained Models of Source Code accepted to EMNLP 2022 Black Box NLP Workshop.

Overview

The code structure is as follows:

  • scripts for running experiments
  • src:
    • models, pretrained models
    • struct_probing:
      • code_augs syntetic changes to code
      • probings utils for probings
  • CodeAnalysis: directory with processed data and results

Repository Environment

  • git clone https://github.com/serjtroshin/probings4code
  • cd probings4code
  • go to link and download the data
  • unzip -l CodeAnalysis.zip

The script was run on CentOS Linux 7, Python 3.9.2. Create a conda environment for the project and install requirements:

  • conda create -n probings4code python=3.9.12
  • conda activate probings4code
  • conda install pytorch==1.11.0 torchvision==0.12.0 cudatoolkit=11.6 numpy=1.22.3 -c pytorch -y
  • pip install -r requirements.txt

Install tree-sitter parser for python and java by running

  • bash build.sh

Data preprocessing

To prepare data for the tasks run and create train, test splits:

  • bash generate_synt_data.sh

The script will output all.json file with train/test splits train.json, test.json in the following subfolders in CodeAnalysis directory: identity,undeclared, dfg, identname, varmisuse, readability, algo.

To prepare data for ablation study (Appendix) run:

  • bash prepare_ablation.sh

Downloading models

src/models directory contains a folder for each pretrained model.

To download BERT, CodeBERT, GraphCodeBERT, CodeT5, and GPT HugginFace checkpoints and tokenizers use:

  • bash download_models.sh

To run the experiments with PLBART models, please download PLBART pretrained plbart_base, plbart_large checkpoints from the original PLBART official repository putting them in the src/models/plbart folder. Finetuned checkpoints are also avaliable in the official PLBART repo to reproduce the experiments comparing finetuned models (Figure 5). Use src/models/available_models.py to provide relevant paths for checkpoints.

Saving embedding to disk

To save embeddings from all layers for all tasks use

  • bash save_embeddings.sh.

The script will save embeddings to the data_all.pkz in the CodeAnalysis subfolders.

  • NOTE: saving all embeddings requires up to 10TB of disk space.

Running probing experiments

The run_parallel.py script runs the probing experiments for all models for all tasks saving the results in csv format for each model-probing pair at CodeAnalysis directory.

To replicate the experiments with linear probing model (Figure 3, 4) use:

  • python3 run_parallel.py to run experiments with the linear model

To run the probing experiments with a 3-layer MLP: python3 run_parallel.py --probing_model mlp

To run the experiments for ablation study (Appendix) use:

  • bash run_ablation.sh

Note you can pass --model <model_name> --probing <probing_task_name> flags to run_parallel.py to run the particular model on the particular task.

Acknowledgements

We use the following projects in our work:

Citation

If you found this code useful, please cite our work:

@misc{https://doi.org/10.48550/arxiv.2202.08975,
    doi = {10.48550/ARXIV.2202.08975},
    url = {https://arxiv.org/abs/2202.08975},
    author = {Troshin, Sergey and Chirkova, Nadezhda},
    keywords = {Software Engineering (cs.SE), Computation and Language (cs.CL), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences},
    title = {Probing Pretrained Models of Source Code},
    publisher = {arXiv},
    year = {2022},
    copyright = {arXiv.org perpetual, non-exclusive license}
}

probings4code's People

Contributors

serjtroshin avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar

probings4code's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.