Coder Social home page Coder Social logo

yookyungkho / mav Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 0.0 101 KB

Boosting Prompt-Based Self-Training With Mapping-Free Automatic Verbalizer for Multi-Class Classification (EMNLP 2023 Findings)

Python 92.91% Dockerfile 0.43% Shell 2.47% Jupyter Notebook 4.19%

mav's Introduction

Mapping-free Automatic Verbalizer (MAV)

Overview

This is the source code of Boosting Prompt-Based Self-Training With Mapping-Free Automatic Verbalizer for Multi-Class Classification (EMNLP 2023 findings).


Overall Structure

MAV
├── docker # A directory for building Docker environments
│   ├── create_container.sh
│   ├── create_image.sh
│   ├── Dockerfile
│   └── requirements.txt
├── tools # A directory for generating train data
│   ├── augmentation_trec.yaml
│   ├── check_dataset.ipynb
│   ├── generate_augmented_data.py
│   └── generate_data.py
├── data # Data directory (e.g. TREC dataset)
│   └── few-shot
│       └── trec
│           ├── 12-4-100
│           ├── 12-4-13
│           ├── 12-4-21
│           ├── 12-4-42
│           └── 12-4-87
│   └── original
│       └── trec
│           └── preprocess.py
├── src # Code directory
│   ├── augmentation
│   │   ├── aug_utils.py
│   │   ├── functional.py
│   │   ├── operations.py
│   │   └── policy.py
│   ├── dataset.py
│   ├── models.py
│   ├── model_utils.py
│   ├── trainer.py
│   ├── processors.py
│   └── utils.py
├── script # Script files to run training and analytics code
│   ├── analysis_trec.sh
│   └── run_trec.sh
├── run.py # Main code
├── calculate_result.py # Code for aggregating results of 5 seeds
├── analysis.py # Code for further analysis (SHAP, t-SNE)
└── exp_result # A directory for saving experimental results
    ├── mav-full_sup-trec # full supervised
    ├── mav-small_sup-trec # small supervised
    └── mav-ssl-singleaug_mask-trec # semi-supervised

Detailed structure of data directory

A data directory is created for each seed and the directory name follows the format k-mu-seed. Where k is the number of labeled data per class and mu is the ratio between labeled and unlabeled data.

The data directory contains train, unlabeled, dev, test data in csv format and augmentation data in npy format.

Below is an example of the data directory structure for seed 13.

12-4-13
├── train.csv
├── dev.csv
├── test.csv
├── unlabeled.csv
├── unlabeled_backtranslation.npy
├── unlabeled_bertaug.npy
├── unlabeled_worddelete.npy
├── unlabeled_worddelete*wordswap.npy
└── unlabeled_wordswap.npy

Detailed structure of exp_result directory

All output files from training, inference, and further analysis are stored in the exp_result directory.

mav-ssl-singleaug_mask-trec
├── seed13
├── seed21
│   ├── shap_trec_s21
│   │   ├── label00_shap_bar_131.png
│   │   ├── label01_shap_bar_53.png
│   │   ├── label02_shap_bar_8.png
│   │   ├── label03_shap_bar_58.png
│   │   ├── label04_shap_bar_75.png
│   │   └── label05_shap_bar_111.png
│   ├── tsne_trec_s21
│   │   └── tsne_mask_rep_test.png
│   ├── eval_results_trec.txt
│   ├── test_results_trec.txt
│   ├── data_args.bin
│   ├── model_args.bin
│   ├── training_args.bin
│   ├── pytorch_model.bin
│   ├── merges.txt
│   ├── special_tokens_map.json
│   ├── tokenizer_config.json
│   ├── config.json
│   └── vocab.json
├── seed42
├── seed87
├── seed100
└── total_results.txt

Requirements

cd docker

bash create_image.sh
bash create_container.sh

Our experimental environment is built on Docker (pytorch/pytorch:1.7.1-cuda11.0-cudnn8-devel image). Detailed dependencies are described in docker/requirements.txt.


How to Get Few-shot Data

0. Download & Preprocessing

The five datasets used in the experiment were downloaded from the sources below and preprocessed in the same way.

The source file for each data is stored in the path data/original/{data_name}.
They are also preprocessed into the same form by running the file data/original/{data_name}/preprocess.py.

1. Sampling Few-shot Data

With the preprocessed data, sampling is performed to match k/mu/seed. This sampling is done via tools/generate_gewshot_data.py, setting the arguments as shown below.
The result is stored in the path data/few-shot/{data_name}/{k}-{mu}-{seed}.

python tools/generate_fewshot_data.py --k 16 --mu 4 --task trec --data_dir data/original --output_dir data/few-shot

2. Preprocessing for Augmentation

Store augmented data for augmentation experiments. Augmentation is defined via tools/augmentation_{data_name}.yaml and the results are stored as npy files in the path data/few-shot/{data_name}/{k}_{mu}_{seed}. To perform the augmentation, refer to the bash code below. The augmentation pool that can be saved in advance and the actual application key are as follows:

python tools/generate_augmented_data.py --config_dir tools/augmentation_trec.yaml

How to train

# Train, Inference
bash script/run_trec.sh

# Further analysis (SHAP, t-SNE)
bash script/analysis_trec.sh

mav's People

Contributors

yookyungkho avatar kimjaehee0725 avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.