Coder Social home page Coder Social logo

minnesotanlp / infoverse Goto Github PK

View Code? Open in Web Editor NEW
16.0 1.0 1.0 9.12 MB

Jaehyung Kim et al's ACL 2023 paper on "infoVerse: A Universal Framework for Dataset Characterization with Multidimensional Meta-information"

License: MIT License

Python 69.99% Shell 0.05% Jupyter Notebook 29.95%
active-learning data-annotation data-centric dpp nlp data-pruning

infoverse's Introduction

infoVerse: A Universal Framework for Dataset Characterization with Multidimensional Meta-information

This repository provides datasets, demo and code of the following paper:

infoVerse: A Universal Framework for Dataset Characterization with Multidimensional Meta-information
Jaehyung Kim, Yekyung Kim, Karin de Langis, Jinwoo Shin, Dongyeop Kang
ACL 2023 (main track, long paper)

Installation

The following command installs all necessary packages:

pip install -r requirements.txt

The project was tested using Python 3.7.

Construction of infoVerse

To construct infoVerse, one first needs to 1) train the vanilla classifiers. Then, using the trained classifiers, one can construct infoVerse by extracting the pre-defined meta-information (defined in ./src/scores_src). We release the constructed infoVerse at google drive. Please check out run.sh.

  1. Train the classifiers used for gathering meta-informations
python train.py --train_type 0000_base --save_ckpt --epochs 10 --dataset sst2 --seed 1234 --backbone roberta_large
  1. Construction of infoVerse
python construct_infoverse.py --train_type 0000_base --seed_list "1234 2345 3456" --epochs 10 --dataset sst2 --seed 1234 --backbone roberta_large

In addition, one can visualize the constructed infoVerse and use it to analyize the given dataset using visualize.ipynb. For example, we provide a code to generate an interactive html file, as shown in the below figure. Pre-constructed tSNE and HTML files can be downloaded from the google drive.

Real-world Application #1: Data Pruning

Please see the repository ./data_pruning.

Real-world Application #2: Active Learning

Please see the repository ./active_learning.

Real-world Application #3: Data Annotation

Please see the repository ./data_annotation.

Citation

If you find this work useful for your research, please cite our papers:

@article{kim2023infoverse,
  title={infoVerse: A Universal Framework for Dataset Characterization with Multidimensional Meta-information},
  author={Kim, Jaehyung and Kim, Yekyung and de Langis, Karin and Shin, Jinwoo and Kang, Dongyeop},
  journal={The 61st Annual Meeting of the Association for Computational Linguistics (ACL)},
  year={2023}
}

infoverse's People

Contributors

bbuing9 avatar dykang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

terry-ning

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.