Coder Social home page Coder Social logo

renee's Introduction

Renee: End-to-end training of extreme classification models

Official PyTorch implementation for the paper: "Renee: End-to-end training of extreme classification models" accepted at MLSys 2023.

๐Ÿ‘‰ You can find the camera-ready paper here.

DOI

DOI

Abstract

The goal of Extreme Multi-label Classification (XC) is to learn representations that enable mapping input texts to the most relevant subset of labels selected from an extremely large label set, potentially in hundreds of millions.

We identify challenges in the end-to-end training of XC models and devise novel optimizations that improve training speed over an order of magnitude, making end-to-end XC model training practical. Renee delivers state-of-the-art accuracy in a wide variety of XC benchmark datasets.

Requirements

Run the below command, this will create a new conda environment with all the dependencies required to run Renee.

bash install1.sh
conda activate renee
bash install2.sh

Data Preparation

You can download the datasets from the XML repo.

A dataset folder should have the following directory structure. Below we show it for LF-AmazonTitles-131K dataset:

๐Ÿ“ LF-AmazonTitles-131K/
    ๐Ÿ“„ trn_X_Y.txt # contains mappings from train IDs to label IDs
    ๐Ÿ“„ trn_filter_labels.txt # this contains train reciprocal pairs to be ignored in evaluation
    ๐Ÿ“„ tst_X_Y.txt # contains mappings from test IDs to label IDs
    ๐Ÿ“„ tst_filter_labels.txt # this contains test reciprocal pairs to be ignored in evaluation
    ๐Ÿ“„ trn_X.txt # each line contains the raw input train text, this needs to be tokenized
    ๐Ÿ“„ tst_X.txt # each line contains the raw input test text, this needs to be tokenized
    ๐Ÿ“„ Y.txt # each line contains the raw label text, this needs to be tokenized

To tokenize the raw train, test and label texts, we can use the following command (change the path of the dataset folder accordingly):

python -W ignore -u utils/CreateTokenizedFiles.py \
--data-dir xc/Datasets/LF-AmazonTitles-131K \
--max-length 32 \
--tokenizer-type bert-base-uncased \
--tokenize-label-texts

To create a dataset having label-text augmentation, we can use the following command:

python utils/CreateAugData.py \
--data-dir xc/Datasets/LF-AmazonTitles-131K \
--tokenization-folder bert-base-uncased-32 \
--max-len 32

Above command will create a folder named xc/Datasets/LF-AmazonTitles-131K-Aug, now we can refer to this dataset directory in our training script to train with label-text augmentation.

Training

Train Renee on LF-AmazonTitles-131K dataset using label-text augmentation, you can use the following command (make sure you modify data-dir, use-ngame-encoder, expname arguments accordingly; also keep in mind that you need to generate label-text augmentation dataset folder first, refer to Data Preparation section of README)

python main.py \
--epochs 100 \
--batch-size 32 \
--lr1 0.05 \
--lr2 1e-5 \
--warmup 5000 \
--data-dir xc/Datasets/LF-AmazonTitles-131K-Aug \
--maxlen 32 \
--tf sentence-transformers/msmarco-distilbert-base-v4 \
--dropout 0.85 \
--pre-tok \
--wd1 1e-4 \
--noloss \
--fp16xfc \
--use-ngame-encoder xc/ngame_pretrained_models/LF-AmazonTitles-131K/state_dict.pt \
--expname lfat-131k-aug-1.0

To change hyperparameters, you can refer to the various arguments provided in main.py file or you can do python main.py --help to list out the all the arguments.

Training commands for other datasets are provided in scripts/train_commands.md.

License

This project is licensed under the Microsoft Research License.

Citation

If you find our work/code useful in your research, please cite the following:

@article{renee_2023,
  title={Renee: End-to-end training of extreme classification models},
  author={Jain, Vidit and Prakash, Jatin and Saini, Deepak and Jiao, Jian and Ramjee, Ramachandran and Varma, Manik},
  journal={Proceedings of Machine Learning and Systems},
  year={2023}
}

References

renee's People

Contributors

jviditmsr avatar bicycleman15 avatar microsoft-github-policy-service[bot] avatar emegua avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.