Coder Social home page Coder Social logo

xpu_text_classifier's Introduction

xpu_text_classifier: Custom Text Classification on Intel dGPUs

xpu_text_classifier allows you to fine-tune transformer models using custom datasets for multi-class or multi-label classification tasks. The models supported include popular transformer architectures like BERT, BART, DistilBERT, etc. This solution uses the Huggingface Trainer to handle the training and leverages Intel Extension for PyTorch to run on Intel dGPUs.

Table of Contents

Installation

Before you start, ensure you have PyTorch and Intel Extension for PyTorch installed.

To install xpu_text_classifier:

  1. Clone the transformers_xpu repository from GitHub:

    git clone https://github.com/rahulunair/transformers_xpu.git
    cd transformers_xpu
  2. Install the package:

    python setup.py install
  3. Install the required dependencies:

    pip install datasets scikit-learn
  4. Optionally, install Weights & Biases to monitor your training process:

    pip install wandb

Preparing Your Dataset

The dataset should be in a format compatible with the Hugging Face's load_dataset function, which includes CSV, JSON, and several others. The dataset should have two columns 'text' and 'label'. For multi-class classification tasks, each label is a single integer. For multi-label classification tasks, each label is a list of integers.

Multi-Class Classification Example:

text label
This is text 1 0
This is text 2 2
This is text 3 1

Multi-Label Classification Example:

text label
This is text 1 [0, 1]
This is text 2 [1, 2]
This is text 3 [0, 2]

After preparing your dataset, save it in a format such as JSON or CSV in a directory. The name of this directory will be used as the dataset_name parameter when using the TextClassifier.

Usage

The script custom_finetune.py in the root directory is your entry point for training a model. By default, it uses the 'distilbert-base-uncased' model and Gutenberg dataset with 30 labels.

You can either tweak the custom_finetune.py file or create a new python file with these details:

Import TextClassifier from classifier module

import torch
import intel_extension_for_pytorch

from classifier import Text Classifier

Instantiate the classifier:

classifier = TextClassifier(
    model_name="distilbert-base-uncased",
    dataset_name="path/to/your/dataset_directory",  # use the name of the directory where you saved your dataset
    num_labels=2,
    task_type="multi_class",
)

Start Training:

classifier.train(epochs=10, batch=16, use_bf16=False)

You can specify the model name, number of labels(classes), number of epochs, batch size, and whether to use BF16 precision with the train function as shown in the file custom_finetune.py.

To train on a single GPU:

python custom_finetune.py

To train using all available GPUs:

export MASTER_ADDR=127.0.0.1
source /home/orange/pytorch_xpu_orange/lib/python3.10/site-packages/oneccl_bindings_for_pytorch/env/setvars.sh
mpirun -n 4 python custom_finetune.py

Replace 4 with the number of GPUs available in your system.

Monitoring GPU Usage

To monitor the GPU usage:

xpu-smi dump -m5,18  # VRAM utilization

Additional Details

The custom_finetune.py script fetches an e-book from Gutenberg and prepares a dataset for the training task. The dataset is stored in the directory specified by dataset_name as a csv file with two columns: text and label.

Please note, the transformers expect the labels to be integers. If your labels are strings, make sure to encode them into integers before passing them to the TextClassifier:

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
labels = ['cat', 'mat', 'bat', 'cat', 'bat']
encoded_labels = le.fit_transform(labels)

For more details on the TextClassifier, refer to classifier.py.

Remember to check the script and adjust the parameters (model type, dataset, epochs, batch size, etc.) according to your needs.

Happy fine-tuning!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.