Coder Social home page Coder Social logo

massive's Introduction

MASSIVE

πŸ‘‰ Join the MMNLU-22 slack workspace here πŸ‘ˆ

News

  • Nov 28: We are pleased to announce the release of MASSIVE 1.1, which includes Catalan data. For instructions on using the dataset, please see below. Data for languages other than Catalan are unchanged versus MASSIVE 1.0, and the leaderboards on eval.ai still use MASSIVE 1.0 (for now, at least). We hope that you will leverage the new Catalan data for your work!
  • Please join us at the Massively Multilingual NLU 2022 workshop, collocated at EMNLP, on Dec 7th. Registration details are here.

Quick Links

Introduction

MASSIVE is a parallel dataset of > 1M utterances across 52 languages with annotations for the Natural Language Understanding tasks of intent prediction and slot annotation. Utterances span 60 intents and include 55 slot types. MASSIVE was created by localizing the SLURP dataset, composed of general Intelligent Voice Assistant single-shot interactions.

Accessing and Processing the Data

MASSIVE 1.0, the dataset used in the paper, can be downloaded here. MASSIVE 1.1, which includes Catalan in addition to the 51 languages of MASSIVE 1.0, can be downloaded here.

The unlabeled MMNLU-22 eval data can be downloaded here

$ curl https://amazon-massive-nlu-dataset.s3.amazonaws.com/amazon-massive-dataset-1.0.tar.gz --output amazon-massive-dataset-1.0.tar.gz
$ tar -xzvf amazon-massive-dataset-1.0.tar.gz
$ tree 1.0
1.0
β”œβ”€β”€ LICENSE
└── data
    β”œβ”€β”€ af-ZA.jsonl
    β”œβ”€β”€ am-ET.jsonl
    β”œβ”€β”€ ar-SA.jsonl
    ...

The dataset is organized into files of JSON lines. Each locale (according to ISO-639-1 and ISO-3166 conventions) has its own file containing all dataset partitions. An example JSON line for de-DE has the following:

{
  "id": "0",
  "locale": "de-DE",
  "partition": "test",
  "scenario": "alarm",
  "intent": "alarm_set",
  "utt": "weck mich diese woche um fΓΌnf uhr morgens auf",
  "annot_utt": "weck mich [date : diese woche] um [time : fΓΌnf uhr morgens] auf",
  "worker_id": "8",
  "slot_method": [
    {
      "slot": "time",
      "method": "translation"
    },
    {
      "slot": "date",
      "method": "translation"
    }
  ],
  "judgments": [
    {
      "worker_id": "32",
      "intent_score": 1,
      "slots_score": 0,
      "grammar_score": 4,
      "spelling_score": 2,
      "language_identification": "target"
    },
    {
      "worker_id": "8",
      "intent_score": 1,
      "slots_score": 1,
      "grammar_score": 4,
      "spelling_score": 2,
      "language_identification": "target"
    },
    {
      "worker_id": "28",
      "intent_score": 1,
      "slots_score": 1,
      "grammar_score": 4,
      "spelling_score": 2,
      "language_identification": "target"
    }
  ]
}

id: maps to the original ID in the SLURP collection. Mapping back to the SLURP en-US utterance, this utterance served as the basis for this localization.

locale: is the language and country code accoring to ISO-639-1 and ISO-3166.

partition: is either train, dev, or test, according to the original split in SLURP.

scenario: is the general domain, aka "scenario" in SLURP terminology, of an utterance

intent: is the specific intent of an utterance within a domain formatted as {scenario}_{intent}

utt: the raw utterance text without annotations

annot_utt: the text from utt with slot annotations formatted as [{label} : {entity}]

worker_id: The obfuscated worker ID from MTurk of the worker completing the localization of the utterance. Worker IDs are specific to a locale and do not map across locales.

slot_method: for each slot in the utterance, whether that slot was a translation (i.e., same expression just in the target language), localization (i.e., not the same expression but a different expression was chosen more suitable to the phrase in that locale), or unchanged (i.e., the original en-US slot value was copied over without modification).

judgments: Each judgment collected for the localized utterance has 6 keys. worker_id is the obfuscated worker ID from MTurk of the worker completing the judgment. Worker IDs are specific to a locale and do not map across locales, but are consistent across the localization tasks and the judgment tasks, e.g., judgment worker ID 32 in the example above may appear as the localization worker ID for the localization of a different de-DE utterance, in which case it would be the same worker.

intent_score : "Does the sentence match the intent?"
  0: No
  1: Yes
  2: It is a reasonable interpretation of the goal

slots_score : "Do all these terms match the categories in square brackets?"
  0: No
  1: Yes
  2: There are no words in square brackets (utterance without a slot)

grammar_score : "Read the sentence out loud. Ignore any spelling, punctuation, or capitalization errors. Does it sound natural?"
  0: Completely unnatural (nonsensical, cannot be understood at all)
  1: Severe errors (the meaning cannot be understood and doesn't sound natural in your language)
  2: Some errors (the meaning can be understood but it doesn't sound natural in your language)
  3: Good enough (easily understood and sounds almost natural in your language)
  4: Perfect (sounds natural in your language)

spelling_score : "Are all words spelled correctly? Ignore any spelling variances that may be due to differences in dialect. Missing spaces should be marked as a spelling error."
  0: There are more than 2 spelling errors
  1: There are 1-2 spelling errors
  2: All words are spelled correctly

language_identification : "The following sentence contains words in the following languages (check all that apply)"
  1: target
  2: english
  3: other
  4: target & english
  5: target & other
  6: english & other
  7: target & english & other

Note that the en-US JSON lines will not have the slot_method or judgment keys, as there was no localization performed. The worker_id key in the en-US file corresponds to the worker ID from SLURP.

{
  "id": "0",
  "locale": "en-US",
  "partition": "test",
  "scenario": "alarm",
  "intent": "alarm_set",
  "utt": "wake me up at five am this week",
  "annot_utt": "wake me up at [time : five am] [date : this week]",
  "worker_id": "1"
}

Preparing the Data in datasets format (Apache Arrow)

The data can be prepared in the datasets Apache Arrow format using our script:

python scripts/create_hf_dataset.py -d /path/to/jsonl/files -o /output/path/and/prefix

If you already have number-to-intent and number-to-slot mappings, those can be used when creating the datasets-style dataset:

python scripts/create_hf_dataset.py \
    -d /path/to/jsonl/files \
    -o /output/path/and/prefix \
    --intent-map /path/to/intentmap \
    --slot-map /path/to/slotmap

Training an Encoder Model

We have included intent classification and slot-filling models based on the pretrained XLM-R Base or mT5 encoders coupled with JointBERT-style classification heads. Training can be conducted using the Trainer from transformers.

We have provided some helper functions in massive.utils.training_utils, described below:

  • create_compute_metrics creates the compute_metrics function, which is used to calculate evaluation metrics.
  • init_model is used to initialize one of our provided models.
  • init_tokeinzer initializes one of the pretrained tokenizers.
  • prepare_collator prepares a collator with user-specified max length and padding strategy.
  • prepare_train_dev_datasets, which loads the datasets prepared as described above.
  • output_predictions, which outputs the final predictions when running test.

Training is configured in a yaml file. Examples are given in examples/. A given yaml file fully describes its respective experiment.

Once an experiment configuration file is created, training can be performed using our provided training script. We also have provided a conda environment configuration file with the necessary dependencies that you may choose to use.

conda env create -f conda_env.yml
conda activate massive

Set the PYTHONPATH if needed:

export PYTHONPATH=${PYTHONPATH}:/PATH/TO/massive/src/

Then run training:

python scripts/train.py -c YOUR/CONFIG/FILE.yml

Distributed training can be run using torchrun for PyTorch v1.10 or later or torch.distributed.launch for earlier PyTorch versions. For example:

torchrun --nproc_per_node=8 scripts/train.py -c YOUR/CONFIG/FILE.yml

or

python -m torch.distributed.launch --nproc_per_node=8 scripts/train.py -c YOUR/CONFIG/FILE.yml

Seq2Seq Model Training

Sequence-to-sequence (Seq2Seq) model training is performed using the MASSIVESeq2SeqTrainer class. This class inherits from Seq2SeqTrainer from transformers. The primary difference with this class is that autoregressive generation is performed during validation, which is turned on using the predict_with_generate training argument. Seq2Seq models use teacher forcing during training.

For text-to-text modeling, we have included the following functions in massive.utils.training_utils:

  • convert_input_to_t2t
  • convert_intents_slots_to_t2t
  • convert_t2t_batch_to_intents_slots

For example, mT5 Base can be trained on an 8-GPU instance as follows:

For PyTorch v1.10 or later:

torchrun --nproc_per_node=8 scripts/train.py -c examples/mt5_base_t2t_20220411.yml 2>&1 | tee /PATH/TO/LOG/FILE

Or on older PyTorch versions:

python -m torch.distributed.launch --nproc_per_node=8 scripts/train.py -c examples/mt5_base_t2t_20220411.yml 2>&1 | tee /PATH/TO/LOG/FILE

Performing Inference on the Test Set

Test inference requires a test block in the configuration. See examples/xlmr_base_test_20220411.yml for an example. Test inference, including evaluation and output of all predictions, can be executed using the scripts/test.py script. For example:

For PyTorch v1.10 or later:

torchrun --nproc_per_node=8 scripts/test.py -c examples/xlmr_base_test_20220411.yml 2>&1 | tee /PATH/TO/LOG/FILE

Or on older PyTorch versions:

python -m torch.distributed.launch --nproc_per_node=8 scripts/test.py -c examples/xlmr_base_test_20220411.yml 2>&1 | tee /PATH/TO/LOG/FILE

Be sure to include a test.predictions_file in the config to output the predictions.

For official test results, please upload your predictions to the eval.ai leaderboard.

MMNLU-22 Eval

To create predictions for the Massively Multilingual NLU 2022 competition on eval.ai, you can follow these example steps using the model you've already trained. An example config is given at examples/mt5_base_t2t_mmnlu_20220720.yml.

Download and untar:

curl https://amazon-massive-nlu-dataset.s3.amazonaws.com/amazon-massive-dataset-heldout-MMNLU-1.0.tar.gz --output amazon-massive-dataset-heldout-MMNLU-1.0.tar.gz

tar -xzvf amazon-massive-dataset-heldout-MMNLU-1.0.tar.gz

Create the huggingface version of the dataset using the mapping files used when training the model.

python scripts/create_hf_dataset.py \
    -d /PATH/TO/mmnlu-eval/data \
    -o /PATH/TO/hf-mmnlu-eval \
    --intent-map /PATH/TO/massive_1.0_hf_format/massive_1.0.intents \
    --slot-map /PATH/TO/massive_1.0_hf_format/massive_1.0.slots

Create a config file similar to examples/mt5_base_t2t_mmnlu_20220720.yml.

Kick off inference from within your environment with dependencies loaded, etc:

For PyTorch v1.10 or later:

torchrun --nproc_per_node=8 scripts/predict.py -c PATH/TO/YOUR/CONFIG.yml 2>&1 | tee PATH/TO/LOG

Or on older PyTorch versions:

python -m torch.distributed.launch --nproc_per_node=8 scripts/predict.py -c PATH/TO/YOUR/CONFIG.yml 2>&1 | tee PATH/TO/LOG

Upload results to the MMNLU-22 Phase on eval.ai.

Hyperparameter Tuning

Hyperparameter tuning can be performed using the Trainer from transformers. Similarly to training, we combine all configurations into a single yaml file. An example is given here: example/xlmr_base_hptuning_20220411.yml.

Once a configuration file has been made, the hyperparameter tuning run can be initiated using our provided scripts/run_hpo.py script. Relative to train.py, this script uses an additional function called prepare_hp_search_args, which converts the hyperparameter search space provided in the configuration into an instantiated ray search space.

Licenses

See LICENSE.txt, NOTICE.md, and THIRD-PARTY.md.

Citation

We ask that you cite both our MASSIVE paper and the paper for SLURP, given that MASSIVE used English data from SLURP as seed data.

MASSIVE paper:

@misc{fitzgerald2022massive,
      title={MASSIVE: A 1M-Example Multilingual Natural Language Understanding Dataset with 51 Typologically-Diverse Languages}, 
      author={Jack FitzGerald and Christopher Hench and Charith Peris and Scott Mackie and Kay Rottmann and Ana Sanchez and Aaron Nash and Liam Urbach and Vishesh Kakarala and Richa Singh and Swetha Ranganath and Laurie Crist and Misha Britan and Wouter Leeuwis and Gokhan Tur and Prem Natarajan},
      year={2022},
      eprint={2204.08582},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

SLURP paper:

@inproceedings{bastianelli-etal-2020-slurp,
    title = "{SLURP}: A Spoken Language Understanding Resource Package",
    author = "Bastianelli, Emanuele  and
      Vanzo, Andrea  and
      Swietojanski, Pawel  and
      Rieser, Verena",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.emnlp-main.588",
    doi = "10.18653/v1/2020.emnlp-main.588",
    pages = "7252--7262",
    abstract = "Spoken Language Understanding infers semantic meaning directly from audio data, and thus promises to reduce error propagation and misunderstandings in end-user applications. However, publicly available SLU resources are limited. In this paper, we release SLURP, a new SLU package containing the following: (1) A new challenging dataset in English spanning 18 domains, which is substantially bigger and linguistically more diverse than existing datasets; (2) Competitive baselines based on state-of-the-art NLU and ASR systems; (3) A new transparent metric for entity labelling which enables a detailed error analysis for identifying potential areas of improvement. SLURP is available at https://github.com/pswietojanski/slurp."
}

Old News

  • 26 Oct: We are pleased to declare Maxime De Bruyn, Ehsan Lotfi, Jeska Buhmann, and Walter Daelemans of the bolleke team as the winners of the Organizers' Choice Award! Please come to our workshop to hear more about their model and their associated paper, Machine Translation for Multilingual Intent Detection and Slots Filling.
  • 12 Aug: We welcome submissions until Sep 2nd for the MMNLU-22 Organizers’ Choice Award, as well as direct paper submissions until Sep 7th. The Organizers’ Choice Award is based primarily on our assessment of the promise of an approach, not only on the evaluation scores. To be eligible, please (a) make a submission on eval.ai to either MMNLU-22 task and (b) send a brief (<1 page) writeup of your approach to [email protected] describing the following:
    • Your architecture,
    • Any changes to training data, use of non-public data, or use of public data,
    • How dev data was used and what hyperparameter tuning was performed,
    • Model input and output formats,
    • What tools and libraries you used, and
    • Any additional training techniques you used, such as knowledge distillation.
  • 12 Aug: We are pleased to declare the HIT-SCIR team as the winner of the MMNLU-22 Competition Full Dataset Task. Congratulations to Bo Zheng, Zhuoyang Li, Fuxuan Wei, Qiguang Chen, Libo Qin, and Wanxiang Che from the Research Center for Social Computing and Information Retrieval, Harbin Institute of Technology. The team has been invited to speak at the MMNLU-22 workshop on Dec 7th, where you can learn more about their approach.
  • 12 Aug: We are pleased to declare the FabT5 team as the winner of the MMNLU-22 Competition Zero-Shot Task. Congratulations to Massimo Nicosia and Francesco Piccinno from Google. They have been invited to speak at the MMNLU-22 workshop on Dec 7th, where you can learn more about their approach.
  • 30 Jul: Based on compelling feedback, we have updated our rules as follows: Contestants for the top-scoring model awards must submit their predictions on the evaluation set by the original deadline of Aug 8th. Contestants for the "organizers' choice award" can submit their predictions until Sep 2nd. The organizers' choice award will be based primarily on the promise of the approach, but we will also consider evaluation scores.
  • 29 Jul 2022: (Outdated -- see above) We have extended the deadline for MMNLU-22 evaluation to Sep 2nd. Additionally, besides the winners of the β€œfull dataset” and β€œzero-shot” categories, we plan to select one team (β€œorganizer’s choice award”) to present their findings at the workshop. This choice will be made based on the promise of the approach, not just on model evaluation scores.
  • 25 Jul 2022: The unlabeled evaluation set for the Massively Multilingual NLU 2022 Competition has been released. Please note that (1) the eval data is unlabeled, meaning that the keys scenario, intent, and annot_utt are not present, as well as any judgment data, and (2) the intent and slot maps from your previous training run should be used when creating a new huggingface-style dataset using create_hf_dataset.py. More details can be found in the section with heading "MMNLU-22 Eval" below.
  • 7 Jul 2022: Get ready! The unlabeled evaluation data for the Massively Multilingual NLU 2022 Competition will be released on July 25th. Scores can be submitted to the MMNLU-22 leaderboard until Aug 8th. Winners will be invited to speak at the workshop, colocated with EMNLP.
  • 30 Jun 2022: (CFP) Paper submissions for Massively Multilingual NLU 2022, a workshop at EMNLP 2022, are now being accepted. MASSIVE is the shared task for the workshop.
  • 22 Jun 2022: We updated the evaluation code to fix bugs identified by @yichaopku and @bozheng-hit (Issues 13 and 21, PRs 14 and 22). Please pull commit 3932705 or later to use the remedied evaluation code. The baseline results on the leaderboard have been updated, as well as the preprint paper on arXiv.
  • 20 Apr 2022: Launch and release of the MASSIVE dataset, this repo, the MASSIVE paper, the leaderboard, and the Massively Multilingual NLU 2022 workshop and competition.

massive's People

Contributors

amazon-auto avatar cperiz avatar ebarkhordar avatar gurnoor6 avatar henchc avatar jgmf-amazon avatar swadey avatar wenh06 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

massive's Issues

About calculating the slot f1 metric

when calculating the slot metric, the parameter "labels_ignore" of the function
def eval_preds(pred_intents=None, lab_intents=None, pred_slots=None, lab_slots=None,
eval_metrics='all', labels_ignore='Other', labels_merge=None, pad='Other',
slot_level_combination=True)

is set as "Other". This result in the case, eg. label: what is the weather [datetime: today] prediction: what [datetime: is the weather today] treated as a correct prediction.
Whether this is by design or a mistake? If this is a mistake, could someone please update the compute_metrics code used for online evaluation and the baseline metric values in the competition webpage and the leaderboard?

Add export python path in the readme?

Really appreciate for the work and clear code organization.
When I try to train model, I found that we need to export python path first before running:

python scripts/train.py -c CONFIG

I think maybe adding this to readme can help others to rerun your code?

Again, thanks for the great work!

Critical errors in translated utterances

I have checked the de-DE file and find various utterances that have not been translated properly. No German speaker would use these words

Examples
id 286: "die lichter in der kΓΌche anzΓΌnden"
id 6160: "outlet an"

What I miss is a workflow for correcting such errors.

Furthermore, some of the utterances in the data are pure garbage.

Examples
id 511 is labeled as "weather_query" and contains the utterance "richtiger formel bitte" - meaning "correct formula please". This is syntactically as well as semantically incorrect.

id 523 is also labeled as "weather_query" and contains the utterance "es ist ein sehr gefΓ€hrlicher" - meaning "it is a very dangerous one". Just absurd, this entry.

After checking several other domains I find that at least in the de-DE data a critical number of such errors exists.
I therefore raise the issue that the data quality is very far from being sufficient for meaningful tests.

Regards

Peter Henning
Professor for Computer Science

Compatibility for Python 3.6.x

str.isascii was added in Python 3.7. For compatibility for Python 3.6.x, can you consider using

def isascii(s: str) -> bool:
    try:
        return s.isascii()
    except AttributeError:
        return all([ord(c) < 128 for c in s])

GPU memory usage keeps growing when Performing Inference on the Test Set.

Hi,

I success to train an xlmr-base-20220411 Encoder Model according to README. When I tried to infer on the test set, the GPU memory usage keeps growing and cause a CUDA out-of-memory issue in the end.

Even though I set the batch size to 1, the GPU memory usage is still creeping up and causing a CUDA out-of-memory issue. Could you help to figure it out?

Here is the command I use to infer the test set.
torchrum --nproc_per_node=4 scripts/test.py -c examples/xlmr_base_test_20220411.yml

Here is the content of the config file xlmr_base_test_20220411.yml

run_name: &run_name xlmr_base_20220411_test
max_length: &max_length 512

model:
  type: xlmr intent classification slot filling
  checkpoint: checkpoints/xlmr_base_20220411/checkpoint-229400/

tokenizer:
  type: xlmr base
  tok_args:
    vocab_file: checkpoints/xlmr_base_20220411/checkpoint-229400/sentencepiece.bpe.model
    max_len: *max_length

collator:
  type: massive intent class slot fill
  args:
    max_length: *max_length
    padding: longest

test:
  test_dataset: massive_datasets/.test
  intent_labels: massive_datasets/.intents
  slot_labels: massive_datasets/.slots
  massive_path: ~/massive_0614/massive
  slot_labels_ignore:
    - Other
  eval_metrics: all
  predictions_file: logs/xlmr_base_20220411/preds.jsonl
  trainer_args:
    output_dir: checkpoints/xlmr_base_20220411/
    per_device_eval_batch_size: 4
    remove_unused_columns: false
    label_names:
      - intent_num
      - slots_num
    log_level: info
    logging_strategy: no
    locale_eval_strategy: all only
    #locale_eval_strategy: all and each
    disable_tqdm: false

Note: I have 15G of GPU memory on my machine.

About the competition rules of zero-shot task

In the competition introduction wegpage, it says "For zero-shot training, only English data from the MASSIVE training split may be used."

Whether it is allowed to use google translation to translate the english training data to other languages and use these translated data to train the zero-shot model. In my understanding, we can regard translation as a data augmentation strategy. So the rules metioned above is not voilated since only English data from the MASSIVE training split used.

Cannot reproduce xlmr-base zero-shot results

Hi,

I can reproduce mt5-base-enc, mt5-base-t2t models, including training on full datasets and zero-shot training. As for the xlmr-base model. I can reproduce the results training on the full dataset, but cannot reproduce zero-shot results. Does anyone face the same problem?

Here is my xlmr-base-zero config:

run_name: &run_name xlmr_base_zero_20220411
max_length: &max_length 512

model:
  type: xlmr intent classification slot filling
  size: base
  pretrained_weights: ../pretrained_models/xlmr.base/model.pt
  pretrained_weight_substring_transform: ['roberta', 'xlmr']
  strict_load_pretrained_weights: false
  model_config_args:
    attention_probs_dropout_prob: 0.35
    bos_token_id: 0
    eos_token_id: 2
    hidden_act: gelu
    hidden_dropout_prob: 0.25
    hidden_size: 768
    initializer_range: 0.02
    intermediate_size: 3072
    layer_norm_eps: 1e-05
    max_position_embeddings: 514
    num_attention_heads: 12
    num_hidden_layers: 12
    output_past: true
    pad_token_id: 1
    type_vocab_size: 1
    vocab_size: 250002
    use_crf: false
    slot_loss_coef: 2.0
    hidden_layer_for_class: 10
    head_num_layers: 2
    head_layer_dim: 8192
    head_intent_pooling: max
    freeze_layers: xlmr.embeddings.word_embeddings.weight

tokenizer:
  type: xlmr base
  tok_args:
    vocab_file: ../pretrained_models/xlmr.base/sentencepiece.bpe.model
    max_len: *max_length

collator:
  type: massive intent class slot fill
  args:
    max_length: *max_length
    padding: longest

train_val:
  train_dataset: massive_datasets/.train
  train_locales: en-US
  dev_dataset: massive_datasets/.dev
  intent_labels: massive_datasets/.intents
  slot_labels: massive_datasets/.slots
  slot_labels_ignore:
    - Other
  eval_metrics: all
  trainer_args:
    output_dir: checkpoints/xlmr_base_zero_20220411/
    logging_dir: logs/xlmr_base_zero_20220411/
    save_strategy: steps
    save_steps: 500
    evaluation_strategy: steps
    eval_steps: 500
    learning_rate: 4.7e-06
    lr_scheduler_type: constant_with_warmup
    warmup_steps: 500
    adam_beta1: 0.99
    adam_beta2: 0.9999
    adam_epsilon: 1.0e-09
    weight_decay: 0.11
    gradient_accumulation_steps: 1
    per_device_train_batch_size: 64
    per_device_eval_batch_size: 16
    eval_accumulation_steps: 4
    num_train_epochs: 850
    remove_unused_columns: false
    label_names:
      - intent_num
      - slots_num
    logging_steps: 100
    log_level: info
    locale_eval_strategy: all and each
    disable_tqdm: false

Issue for converting bio sequence

Hi @jgmf-amazon, thanks for your reply. However, the current evaluation code contains some issue:

def convert_to_bio(seq_tags, outside='Other', labels_merge=None):
    """
    Converts a sequence of tags into BIO format. EX:

        ['city', 'city', 'Other', 'country', -100, 'Other']
        to
        ['B-city', 'I-city', 'O', 'B-country', 'I-country', 'O']
        where outside = 'Other' and labels_merge = [-100]

    :param seq_tags: the sequence of tags that should be converted
    :type seq_tags: list
    :param outside: The label(s) to put outside (ignore). Default: 'Other'
    :type outside: str or list
    :param labels_merge: The labels to merge leftward (i.e. for tokenized inputs)
    :type labels_merge: str or list
    :return: a BIO-tagged sequence
    :rtype: list
    """

    seq_tags = [str(x) for x in seq_tags]

    outside = [outside] if type(outside) != list else outside
    outside = [str(x) for x in outside]

    if labels_merge:
        labels_merge = [labels_merge] if type(labels_merge) != list else labels_merge
        labels_merge = [str(x) for x in labels_merge]
    else:
        labels_merge = []

    bio_tagged = []
    prev_tag = None
    for tag in seq_tags:
        if tag in outside:
            bio_tagged.append('O')
            prev_tag = tag
            continue
        if tag != prev_tag and tag not in labels_merge:
            bio_tagged.append('B-' + tag)
            prev_tag = tag
            continue
        if tag == prev_tag or tag in labels_merge:
            if prev_tag in outside:
                bio_tagged.append('O')
            else:
                bio_tagged.append('I-' + prev_tag)

    return bio_tagged

The current code will meet a bug when:
prev_tag is None and tag is -100.
This will not happen if model is properly trained, but when train the model at the beginning stage, the model might output this silly combination.

Maybe a potential solution is not initialized prev_tag with None, but using 'O'?

Originally posted by @ihungalexhsu in #13 (comment)

Training duration for Zero Shot Training?

Hi MASSIVE team,

I'm working to replicate the zero-shot results trained on en-US and I'm not sure of the best option for the number of training epochs to run. Table 5 from the paper suggests that 26 epochs is sufficient but the example config sets 850 epochs. This is obviously a big difference and 850 epochs of training suggests some kind of grokking phenomenon.

I tried with 100 epochs and got some very poor results (en-US EM: 6.2) but I might have had the same issue as #27, i'm currently re-running with their config to check at 850 epochs. Can you clarify the epoch setting for training only on en-US? Thanks!

Error while creating predictions on heldout dataset

Steps to reproduce:

  1. Create new dataset using create_hf_dataset.py script
  2. In the config, point to your finetuned model and new dataset. We are using XLMR model.

Running
torchrun --nproc_per_node=1 scripts/predict.py -c examples/xlmr_base_test_20220411.yml

throws the below error.

Traceback (most recent call last):
File "/local/home/desktop/Experiments/massive/scripts/predict.py", line 112, in
main()
File "/local/home/desktop/Experiments/massive/scripts/predict.py", line 102, in main
outputs = trainer.predict(test_ds, tokenizer=tokenizer)
File "/home/desktop/Experiments/massive/src/massive/utils/trainer.py", line 188, in predict
output = self.evaluate(
File "/home/desktop/Experiments/massive/src/massive/utils/trainer.py", line 142, in evaluate
output = eval_loop(
File "/home/desktop/anaconda3/envs/massive/lib/python3.9/site-packages/transformers/trainer.py", line 2314, in evaluation_loop
for step, inputs in enumerate(dataloader):
File "/home/desktop/anaconda3/envs/massive/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 652, in next
data = self._next_data()
File "/home/desktop/anaconda3/envs/massive/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 692, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/home/desktop/anaconda3/envs/massive/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
return self.collate_fn(data)
File "/home/desktop/Experiments/massive/src/massive/loaders/collator_ic_sf.py", line 64, in call
label = entry['slots_num']
KeyError: 'slots_num'

Unable to prepare data

Preparing the dataset using the command mentioned in the README throws a TypeError.

python scripts/create_hf_dataset.py -d /path/to/jsonl/files -o /output/path/and/prefix

Traceback of the error:

...
Adding numeric intent and slot labels to the datasets
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 587214/587214 [01:05<00:00, 8946.88ex/s]
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 103683/103683 [00:11<00:00, 8882.30ex/s]
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 151674/151674 [00:17<00:00, 8832.28ex/s]
Traceback (most recent call last):
  File "/path/to/massive/scripts/create_hf_dataset.py", line 331, in <module>
    main()
  File "/path/to/massive/scripts/create_hf_dataset.py", line 325, in main
    ds_creator.add_numeric_labels()
  File "/path/to/massive/scripts/create_hf_dataset.py", line 259, in add_numeric_labels
    if self.hidden_eval[0]['intent_str']:
TypeError: 'NoneType' object is not subscriptable

jsonl files used to prepare dataset contain the MASSIVE dataset downloaded from this link: https://amazon-massive-nlu-dataset.s3.amazonaws.com/amazon-massive-dataset-1.0.tar.gz

The error is arising because the dataset mentioned above does not contain MMNLU-22 eval data. Either the README should be updated to clarify that the jsonl path in the create_hf_dataset.py script should contain the MMNLU eval data as well, or alternatively the line 259-260 in scripts/create_hf_dataset.py can be modified to the following:

if self.hidden_eval and self.hidden_eval[0]['intent_str']:
    self.hidden_eval = self.hidden_eval.map(create_numeric_labels)

instead of the current

if self.hidden_eval[0]['intent_str']:
    self.hidden_eval = self.hidden_eval.map(create_numeric_labels)

19,521 vs. 16,521 utterances

Why does it say in the paper and e.g. blog post that there are 19,521 utterances per language. Downloading the files there are only 16,521 per language which is also the number of utterances in the SLURP dataset. Is this a typo or am I missing something?

Will the rule "dev splits of the MASSIVE dataset may not be used for model training" makes the competition unfair for individual contestants

Though dev splits of the MASSIVE dataset may not be used for model training, but it can be used for hyperparamer tuning. Tuning hyperparamters with effective hyperparamer searching algorithms is essentially training with dev splits in some extent, especially for those with many gpus. So, for the full dataset competition, with more gpus, contestants can use more training data(the dev split). For the zero-shot competition, with more gpus, contestants can use non-english labelled data(the dev split) indirectly. Maybe it is unfair for individual contestants(those with no enough gpu resources) compared with those who stands for a lab or company(usually they have more gpu resources). Maybe merge the train and dev splits as train split and let contestants to split the train-set as train and dev by themselves for hyper-parameter tuning is better and fair for everyone.

Preparing data

I prepare the MASSIVE dataset using the following script

python scripts/create_hf_dataset.py -d /path/to/jsonl/files -o /output/path/and/prefix

which result in the following error

Traceback (most recent call last):
  File "scripts/create_hf_dataset.py", line 310, in <module>
    main()
  File "scripts/create_hf_dataset.py", line 301, in main
    ds_creator.create_datasets(args.massive_data_paths)
  File "scripts/create_hf_dataset.py", line 84, in create_datasets
    for line in f:
  File "/scratch/hle/conda_env/py3_env/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 37: invalid start byte

Also, the jsonl files seem to be encoded and cannot be viewed as shown in the README.

Any suggestion to resolve the issue is greatly appreciated!

About the official evaluation script

Is there an evaluation script that can directly compare a prediction file against the gold prediction file, i.e., the official evaluation script?

Does Label Truncation Influence the Results?

Hi,

It seems that labels for testing are also truncated to a fixed max length:

tok_labels = self.tokenizer(labels, max_length=self.max_length, truncation=True)

If the input sequence for encoder-only models or label sequence for generative models are longer than max length, additional labels will be ignored during evaluation. (Although samples in this dataset are relatively short sentences, it is still a risk.)
I think a better way could be to maintain an independent evaluation tool taking the predictions and raw labels as inputs.

TypeError when training a xlmr zero-shot encoder

Hi,

When I try to train an xlmr-base zero-shot encoder, a TypeError occurs. Here is the error message:

Traceback (most recent call last):
  File "/home/massive/scripts/train.py", line 99, in <module>
    main()
  File "/home/massive/scripts/train.py", line 96, in main
    trainer.train()
  File "/home/miniconda3/envs/massive/lib/python3.9/site-packages/transformers/trainer.py", line 1392, in train
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/home/miniconda3/envs/massive/lib/python3.9/site-packages/transformers/trainer.py", line 1514, in _maybe_log_save_evaluate
    metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
  File "/home/massive/src/massive/utils/trainer.py", line 110, in evaluate
    output = eval_loop(
  File "/home/miniconda3/envs/massive/lib/python3.9/site-packages/transformers/trainer.py", line 2392, in evaluation_loop
    metrics = self.compute_metrics(EvalPrediction(predictions=all_preds, label_ids=all_labels))
  File "/home/massive/src/massive/utils/training_utils.py", line 324, in compute_metrics
    return eval_preds(
  File "/home/massive/src/massive/utils/training_utils.py", line 475, in eval_preds
    convert_to_bio(lab, outside=labels_ignore, labels_merge=labels_merge)
  File "/home/massive/src/massive/utils/training_utils.py", line 423, in convert_to_bio
    bio_tagged.append('I-' + prev_tag)
TypeError: can only concatenate str (not "NoneType") to str

I trace the code and found that prev_tag is None concatenating with the string 'I-' when tag is -100 and labels_merge is '-100'. Could you help check this error?

Error in create_hf_dataset.py while using existing intent and slot mappings

When I use existing intent and slot mappings to create dataset using create_hf_dataset.py, and then try to use the dataset in my model, I get the following error:

  File "scripts/test.py", line 131, in <module>
    main()
  File "scripts/test.py", line 111, in main
    outputs = trainer.predict(test_ds, tokenizer=tokenizer)
  File "/path/to/massive/massive/src/massive/utils/trainer.py", line 158, in predict
    output = self.evaluate(
  File "/path/to/massive/massive/src/massive/utils/trainer.py", line 112, in evaluate
    output = eval_loop(
  File "/path/to/env/lib/python3.8/site-packages/transformers/trainer.py", line 2926, in evaluation_loop
    for step, inputs in enumerate(dataloader):
  File "/path/to/env/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 681, in __next__
    data = self._next_data()
  File "/path/to/env/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 721, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/path/to/env/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
    return self.collate_fn(data)
  File "/path/to/massive/massive/src/massive/loaders/collator_ic_sf.py", line 121, in __call__
    return {k: torch.tensor(v, dtype=torch.int64) for k, v in pad_tok_inputs.items()}
  File "/path/to/massive/massive/src/massive/loaders/collator_ic_sf.py", line 121, in <dictcomp>
    return {k: torch.tensor(v, dtype=torch.int64) for k, v in pad_tok_inputs.items()}
ValueError: too many dimensions 'str'

This error does not occur when I create new mappings from the create_hf_dataset.py script itself.

Adding datasets and models to the Hugging Face Hub

Hi there!

Congratulations on releasing the MASSIVE dataset and benchmark - this is really exciting! πŸŽ‰

Seeing that you already have a script to load the dataset with datasets I was wondering if you would like to directly host the dataset on the Hugging Face Hub? That would allow loading the dataset in a single line :ds = load_dataset("amazon/massive").

There are already two organization where you could host it:
https://huggingface.co/AmazonScience
https://huggingface.co/amazon

In addition we could also add some baseline models e.g. the ones from the example script.

What do you think? Happy to help adding them!

Number of parameters limit

Thank you for organizing this interesting shared task!

What is the parameters limit for decoder-only models? They also use the text-to-text approach, but resemble the decoder-only architecture...

Thank you,

Maxime.

infer a statement

I used mt5-base pre-trained model to train my massive slot and intend model and I saved checkpoints in a directory.
I want to infer one simple statement without creating test dataset or evaluation.
Is there a sample code?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.