Coder Social home page Coder Social logo

ofa-sys / ofa Goto Github PK

View Code? Open in Web Editor NEW
2.3K 21.0 245.0 122.64 MB

Official repository of OFA (ICML 2022). Paper: OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

License: Apache License 2.0

Python 92.81% Makefile 0.01% Batchfile 0.01% Shell 5.88% C++ 0.39% Cuda 0.62% Cython 0.21% Lua 0.07%
multimodal pretraining image-captioning text-to-image-synthesis visual-question-answering referring-expression-comprehension vision-language pretrained-models prompt prompt-tuning

ofa's Introduction




ModelScope  |  Checkpoints  |  Colab  |  Demo  |  Paper   |  Blog



OFA is a unified sequence-to-sequence pretrained model (support English and Chinese) that unifies modalities (i.e., cross-modality, vision, language) and tasks (finetuning and prompt tuning are supported): image captioning (1st at the MSCOCO Leaderboard), VQA (link), visual grounding, text-to-image generation, text classification, text generation, image classification, etc. We provide step-by-step instructions for pretraining and finetuning and corresponding checkpoints (check official ckpt [EN|CN] or Hugging Face ckpt).

We sincerely welcome contributions to our project. Feel free to contact us or send us issues / PRs!

Online Demos

We provide online demo via Hugging Face Spaces for you to interact with our pretrained and finetuned models. Below are the links to the demos:

Also we provide Colab notebooks for you to better perceive the procedures. Click here to check them out!

Use in Hugging Face Transformers

We support the inference of OFA in Hugging Face Transformers. Check the README and Colab Notebook for more information. Codes are released in this branch https://github.com/OFA-Sys/OFA/tree/feature/add_transformers

News

  • 2023.5.11: Two papers (OFA-OCR and OFA-prompt) are accepted by ACL. The evaluation scripts and checkpoints of OFA-OCR are released.
  • 2023.1.11: Released MuE (https://arxiv.org/abs/2211.11152), which significantly accelerates OFA with little performance degradation. Many thanks to the first author, Shengkun Tang (@Tangshengku). See the branch feature/MuE and PR for more information.
  • 2022.12.20: Released OFA-OCR, a model for Chinese text recognition based on OFA. Check our paper and demo.
  • 2022.12.7: Released the MMSpeech an ASR pre-training method based on OFA. Check our paper here! Please see the README_mmspeech.md for further details.
  • 2022.8.16: Released the Chinese version of OFA. OFA-CN needs only switching to bpe_dir=../../utils/BERT_CN_dict and bpe=bert and using our provided Chinese checkpoints in checkpoints_cn.md. Temporarily, we only provide base-size and large-size pretrained checkpoints and finetuned checkpoints on MUGE Caption and the Chinese version of RefCOCO(-/+/g) (to release soon).
  • 2022.8.5: Released support of prompt tuning for OFA. Check our paper here! Please see the prompt_tuning.md for further details.
  • 2022.7.7: Updated support of OFA on Hugging Face transformers (fixed bugs in forward, add sequence generator from Fairseq to ensure performance, etc.). Refer to the doc transformers.md and the branch feature/add_transformers.
  • 2022.6.17: Released the pretrained checkpoint of OFA-Huge. To use it, set --arch=ofa_huge in the script.
  • 2022.5.15: OFA was accepted by ICML 2022
More News

  • 2022.4.28: Add support of inference on **Hugging Face transformers**. For how to use it, please refer to the doc [transformers.md](transformers.md) and our [Hugging Face models](https://huggingface.co/OFA-Sys).
  • 2022.4.16: Released lightweight pretrained models **OFA-Medium** (~93M params) and **OFA-Tiny** (~33M params) in [checkpoints.md](checkpoints.md). To use them, you just need to load the corresponding checkpoint and set `--arch=ofa_medium` or `--arch=ofa_tiny` in the scripts.
  • 2022.3.23: Added [Encouraging Loss](https://arxiv.org/pdf/2110.06537.pdf) as a feature. See [README_EncouragingLoss.md](README_EncouragingLoss.md). Leveraging this feature, OFA-Large has achieved improved results in both VQA (**test-std acc: 80.67**) and Image Classification (**test acc: 85.6**) recently.
  • 2022.3.21: Released codes for pretraining OFA.
  • 2022.3.18: Released the finetuned OFA-Base (~180M parameters) checkpoints and running scripts for vision & language tasks, including: Caption (146.4 CIDEr), VQA (78.07 on test-std), SNLI-VE (89.3 on dev), RefCOCO (90.67 on testA), RefCOCO+ (87.15 on testA) and RefCOCOg (82.31 on test-u).
  • 2022.3.11: Released the finetuning & inference code/checkpoints for Gigaword.
  • 2022.3.08: Released the pretrained checkpoint of OFA-Base in checkpoints.md. To use OFA-Base, you just need to load ofa_base.pt and change --arch=ofa_large to --arch=ofa_base in the training scripts.
  • 2022.3.07: Released the finetuning & inference code/checkpoints for Image Classification, which achieves 85.0 accuracy on ImageNet-1K, slightly better than reported in OFA paper.
  • 2022.3.04: Released the finetuning & inference code/checkpoints for Text-to-Image Generation.
  • 2022.3.03: Released the finetuning & inference code/checkpoints for SNLI-VE and GLUE.
  • 2022.2.22: Released the finetuning & inference code/checkpoints for Visual Question Answering, which can reproduce the reported VQA accuracy in OFA paper (80.02 on test-std). Check our results on the VQA Challenge.
  • 2022.2.15: Released finetuning & inference code/checkpoints for Referring Expression Comprehension
  • 2022.2.10: Released the inference code & finetuned checkpoint for Image captioning, which can reproduce the results on COCO Karparthy test split (149.6 CIDEr). OFA also achieves No.1 on the COCO image captioning online leaderboard Link (marked as M6-Team).



Model Card

We list the parameters and pretrained checkpoints of OFAs below. For finetuned checkpoints, please refer to checkpoints.md.

ModelCkptParamsBackboneHidden sizeIntermediate sizeNum. of headsEnc layersDec layers
OFATinyDownload33MResNet502561024444
OFAMediumDownload93MResNet1015122048844
OFABaseDownload180MResNet10176830721266
OFALargeDownload470MResNet15210244096161212
OFAHugeDownload930MResNet15212805120162412


Results

Below we demonstrate the results of OFAs on cross-modal understanding and generation.

TaskImage CaptioningVQAVisual EntailmentReferring Expression Comprehension
DatasetCOCOVQA v2SNLI-VERefCOCORefCOCO+RefCOCOg
SplitKarpathy test (CE/CIDEr)test-dev/test-stdval/testval/test-a/test-bval/test-a/test-bval-u/test-u
MetricCIDErAcc.Acc.Acc.
OFATiny119.0 / 128.770.3 / 70.485.3 / 85.280.20 / 84.07 / 75.0068.22 / 75.13 / 57.6672.02 / 69.74
OFAMedium130.4 / 140.375.4 / 75.586.6 / 87.085.34 / 87.68 / 77.9276.09 / 83.04 / 66.2578.76 / 78.58
OFABase138.2 / 146.778.0 / 78.189.3 / 89.288.48 / 90.67 / 83.3081.39 / 87.15 / 74.2982.29 / 82.31
OFALarge142.2 / 150.780.4 / 80.790.3 / 90.290.05 / 92.93 / 85.2685.80 / 89.87 / 79.2285.89 / 86.55
OFAHuge145.3 / 154.982.0 / 82.091.0 / 91.292.04 / 94.03 / 88.4487.86 / 91.70 / 80.7188.07 / 88.78


Requirements

  • python 3.7.4
  • pytorch 1.8.1
  • torchvision 0.9.1
  • JAVA 1.8 (for COCO evaluation)

Installation

git clone https://github.com/OFA-Sys/OFA
pip install -r requirements.txt



Datasets and Checkpoints

See datasets.md and checkpoints.md.

Training & Inference

Below we provide methods for training and inference on different tasks. We provide both pretrained OFA-Large and OFA-Base in checkpoints.md. The scripts mentioned in this section are prepared for OFA-Large. For reproducing the downstreaming results of OFA-Base, we have also provided the corresponding finetuning and inference scripts for OFA-Base in the run_scripts/ folder.

We recommend that your workspace directory should be organized like this:

OFA/
├── checkpoints/
│   ├── ofa_base.pt
│   ├── ofa_large.pt
│   ├── caption_large_best_clean.pt
│   └── ...
├── criterions/
├── data/
├── dataset/
│   ├── caption_data/
│   ├── gigaword_data/
│   └── ...
├── fairseq/
├── models/
├── run_scripts/
├── tasks/
├── train.py
├── trainer.py
└── utils/

Image Processing

To ensure the efficiency of processing data, we did not store images with small files, but instead we encode them to base64 strings. Transforming image files to base64 strings is simple. Run the following code:

from PIL import Image
from io import BytesIO
import base64

img = Image.open(file_name) # path to file
img_buffer = BytesIO()
img.save(img_buffer, format=img.format)
byte_data = img_buffer.getvalue()
base64_str = base64.b64encode(byte_data) # bytes
base64_str = base64_str.decode("utf-8") # str

Pretraining

Below we provide methods for pretraining OFA.

1. Prepare the Dataset

To pretrain OFA, you should first download the dataset we provide (pretrain_data_examples.zip, a small subset of the original pretraining data). For your customed pretraining datasets, please prepare your training samples into the same format. pretrain_data_examples.zip contains 4 TSV files: vision_language_examples.tsv, text_examples.tsv, image_examples.tsv and detection_examples.tsv. Details of these files are as follows:

  • vision_language_examples.tsv: Each line contains uniq-id, image (base64 string), caption, question, answer, ground-truth objects (objects appearing in the caption or question), dataset name (source of the data) and task type (caption, qa or visual gronunding). Prepared for the pretraining tasks of visual grounding, grounded captioning, image-text matching, image captioning and visual question answering.
  • text_examples.tsv: Each line contains uniq-id and text. Prepared for the pretraining task of text infilling.
  • image_examples.tsv: Each line contains uniq-id, image (base64 string, should be resized to 256*256 resolution) and image-code (generate the sparse codes for the central part of image through VQ-GAN). Prepared for the pretraining task of image infilling.
  • detection_examples.tsv: Each line contains uniq-id, image (base64 string) and bounding box annotations (contains the top-left and bottom-right coordinates of the bounding box, object_id and object_name, seperated by commas). Prepared for the pretraining task of detection.
In addition, the folder negative_sample in pretrain_data_examples.zip contains three files all_captions.txt, object.txt and type2ans.json. The data in these files are used as negative samples for the image-text matching (ITM) task.

2. Pretraining

By default, the pretraining script will attempt to restore the released pretrained checkpoints of OFA-Base or OFA-Large and perform continuous pretraining. Continuous pretraining is more recommended, which achieves much better results compared with pretraining from scratch. For continuous pretraining, please download the pretrained weights in advance (see checkpoints.md) and put them in the correct directory OFA/checkpoints/. If not, the pretraining will begin from scratch.

cd run_scripts/pretraining
bash pretrain_ofa_large.sh # Pretrain OFA-Large. For OFA-Base, use pretrain_ofa_base.sh

If the pretrained OFA checkpoint is restored successfully, you will see the following information in the log:

INFO: Loaded checkpoint ../../checkpoints/ofa_large.pt

Image Captioning

We provide procedures to reproduce our results of image captioning on our paper below.

1. Prepare the Dataset & Checkpoints

Download data (see datasets.md) and models (see checkpoints.md) and put them in the correct directory. The dataset zipfile caption_data.zip contains caption_stage1_train.tsv, caption_stage2_train.tsv, caption_val.tsv and caption_test.tsv. Each image corresponds to only 1 caption in caption_stage1_train.tsv and corresponds to multiple captions in other TSV files (about 5 captions per image). Each line of the dataset represents a caption sample with the following format. The information of uniq-id, image-id, caption, predicted object labels (taken from VinVL, not used), image base64 string are separated by tabs.

162365  12455   the sun sets over the trees beyond some docks.  sky&&water&&dock&&pole  /9j/4AAQSkZJ....UCP/2Q==
2. Finetuning

Following previous standard practice, we divide the finetuning process of image captioning into two stages. In stage 1, we finetune OFA with cross-entropy loss on 4 NVIDIA-V100 GPUs with 32GB memory (expected to obtain ~139.5 CIDEr on the validation set at this stage). In stage 2, we select the best checkpoint of stage 1 and train with CIDEr optimization on 8 NVIDIA-V100 GPUs. Note that CIDEr optimization is very unstable and requires careful hyperparameter tuning. If you encounter training errors in the stage2 finetuning, you can increase the batch size or reduce the learning rate. If neither of these works, you can directly set --freeze-resnet to freeze the inner states of batch normalization.

cd run_scripts/caption
nohup sh train_caption_stage1.sh > train_stage1.out &  # stage 1, train with cross-entropy loss
nohup sh train_caption_stage2.sh > train_stage2.out &  # stage 2, load the best ckpt of stage1 and train with CIDEr optimization 
3. Inference

Run the following commands to get your results and evaluate your model.

cd run_scripts/caption ; sh evaluate_caption.sh  # inference & evaluate

Text-to-Image Generation

This part provides procedures for the finetuning and inference of text-to-image generation. See below.

1. Prepare the Dataset & Checkpoints

Download data (see datasets.md) and models (see checkpoints.md) and put them in the correct directory. The dataset zipfile coco_image_gen.zip contains coco_vqgan_train.tsv, coco_vqgan_dev.tsv and coco_vqgan_full_test.tsv. Each line of the dataset represents a sample with the following format. The information of uniq-id, image-code (produced by vqgan, a list of integers separated by single-whitespaces), lowercased caption are separated by tabs.

1	6674 4336 4532 5334 3251 5461 3615 2469 ...4965 4190 1846	the people are posing for a group photo.

The checkpoint zipfile image_gen_large_best.zip contains image_gen_large_best.pt, vqgan/last.ckpt, vqgan/model.yaml and clip/Vit-B-16.pt.

2. Shuffle the Training Data

(Optional, but achieves better result): If the disk storage is sufficient, we recommend to prepare the shuffled training data for each epoch in advance.

cd dataset/image_gen
ln coco_vqgan_train.tsv coco_vqgan_train_1.tsv
for idx in `seq 1 9`;do shuf coco_vqgan_train_${idx}.tsv > coco_vqgan_train_$[${idx}+1].tsv;done # each file is used for an epoch
3. Finetuning

Following previous practice, we divide the finetuning process of image generating into two stages. In stage 1, we finetune OFA with cross-entropy loss on 4 8-V100-32G-GPU servers (expected to obtain ~32.5+ CLIP Score on the validation set at this stage). In stage 2, we select the last checkpoint of stage 1 and train with CLIP Score optimization on 4 8-V100-32G-GPU servers (expected to obtain ~34.0+ CLIP Score on the validation set at this stage). During the validation, the generated image will be dumped into _GEN_IMAGE_PATH_.

# run on each worker after the distributed and data configs have been correctly set following the guide in train_image_gen_stage1_distributed.sh 
cd run_scripts/image_gen
nohup sh train_image_gen_stage1_distributed.sh # stage 1, train with cross-entropy loss
nohup sh train_image_gen_stage2_distributed.sh # stage 2, load the last ckpt of stage1 and train with CLIP Score optimization 
4. Inference

Run the command below to generate your images.

cd run_scripts/image_gen ; sh evaluate_image_gen.sh  # inference & evaluate (FID, IS and CLIP Score)

Visual Question Answering

Here we provide the finetuning and inference codes to reproduce the VQAv2 result reported in our paper (test-std 80.02). We believe much improvement on accuracy can still be achieved based on this codebase :)

1. Prepare the Dataset & Checkpoints

Download data (see datasets.md) and models (see checkpoints.md) and put them in the correct directory. The dataset zipfile vqa_data.zip is around 100G and the decompressed data costs around 135G disk storage, which contains the training, validation and testing samples together with other necessary data resources. (Since vqa_data.zip is large in size, we have also provided chunked parts of the dataset files for more convenient and stable downloading. Please refer to issue #68.) Following common practice, VG-QA samples are also included in the training data. To adapt to the seq2seq paradigm of OFA, we transform original VQA training questions with multiple golden answers into multiple training samples. For the original VQA validation set, we keep around 10k samples for our validation and utilize the other samples for training. Each line of the dataset represents a VQA sample with the following format. The information of question-id, image-id, question, answer (with confidence), predicted object labels (taken from VinVL, slightly brings around +0.1 accuracy improvement), image base64 string are separated by tabs.

79459   79459   is this person wearing shorts?  0.6|!+no    house&&short&&...&&sky  /9j/4AAQS...tigZ/9k=

For fine-tuning on customed VQA-formulated tasks, please refer to issue #76, #105 and #73 for more information.

2. Shuffle the Training Data

(Optional, but achieves better finetuning accuracy): If the disk storage is sufficient, we recommend to prepare the shuffled training data for each epoch in advance. In our experiments, we use shuffling which brings around +0.3 improvement on VQA accuracy.

cd dataset/vqa_data
ln vqa_train.tsv vqa_train_1.tsv
for idx in `seq 1 9`;do shuf vqa_train_${idx}.tsv > vqa_train_$[${idx}+1].tsv;done # each file is used for an epoch
3. Finetuning

In our experiments, the VQA finetuning is performed on 4 8-A100-GPU servers (with RDMA). Here provides the finetuning script train_vqa_distributed.sh, which supports multi-server distributed training (as well as single-server training). Please refer to the comments in the beginning of the script and set the configs correctly according to your distribution environment. If you have shuffled the training data in the previous step, please correctly specify the training data path following the guide in the script comments. The command should be run on each worker.

# run on each worker after the distributed and data configs have been correctly set following the guide in train_vqa_distributed.sh 
cd run_scripts/vqa
bash train_vqa_distributed.sh 

In our experiments, the finetuning costs around 36 hours (for 12 epochs). After each epoch, an evaluation on validation set is performed. The best validation accuracy during finetuning will be around 80.8. The log is saved in ${log_dir}.

(Update on validation time-cost) As will be mentioned in the 4. Inference section, we prepare 2 types of inference: beam-search and all-candidate inference. By default, all-candidate inference is used for validation during fine-tuning, which achieves better accuracy but costs much time. Now we have added a new option in the training scripts called --val-inference-type to switch the validation inference type during fine-tuning. If you feel the validation takes too long, you can refer to PR #79 to activate beam-search validation, which significantly takes much less time, with around 0.5-0.6 validation score degradation compared with all-candidate validation.

4. Inference

We provide 2 types of inference, beam-search (much faster but gets sub-optimal accuracy) and all-candidate evaluation (slower but best accuracy).

For beam-search inference, use the script evaluate_vqa_beam.sh. Refer to the command below. The inference on test set costs around 16 GPU hours. After inference on test set, the result JSON file will be dumped in the ${result_path} defined in the shell script. You can submit the result test_predict.json to EvalAI. Using our released finetuned checkpoint, beam-search inference will get 80.15 validation accuracy, 79.36 test-dev accuracy and 79.48 test-std accuracy (around 0.6 lower than all-candidate evaluation).

cd run_scripts/vqa
bash evaluate_vqa_beam.sh val # specify 'val' or 'test'

For all-candidate evaluation, we recommend to use the distributed script evaluate_vqa_allcand_distributed.sh. Please refer to the guide in the script to set the distributed configs before running. The result JSON file will be dumped in the ${result_path} defined in the shell script of rank-0 server. All-candidate evaluation computes scores on all the candidate answers in the VQA dataset, which achieves 80.82 validation accuracy, 79.87 test-dev accuracy and 80.02 test-std accuracy, reproducing our reported results in the paper. However, the inference on test set costs around 1k GPU hours, which is much slower.

# run on each worker after the distributed configs have been correctly set following the guide in evaluate_vqa_allcand_distributed.sh
cd run_scripts/vqa
bash evaluate_vqa_allcand_distributed.sh val # specify 'val' or 'test'

Visual Grounding (Referring Expression Comprehension)

Here provides procedures for you to prepare data, train, and evaluate your model on visual grounding.

1. Prepare the Dataset & Checkpoints

Download data (see datasets.md) and models (see checkpoints.md) and put them in the correct directory. We provide RefCOCO (split by UNC), RefCOCO+ (split by UNC) and RefCOCOg (split by UMD) datasets. See RefCOCO and Refer for more details. Note that in the original dataset, each region-coord (or bounding box) may corresponds to multiple descriptive texts. We split these texts into multiple samples so that the region-coord in each sample corresponds to only one text. Each line of the processed dataset represents a sample with the following format. The information of uniq-id, image-id, text, region-coord (separated by commas), image base64 string are separated by tabs.

79_1    237367  A woman in a white blouse holding a glass of wine.  230.79,121.75,423.66,463.06 9j/4AAQ...1pAz/9k=
2. Finetuning

Unlike the original paper, we finetune OFA with a drop-path rate of 0.2, and found that training with this hyper-parameter achieves better results. We will update the reported results of the paper later.

cd run_scripts/refcoco
nohup sh train_refcoco.sh > train_refcoco.out &  # finetune for refcoco
nohup sh train_refcocoplus.sh > train_refcocoplus.out &  # finetune for refcoco+
nohup sh train_refcocog.sh > train_refcocog.out &  # finetune for refcocog
3. Inference

Run the following commands for the evaluation.

cd run_scripts/refcoco ; sh evaluate_refcoco.sh  # inference & evaluate for refcoco/refcoco+/refcocog

Visual Entailment

We provide steps for you to reproduce our results in visual entailment. See the details below.

1. Prepare the Dataset & Checkpoints

Download data (see datasets.md) and models (see checkpoints.md) and put them in the correct directory. Each line of the processed dataset represents a sample with the following format. The information of uniq-id, image-id, image base64 string, hypothesis, caption (or text premise), label are separated by tabs.

252244149.jpg#1r1n  252244149   /9j/4AAQ...MD/2Q==   a man in pink and gold is chewing on a wooden toothpick.   a man in pink is chewing a toothpick on the subway.   neutral 
2. Finetuning

In our experiments, the SNLI-VE finetuning is performed on 8 NVIDIA-V100 GPUs with 32GB memory. In this task, we experimented with only a few sets of hyperparameters. We believe that proper hyperparameter tuning can lead to further accuracy improvement.

cd run_scripts/snli_ve
nohup sh train_snli_ve.sh > train_snli_ve.out &  # finetune for snli_ve
3. Inference

Run the following command to obtain the results.

cd run_scripts/snli_ve ; sh evaluate_snli_ve.sh dev  # specify 'dev' or 'test'

GLUE

Here we provide steps for you to finetune and evaluate our model on language understanding tasks. We demonstrate our practice for the GLUE benchmark.

1. Prepare the Dataset & Checkpoints

Download data (see datasets.md) and models (see checkpoints.md) and put them in the correct directory. we provide 7 language understanding datasets from GLUE benchmark, including COLA, MNLI, MRPC, QNLI, QQP, RTE and SST2. More details about these datasets can be found in this link.

2. Finetuning

For each task, we have tried multiple sets of hyperparameters (including learning rate, batch size, training epochs). The results under different sets of hyperparameters can be found in ${log_dir}.

cd run_scripts/glue
nohup sh train_cola.sh > train_cola.out &  # finetune for cola
nohup sh train_mnli.sh > train_mnli.out &  # finetune for mnli
nohup sh train_mrpc.sh > train_mrpc.out &  # finetune for mrpc
nohup sh train_qnli.sh > train_qnli.out &  # finetune for qnli
nohup sh train_qqp.sh > train_qqp.out &  # finetune for qqp
nohup sh train_rte.sh > train_rte.out &  # finetune for rte
nohup sh train_sst2.sh > train_sst2.out &  # finetune for sst2

Image Classification on ImageNet-1K

We provide the finetuning and inference codes which reproduce 85.0 ImageNet-1K accuracy, slightly better than reported in our paper.

1. Prepare the Dataset & Checkpoints

Download data (see datasets.md) and models (see checkpoints.md) and put them in the correct directory. Our provided data is derived from the original ImageNet-1K (ILSVRC2012 train & validation) dataset and shares the same data split with it. To formulate the classification task into seq2seq paradigm, we use the synset words provided by Caffe as the generation target for each image class. Each line of the processed dataset represents a sample with the following format. The information of image base64 string, classification label (1-indexed, conform to the order in synset_words.txt), synset words of the label are separated by tabs.

_9j_4AAQS...fzX__Z  769 rugby ball
2. Shuffle the Training Data

(Optional, but achieves better finetuning accuracy): If the disk storage is sufficient, we recommend to prepare the shuffled training data for each epoch in advance. In our experiments, we use shuffling which brings around +0.2 improvement on ImageNet-1K accuracy.

cd dataset/imagenet_1k_data
ln imagenet_1k_train.tsv imagenet_1k_train_1.tsv
for idx in `seq 1 9`;do shuf imagenet_1k_train_${idx}.tsv > imagenet_1k_train_$[${idx}+1].tsv;done # each file is used for an epoch one by one
3. Finetuning

In our experiments, the ImageNet-1K finetuning is performed on 2 8-A100-GPU servers (with RDMA). Here provides the finetuning script train_imagenet_distributed.sh, which supports multi-server distributed training (as well as single-server training). Please refer to the comments in the beginning of the script and set the configs correctly according to your distribution environment. If you have shuffled the training data in the previous step, please correctly specify the training data path following the guide in the script comments. The command should be run on each worker. For quick evaluation during finetuning, by default we sample 20% of the original validation split and report accuracy on this subset after each epoch. The accuracy on the validation subset is generally ±0.1 relative to accuracy on the whole validation split.

# run on each worker after the distributed and data configs have been correctly set following the guide in train_imagenet_distributed.sh
cd run_scripts/image_classify
bash train_imagenet_distributed.sh

In our experiments, the finetuning costs around 80 hours (for 32 epochs). The best accuracy on validation subset during finetuning will be around 85.0. The log is saved in ${log_dir}.

4. Inference

To get the validation accuracy on the whole ImageNet-1K validation set, run the following command. The evaluation costs around 10 GPU hours. The accuracy will be reported in the stdout (expected to be around 85.0).

cd run_scripts/image_classify ; sh evaluate_imagenet.sh  # inference & evaluate for imagenet-1k

Gigaword

We provide steps for you to reproduce our results in Gigaword. See the details below.

1. Prepare the Dataset & Checkpoints

Download data (see datasets.md) and models (see checkpoints.md) and put them in the correct directory. The original dataset is taken from UniLM and we organized the data into the tsv format. Each line of the processed dataset represents a sample with the following format. The information of source and target texts are separated by tabs.

factory orders for manufactured goods rose #.# percent in september...  us september factory orders up #.# percent
2. Finetuning

Run the following command to train the model.

cd run_scripts/gigaword
nohup sh train_gigaword.sh > train_gigaword.out &  # finetune for gigaword
3. Inference

Run the following command to obtain the results (~36.43 rougeL).

cd run_scripts/gigaword ; sh evaluate_gigaword.sh  # inference & evaluate for gigaword



Gallery

Below we provide examples of OFA in text-to-image generation and open-ended VQA. Also, we demonstrate its performance in unseen task (Grounded QA) as well as unseen domain (Visual Grounding on images from unseen domains).

Text-to-Image Generation

case1

Open-Ended VQA

open_vqa

Grounded QA (unseen task)

grounded_qa

Visual Grounding (unseen domain)

vg

Related Codebase

Getting Involved

Feel free to submit Github issues or pull requests. Welcome to contribute to our project!

To contact us, never hestitate to send an email to [email protected] or [email protected]!

Citation

Please cite our papers if you find them helpful :)

@article{wang2022ofa,
  author    = {Peng Wang and
               An Yang and
               Rui Men and
               Junyang Lin and
               Shuai Bai and
               Zhikang Li and
               Jianxin Ma and
               Chang Zhou and
               Jingren Zhou and
               Hongxia Yang},
  title     = {OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence
               Learning Framework},
  journal   = {CoRR},
  volume    = {abs/2202.03052},
  year      = {2022}
}



@article{ofa_ocr,
  author       = {Junyang Lin and
                  Xuancheng Ren and
                  Yichang Zhang and
                  Gao Liu and
                  Peng Wang and
                  An Yang and
                  Chang Zhou},
  title        = {Transferring General Multimodal Pretrained Models to Text Recognition},
  journal      = {CoRR},
  volume       = {abs/2212.09297},
  year         = {2022}
}



@article{ofa_prompt,
  author       = {Hao Yang and
                  Junyang Lin and
                  An Yang and
                  Peng Wang and
                  Chang Zhou and
                  Hongxia Yang},
  title        = {Prompt Tuning for Generative Multimodal Pretrained Models},
  journal      = {CoRR},
  volume       = {abs/2208.02532},
  year         = {2022}
}



@article{mmspeech,
  title={MMSpeech: Multi-modal Multi-task Encoder-Decoder Pre-training for Speech Recognition},
  author={Zhou, Xiaohuan and Wang, Jiaming and Cui, Zeyu and Zhang, Shiliang and Yan, Zhijie and Zhou, Jingren and Zhou, Chang},
  journal={arXiv preprint arXiv:2212.00500},
  year={2022}
}

ofa's People

Contributors

eltociear avatar faychu avatar justinlin610 avatar jxst539246 avatar logicwong avatar maggione avatar simonjjj avatar xcvil avatar yangapku avatar zhaoguangxiang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ofa's Issues

Problems with VQA finetuning

Hello! I am trying to finetune OFA-large on VQA using Visual Genome dataset, using the finetuning instruction in the repo. Unfortunately, I have encountered a bug that I have some difficulties indentifying. I preprocessed the data exactly like in an example, but during training my gradients overflow and model does not train.

slice_id 0 seek offset 0
2022-03-28 02:29:07 - trainer.py[line:703] - INFO: begin training epoch 1
2022-03-28 02:29:07 - train.py[line:296] - INFO: Start iterating over samples
2022-03-28 02:29:09 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 64.0
2022-03-28 02:29:11 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 32.0
2022-03-28 02:29:14 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 16.0
2022-03-28 02:29:15 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 8.0
2022-03-28 02:29:17 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 4.0
2022-03-28 02:29:19 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 2.0
2022-03-28 02:29:22 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 1.0
2022-03-28 02:29:23 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.5
2022-03-28 02:29:26 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.25
2022-03-28 02:29:28 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.125
2022-03-28 02:29:28 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.0625

I narrowed the issue to the answers column. If I replace this column in my dataset with the column in the dataset provided in the repo, everything works fine. However, if I change the answers in the column, or even modify them in any way I get the same issue. I suspected that my procedure of changing the column could be a problem, but if I "modify" the column with empty string, it still works. Any other symbol added to the column again concludes in an overflow. I also tried modifying not the whole column, but single elements, and found out that changing certain answers does not lead to an overflow, while changing others does. I was unable to further narrow the issue or find any pattern in it.

I train on single server with 1 GPU.

run_scripts /caption /evaluate_caption.sh ?

First of all, Thanks for your amazing work.👍
I'm very surprised at the results you've made.
I wonder how to get the last line( ../.. /results/caption/test_predict.json) of the file(run_scripts /caption /evaluate_caption.sh)
I want know how to get ../.. /results/caption/test_predict.json
I can not generate this json file,
I'll be waiting for the reply.
Thank you! @logicwong

model release

Thanks for releasing the amazing work!
Is there any plans to release the finetuned checkpoint for VQAv2 trained with Encouraging Loss?

How to reduce the size of dataset in Image Caption

Hi, may I ask which configuration in train_caption_stage1.sh should I adjust to reduce the data size in the Image Caption Task? ( data is too big for GPU like 3090 )
I will really appreciate it if there is any help!

Checkpoint name attributed may needs updated for OFA_large

I downloaded the pre-trained OFA-large checkpoint here: https://ofa-beijing.oss-cn-beijing.aliyuncs.com/checkpoints/ofa_large.pt

And try to load the model by

task = tasks.setup_task(cfg.task)
models, cfg = checkpoint_utils.load_model_ensemble(
    pre_trained_model_path,
    task=task
)

But it doesn't work and it shows didn't find the corresponding model name.

Then I check the checkpoint , found the name is m6_unify_large

Then I modify this line https://github.com/OFA-Sys/OFA/blob/main/models/ofa/ofa.py#L320 to replace ofa_large to m6_unify_large then everything works fine.

I guess maybe need to update a bit the checkpoint state and upload again

unrecognized arguments: --warmup-ratio

Hi,
I am trying to train your model for the caption task, to do that I clone your last updated repository, and then I have followed your instruction. first of all, I faced a max_epoch error that was because of the shell version. After that I tried to train the model it gives me unrecognized arguments: --warmup-ratio=0.06. I go to your train code and I could not find the warmup-ratio variable, did you remove it? How should I solve this issue?

/home/XXXX/.conda/envs/ofa/lib/python3.7/site-packages/torch/distributed/launch.py:186: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  FutureWarning,
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
usage: train.py [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL]
                [--log-format {json,none,simple,tqdm}] [--log-file LOG_FILE]
                [--aim-repo AIM_REPO] [--aim-run-hash AIM_RUN_HASH]
                [--tensorboard-logdir TENSORBOARD_LOGDIR]
                [--wandb-project WANDB_PROJECT] [--azureml-logging]
                [--seed SEED] [--cpu] [--tpu] [--bf16]
                [--memory-efficient-bf16] [--fp16] [--memory-efficient-fp16]
                [--fp16-no-flatten-grads] [--fp16-init-scale FP16_INIT_SCALE]
                [--fp16-scale-window FP16_SCALE_WINDOW]
                [--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
                [--on-cpu-convert-precision] [--min-loss-scale MIN_LOSS_SCALE]
                [--threshold-loss-scale THRESHOLD_LOSS_SCALE] [--amp]
                [--amp-batch-retries AMP_BATCH_RETRIES]
                [--amp-init-scale AMP_INIT_SCALE]
                [--amp-scale-window AMP_SCALE_WINDOW] [--user-dir USER_DIR]
                [--empty-cache-freq EMPTY_CACHE_FREQ]
                [--all-gather-list-size ALL_GATHER_LIST_SIZE]
                [--model-parallel-size MODEL_PARALLEL_SIZE]
                [--quantization-config-path QUANTIZATION_CONFIG_PATH]
                [--profile] [--reset-logging] [--suppress-crashes]
                [--use-plasma-view] [--plasma-path PLASMA_PATH]
                [--criterion {adaptive_loss,composite_loss,cross_entropy,ctc,fastspeech2,hubert,label_smoothed_cross_entropy,latency_augmented_label_smoothed_cross_entropy,label_smoothed_cross_entropy_with_alignment,label_smoothed_cross_entropy_with_ctc,legacy_masked_lm_loss,masked_lm,model,nat_loss,sentence_prediction,sentence_ranking,tacotron2,speech_to_unit,speech_to_spectrogram,speech_unit_lm_criterion,wav2vec,vocab_parallel_cross_entropy,scst_reward_criterion,adjust_label_smoothed_cross_entropy,clip_scst_reward_criterion,adjust_label_smoothed_encouraging_loss}]
                [--tokenizer {moses,nltk,space}]
                [--bpe {byte_bpe,bytes,characters,fastbpe,gpt2,bert,hf_byte_bpe,sentencepiece,subword_nmt}]
                [--optimizer {adadelta,adafactor,adagrad,adam,adamax,composite,cpu_adam,lamb,nag,sgd}]
                [--lr-scheduler {cosine,fixed,inverse_sqrt,manual,pass_through,polynomial_decay,reduce_lr_on_plateau,step,tri_stage,triangular}]
                [--scoring {bert_score,sacrebleu,bleu,chrf,meteor,wer}]
                [--task TASK] [--num-workers NUM_WORKERS]
                [--skip-invalid-size-inputs-valid-test]
                [--max-tokens MAX_TOKENS] [--batch-size BATCH_SIZE]
                [--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE]
                [--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE]
                [--dataset-impl {raw,lazy,cached,mmap,fasta,huffman}]
                [--data-buffer-size DATA_BUFFER_SIZE]
                [--train-subset TRAIN_SUBSET] [--valid-subset VALID_SUBSET]
                [--combine-valid-subsets] [--ignore-unused-valid-subsets]
                [--validate-interval VALIDATE_INTERVAL]
                [--validate-interval-updates VALIDATE_INTERVAL_UPDATES]
                [--validate-after-updates VALIDATE_AFTER_UPDATES]
                [--fixed-validation-seed FIXED_VALIDATION_SEED]
                [--disable-validation] [--max-tokens-valid MAX_TOKENS_VALID]
                [--batch-size-valid BATCH_SIZE_VALID]
                [--max-valid-steps MAX_VALID_STEPS] [--curriculum CURRICULUM]
                [--gen-subset GEN_SUBSET] [--num-shards NUM_SHARDS]
                [--shard-id SHARD_ID] [--grouped-shuffling]
                [--update-epoch-batch-itr UPDATE_EPOCH_BATCH_ITR]
                [--update-ordered-indices-seed]
                [--distributed-world-size DISTRIBUTED_WORLD_SIZE]
                [--distributed-num-procs DISTRIBUTED_NUM_PROCS]
                [--distributed-rank DISTRIBUTED_RANK]
                [--distributed-backend DISTRIBUTED_BACKEND]
                [--distributed-init-method DISTRIBUTED_INIT_METHOD]
                [--distributed-port DISTRIBUTED_PORT] [--device-id DEVICE_ID]
                [--distributed-no-spawn]
                [--ddp-backend {c10d,fully_sharded,legacy_ddp,no_c10d,pytorch_ddp,slowmo}]
                [--ddp-comm-hook {none,fp16}] [--bucket-cap-mb BUCKET_CAP_MB]
                [--fix-batches-to-gpus] [--find-unused-parameters]
                [--gradient-as-bucket-view] [--fast-stat-sync]
                [--heartbeat-timeout HEARTBEAT_TIMEOUT] [--broadcast-buffers]
                [--slowmo-momentum SLOWMO_MOMENTUM]
                [--slowmo-base-algorithm SLOWMO_BASE_ALGORITHM]
                [--localsgd-frequency LOCALSGD_FREQUENCY]
                [--nprocs-per-node NPROCS_PER_NODE]
                [--pipeline-model-parallel]
                [--pipeline-balance PIPELINE_BALANCE]
                [--pipeline-devices PIPELINE_DEVICES]
                [--pipeline-chunks PIPELINE_CHUNKS]
                [--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE]
                [--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES]
                [--pipeline-decoder-balance PIPELINE_DECODER_BALANCE]
                [--pipeline-decoder-devices PIPELINE_DECODER_DEVICES]
                [--pipeline-checkpoint {always,never,except_last}]
                [--zero-sharding {none,os}] [--no-reshard-after-forward]
                [--fp32-reduce-scatter] [--cpu-offload] [--use-sharded-state]
                [--not-fsdp-flatten-parameters] [--arch ARCH]
                [--max-epoch MAX_EPOCH] [--max-update MAX_UPDATE]
                [--stop-time-hours STOP_TIME_HOURS] [--clip-norm CLIP_NORM]
                [--sentence-avg] [--update-freq UPDATE_FREQ] [--lr LR]
                [--stop-min-lr STOP_MIN_LR] [--use-bmuf]
                [--skip-remainder-batch] [--save-dir SAVE_DIR]
                [--restore-file RESTORE_FILE] [--continue-once CONTINUE_ONCE]
                [--finetune-from-model FINETUNE_FROM_MODEL]
                [--reset-dataloader] [--reset-lr-scheduler] [--reset-meters]
                [--reset-optimizer]
                [--optimizer-overrides OPTIMIZER_OVERRIDES]
                [--save-interval SAVE_INTERVAL]
                [--save-interval-updates SAVE_INTERVAL_UPDATES]
                [--keep-interval-updates KEEP_INTERVAL_UPDATES]
                [--keep-interval-updates-pattern KEEP_INTERVAL_UPDATES_PATTERN]
                [--keep-last-epochs KEEP_LAST_EPOCHS]
                [--keep-best-checkpoints KEEP_BEST_CHECKPOINTS] [--no-save]
                [--no-epoch-checkpoints] [--no-last-checkpoints]
                [--no-save-optimizer-state]
                [--best-checkpoint-metric BEST_CHECKPOINT_METRIC]
                [--maximize-best-checkpoint-metric] [--patience PATIENCE]
                [--checkpoint-suffix CHECKPOINT_SUFFIX]
                [--checkpoint-shard-count CHECKPOINT_SHARD_COUNT]
                [--load-checkpoint-on-all-dp-ranks]
                [--write-checkpoints-asynchronously] [--store-ema]
                [--ema-decay EMA_DECAY] [--ema-start-update EMA_START_UPDATE]
                [--ema-seed-model EMA_SEED_MODEL]
                [--ema-update-freq EMA_UPDATE_FREQ] [--ema-fp32]
                [--activation-fn {relu,gelu,gelu_fast,gelu_accurate,tanh,linear}]
                [--dropout D] [--attention-dropout D] [--activation-dropout D]
                [--encoder-embed-path STR] [--encoder-embed-dim N]
                [--encoder-ffn-embed-dim N] [--encoder-layers N]
                [--encoder-attention-heads N] [--encoder-normalize-before]
                [--encoder-learned-pos] [--decoder-embed-path STR]
                [--decoder-embed-dim N] [--decoder-ffn-embed-dim N]
                [--decoder-layers N] [--decoder-attention-heads N]
                [--decoder-learned-pos] [--decoder-normalize-before]
                [--decoder-output-dim N] [--share-decoder-input-output-embed]
                [--share-all-embeddings] [--no-token-positional-embeddings]
                [--adaptive-softmax-cutoff EXPR]
                [--adaptive-softmax-dropout D] [--layernorm-embedding]
                [--no-scale-embedding] [--checkpoint-activations]
                [--offload-activations] [--no-cross-attention]
                [--cross-self-attention] [--encoder-layerdrop D]
                [--decoder-layerdrop D]
                [--encoder-layers-to-keep ENCODER_LAYERS_TO_KEEP]
                [--decoder-layers-to-keep DECODER_LAYERS_TO_KEEP]
                [--quant-noise-pq D] [--quant-noise-pq-block-size D]
                [--quant-noise-scalar D] [--min-params-to-wrap D]
                [--resnet-drop-path-rate RESNET_DROP_PATH_RATE]
                [--encoder-drop-path-rate ENCODER_DROP_PATH_RATE]
                [--decoder-drop-path-rate DECODER_DROP_PATH_RATE]
                [--token-bucket-size TOKEN_BUCKET_SIZE]
                [--image-bucket-size IMAGE_BUCKET_SIZE]
                [--attn-scale-factor ATTN_SCALE_FACTOR] [--freeze-resnet]
                [--freeze-encoder-embedding] [--freeze-decoder-embedding]
                [--add-type-embedding]
                [--resnet-type {resnet50,resnet101,resnet152}]
                [--resnet-model-path STR] [--code-image-size CODE_IMAGE_SIZE]
                [--patch-layernorm-embedding] [--code-layernorm-embedding]
                [--entangle-position-embedding] [--disable-entangle]
                [--sync-bn] [--scale-attn] [--scale-fc] [--scale-heads]
                [--scale-resids] [--pooler-dropout D]
                [--pooler-classifier {mlp,linear}]
                [--pooler-activation-fn {relu,gelu,gelu_fast,gelu_accurate,tanh,linear}]
                [--spectral-norm-classification-head]
                [--selected-cols SELECTED_COLS] [--bpe-dir BPE_DIR]
                [--max-source-positions MAX_SOURCE_POSITIONS]
                [--max-target-positions MAX_TARGET_POSITIONS]
                [--max-src-length MAX_SRC_LENGTH]
                [--max-tgt-length MAX_TGT_LENGTH]
                [--code-dict-size CODE_DICT_SIZE]
                [--patch-image-size PATCH_IMAGE_SIZE] [--num-bins NUM_BINS]
                [--imagenet-default-mean-and-std]
                [--constraint-range CONSTRAINT_RANGE] [--eval-bleu]
                [--eval-cider] [--eval-args EVAL_ARGS] [--eval-print-samples]
                [--eval-cider-cached-tokens EVAL_CIDER_CACHED_TOKENS] [--scst]
                [--scst-args SCST_ARGS] [--label-smoothing LABEL_SMOOTHING]
                [--report-accuracy] [--ignore-prefix-size IGNORE_PREFIX_SIZE]
                [--ignore-eos] [--drop-worst-ratio DROP_WORST_RATIO]
                [--drop-worst-after DROP_WORST_AFTER] [--use-rdrop]
                [--reg-alpha REG_ALPHA] [--sample-patch-num SAMPLE_PATCH_NUM]
                [--adam-betas ADAM_BETAS] [--adam-eps ADAM_EPS]
                [--weight-decay WEIGHT_DECAY] [--use-old-adam]
                [--fp16-adam-stats] [--warmup-updates WARMUP_UPDATES]
                [--force-anneal FORCE_ANNEAL]
                [--end-learning-rate END_LEARNING_RATE] [--power POWER]
                [--total-num-update TOTAL_NUM_UPDATE] [--pad PAD] [--eos EOS]
                [--unk UNK]
                data
usage: train.py [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL]
                [--log-format {json,none,simple,tqdm}] [--log-file LOG_FILE]
                [--aim-repo AIM_REPO] [--aim-run-hash AIM_RUN_HASH]
                [--tensorboard-logdir TENSORBOARD_LOGDIR]
                [--wandb-project WANDB_PROJECT] [--azureml-logging]
                [--seed SEED] [--cpu] [--tpu] [--bf16]
                [--memory-efficient-bf16] [--fp16] [--memory-efficient-fp16]
                [--fp16-no-flatten-grads] [--fp16-init-scale FP16_INIT_SCALE]
                [--fp16-scale-window FP16_SCALE_WINDOW]
                [--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
                [--on-cpu-convert-precision] [--min-loss-scale MIN_LOSS_SCALE]
                [--threshold-loss-scale THRESHOLD_LOSS_SCALE] [--amp]
                [--amp-batch-retries AMP_BATCH_RETRIES]
                [--amp-init-scale AMP_INIT_SCALE]
                [--amp-scale-window AMP_SCALE_WINDOW] [--user-dir USER_DIR]
                [--empty-cache-freq EMPTY_CACHE_FREQ]
                [--all-gather-list-size ALL_GATHER_LIST_SIZE]
                [--model-parallel-size MODEL_PARALLEL_SIZE]
                [--quantization-config-path QUANTIZATION_CONFIG_PATH]
                [--profile] [--reset-logging] [--suppress-crashes]
                [--use-plasma-view] [--plasma-path PLASMA_PATH]
                [--criterion {adaptive_loss,composite_loss,cross_entropy,ctc,fastspeech2,hubert,label_smoothed_cross_entropy,latency_augmented_label_smoothed_cross_entropy,label_smoothed_cross_entropy_with_alignment,label_smoothed_cross_entropy_with_ctc,legacy_masked_lm_loss,masked_lm,model,nat_loss,sentence_prediction,sentence_ranking,tacotron2,speech_to_unit,speech_to_spectrogram,speech_unit_lm_criterion,wav2vec,vocab_parallel_cross_entropy,scst_reward_criterion,adjust_label_smoothed_cross_entropy,clip_scst_reward_criterion,adjust_label_smoothed_encouraging_loss}]
                [--tokenizer {moses,nltk,space}]
                [--bpe {byte_bpe,bytes,characters,fastbpe,gpt2,bert,hf_byte_bpe,sentencepiece,subword_nmt}]
                [--optimizer {adadelta,adafactor,adagrad,adam,adamax,composite,cpu_adam,lamb,nag,sgd}]
                [--lr-scheduler {cosine,fixed,inverse_sqrt,manual,pass_through,polynomial_decay,reduce_lr_on_plateau,step,tri_stage,triangular}]
                [--scoring {bert_score,sacrebleu,bleu,chrf,meteor,wer}]
                [--task TASK] [--num-workers NUM_WORKERS]
                [--skip-invalid-size-inputs-valid-test]
                [--max-tokens MAX_TOKENS] [--batch-size BATCH_SIZE]
                [--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE]
                [--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE]
                [--dataset-impl {raw,lazy,cached,mmap,fasta,huffman}]
                [--data-buffer-size DATA_BUFFER_SIZE]
                [--train-subset TRAIN_SUBSET] [--valid-subset VALID_SUBSET]
                [--combine-valid-subsets] [--ignore-unused-valid-subsets]
                [--validate-interval VALIDATE_INTERVAL]
                [--validate-interval-updates VALIDATE_INTERVAL_UPDATES]
                [--validate-after-updates VALIDATE_AFTER_UPDATES]
                [--fixed-validation-seed FIXED_VALIDATION_SEED]
                [--disable-validation] [--max-tokens-valid MAX_TOKENS_VALID]
                [--batch-size-valid BATCH_SIZE_VALID]
                [--max-valid-steps MAX_VALID_STEPS] [--curriculum CURRICULUM]
                [--gen-subset GEN_SUBSET] [--num-shards NUM_SHARDS]
                [--shard-id SHARD_ID] [--grouped-shuffling]
                [--update-epoch-batch-itr UPDATE_EPOCH_BATCH_ITR]
                [--update-ordered-indices-seed]
                [--distributed-world-size DISTRIBUTED_WORLD_SIZE]
                [--distributed-num-procs DISTRIBUTED_NUM_PROCS]
                [--distributed-rank DISTRIBUTED_RANK]
                [--distributed-backend DISTRIBUTED_BACKEND]
                [--distributed-init-method DISTRIBUTED_INIT_METHOD]
                [--distributed-port DISTRIBUTED_PORT] [--device-id DEVICE_ID]
                [--distributed-no-spawn]
                [--ddp-backend {c10d,fully_sharded,legacy_ddp,no_c10d,pytorch_ddp,slowmo}]
                [--ddp-comm-hook {none,fp16}] [--bucket-cap-mb BUCKET_CAP_MB]
                [--fix-batches-to-gpus] [--find-unused-parameters]
                [--gradient-as-bucket-view] [--fast-stat-sync]
                [--heartbeat-timeout HEARTBEAT_TIMEOUT] [--broadcast-buffers]
                [--slowmo-momentum SLOWMO_MOMENTUM]
                [--slowmo-base-algorithm SLOWMO_BASE_ALGORITHM]
                [--localsgd-frequency LOCALSGD_FREQUENCY]
                [--nprocs-per-node NPROCS_PER_NODE]
                [--pipeline-model-parallel]
                [--pipeline-balance PIPELINE_BALANCE]
                [--pipeline-devices PIPELINE_DEVICES]
                [--pipeline-chunks PIPELINE_CHUNKS]
                [--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE]
                [--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES]
                [--pipeline-decoder-balance PIPELINE_DECODER_BALANCE]
                [--pipeline-decoder-devices PIPELINE_DECODER_DEVICES]
                [--pipeline-checkpoint {always,never,except_last}]
                [--zero-sharding {none,os}] [--no-reshard-after-forward]
                [--fp32-reduce-scatter] [--cpu-offload] [--use-sharded-state]
                [--not-fsdp-flatten-parameters] [--arch ARCH]
                [--max-epoch MAX_EPOCH] [--max-update MAX_UPDATE]
                [--stop-time-hours STOP_TIME_HOURS] [--clip-norm CLIP_NORM]
                [--sentence-avg] [--update-freq UPDATE_FREQ] [--lr LR]
                [--stop-min-lr STOP_MIN_LR] [--use-bmuf]
                [--skip-remainder-batch] [--save-dir SAVE_DIR]
                [--restore-file RESTORE_FILE] [--continue-once CONTINUE_ONCE]
                [--finetune-from-model FINETUNE_FROM_MODEL]
                [--reset-dataloader] [--reset-lr-scheduler] [--reset-meters]
                [--reset-optimizer]
                [--optimizer-overrides OPTIMIZER_OVERRIDES]
                [--save-interval SAVE_INTERVAL]
                [--save-interval-updates SAVE_INTERVAL_UPDATES]
                [--keep-interval-updates KEEP_INTERVAL_UPDATES]
                [--keep-interval-updates-pattern KEEP_INTERVAL_UPDATES_PATTERN]
                [--keep-last-epochs KEEP_LAST_EPOCHS]
                [--keep-best-checkpoints KEEP_BEST_CHECKPOINTS] [--no-save]
                [--no-epoch-checkpoints] [--no-last-checkpoints]
                [--no-save-optimizer-state]
                [--best-checkpoint-metric BEST_CHECKPOINT_METRIC]
                [--maximize-best-checkpoint-metric] [--patience PATIENCE]
                [--checkpoint-suffix CHECKPOINT_SUFFIX]
                [--checkpoint-shard-count CHECKPOINT_SHARD_COUNT]
                [--load-checkpoint-on-all-dp-ranks]
                [--write-checkpoints-asynchronously] [--store-ema]
                [--ema-decay EMA_DECAY] [--ema-start-update EMA_START_UPDATE]
                [--ema-seed-model EMA_SEED_MODEL]
                [--ema-update-freq EMA_UPDATE_FREQ] [--ema-fp32]
                [--activation-fn {relu,gelu,gelu_fast,gelu_accurate,tanh,linear}]
                [--dropout D] [--attention-dropout D] [--activation-dropout D]
                [--encoder-embed-path STR] [--encoder-embed-dim N]
                [--encoder-ffn-embed-dim N] [--encoder-layers N]
                [--encoder-attention-heads N] [--encoder-normalize-before]
                [--encoder-learned-pos] [--decoder-embed-path STR]
                [--decoder-embed-dim N] [--decoder-ffn-embed-dim N]
                [--decoder-layers N] [--decoder-attention-heads N]
                [--decoder-learned-pos] [--decoder-normalize-before]
                [--decoder-output-dim N] [--share-decoder-input-output-embed]
                [--share-all-embeddings] [--no-token-positional-embeddings]
                [--adaptive-softmax-cutoff EXPR]
                [--adaptive-softmax-dropout D] [--layernorm-embedding]
                [--no-scale-embedding] [--checkpoint-activations]
                [--offload-activations] [--no-cross-attention]
                [--cross-self-attention] [--encoder-layerdrop D]
                [--decoder-layerdrop D]
                [--encoder-layers-to-keep ENCODER_LAYERS_TO_KEEP]
                [--decoder-layers-to-keep DECODER_LAYERS_TO_KEEP]
                [--quant-noise-pq D] [--quant-noise-pq-block-size D]
                [--quant-noise-scalar D] [--min-params-to-wrap D]
                [--resnet-drop-path-rate RESNET_DROP_PATH_RATE]
                [--encoder-drop-path-rate ENCODER_DROP_PATH_RATE]
                [--decoder-drop-path-rate DECODER_DROP_PATH_RATE]
                [--token-bucket-size TOKEN_BUCKET_SIZE]
                [--image-bucket-size IMAGE_BUCKET_SIZE]
                [--attn-scale-factor ATTN_SCALE_FACTOR] [--freeze-resnet]
                [--freeze-encoder-embedding] [--freeze-decoder-embedding]
                [--add-type-embedding]
                [--resnet-type {resnet50,resnet101,resnet152}]
                [--resnet-model-path STR] [--code-image-size CODE_IMAGE_SIZE]
                [--patch-layernorm-embedding] [--code-layernorm-embedding]
                [--entangle-position-embedding] [--disable-entangle]
                [--sync-bn] [--scale-attn] [--scale-fc] [--scale-heads]
                [--scale-resids] [--pooler-dropout D]
                [--pooler-classifier {mlp,linear}]
                [--pooler-activation-fn {relu,gelu,gelu_fast,gelu_accurate,tanh,linear}]
                [--spectral-norm-classification-head]
                [--selected-cols SELECTED_COLS] [--bpe-dir BPE_DIR]
                [--max-source-positions MAX_SOURCE_POSITIONS]
                [--max-target-positions MAX_TARGET_POSITIONS]
                [--max-src-length MAX_SRC_LENGTH]
                [--max-tgt-length MAX_TGT_LENGTH]
                [--code-dict-size CODE_DICT_SIZE]
                [--patch-image-size PATCH_IMAGE_SIZE] [--num-bins NUM_BINS]
                [--imagenet-default-mean-and-std]
                [--constraint-range CONSTRAINT_RANGE] [--eval-bleu]
                [--eval-cider] [--eval-args EVAL_ARGS] [--eval-print-samples]
                [--eval-cider-cached-tokens EVAL_CIDER_CACHED_TOKENS] [--scst]
                [--scst-args SCST_ARGS] [--label-smoothing LABEL_SMOOTHING]
                [--report-accuracy] [--ignore-prefix-size IGNORE_PREFIX_SIZE]
                [--ignore-eos] [--drop-worst-ratio DROP_WORST_RATIO]
                [--drop-worst-after DROP_WORST_AFTER] [--use-rdrop]
                [--reg-alpha REG_ALPHA] [--sample-patch-num SAMPLE_PATCH_NUM]
                [--adam-betas ADAM_BETAS] [--adam-eps ADAM_EPS]
                [--weight-decay WEIGHT_DECAY] [--use-old-adam]
                [--fp16-adam-stats] [--warmup-updates WARMUP_UPDATES]
                [--force-anneal FORCE_ANNEAL]
                [--end-learning-rate END_LEARNING_RATE] [--power POWER]
                [--total-num-update TOTAL_NUM_UPDATE] [--pad PAD] [--eos EOS]
                [--unk UNK]
                data
train.py: error: unrecognized arguments: --warmup-ratio=0.06
train.py: error: unrecognized arguments: --warmup-ratio=0.06
usage: train.py [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL]
                [--log-format {json,none,simple,tqdm}] [--log-file LOG_FILE]
                [--aim-repo AIM_REPO] [--aim-run-hash AIM_RUN_HASH]
                [--tensorboard-logdir TENSORBOARD_LOGDIR]
                [--wandb-project WANDB_PROJECT] [--azureml-logging]
                [--seed SEED] [--cpu] [--tpu] [--bf16]
                [--memory-efficient-bf16] [--fp16] [--memory-efficient-fp16]
                [--fp16-no-flatten-grads] [--fp16-init-scale FP16_INIT_SCALE]
                [--fp16-scale-window FP16_SCALE_WINDOW]
                [--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
                [--on-cpu-convert-precision] [--min-loss-scale MIN_LOSS_SCALE]
                [--threshold-loss-scale THRESHOLD_LOSS_SCALE] [--amp]
                [--amp-batch-retries AMP_BATCH_RETRIES]
                [--amp-init-scale AMP_INIT_SCALE]
                [--amp-scale-window AMP_SCALE_WINDOW] [--user-dir USER_DIR]
                [--empty-cache-freq EMPTY_CACHE_FREQ]
                [--all-gather-list-size ALL_GATHER_LIST_SIZE]
                [--model-parallel-size MODEL_PARALLEL_SIZE]
                [--quantization-config-path QUANTIZATION_CONFIG_PATH]
                [--profile] [--reset-logging] [--suppress-crashes]
                [--use-plasma-view] [--plasma-path PLASMA_PATH]
                [--criterion {adaptive_loss,composite_loss,cross_entropy,ctc,fastspeech2,hubert,label_smoothed_cross_entropy,latency_augmented_label_smoothed_cross_entropy,label_smoothed_cross_entropy_with_alignment,label_smoothed_cross_entropy_with_ctc,legacy_masked_lm_loss,masked_lm,model,nat_loss,sentence_prediction,sentence_ranking,tacotron2,speech_to_unit,speech_to_spectrogram,speech_unit_lm_criterion,wav2vec,vocab_parallel_cross_entropy,scst_reward_criterion,adjust_label_smoothed_cross_entropy,clip_scst_reward_criterion,adjust_label_smoothed_encouraging_loss}]
                [--tokenizer {moses,nltk,space}]
                [--bpe {byte_bpe,bytes,characters,fastbpe,gpt2,bert,hf_byte_bpe,sentencepiece,subword_nmt}]
                [--optimizer {adadelta,adafactor,adagrad,adam,adamax,composite,cpu_adam,lamb,nag,sgd}]
                [--lr-scheduler {cosine,fixed,inverse_sqrt,manual,pass_through,polynomial_decay,reduce_lr_on_plateau,step,tri_stage,triangular}]
                [--scoring {bert_score,sacrebleu,bleu,chrf,meteor,wer}]
                [--task TASK] [--num-workers NUM_WORKERS]
                [--skip-invalid-size-inputs-valid-test]
                [--max-tokens MAX_TOKENS] [--batch-size BATCH_SIZE]
                [--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE]
                [--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE]
                [--dataset-impl {raw,lazy,cached,mmap,fasta,huffman}]
                [--data-buffer-size DATA_BUFFER_SIZE]
                [--train-subset TRAIN_SUBSET] [--valid-subset VALID_SUBSET]
                [--combine-valid-subsets] [--ignore-unused-valid-subsets]
                [--validate-interval VALIDATE_INTERVAL]
                [--validate-interval-updates VALIDATE_INTERVAL_UPDATES]
                [--validate-after-updates VALIDATE_AFTER_UPDATES]
                [--fixed-validation-seed FIXED_VALIDATION_SEED]
                [--disable-validation] [--max-tokens-valid MAX_TOKENS_VALID]
                [--batch-size-valid BATCH_SIZE_VALID]
                [--max-valid-steps MAX_VALID_STEPS] [--curriculum CURRICULUM]
                [--gen-subset GEN_SUBSET] [--num-shards NUM_SHARDS]
                [--shard-id SHARD_ID] [--grouped-shuffling]
                [--update-epoch-batch-itr UPDATE_EPOCH_BATCH_ITR]
                [--update-ordered-indices-seed]
                [--distributed-world-size DISTRIBUTED_WORLD_SIZE]
                [--distributed-num-procs DISTRIBUTED_NUM_PROCS]
                [--distributed-rank DISTRIBUTED_RANK]
                [--distributed-backend DISTRIBUTED_BACKEND]
                [--distributed-init-method DISTRIBUTED_INIT_METHOD]
                [--distributed-port DISTRIBUTED_PORT] [--device-id DEVICE_ID]
                [--distributed-no-spawn]
                [--ddp-backend {c10d,fully_sharded,legacy_ddp,no_c10d,pytorch_ddp,slowmo}]
                [--ddp-comm-hook {none,fp16}] [--bucket-cap-mb BUCKET_CAP_MB]
                [--fix-batches-to-gpus] [--find-unused-parameters]
                [--gradient-as-bucket-view] [--fast-stat-sync]
                [--heartbeat-timeout HEARTBEAT_TIMEOUT] [--broadcast-buffers]
                [--slowmo-momentum SLOWMO_MOMENTUM]
                [--slowmo-base-algorithm SLOWMO_BASE_ALGORITHM]
                [--localsgd-frequency LOCALSGD_FREQUENCY]
                [--nprocs-per-node NPROCS_PER_NODE]
                [--pipeline-model-parallel]
                [--pipeline-balance PIPELINE_BALANCE]
                [--pipeline-devices PIPELINE_DEVICES]
                [--pipeline-chunks PIPELINE_CHUNKS]
                [--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE]
                [--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES]
                [--pipeline-decoder-balance PIPELINE_DECODER_BALANCE]
                [--pipeline-decoder-devices PIPELINE_DECODER_DEVICES]
                [--pipeline-checkpoint {always,never,except_last}]
                [--zero-sharding {none,os}] [--no-reshard-after-forward]
                [--fp32-reduce-scatter] [--cpu-offload] [--use-sharded-state]
                [--not-fsdp-flatten-parameters] [--arch ARCH]
                [--max-epoch MAX_EPOCH] [--max-update MAX_UPDATE]
                [--stop-time-hours STOP_TIME_HOURS] [--clip-norm CLIP_NORM]
                [--sentence-avg] [--update-freq UPDATE_FREQ] [--lr LR]
                [--stop-min-lr STOP_MIN_LR] [--use-bmuf]
                [--skip-remainder-batch] [--save-dir SAVE_DIR]
                [--restore-file RESTORE_FILE] [--continue-once CONTINUE_ONCE]
                [--finetune-from-model FINETUNE_FROM_MODEL]
                [--reset-dataloader] [--reset-lr-scheduler] [--reset-meters]
                [--reset-optimizer]
                [--optimizer-overrides OPTIMIZER_OVERRIDES]
                [--save-interval SAVE_INTERVAL]
                [--save-interval-updates SAVE_INTERVAL_UPDATES]
                [--keep-interval-updates KEEP_INTERVAL_UPDATES]
                [--keep-interval-updates-pattern KEEP_INTERVAL_UPDATES_PATTERN]
                [--keep-last-epochs KEEP_LAST_EPOCHS]
                [--keep-best-checkpoints KEEP_BEST_CHECKPOINTS] [--no-save]
                [--no-epoch-checkpoints] [--no-last-checkpoints]
                [--no-save-optimizer-state]
                [--best-checkpoint-metric BEST_CHECKPOINT_METRIC]
                [--maximize-best-checkpoint-metric] [--patience PATIENCE]
                [--checkpoint-suffix CHECKPOINT_SUFFIX]
                [--checkpoint-shard-count CHECKPOINT_SHARD_COUNT]
                [--load-checkpoint-on-all-dp-ranks]
                [--write-checkpoints-asynchronously] [--store-ema]
                [--ema-decay EMA_DECAY] [--ema-start-update EMA_START_UPDATE]
                [--ema-seed-model EMA_SEED_MODEL]
                [--ema-update-freq EMA_UPDATE_FREQ] [--ema-fp32]
                [--activation-fn {relu,gelu,gelu_fast,gelu_accurate,tanh,linear}]
                [--dropout D] [--attention-dropout D] [--activation-dropout D]
                [--encoder-embed-path STR] [--encoder-embed-dim N]
                [--encoder-ffn-embed-dim N] [--encoder-layers N]
                [--encoder-attention-heads N] [--encoder-normalize-before]
                [--encoder-learned-pos] [--decoder-embed-path STR]
                [--decoder-embed-dim N] [--decoder-ffn-embed-dim N]
                [--decoder-layers N] [--decoder-attention-heads N]
                [--decoder-learned-pos] [--decoder-normalize-before]
                [--decoder-output-dim N] [--share-decoder-input-output-embed]
                [--share-all-embeddings] [--no-token-positional-embeddings]
                [--adaptive-softmax-cutoff EXPR]
                [--adaptive-softmax-dropout D] [--layernorm-embedding]
                [--no-scale-embedding] [--checkpoint-activations]
                [--offload-activations] [--no-cross-attention]
                [--cross-self-attention] [--encoder-layerdrop D]
                [--decoder-layerdrop D]
                [--encoder-layers-to-keep ENCODER_LAYERS_TO_KEEP]
                [--decoder-layers-to-keep DECODER_LAYERS_TO_KEEP]
                [--quant-noise-pq D] [--quant-noise-pq-block-size D]
                [--quant-noise-scalar D] [--min-params-to-wrap D]
                [--resnet-drop-path-rate RESNET_DROP_PATH_RATE]
                [--encoder-drop-path-rate ENCODER_DROP_PATH_RATE]
                [--decoder-drop-path-rate DECODER_DROP_PATH_RATE]
                [--token-bucket-size TOKEN_BUCKET_SIZE]
                [--image-bucket-size IMAGE_BUCKET_SIZE]
                [--attn-scale-factor ATTN_SCALE_FACTOR] [--freeze-resnet]
                [--freeze-encoder-embedding] [--freeze-decoder-embedding]
                [--add-type-embedding]
                [--resnet-type {resnet50,resnet101,resnet152}]
                [--resnet-model-path STR] [--code-image-size CODE_IMAGE_SIZE]
                [--patch-layernorm-embedding] [--code-layernorm-embedding]
                [--entangle-position-embedding] [--disable-entangle]
                [--sync-bn] [--scale-attn] [--scale-fc] [--scale-heads]
                [--scale-resids] [--pooler-dropout D]
                [--pooler-classifier {mlp,linear}]
                [--pooler-activation-fn {relu,gelu,gelu_fast,gelu_accurate,tanh,linear}]
                [--spectral-norm-classification-head]
                [--selected-cols SELECTED_COLS] [--bpe-dir BPE_DIR]
                [--max-source-positions MAX_SOURCE_POSITIONS]
                [--max-target-positions MAX_TARGET_POSITIONS]
                [--max-src-length MAX_SRC_LENGTH]
                [--max-tgt-length MAX_TGT_LENGTH]
                [--code-dict-size CODE_DICT_SIZE]
                [--patch-image-size PATCH_IMAGE_SIZE] [--num-bins NUM_BINS]
                [--imagenet-default-mean-and-std]
                [--constraint-range CONSTRAINT_RANGE] [--eval-bleu]
                [--eval-cider] [--eval-args EVAL_ARGS] [--eval-print-samples]
                [--eval-cider-cached-tokens EVAL_CIDER_CACHED_TOKENS] [--scst]
                [--scst-args SCST_ARGS] [--label-smoothing LABEL_SMOOTHING]
                [--report-accuracy] [--ignore-prefix-size IGNORE_PREFIX_SIZE]
                [--ignore-eos] [--drop-worst-ratio DROP_WORST_RATIO]
                [--drop-worst-after DROP_WORST_AFTER] [--use-rdrop]
                [--reg-alpha REG_ALPHA] [--sample-patch-num SAMPLE_PATCH_NUM]
                [--adam-betas ADAM_BETAS] [--adam-eps ADAM_EPS]
                [--weight-decay WEIGHT_DECAY] [--use-old-adam]
                [--fp16-adam-stats] [--warmup-updates WARMUP_UPDATES]
                [--force-anneal FORCE_ANNEAL]
                [--end-learning-rate END_LEARNING_RATE] [--power POWER]
                [--total-num-update TOTAL_NUM_UPDATE] [--pad PAD] [--eos EOS]
                [--unk UNK]
                data
train.py: error: unrecognized arguments: --warmup-ratio=0.06
usage: train.py [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL]
                [--log-format {json,none,simple,tqdm}] [--log-file LOG_FILE]
                [--aim-repo AIM_REPO] [--aim-run-hash AIM_RUN_HASH]
                [--tensorboard-logdir TENSORBOARD_LOGDIR]
                [--wandb-project WANDB_PROJECT] [--azureml-logging]
                [--seed SEED] [--cpu] [--tpu] [--bf16]
                [--memory-efficient-bf16] [--fp16] [--memory-efficient-fp16]
                [--fp16-no-flatten-grads] [--fp16-init-scale FP16_INIT_SCALE]
                [--fp16-scale-window FP16_SCALE_WINDOW]
                [--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
                [--on-cpu-convert-precision] [--min-loss-scale MIN_LOSS_SCALE]
                [--threshold-loss-scale THRESHOLD_LOSS_SCALE] [--amp]
                [--amp-batch-retries AMP_BATCH_RETRIES]
                [--amp-init-scale AMP_INIT_SCALE]
                [--amp-scale-window AMP_SCALE_WINDOW] [--user-dir USER_DIR]
                [--empty-cache-freq EMPTY_CACHE_FREQ]
                [--all-gather-list-size ALL_GATHER_LIST_SIZE]
                [--model-parallel-size MODEL_PARALLEL_SIZE]
                [--quantization-config-path QUANTIZATION_CONFIG_PATH]
                [--profile] [--reset-logging] [--suppress-crashes]
                [--use-plasma-view] [--plasma-path PLASMA_PATH]
                [--criterion {adaptive_loss,composite_loss,cross_entropy,ctc,fastspeech2,hubert,label_smoothed_cross_entropy,latency_augmented_label_smoothed_cross_entropy,label_smoothed_cross_entropy_with_alignment,label_smoothed_cross_entropy_with_ctc,legacy_masked_lm_loss,masked_lm,model,nat_loss,sentence_prediction,sentence_ranking,tacotron2,speech_to_unit,speech_to_spectrogram,speech_unit_lm_criterion,wav2vec,vocab_parallel_cross_entropy,scst_reward_criterion,adjust_label_smoothed_cross_entropy,clip_scst_reward_criterion,adjust_label_smoothed_encouraging_loss}]
                [--tokenizer {moses,nltk,space}]
                [--bpe {byte_bpe,bytes,characters,fastbpe,gpt2,bert,hf_byte_bpe,sentencepiece,subword_nmt}]
                [--optimizer {adadelta,adafactor,adagrad,adam,adamax,composite,cpu_adam,lamb,nag,sgd}]
                [--lr-scheduler {cosine,fixed,inverse_sqrt,manual,pass_through,polynomial_decay,reduce_lr_on_plateau,step,tri_stage,triangular}]
                [--scoring {bert_score,sacrebleu,bleu,chrf,meteor,wer}]
                [--task TASK] [--num-workers NUM_WORKERS]
                [--skip-invalid-size-inputs-valid-test]
                [--max-tokens MAX_TOKENS] [--batch-size BATCH_SIZE]
                [--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE]
                [--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE]
                [--dataset-impl {raw,lazy,cached,mmap,fasta,huffman}]
                [--data-buffer-size DATA_BUFFER_SIZE]
                [--train-subset TRAIN_SUBSET] [--valid-subset VALID_SUBSET]
                [--combine-valid-subsets] [--ignore-unused-valid-subsets]
                [--validate-interval VALIDATE_INTERVAL]
                [--validate-interval-updates VALIDATE_INTERVAL_UPDATES]
                [--validate-after-updates VALIDATE_AFTER_UPDATES]
                [--fixed-validation-seed FIXED_VALIDATION_SEED]
                [--disable-validation] [--max-tokens-valid MAX_TOKENS_VALID]
                [--batch-size-valid BATCH_SIZE_VALID]
                [--max-valid-steps MAX_VALID_STEPS] [--curriculum CURRICULUM]
                [--gen-subset GEN_SUBSET] [--num-shards NUM_SHARDS]
                [--shard-id SHARD_ID] [--grouped-shuffling]
                [--update-epoch-batch-itr UPDATE_EPOCH_BATCH_ITR]
                [--update-ordered-indices-seed]
                [--distributed-world-size DISTRIBUTED_WORLD_SIZE]
                [--distributed-num-procs DISTRIBUTED_NUM_PROCS]
                [--distributed-rank DISTRIBUTED_RANK]
                [--distributed-backend DISTRIBUTED_BACKEND]
                [--distributed-init-method DISTRIBUTED_INIT_METHOD]
                [--distributed-port DISTRIBUTED_PORT] [--device-id DEVICE_ID]
                [--distributed-no-spawn]
                [--ddp-backend {c10d,fully_sharded,legacy_ddp,no_c10d,pytorch_ddp,slowmo}]
                [--ddp-comm-hook {none,fp16}] [--bucket-cap-mb BUCKET_CAP_MB]
                [--fix-batches-to-gpus] [--find-unused-parameters]
                [--gradient-as-bucket-view] [--fast-stat-sync]
                [--heartbeat-timeout HEARTBEAT_TIMEOUT] [--broadcast-buffers]
                [--slowmo-momentum SLOWMO_MOMENTUM]
                [--slowmo-base-algorithm SLOWMO_BASE_ALGORITHM]
                [--localsgd-frequency LOCALSGD_FREQUENCY]
                [--nprocs-per-node NPROCS_PER_NODE]
                [--pipeline-model-parallel]
                [--pipeline-balance PIPELINE_BALANCE]
                [--pipeline-devices PIPELINE_DEVICES]
                [--pipeline-chunks PIPELINE_CHUNKS]
                [--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE]
                [--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES]
                [--pipeline-decoder-balance PIPELINE_DECODER_BALANCE]
                [--pipeline-decoder-devices PIPELINE_DECODER_DEVICES]
                [--pipeline-checkpoint {always,never,except_last}]
                [--zero-sharding {none,os}] [--no-reshard-after-forward]
                [--fp32-reduce-scatter] [--cpu-offload] [--use-sharded-state]
                [--not-fsdp-flatten-parameters] [--arch ARCH]
                [--max-epoch MAX_EPOCH] [--max-update MAX_UPDATE]
                [--stop-time-hours STOP_TIME_HOURS] [--clip-norm CLIP_NORM]
                [--sentence-avg] [--update-freq UPDATE_FREQ] [--lr LR]
                [--stop-min-lr STOP_MIN_LR] [--use-bmuf]
                [--skip-remainder-batch] [--save-dir SAVE_DIR]
                [--restore-file RESTORE_FILE] [--continue-once CONTINUE_ONCE]
                [--finetune-from-model FINETUNE_FROM_MODEL]
                [--reset-dataloader] [--reset-lr-scheduler] [--reset-meters]
                [--reset-optimizer]
                [--optimizer-overrides OPTIMIZER_OVERRIDES]
                [--save-interval SAVE_INTERVAL]
                [--save-interval-updates SAVE_INTERVAL_UPDATES]
                [--keep-interval-updates KEEP_INTERVAL_UPDATES]
                [--keep-interval-updates-pattern KEEP_INTERVAL_UPDATES_PATTERN]
                [--keep-last-epochs KEEP_LAST_EPOCHS]
                [--keep-best-checkpoints KEEP_BEST_CHECKPOINTS] [--no-save]
                [--no-epoch-checkpoints] [--no-last-checkpoints]
                [--no-save-optimizer-state]
                [--best-checkpoint-metric BEST_CHECKPOINT_METRIC]
                [--maximize-best-checkpoint-metric] [--patience PATIENCE]
                [--checkpoint-suffix CHECKPOINT_SUFFIX]
                [--checkpoint-shard-count CHECKPOINT_SHARD_COUNT]
                [--load-checkpoint-on-all-dp-ranks]
                [--write-checkpoints-asynchronously] [--store-ema]
                [--ema-decay EMA_DECAY] [--ema-start-update EMA_START_UPDATE]
                [--ema-seed-model EMA_SEED_MODEL]
                [--ema-update-freq EMA_UPDATE_FREQ] [--ema-fp32]
                [--activation-fn {relu,gelu,gelu_fast,gelu_accurate,tanh,linear}]
                [--dropout D] [--attention-dropout D] [--activation-dropout D]
                [--encoder-embed-path STR] [--encoder-embed-dim N]
                [--encoder-ffn-embed-dim N] [--encoder-layers N]
                [--encoder-attention-heads N] [--encoder-normalize-before]
                [--encoder-learned-pos] [--decoder-embed-path STR]
                [--decoder-embed-dim N] [--decoder-ffn-embed-dim N]
                [--decoder-layers N] [--decoder-attention-heads N]
                [--decoder-learned-pos] [--decoder-normalize-before]
                [--decoder-output-dim N] [--share-decoder-input-output-embed]
                [--share-all-embeddings] [--no-token-positional-embeddings]
                [--adaptive-softmax-cutoff EXPR]
                [--adaptive-softmax-dropout D] [--layernorm-embedding]
                [--no-scale-embedding] [--checkpoint-activations]
                [--offload-activations] [--no-cross-attention]
                [--cross-self-attention] [--encoder-layerdrop D]
                [--decoder-layerdrop D]
                [--encoder-layers-to-keep ENCODER_LAYERS_TO_KEEP]
                [--decoder-layers-to-keep DECODER_LAYERS_TO_KEEP]
                [--quant-noise-pq D] [--quant-noise-pq-block-size D]
                [--quant-noise-scalar D] [--min-params-to-wrap D]
                [--resnet-drop-path-rate RESNET_DROP_PATH_RATE]
                [--encoder-drop-path-rate ENCODER_DROP_PATH_RATE]
                [--decoder-drop-path-rate DECODER_DROP_PATH_RATE]
                [--token-bucket-size TOKEN_BUCKET_SIZE]
                [--image-bucket-size IMAGE_BUCKET_SIZE]
                [--attn-scale-factor ATTN_SCALE_FACTOR] [--freeze-resnet]
                [--freeze-encoder-embedding] [--freeze-decoder-embedding]
                [--add-type-embedding]
                [--resnet-type {resnet50,resnet101,resnet152}]
                [--resnet-model-path STR] [--code-image-size CODE_IMAGE_SIZE]
                [--patch-layernorm-embedding] [--code-layernorm-embedding]
                [--entangle-position-embedding] [--disable-entangle]
                [--sync-bn] [--scale-attn] [--scale-fc] [--scale-heads]
                [--scale-resids] [--pooler-dropout D]
                [--pooler-classifier {mlp,linear}]
                [--pooler-activation-fn {relu,gelu,gelu_fast,gelu_accurate,tanh,linear}]
                [--spectral-norm-classification-head]
                [--selected-cols SELECTED_COLS] [--bpe-dir BPE_DIR]
                [--max-source-positions MAX_SOURCE_POSITIONS]
                [--max-target-positions MAX_TARGET_POSITIONS]
                [--max-src-length MAX_SRC_LENGTH]
                [--max-tgt-length MAX_TGT_LENGTH]
                [--code-dict-size CODE_DICT_SIZE]
                [--patch-image-size PATCH_IMAGE_SIZE] [--num-bins NUM_BINS]
                [--imagenet-default-mean-and-std]
                [--constraint-range CONSTRAINT_RANGE] [--eval-bleu]
                [--eval-cider] [--eval-args EVAL_ARGS] [--eval-print-samples]
                [--eval-cider-cached-tokens EVAL_CIDER_CACHED_TOKENS] [--scst]
                [--scst-args SCST_ARGS] [--label-smoothing LABEL_SMOOTHING]
                [--report-accuracy] [--ignore-prefix-size IGNORE_PREFIX_SIZE]
                [--ignore-eos] [--drop-worst-ratio DROP_WORST_RATIO]
                [--drop-worst-after DROP_WORST_AFTER] [--use-rdrop]
                [--reg-alpha REG_ALPHA] [--sample-patch-num SAMPLE_PATCH_NUM]
                [--adam-betas ADAM_BETAS] [--adam-eps ADAM_EPS]
                [--weight-decay WEIGHT_DECAY] [--use-old-adam]
                [--fp16-adam-stats] [--warmup-updates WARMUP_UPDATES]
                [--force-anneal FORCE_ANNEAL]
                [--end-learning-rate END_LEARNING_RATE] [--power POWER]
                [--total-num-update TOTAL_NUM_UPDATE] [--pad PAD] [--eos EOS]
                [--unk UNK]
                data
train.py: error: unrecognized arguments: --warmup-ratio=0.06
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 31799) of binary: /home/XXXX/.conda/envs/ofa/bin/python
Traceback (most recent call last):
  File "/home/XXXX/.conda/envs/ofa/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/XXXX/.conda/envs/ofa/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/XXXX/.conda/envs/ofa/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home/XXXX/.conda/envs/ofa/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home/XXXX/.conda/envs/ofa/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home/XXXX/.conda/envs/ofa/lib/python3.7/site-packages/torch/distributed/run.py", line 718, in run
    )(*cmd_args)
  File "/home/XXXX/.conda/envs/ofa/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/XXXX/.conda/envs/ofa/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
../../train.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2022-04-30_22:19:49
  host      : hartley
  rank      : 1 (local_rank: 1)
  exitcode  : 2 (pid: 31800)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2022-04-30_22:19:49
  host      : hartley
  rank      : 2 (local_rank: 2)
  exitcode  : 2 (pid: 31801)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2022-04-30_22:19:49
  host      : hartley
  rank      : 3 (local_rank: 3)
  exitcode  : 2 (pid: 31802)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-04-30_22:19:49
  host      : hartley
  rank      : 0 (local_rank: 0)
  exitcode  : 2 (pid: 31799)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

original_id of items in datasets

Thanks for giving preprocessed datasets for finetuning. Each dataset only contains uniq_id without the original_id in its original dataset. For example, I get the caption_id of each item in neither image_gen datasets nor caption datasets. Could you please add the original_id to each item in the datasets or offer the connection file between the original_id and uniq_id.

Better result for VQA

Would you mind share some tricks to get ~0.4 improvements (80.04 vs 80.45) using current codebase on VQA? Probably not due to Shuffle the Training Data.

custom training code

I just want to make custom training code with torch lightning

but I can't find model declaration and model config

where can I find it?

How to train VQA on my custom data?

Hello! I am trying to finetune OFA-large on VQA using custom dataset, using the finetuning instruction in the repo. I have checked my .tsv and .pkl file several times and they are correct as your provided sample. But after command "bash train_vqa_distributed.sh", the terminal just prints:

total_num_updates 40000
warmup_updates 1000
lr 5e-5
patch_image_size 480

The GPU usage will rise to a certain value and then suddenly return to zero, and then the program will end. I train on single server with 2 GPU. Looking forward to reply, thanks for your sharing work!

RuntimeError: CUDA error: device-side assert triggered

I wanted to run the given pretrain data examples, so I downloaded pretrain_ data_ examples.zip and unzip it into the directory ofa/dataset/ , and rename it pretrain_ data (without _examples), I also downloaded ofa_ base.pt and put it in the directory ofa/checkpoints/.
when I cd the path into ofa/run_scripts/pretraining/ and ran the command that bash pretrain_ofa_base.sh,I got the following error. I searched on Google and got the suggestion to check the type and value range of target involved in cross entropy calculation, but I am not familar with this repo, anbody could tell me how to sovle it? I would appreciate it very much.

Traceback (most recent call last):
 File "../../train.py", line 528, in <module>
  cli_main()
 File "../../train.py", line 521, in cli_main
  distributed_utils.call_main(cfg, main)
 File "/media/wuzhaoji/Data/OFA/fairseq/distributed/utils.py", line 374, in call_main
  distributed_main(cfg.distributed_training.device_id, main, cfg, kwargs)
 File "/media/wuzhaoji/Data/OFA/fairseq/distributed/utils.py", line 348, in distributed_main
  main(cfg, **kwargs)
 File "../../train.py", line 190, in main
  valid_losses, should_stop = train(cfg, trainer, task, epoch_itr)
 File "/home/wuzhaoji/anaconda3/envs/pytorch/lib/python3.7/contextlib.py", line 74, in inner
  return func(*args, **kwds)
 File "../../train.py", line 301, in train
  log_output = trainer.train_step(samples)
 File "/home/wuzhaoji/anaconda3/envs/pytorch/lib/python3.7/contextlib.py", line 74, in inner
  return func(*args, **kwds)
 File "/media/wuzhaoji/Data/OFA/trainer.py", line 806, in train_step
  raise e
 File "/media/wuzhaoji/Data/OFA/trainer.py", line 780, in train_step
  **extra_kwargs
 File "/media/wuzhaoji/Data/OFA/tasks/ofa_task.py", line 319, in train_step
  loss, sample_size, logging_output = criterion(model, sample, update_num=update_num)
 File "/home/wuzhaoji/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
  return forward_call(*input, **kwargs)
 File "/media/wuzhaoji/Data/OFA/criterions/label_smoothed_cross_entropy.py", line 179, in forward
  loss_v1, sample_size_v1, logging_output_v1 = self.forward(model, sample[0], update_num, reduce)
 File "/media/wuzhaoji/Data/OFA/criterions/label_smoothed_cross_entropy.py" line 199, in forward
  net_output = model(**sample["net_input"])
 File "/home/wuzhaoji/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
  return forward_call(*input, **kwargs)
 File "/media/wuzhaoji/Data/OFA/models/ofa/ofa.py", line 97, in forward
  sample_patch_num=sample_patch_num
 File "/home/wuzhaoji/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
  return forward_call(*input, **kwargs)
 File "/media/wuzhaoji/Data/OFA/models/ofa/unify_transformer.py", line 670, in forward
  sample_patch_num)
 File "/media/wuzhaoji/Data/OFA/models/ofa/unify_transformer.py", line 737, in forward_scriptable
  if has_pads:
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. 
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: device-side assert triggered....

ConfigAttributeError when load the checkpoint

Hi, Thanks for the great work! I meet problems when I load the pre-trained checkpoint (refcocog_large_best.pt). I load the model by

overrides={"bpe_dir":"utils/BPE"}
models, cfg, task = checkpoint_utils.load_model_ensemble_and_task(
        utils.split_paths('checkpoints/refcocog.pt'),
        arg_overrides=overrides
    )

The error occurs

Traceback (most recent call last):
  File "eval_refcoco.py", line 22, in <module>
    arg_overrides=overrides
  File "/home/tiger/.local/lib/python3.7/site-packages/fairseq-1.0.0a0+4095baa-py3.7-linux-x86_64.egg/fairseq/checkpoint_utils.py", line 457, in load_model_ensemble_and_task
    model = task.build_model(cfg.model)
  File "/opt/tiger/OFA_offical/tasks/mm_tasks/refcoco.py", line 79, in build_model
    if self.cfg.scst:
  File "/home/tiger/.local/lib/python3.7/site-packages/omegaconf/dictconfig.py", line 305, in __getattr__
    self._format_and_raise(key=key, value=None, cause=e)
  File "/home/tiger/.local/lib/python3.7/site-packages/omegaconf/base.py", line 101, in _format_and_raise
    type_override=type_override,
  File "/home/tiger/.local/lib/python3.7/site-packages/omegaconf/_utils.py", line 629, in format_and_raise
    _raise(ex, cause)
  File "/home/tiger/.local/lib/python3.7/site-packages/omegaconf/_utils.py", line 610, in _raise
    raise ex  # set end OC_CAUSE=1 for full backtrace
  File "/home/tiger/.local/lib/python3.7/site-packages/omegaconf/dictconfig.py", line 303, in __getattr__
    return self._get_impl(key=key, default_value=DEFAULT_VALUE_MARKER)
  File "/home/tiger/.local/lib/python3.7/site-packages/omegaconf/dictconfig.py", line 361, in _get_impl
    node = self._get_node(key=key)
  File "/home/tiger/.local/lib/python3.7/site-packages/omegaconf/dictconfig.py", line 383, in _get_node
    self._validate_get(key)
  File "/home/tiger/.local/lib/python3.7/site-packages/omegaconf/dictconfig.py", line 136, in _validate_get
    key=key, value=value, cause=ConfigAttributeError(msg)
  File "/home/tiger/.local/lib/python3.7/site-packages/omegaconf/base.py", line 101, in _format_and_raise
    type_override=type_override,
  File "/home/tiger/.local/lib/python3.7/site-packages/omegaconf/_utils.py", line 694, in format_and_raise
    _raise(ex, cause)
  File "/home/tiger/.local/lib/python3.7/site-packages/omegaconf/_utils.py", line 610, in _raise
    raise ex  # set end OC_CAUSE=1 for full backtrace
omegaconf.errors.ConfigAttributeError: Key 'scst' not in 'RefcocoConfig'
        full_key: scst
        reference_type=Optional[RefcocoConfig]
        object_type=RefcocoConfig

I would appreciate your help!

model initialize part

Where OFAModel class initialized?

I want use model like

"model = OFAModel(args, encoder, decoder)"

but i cant find args, encoder, decoder init code each

where can I get initialized code?

Unsuccessful VQA checkpoint evaluation

Greetings! I tried to run test evaluation with on VQA checkpoint with beam search, but never got the final .json result. Here is my log

2022-02-24 14:24:56 | INFO | fairseq.logging.progress_bar | :   6961 / 6997 sentences=16
2022-02-24 14:25:14 | INFO | fairseq.logging.progress_bar | :   6971 / 6997 sentences=16
2022-02-24 14:25:31 | INFO | fairseq.logging.progress_bar | :   6981 / 6997 sentences=16
2022-02-24 14:25:48 | INFO | fairseq.logging.progress_bar | :   6991 / 6997 sentences=16
2022-02-24 14:26:00 | INFO | ofa.evaluate | score_sum: tensor([95820.], device='cuda:1'), score_cnt: tensor([447793.], device='cuda:1'), score: 0.214
2022-02-24 14:26:00 | INFO | ofa.evaluate | score_sum: tensor([95820.], device='cuda:0'), score_cnt: tensor([447793.], device='cuda:0'), score: 0.214
2022-02-24 14:26:00 | INFO | ofa.evaluate | score_sum: tensor([95820.], device='cuda:3'), score_cnt: tensor([447793.], device='cuda:3'), score: 0.214
2022-02-24 14:26:00 | INFO | ofa.evaluate | score_sum: tensor([95820.], device='cuda:2'), score_cnt: tensor([447793.], device='cuda:2'), score: 0.214
2022-02-24 14:26:00 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:2 to store for rank: 1
2022-02-24 14:26:00 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:2 to store for rank: 2
2022-02-24 14:26:00 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:2 to store for rank: 3
2022-02-24 14:26:02 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:2 to store for rank: 0
2022-02-24 14:26:02 | INFO | torch.distributed.distributed_c10d | Rank 0: Completed store-based barrier for 4 nodes.
[W ProcessGroupNCCL.cpp:1569] Rank 0 using best-guess GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
2022-02-24 14:26:02 | INFO | torch.distributed.distributed_c10d | Rank 1: Completed store-based barrier for 4 nodes.
[W ProcessGroupNCCL.cpp:1569] Rank 1 using best-guess GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
2022-02-24 14:26:02 | INFO | torch.distributed.distributed_c10d | Rank 2: Completed store-based barrier for 4 nodes.
[W ProcessGroupNCCL.cpp:1569] Rank 2 using best-guess GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
2022-02-24 14:26:02 | INFO | torch.distributed.distributed_c10d | Rank 3: Completed store-based barrier for 4 nodes.
[W ProcessGroupNCCL.cpp:1569] Rank 3 using best-guess GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
rudolph-experiments-0:2658:3465 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/64
rudolph-experiments-0:2660:3467 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/64
rudolph-experiments-0:2658:3465 [1] NCCL INFO Trees [0] 2/-1/-1->1->0|0->1->2/-1/-1 [1] 2/-1/-1->1->0|0->1->2/-1/-1 [2] 2/-1/-1->1->0|0->1->2/-1/-1 [3] 2/-1/-1->1->0|0->1->2/-1/-1 [4] 2/-1/-1->1->-1|-1->1->2/-1/-1 [5] -1/-1/-1->1->0|0->1->-1/-1/-1 [6] 2/-1/-1->1->0|0->1->2/-1/-1 [7] 2/-1/-1->1->0|0->1->2/-1/-1 [8] 2/-1/-1->1->0|0->1->2/-1/-1 [9] 2/-1/-1->1->0|0->1->2/-1/-1 [10] 2/-1/-1->1->0|0->1->2/-1/-1 [11] 2/-1/-1->1->-1|-1->1->2/-1/-1 [12] 2/-1/-1->1->0|0->1->2/-1/-1 [13] 2/-1/-1->1->0|0->1->2/-1/-1 [14] 2/-1/-1->1->0|0->1->2/-1/-1 [15] 2/-1/-1->1->0|0->1->2/-1/-1 [16] 2/-1/-1->1->-1|-1->1->2/-1/-1 [17] -1/-1/-1->1->0|0->1->-1/-1/-1 [18] 2/-1/-1->1->0|0->1->2/-1/-1 [19] 2/-1/-1->1->0|0->1->2/-1/-1 [20] 2/-1/-1->1->0|0->1->2/-1/-1 [21] 2/-1/-1->1->0|0->1->2/-1/-1 [22] 2/-1/-1->1->0|0->1->2/-1/-1 [23] 2/-1/-1->1->-1|-1->1->2/-1/-1
rudolph-experiments-0:2660:3467 [3] NCCL INFO Trees [0] -1/-1/-1->3->2|2->3->-1/-1/-1 [1] -1/-1/-1->3->2|2->3->-1/-1/-1 [2] -1/-1/-1->3->2|2->3->-1/-1/-1 [3] -1/-1/-1->3->2|2->3->-1/-1/-1 [4] 0/-1/-1->3->2|2->3->0/-1/-1 [5] 0/-1/-1->3->2|2->3->0/-1/-1 [6] 0/-1/-1->3->-1|-1->3->0/-1/-1 [7] -1/-1/-1->3->2|2->3->-1/-1/-1 [8] -1/-1/-1->3->2|2->3->-1/-1/-1 [9] -1/-1/-1->3->2|2->3->-1/-1/-1 [10] -1/-1/-1->3->2|2->3->-1/-1/-1 [11] 0/-1/-1->3->2|2->3->0/-1/-1 [12] -1/-1/-1->3->2|2->3->-1/-1/-1 [13] -1/-1/-1->3->2|2->3->-1/-1/-1 [14] -1/-1/-1->3->2|2->3->-1/-1/-1 [15] -1/-1/-1->3->2|2->3->-1/-1/-1 [16] 0/-1/-1->3->2|2->3->0/-1/-1 [17] 0/-1/-1->3->2|2->3->0/-1/-1 [18] 0/-1/-1->3->-1|-1->3->0/-1/-1 [19] -1/-1/-1->3->2|2->3->-1/-1/-1 [20] -1/-1/-1->3->2|2->3->-1/-1/-1 [21] -1/-1/-1->3->2|2->3->-1/-1/-1 [22] -1/-1/-1->3->2|2->3->-1/-1/-1 [23] 0/-1/-1->3->2|2->3->0/-1/-1
rudolph-experiments-0:2657:3464 [0] NCCL INFO Channel 00/24 :    0   1   2   3
rudolph-experiments-0:2659:3466 [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/64
rudolph-experiments-0:2660:3467 [3] NCCL INFO Setting affinity for GPU 3 to ffff0000,00000000,00000000,00000000,ffff0000,00000000,00000000
rudolph-experiments-0:2658:3465 [1] NCCL INFO Setting affinity for GPU 1 to ffff0000,00000000,00000000,00000000,ffff0000,00000000,00000000,00000000
rudolph-experiments-0:2657:3464 [0] NCCL INFO Channel 01/24 :    0   1   2   3
rudolph-experiments-0:2659:3466 [2] NCCL INFO Trees [0] 3/-1/-1->2->1|1->2->3/-1/-1 [1] 3/-1/-1->2->1|1->2->3/-1/-1 [2] 3/-1/-1->2->1|1->2->3/-1/-1 [3] 3/-1/-1->2->1|1->2->3/-1/-1 [4] 3/-1/-1->2->1|1->2->3/-1/-1 [5] 3/-1/-1->2->-1|-1->2->3/-1/-1 [6] -1/-1/-1->2->1|1->2->-1/-1/-1 [7] 3/-1/-1->2->1|1->2->3/-1/-1 [8] 3/-1/-1->2->1|1->2->3/-1/-1 [9] 3/-1/-1->2->1|1->2->3/-1/-1 [10] 3/-1/-1->2->1|1->2->3/-1/-1 [11] 3/-1/-1->2->1|1->2->3/-1/-1 [12] 3/-1/-1->2->1|1->2->3/-1/-1 [13] 3/-1/-1->2->1|1->2->3/-1/-1 [14] 3/-1/-1->2->1|1->2->3/-1/-1 [15] 3/-1/-1->2->1|1->2->3/-1/-1 [16] 3/-1/-1->2->1|1->2->3/-1/-1 [17] 3/-1/-1->2->-1|-1->2->3/-1/-1 [18] -1/-1/-1->2->1|1->2->-1/-1/-1 [19] 3/-1/-1->2->1|1->2->3/-1/-1 [20] 3/-1/-1->2->1|1->2->3/-1/-1 [21] 3/-1/-1->2->1|1->2->3/-1/-1 [22] 3/-1/-1->2->1|1->2->3/-1/-1 [23] 3/-1/-1->2->1|1->2->3/-1/-1
rudolph-experiments-0:2657:3464 [0] NCCL INFO Channel 02/24 :    0   1   2   3
rudolph-experiments-0:2657:3464 [0] NCCL INFO Channel 03/24 :    0   1   2   3
rudolph-experiments-0:2657:3464 [0] NCCL INFO Channel 04/24 :    0   1   2   3
rudolph-experiments-0:2659:3466 [2] NCCL INFO Setting affinity for GPU 2 to ffff0000,00000000,00000000,00000000,ffff0000,00000000,00000000
rudolph-experiments-0:2657:3464 [0] NCCL INFO Channel 05/24 :    0   1   2   3
rudolph-experiments-0:2657:3464 [0] NCCL INFO Channel 06/24 :    0   1   2   3
rudolph-experiments-0:2657:3464 [0] NCCL INFO Channel 07/24 :    0   1   2   3
rudolph-experiments-0:2657:3464 [0] NCCL INFO Channel 08/24 :    0   1   2   3
rudolph-experiments-0:2657:3464 [0] NCCL INFO Channel 09/24 :    0   1   2   3
rudolph-experiments-0:2657:3464 [0] NCCL INFO Channel 10/24 :    0   1   2   3
rudolph-experiments-0:2657:3464 [0] NCCL INFO Channel 11/24 :    0   1   2   3
rudolph-experiments-0:2657:3464 [0] NCCL INFO Channel 12/24 :    0   1   2   3
rudolph-experiments-0:2657:3464 [0] NCCL INFO Channel 13/24 :    0   1   2   3
rudolph-experiments-0:2657:3464 [0] NCCL INFO Channel 14/24 :    0   1   2   3
rudolph-experiments-0:2657:3464 [0] NCCL INFO Channel 15/24 :    0   1   2   3
rudolph-experiments-0:2657:3464 [0] NCCL INFO Channel 16/24 :    0   1   2   3
rudolph-experiments-0:2657:3464 [0] NCCL INFO Channel 17/24 :    0   1   2   3
rudolph-experiments-0:2657:3464 [0] NCCL INFO Channel 18/24 :    0   1   2   3
rudolph-experiments-0:2657:3464 [0] NCCL INFO Channel 19/24 :    0   1   2   3
rudolph-experiments-0:2657:3464 [0] NCCL INFO Channel 20/24 :    0   1   2   3
rudolph-experiments-0:2657:3464 [0] NCCL INFO Channel 21/24 :    0   1   2   3
rudolph-experiments-0:2657:3464 [0] NCCL INFO Channel 22/24 :    0   1   2   3
rudolph-experiments-0:2657:3464 [0] NCCL INFO Channel 23/24 :    0   1   2   3
rudolph-experiments-0:2657:3464 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/64
rudolph-experiments-0:2657:3464 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1|-1->0->1/-1/-1 [1] 1/-1/-1->0->-1|-1->0->1/-1/-1 [2] 1/-1/-1->0->-1|-1->0->1/-1/-1 [3] 1/-1/-1->0->-1|-1->0->1/-1/-1 [4] -1/-1/-1->0->3|3->0->-1/-1/-1 [5] 1/-1/-1->0->3|3->0->1/-1/-1 [6] 1/-1/-1->0->3|3->0->1/-1/-1 [7] 1/-1/-1->0->-1|-1->0->1/-1/-1 [8] 1/-1/-1->0->-1|-1->0->1/-1/-1 [9] 1/-1/-1->0->-1|-1->0->1/-1/-1 [10] 1/-1/-1->0->-1|-1->0->1/-1/-1 [11] -1/-1/-1->0->3|3->0->-1/-1/-1 [12] 1/-1/-1->0->-1|-1->0->1/-1/-1 [13] 1/-1/-1->0->-1|-1->0->1/-1/-1 [14] 1/-1/-1->0->-1|-1->0->1/-1/-1 [15] 1/-1/-1->0->-1|-1->0->1/-1/-1 [16] -1/-1/-1->0->3|3->0->-1/-1/-1 [17] 1/-1/-1->0->3|3->0->1/-1/-1 [18] 1/-1/-1->0->3|3->0->1/-1/-1 [19] 1/-1/-1->0->-1|-1->0->1/-1/-1 [20] 1/-1/-1->0->-1|-1->0->1/-1/-1 [21] 1/-1/-1->0->-1|-1->0->1/-1/-1 [22] 1/-1/-1->0->-1|-1->0->1/-1/-1 [23] -1/-1/-1->0->3|3->0->-1/-1/-1
rudolph-experiments-0:2657:3464 [0] NCCL INFO Setting affinity for GPU 0 to ffff0000,00000000,00000000,00000000,ffff0000,00000000,00000000,00000000
rudolph-experiments-0:2658:3465 [1] NCCL INFO Channel 00 : 1[90000] -> 2[b7000] via P2P/IPC/read
rudolph-experiments-0:2657:3464 [0] NCCL INFO Channel 00 : 0[87000] -> 1[90000] via P2P/IPC/read
rudolph-experiments-0:2660:3467 [3] NCCL INFO Channel 00 : 3[bd000] -> 0[87000] via P2P/IPC/read
rudolph-experiments-0:2659:3466 [2] NCCL INFO Channel 00 : 2[b7000] -> 3[bd000] via P2P/IPC/read
rudolph-experiments-0:2660:3467 [3] NCCL INFO Channel 00 : 3[bd000] -> 2[b7000] via P2P/IPC/read
rudolph-experiments-0:2658:3465 [1] NCCL INFO Channel 00 : 1[90000] -> 0[87000] via P2P/IPC/read
rudolph-experiments-0:2659:3466 [2] NCCL INFO Channel 00 : 2[b7000] -> 1[90000] via P2P/IPC/read
rudolph-experiments-0:2660:3467 [3] NCCL INFO Channel 01 : 3[bd000] -> 0[87000] via P2P/IPC/read
rudolph-experiments-0:2657:3464 [0] NCCL INFO Channel 01 : 0[87000] -> 1[90000] via P2P/IPC/read
rudolph-experiments-0:2658:3465 [1] NCCL INFO Channel 01 : 1[90000] -> 2[b7000] via P2P/IPC/read
rudolph-experiments-0:2659:3466 [2] NCCL INFO Channel 01 : 2[b7000] -> 3[bd000] via P2P/IPC/read
rudolph-experiments-0:2660:3467 [3] NCCL INFO Channel 01 : 3[bd000] -> 2[b7000] via P2P/IPC/read
rudolph-experiments-0:2658:3465 [1] NCCL INFO Channel 01 : 1[90000] -> 0[87000] via P2P/IPC/read
rudolph-experiments-0:2659:3466 [2] NCCL INFO Channel 01 : 2[b7000] -> 1[90000] via P2P/IPC/read
rudolph-experiments-0:2660:3467 [3] NCCL INFO Channel 02 : 3[bd000] -> 0[87000] via P2P/IPC/read
rudolph-experiments-0:2657:3464 [0] NCCL INFO Channel 02 : 0[87000] -> 1[90000] via P2P/IPC/read
rudolph-experiments-0:2658:3465 [1] NCCL INFO Channel 02 : 1[90000] -> 2[b7000] via P2P/IPC/read
rudolph-experiments-0:2659:3466 [2] NCCL INFO Channel 02 : 2[b7000] -> 3[bd000] via P2P/IPC/read
rudolph-experiments-0:2660:3467 [3] NCCL INFO Channel 02 : 3[bd000] -> 2[b7000] via P2P/IPC/read
rudolph-experiments-0:2658:3465 [1] NCCL INFO Channel 02 : 1[90000] -> 0[87000] via P2P/IPC/read
rudolph-experiments-0:2659:3466 [2] NCCL INFO Channel 02 : 2[b7000] -> 1[90000] via P2P/IPC/read
rudolph-experiments-0:2660:3467 [3] NCCL INFO Channel 03 : 3[bd000] -> 0[87000] via P2P/IPC/read
rudolph-experiments-0:2657:3464 [0] NCCL INFO Channel 03 : 0[87000] -> 1[90000] via P2P/IPC/read
rudolph-experiments-0:2658:3465 [1] NCCL INFO Channel 03 : 1[90000] -> 2[b7000] via P2P/IPC/read
rudolph-experiments-0:2659:3466 [2] NCCL INFO Channel 03 : 2[b7000] -> 3[bd000] via P2P/IPC/read
rudolph-experiments-0:2660:3467 [3] NCCL INFO Channel 03 : 3[bd000] -> 2[b7000] via P2P/IPC/read
rudolph-experiments-0:2658:3465 [1] NCCL INFO Channel 03 : 1[90000] -> 0[87000] via P2P/IPC/read
rudolph-experiments-0:2659:3466 [2] NCCL INFO Channel 03 : 2[b7000] -> 1[90000] via P2P/IPC/read
rudolph-experiments-0:2660:3467 [3] NCCL INFO Channel 04 : 3[bd000] -> 0[87000] via P2P/IPC/read
rudolph-experiments-0:2657:3464 [0] NCCL INFO Channel 04 : 0[87000] -> 1[90000] via P2P/IPC/read
rudolph-experiments-0:2658:3465 [1] NCCL INFO Channel 04 : 1[90000] -> 2[b7000] via P2P/IPC/read
rudolph-experiments-0:2659:3466 [2] NCCL INFO Channel 04 : 2[b7000] -> 3[bd000] via P2P/IPC/read
rudolph-experiments-0:2657:3464 [0] NCCL INFO Channel 04 : 0[87000] -> 3[bd000] via P2P/IPC/read
rudolph-experiments-0:2660:3467 [3] NCCL INFO Channel 04 : 3[bd000] -> 2[b7000] via P2P/IPC/read
rudolph-experiments-0:2659:3466 [2] NCCL INFO Channel 04 : 2[b7000] -> 1[90000] via P2P/IPC/read
rudolph-experiments-0:2657:3464 [0] NCCL INFO Channel 05 : 0[87000] -> 1[90000] via P2P/IPC/read
rudolph-experiments-0:2660:3467 [3] NCCL INFO Channel 05 : 3[bd000] -> 0[87000] via P2P/IPC/read
rudolph-experiments-0:2658:3465 [1] NCCL INFO Channel 05 : 1[90000] -> 2[b7000] via P2P/IPC/read
rudolph-experiments-0:2659:3466 [2] NCCL INFO Channel 05 : 2[b7000] -> 3[bd000] via P2P/IPC/read
rudolph-experiments-0:2658:3465 [1] NCCL INFO Channel 05 : 1[90000] -> 0[87000] via P2P/IPC/read
rudolph-experiments-0:2657:3464 [0] NCCL INFO Channel 05 : 0[87000] -> 3[bd000] via P2P/IPC/read
rudolph-experiments-0:2660:3467 [3] NCCL INFO Channel 05 : 3[bd000] -> 2[b7000] via P2P/IPC/read
rudolph-experiments-0:2658:3465 [1] NCCL INFO Channel 06 : 1[90000] -> 2[b7000] via P2P/IPC/read
rudolph-experiments-0:2657:3464 [0] NCCL INFO Channel 06 : 0[87000] -> 1[90000] via P2P/IPC/read
rudolph-experiments-0:2659:3466 [2] NCCL INFO Channel 06 : 2[b7000] -> 3[bd000] via P2P/IPC/read
rudolph-experiments-0:2660:3467 [3] NCCL INFO Channel 06 : 3[bd000] -> 0[87000] via P2P/IPC/read
rudolph-experiments-0:2659:3466 [2] NCCL INFO Channel 06 : 2[b7000] -> 1[90000] via P2P/IPC/read
rudolph-experiments-0:2658:3465 [1] NCCL INFO Channel 06 : 1[90000] -> 0[87000] via P2P/IPC/read
rudolph-experiments-0:2657:3464 [0] NCCL INFO Channel 06 : 0[87000] -> 3[bd000] via P2P/IPC/read
rudolph-experiments-0:2659:3466 [2] NCCL INFO Channel 07 : 2[b7000] -> 3[bd000] via P2P/IPC/read
rudolph-experiments-0:2658:3465 [1] NCCL INFO Channel 07 : 1[90000] -> 2[b7000] via P2P/IPC/read
rudolph-experiments-0:2660:3467 [3] NCCL INFO Channel 07 : 3[bd000] -> 0[87000] via P2P/IPC/read
rudolph-experiments-0:2657:3464 [0] NCCL INFO Channel 07 : 0[87000] -> 1[90000] via P2P/IPC/read
rudolph-experiments-0:2660:3467 [3] NCCL INFO Channel 07 : 3[bd000] -> 2[b7000] via P2P/IPC/read
rudolph-experiments-0:2659:3466 [2] NCCL INFO Channel 07 : 2[b7000] -> 1[90000] via P2P/IPC/read
rudolph-experiments-0:2658:3465 [1] NCCL INFO Channel 07 : 1[90000] -> 0[87000] via P2P/IPC/read
rudolph-experiments-0:2660:3467 [3] NCCL INFO Channel 08 : 3[bd000] -> 0[87000] via P2P/IPC/read
rudolph-experiments-0:2659:3466 [2] NCCL INFO Channel 08 : 2[b7000] -> 3[bd000] via P2P/IPC/read
rudolph-experiments-0:2657:3464 [0] NCCL INFO Channel 08 : 0[87000] -> 1[90000] via P2P/IPC/read
rudolph-experiments-0:2658:3465 [1] NCCL INFO Channel 08 : 1[90000] -> 2[b7000] via P2P/IPC/read
rudolph-experiments-0:2660:3467 [3] NCCL INFO Channel 08 : 3[bd000] -> 2[b7000] via P2P/IPC/read
rudolph-experiments-0:2659:3466 [2] NCCL INFO Channel 08 : 2[b7000] -> 1[90000] via P2P/IPC/read
rudolph-experiments-0:2658:3465 [1] NCCL INFO Channel 08 : 1[90000] -> 0[87000] via P2P/IPC/read
rudolph-experiments-0:2660:3467 [3] NCCL INFO Channel 09 : 3[bd000] -> 0[87000] via P2P/IPC/read
rudolph-experiments-0:2659:3466 [2] NCCL INFO Channel 09 : 2[b7000] -> 3[bd000] via P2P/IPC/read
rudolph-experiments-0:2657:3464 [0] NCCL INFO Channel 09 : 0[87000] -> 1[90000] via P2P/IPC/read
rudolph-experiments-0:2658:3465 [1] NCCL INFO Channel 09 : 1[90000] -> 2[b7000] via P2P/IPC/read
rudolph-experiments-0:2660:3467 [3] NCCL INFO Channel 09 : 3[bd000] -> 2[b7000] via P2P/IPC/read
rudolph-experiments-0:2659:3466 [2] NCCL INFO Channel 09 : 2[b7000] -> 1[90000] via P2P/IPC/read
rudolph-experiments-0:2658:3465 [1] NCCL INFO Channel 09 : 1[90000] -> 0[87000] via P2P/IPC/read
rudolph-experiments-0:2660:3467 [3] NCCL INFO Channel 10 : 3[bd000] -> 0[87000] via P2P/IPC/read
rudolph-experiments-0:2659:3466 [2] NCCL INFO Channel 10 : 2[b7000] -> 3[bd000] via P2P/IPC/read
rudolph-experiments-0:2657:3464 [0] NCCL INFO Channel 10 : 0[87000] -> 1[90000] via P2P/IPC/read
rudolph-experiments-0:2658:3465 [1] NCCL INFO Channel 10 : 1[90000] -> 2[b7000] via P2P/IPC/read
rudolph-experiments-0:2660:3467 [3] NCCL INFO Channel 10 : 3[bd000] -> 2[b7000] via P2P/IPC/read
rudolph-experiments-0:2659:3466 [2] NCCL INFO Channel 10 : 2[b7000] -> 1[90000] via P2P/IPC/read
rudolph-experiments-0:2658:3465 [1] NCCL INFO Channel 10 : 1[90000] -> 0[87000] via P2P/IPC/read
rudolph-experiments-0:2660:3467 [3] NCCL INFO Channel 11 : 3[bd000] -> 0[87000] via P2P/IPC/read
rudolph-experiments-0:2659:3466 [2] NCCL INFO Channel 11 : 2[b7000] -> 3[bd000] via P2P/IPC/read
rudolph-experiments-0:2657:3464 [0] NCCL INFO Channel 11 : 0[87000] -> 1[90000] via P2P/IPC/read
rudolph-experiments-0:2658:3465 [1] NCCL INFO Channel 11 : 1[90000] -> 2[b7000] via P2P/IPC/read
rudolph-experiments-0:2657:3464 [0] NCCL INFO Channel 11 : 0[87000] -> 3[bd000] via P2P/IPC/read
rudolph-experiments-0:2660:3467 [3] NCCL INFO Channel 11 : 3[bd000] -> 2[b7000] via P2P/IPC/read
rudolph-experiments-0:2659:3466 [2] NCCL INFO Channel 11 : 2[b7000] -> 1[90000] via P2P/IPC/read
rudolph-experiments-0:2657:3464 [0] NCCL INFO Channel 12 : 0[87000] -> 1[90000] via P2P/IPC/read
rudolph-experiments-0:2659:3466 [2] NCCL INFO Channel 12 : 2[b7000] -> 3[bd000] via P2P/IPC/read
rudolph-experiments-0:2658:3465 [1] NCCL INFO Channel 12 : 1[90000] -> 2[b7000] via P2P/IPC/read
rudolph-experiments-0:2660:3467 [3] NCCL INFO Channel 12 : 3[bd000] -> 0[87000] via P2P/IPC/read
rudolph-experiments-0:2660:3467 [3] NCCL INFO Channel 12 : 3[bd000] -> 2[b7000] via P2P/IPC/read
rudolph-experiments-0:2659:3466 [2] NCCL INFO Channel 12 : 2[b7000] -> 1[90000] via P2P/IPC/read
rudolph-experiments-0:2658:3465 [1] NCCL INFO Channel 12 : 1[90000] -> 0[87000] via P2P/IPC/read
rudolph-experiments-0:2660:3467 [3] NCCL INFO Channel 13 : 3[bd000] -> 0[87000] via P2P/IPC/read
rudolph-experiments-0:2658:3465 [1] NCCL INFO Channel 13 : 1[90000] -> 2[b7000] via P2P/IPC/read
rudolph-experiments-0:2657:3464 [0] NCCL INFO Channel 13 : 0[87000] -> 1[90000] via P2P/IPC/read
rudolph-experiments-0:2659:3466 [2] NCCL INFO Channel 13 : 2[b7000] -> 3[bd000] via P2P/IPC/read
rudolph-experiments-0:2660:3467 [3] NCCL INFO Channel 13 : 3[bd000] -> 2[b7000] via P2P/IPC/read
rudolph-experiments-0:2658:3465 [1] NCCL INFO Channel 13 : 1[90000] -> 0[87000] via P2P/IPC/read
rudolph-experiments-0:2659:3466 [2] NCCL INFO Channel 13 : 2[b7000] -> 1[90000] via P2P/IPC/read
rudolph-experiments-0:2660:3467 [3] NCCL INFO Channel 14 : 3[bd000] -> 0[87000] via P2P/IPC/read
rudolph-experiments-0:2657:3464 [0] NCCL INFO Channel 14 : 0[87000] -> 1[90000] via P2P/IPC/read
rudolph-experiments-0:2658:3465 [1] NCCL INFO Channel 14 : 1[90000] -> 2[b7000] via P2P/IPC/read
rudolph-experiments-0:2659:3466 [2] NCCL INFO Channel 14 : 2[b7000] -> 3[bd000] via P2P/IPC/read
rudolph-experiments-0:2660:3467 [3] NCCL INFO Channel 14 : 3[bd000] -> 2[b7000] via P2P/IPC/read
rudolph-experiments-0:2658:3465 [1] NCCL INFO Channel 14 : 1[90000] -> 0[87000] via P2P/IPC/read
rudolph-experiments-0:2659:3466 [2] NCCL INFO Channel 14 : 2[b7000] -> 1[90000] via P2P/IPC/read
rudolph-experiments-0:2660:3467 [3] NCCL INFO Channel 15 : 3[bd000] -> 0[87000] via P2P/IPC/read
rudolph-experiments-0:2657:3464 [0] NCCL INFO Channel 15 : 0[87000] -> 1[90000] via P2P/IPC/read
rudolph-experiments-0:2658:3465 [1] NCCL INFO Channel 15 : 1[90000] -> 2[b7000] via P2P/IPC/read
rudolph-experiments-0:2659:3466 [2] NCCL INFO Channel 15 : 2[b7000] -> 3[bd000] via P2P/IPC/read
rudolph-experiments-0:2660:3467 [3] NCCL INFO Channel 15 : 3[bd000] -> 2[b7000] via P2P/IPC/read
rudolph-experiments-0:2658:3465 [1] NCCL INFO Channel 15 : 1[90000] -> 0[87000] via P2P/IPC/read
rudolph-experiments-0:2659:3466 [2] NCCL INFO Channel 15 : 2[b7000] -> 1[90000] via P2P/IPC/read
rudolph-experiments-0:2660:3467 [3] NCCL INFO Channel 16 : 3[bd000] -> 0[87000] via P2P/IPC/read
rudolph-experiments-0:2657:3464 [0] NCCL INFO Channel 16 : 0[87000] -> 1[90000] via P2P/IPC/read
rudolph-experiments-0:2658:3465 [1] NCCL INFO Channel 16 : 1[90000] -> 2[b7000] via P2P/IPC/read
rudolph-experiments-0:2659:3466 [2] NCCL INFO Channel 16 : 2[b7000] -> 3[bd000] via P2P/IPC/read
rudolph-experiments-0:2657:3464 [0] NCCL INFO Channel 16 : 0[87000] -> 3[bd000] via P2P/IPC/read
rudolph-experiments-0:2660:3467 [3] NCCL INFO Channel 16 : 3[bd000] -> 2[b7000] via P2P/IPC/read
rudolph-experiments-0:2659:3466 [2] NCCL INFO Channel 16 : 2[b7000] -> 1[90000] via P2P/IPC/read
rudolph-experiments-0:2657:3464 [0] NCCL INFO Channel 17 : 0[87000] -> 1[90000] via P2P/IPC/read
rudolph-experiments-0:2660:3467 [3] NCCL INFO Channel 17 : 3[bd000] -> 0[87000] via P2P/IPC/read
rudolph-experiments-0:2658:3465 [1] NCCL INFO Channel 17 : 1[90000] -> 2[b7000] via P2P/IPC/read
rudolph-experiments-0:2659:3466 [2] NCCL INFO Channel 17 : 2[b7000] -> 3[bd000] via P2P/IPC/read
rudolph-experiments-0:2658:3465 [1] NCCL INFO Channel 17 : 1[90000] -> 0[87000] via P2P/IPC/read
rudolph-experiments-0:2657:3464 [0] NCCL INFO Channel 17 : 0[87000] -> 3[bd000] via P2P/IPC/read
rudolph-experiments-0:2660:3467 [3] NCCL INFO Channel 17 : 3[bd000] -> 2[b7000] via P2P/IPC/read
rudolph-experiments-0:2658:3465 [1] NCCL INFO Channel 18 : 1[90000] -> 2[b7000] via P2P/IPC/read
rudolph-experiments-0:2657:3464 [0] NCCL INFO Channel 18 : 0[87000] -> 1[90000] via P2P/IPC/read
rudolph-experiments-0:2659:3466 [2] NCCL INFO Channel 18 : 2[b7000] -> 3[bd000] via P2P/IPC/read
rudolph-experiments-0:2660:3467 [3] NCCL INFO Channel 18 : 3[bd000] -> 0[87000] via P2P/IPC/read
rudolph-experiments-0:2659:3466 [2] NCCL INFO Channel 18 : 2[b7000] -> 1[90000] via P2P/IPC/read
rudolph-experiments-0:2658:3465 [1] NCCL INFO Channel 18 : 1[90000] -> 0[87000] via P2P/IPC/read
rudolph-experiments-0:2657:3464 [0] NCCL INFO Channel 18 : 0[87000] -> 3[bd000] via P2P/IPC/read
rudolph-experiments-0:2659:3466 [2] NCCL INFO Channel 19 : 2[b7000] -> 3[bd000] via P2P/IPC/read
rudolph-experiments-0:2658:3465 [1] NCCL INFO Channel 19 : 1[90000] -> 2[b7000] via P2P/IPC/read
rudolph-experiments-0:2660:3467 [3] NCCL INFO Channel 19 : 3[bd000] -> 0[87000] via P2P/IPC/read
rudolph-experiments-0:2657:3464 [0] NCCL INFO Channel 19 : 0[87000] -> 1[90000] via P2P/IPC/read
rudolph-experiments-0:2660:3467 [3] NCCL INFO Channel 19 : 3[bd000] -> 2[b7000] via P2P/IPC/read
rudolph-experiments-0:2659:3466 [2] NCCL INFO Channel 19 : 2[b7000] -> 1[90000] via P2P/IPC/read
rudolph-experiments-0:2658:3465 [1] NCCL INFO Channel 19 : 1[90000] -> 0[87000] via P2P/IPC/read
rudolph-experiments-0:2660:3467 [3] NCCL INFO Channel 20 : 3[bd000] -> 0[87000] via P2P/IPC/read
rudolph-experiments-0:2659:3466 [2] NCCL INFO Channel 20 : 2[b7000] -> 3[bd000] via P2P/IPC/read
rudolph-experiments-0:2657:3464 [0] NCCL INFO Channel 20 : 0[87000] -> 1[90000] via P2P/IPC/read
rudolph-experiments-0:2658:3465 [1] NCCL INFO Channel 20 : 1[90000] -> 2[b7000] via P2P/IPC/read
rudolph-experiments-0:2660:3467 [3] NCCL INFO Channel 20 : 3[bd000] -> 2[b7000] via P2P/IPC/read
rudolph-experiments-0:2659:3466 [2] NCCL INFO Channel 20 : 2[b7000] -> 1[90000] via P2P/IPC/read
rudolph-experiments-0:2658:3465 [1] NCCL INFO Channel 20 : 1[90000] -> 0[87000] via P2P/IPC/read
rudolph-experiments-0:2660:3467 [3] NCCL INFO Channel 21 : 3[bd000] -> 0[87000] via P2P/IPC/read
rudolph-experiments-0:2659:3466 [2] NCCL INFO Channel 21 : 2[b7000] -> 3[bd000] via P2P/IPC/read
rudolph-experiments-0:2657:3464 [0] NCCL INFO Channel 21 : 0[87000] -> 1[90000] via P2P/IPC/read
rudolph-experiments-0:2658:3465 [1] NCCL INFO Channel 21 : 1[90000] -> 2[b7000] via P2P/IPC/read
rudolph-experiments-0:2660:3467 [3] NCCL INFO Channel 21 : 3[bd000] -> 2[b7000] via P2P/IPC/read
rudolph-experiments-0:2659:3466 [2] NCCL INFO Channel 21 : 2[b7000] -> 1[90000] via P2P/IPC/read
rudolph-experiments-0:2658:3465 [1] NCCL INFO Channel 21 : 1[90000] -> 0[87000] via P2P/IPC/read
rudolph-experiments-0:2660:3467 [3] NCCL INFO Channel 22 : 3[bd000] -> 0[87000] via P2P/IPC/read
rudolph-experiments-0:2659:3466 [2] NCCL INFO Channel 22 : 2[b7000] -> 3[bd000] via P2P/IPC/read
rudolph-experiments-0:2657:3464 [0] NCCL INFO Channel 22 : 0[87000] -> 1[90000] via P2P/IPC/read
rudolph-experiments-0:2658:3465 [1] NCCL INFO Channel 22 : 1[90000] -> 2[b7000] via P2P/IPC/read
rudolph-experiments-0:2660:3467 [3] NCCL INFO Channel 22 : 3[bd000] -> 2[b7000] via P2P/IPC/read
rudolph-experiments-0:2659:3466 [2] NCCL INFO Channel 22 : 2[b7000] -> 1[90000] via P2P/IPC/read
rudolph-experiments-0:2658:3465 [1] NCCL INFO Channel 22 : 1[90000] -> 0[87000] via P2P/IPC/read
rudolph-experiments-0:2660:3467 [3] NCCL INFO Channel 23 : 3[bd000] -> 0[87000] via P2P/IPC/read
rudolph-experiments-0:2659:3466 [2] NCCL INFO Channel 23 : 2[b7000] -> 3[bd000] via P2P/IPC/read
rudolph-experiments-0:2657:3464 [0] NCCL INFO Channel 23 : 0[87000] -> 1[90000] via P2P/IPC/read
rudolph-experiments-0:2658:3465 [1] NCCL INFO Channel 23 : 1[90000] -> 2[b7000] via P2P/IPC/read
rudolph-experiments-0:2657:3464 [0] NCCL INFO Channel 23 : 0[87000] -> 3[bd000] via P2P/IPC/read
rudolph-experiments-0:2660:3467 [3] NCCL INFO Channel 23 : 3[bd000] -> 2[b7000] via P2P/IPC/read
rudolph-experiments-0:2659:3466 [2] NCCL INFO Channel 23 : 2[b7000] -> 1[90000] via P2P/IPC/read
rudolph-experiments-0:2657:3464 [0] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer
rudolph-experiments-0:2657:3464 [0] NCCL INFO comm 0x7f7608002dd0 rank 0 nranks 4 cudaDev 0 busId 87000 - Init COMPLETE
rudolph-experiments-0:2657:2657 [0] NCCL INFO Launch mode Parallel
rudolph-experiments-0:2660:3467 [3] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer
rudolph-experiments-0:2658:3465 [1] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer
rudolph-experiments-0:2660:3467 [3] NCCL INFO comm 0x7fdd6c002dd0 rank 3 nranks 4 cudaDev 3 busId bd000 - Init COMPLETE
rudolph-experiments-0:2658:3465 [1] NCCL INFO comm 0x7f7cdc002dd0 rank 1 nranks 4 cudaDev 1 busId 90000 - Init COMPLETE
rudolph-experiments-0:2659:3466 [2] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer
rudolph-experiments-0:2659:3466 [2] NCCL INFO comm 0x7f6878002dd0 rank 2 nranks 4 cudaDev 2 busId b7000 - Init COMPLETE

Setup: 4 A100 GPUs, single server
Cuda: 11.1
PyTorch: 1.9.1+cu111

What is the meaning of `train_caption_stage1_el.sh`?

I found that there are some files under the caption directory but do not know when should I use them.
What is the meaning and function of train_caption_stage1_el.sh and train_caption_stage1_el_db.sh
Thanks!

Custom image captioning model in Chinese

Hello, I am doing the image captioning task in Chinese. Is there any way to switch the tokenization and word embedding method to my own method in the training process to adapt the OFA captioning model into Chinese?

OFA on customised task e.g. OK-VQA

Hi, thanks for the awesome work!

I'd like to fine-tune OFA on OK-VQA, I have been trying to follow the instructions of VQA assuming they are similar, but I have raw image input (and question) , how do I convert to what OFA understands? Do I need to follow the format of the example tsv file? Is trainval_ans2label.pkl required (if so how do I generate it)?
What are the steps to take to extend OFA on OK-VQA?

Thank you in advance for your help!

How can I output multiple captions using beam search?

Hi, thanks for the wonderful work, I want to output 5 captions when I set "beam" to 5, but it seems to output only one caption by default, since this repository is rich in functionality, I could not find out which part to modify to output more than one caption for each image, can you help me?

How is the confidence score used in VQA training?

Hi, I have a question regarding how the confidence score associated with the annotated answer is used for VQA? I don't think it is mentioned in the paper, could you explain for example, how would the confidence from 0.3!|+ans1 and 1.0!|+ans2 be processed, which part of the code is related to this? I am not entirely sure the mechanism of training with the confidence then just assigning '1.0!|+' in for inference.

Also, is it possible to generate top K answers instead of just one?

Thank you!

when i run run_scripts/caption/caption_train_stage1, i had some problems

@logicwong Hi, thanks for the wonderful work. when i run run_scripts/caption/caption_train_stage1, i had some problems
That seems to be the problem:
File "/data1/cxy/OFA/models/sequence_generator.py", line 480, in _generate
assert step < max_len, f"{step} < {max_len}"
AssertionError: 16 < 16
How should I handle it?

2022-04-18 21:46:53 - progress_bar.py[line:274] - INFO: epoch 001: 491 / 70844 loss=3.702, loss_v1=0, loss_v2=0, nll_loss=1.589, ntokens=99.2, nsentences=8, sample_size=99.2, sample_size_v1=0, sample_size_v2=0, ppl=3.01, wps=92, ups=0.93, wpb=99.2, bsz=8, num_updates=490, lr=2.88201e-07, gnorm=8.633, clip=100, loss_scale=64, train_wall=11, gb_free=12.7, wall=604
2022-04-18 21:47:03 - progress_bar.py[line:274] - INFO: epoch 001: 501 / 70844 loss=3.913, loss_v1=0, loss_v2=0, nll_loss=1.869, ntokens=99.3, nsentences=8, sample_size=99.3, sample_size_v1=0, sample_size_v2=0, ppl=3.65, wps=92.4, ups=0.93, wpb=99.3, bsz=8, num_updates=500, lr=2.94083e-07, gnorm=10.402, clip=100, loss_scale=64, train_wall=11, gb_free=12.7, wall=615
2022-04-18 21:47:03 - train.py[line:436] - INFO: begin validation on "valid" subset
slice_id 0 seek offset 0
slice_id 1 seek offset 2500
slice_id 1 seek offset 2500
slice_id 0 seek offset 0
Traceback (most recent call last):
File "../../train.py", line 528, in
cli_main()
File "../../train.py", line 521, in cli_main
distributed_utils.call_main(cfg, main)
File "/data1/cxy/OFA/fairseq/fairseq/distributed/utils.py", line 374, in call_main
distributed_main(cfg.distributed_training.device_id, main, cfg, kwargs)
File "/data1/cxy/OFA/fairseq/fairseq/distributed/utils.py", line 348, in distributed_main
main(cfg, **kwargs)
File "../../train.py", line 190, in main
valid_losses, should_stop = train(cfg, trainer, task, epoch_itr)
File "/home/ubuntu/miniconda3/envs/pytorch-zme/lib/python3.7/contextlib.py", line 74, in inner
return func(*args, **kwds)
File "../../train.py", line 316, in train
cfg, trainer, task, epoch_itr, valid_subsets, end_of_epoch
File "../../train.py", line 402, in validate_and_save
valid_losses = validate(cfg, trainer, task, epoch_itr, valid_subsets)
File "../../train.py", line 472, in validate
trainer.valid_step(sample)
File "/home/ubuntu/miniconda3/envs/pytorch-zme/lib/python3.7/contextlib.py", line 74, in inner
return func(*args, **kwds)
File "/data1/cxy/OFA/trainer.py", line 1059, in valid_step
sample, self.model, self.criterion, **extra_kwargs
File "/data1/cxy/OFA/tasks/mm_tasks/caption.py", line 139, in valid_step
hyps, refs = self._inference(self.sequence_generator, sample, model)
File "/data1/cxy/OFA/tasks/mm_tasks/caption.py", line 230, in _inference
gen_out = self.inference_step(generator, [model], sample)
File "/data1/cxy/OFA/fairseq/fairseq/tasks/fairseq_task.py", line 518, in inference_step
models, sample, prefix_tokens=prefix_tokens, constraints=constraints
File "/home/ubuntu/miniconda3/envs/pytorch-zme/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/data1/cxy/OFA/models/sequence_generator.py", line 207, in generate
return self._generate(models, sample, **kwargs)
File "/data1/cxy/OFA/models/sequence_generator.py", line 480, in _generate
assert step < max_len, f"{step} < {max_len}"
AssertionError: 16 < 16
Traceback (most recent call last):
File "/home/ubuntu/miniconda3/envs/pytorch-zme/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/ubuntu/miniconda3/envs/pytorch-zme/lib/python3.7/runpy.py", line 85, in run_code
exec(code, run_globals)
File "/home/ubuntu/miniconda3/envs/pytorch-zme/lib/python3.7/site-packages/torch/distributed/launch.py", line 340, in
main()
File "/home/ubuntu/miniconda3/envs/pytorch-zme/lib/python3.7/site-packages/torch/distributed/launch.py", line 326, in main
sigkill_handler(signal.SIGTERM, None) # not coming back
File "/home/ubuntu/miniconda3/envs/pytorch-zme/lib/python3.7/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/ubuntu/miniconda3/envs/pytorch-zme/bin/python3', '-u', '../../train.py', '--local_rank=1', '../../dataset/caption_data/caption_stage1_train.tsv,../../dataset/caption_data/caption_val.tsv', '--selected-cols=0,4,2', '--bpe-dir=../../utils/BPE', '--user-dir=../../ofa_module', '--restore-file=../../checkpoints/ofa_large.pt', '--reset-optimizer', '--reset-dataloader', '--reset-meters', '--save-dir=./stage1_checkpoints/4_0.06
', '--task=caption', '--arch=ofa_large', '--criterion=adjust_label_smoothed_cross_entropy', '--label-smoothing=0.1', '--batch-size=1', '--update-freq=4', '--encoder-normalize-before', '--decoder-normalize-before', '--share-decoder-input-output-embed', '--share-all-embeddings', '--layernorm-embedding', '--patch-layernorm-embedding', '--code-layernorm-embedding', '--resnet-drop-path-rate=0.0', '--encoder-drop-path-rate=0.1', '--decoder-drop-path-rate=0.1', '--dropout=0.1', '--attention-dropout=0.0', '--weight-decay=0.01', '--optimizer=adam', '--adam-betas=(0.9,0.999)', '--adam-eps=1e-08', '--clip-norm=1.0', '--lr-scheduler=polynomial_decay', '--lr=1e-5', '--max-epoch=4', '--warmup-ratio=0.06', '--log-format=simple', '--log-interval=10', '--fixed-validation-seed=7', '--no-epoch-checkpoints', '--keep-best-checkpoints=1', '--save-interval=1', '--validate-interval=1', '--save-interval-updates=500', '--validate-interval-updates=500', '--eval-cider', '--eval-cider-cached-tokens=../../dataset/caption_data/cider_cached_tokens/coco-valid-words.p', '--eval-args={"beam":5,"max_len_b":16,"no_repeat_ngram_size":3}', '--best-checkpoint-metric=cider', '--maximize-best-checkpoint-metric', '--max-src-length=80', '--max-tgt-length=20', '--find-unused-parameters', '--freeze-encoder-embedding', '--freeze-decoder-embedding', '--add-type-embedding', '--scale-attn', '--scale-fc', '--scale-heads', '--disable-entangle', '--num-bins=1000', '--patch-image-size=480', '--drop-worst-ratio=0.2', '--drop-worst-after=2500', '--fp16', '--fp16-scale-window=512', '--num-workers=0']' returned non-zero exit status 1.

Visualization of the image token

Thanks for your awesome work! I wonder how could I get the patch pixels of an image token? I know it requires the dVAE decoder
to decode the image tokens. Does the code contain the decoding process? Or could you recommend any pre-trained models that I could quickly decode the image tokens? I appreciate your reply.

Discussion: Experiments with SwinTransformer?

Just a discussion

I wonder if the authors have tried replacing the ResNet with SwinTransformers in such SimVLM architecture.

Under the SimVLM architecture, using SwinTransformers obtains way worse performance than ResNet in my experiment. I wonder if the authors have similar observations or any thoughts on this.

My replacement: simply replace the image embedding part.

When i run the run_scripts/caption/train_caption_stage1.sh, i I encountered the following problem.

@logicwong
Thanks for releasing the amazing work!
When i run the run_scripts/caption/train_caption_stage1.sh, i I encountered the following problem. Can you help me?

train.py: error: argument --max-epoch: invalid int value: '{2,}'

train.py: error: argument --max-epoch: invalid int value: '{2,}'
Traceback (most recent call last):
File "/home/ubuntu/miniconda3/envs/pytorch-zme/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/ubuntu/miniconda3/envs/pytorch-zme/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/ubuntu/miniconda3/envs/pytorch-zme/lib/python3.7/site-packages/torch/distributed/launch.py", line 340, in
main()
File "/home/ubuntu/miniconda3/envs/pytorch-zme/lib/python3.7/site-packages/torch/distributed/launch.py", line 326, in main
sigkill_handler(signal.SIGTERM, None) # not coming back
File "/home/ubuntu/miniconda3/envs/pytorch-zme/lib/python3.7/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/ubuntu/miniconda3/envs/pytorch-zme/bin/python3', '-u', '../../

Finetuned BASE checkpoints

Could you please release the finetuned BASE checkpoints of different downstream tasks? It seems you release finetuned LARGE checkpoints.
Thanks

How can I handle this in a modified model?

Hi, I add another layer to the model but there is a problem that happened after several steps.

2022-03-21 23:16:50 - progress_bar.py[line:272] - INFO: epoch 001:     41 / 24544 loss=1.825, loss_v1=0, loss_v2=0, nll_loss=1.825, ntokens=16, nsentences=16, sample_size=16, sample_size_v1=0, sample_size_v2=0, ppl=3.54, wps=11.3, ups=0.7, wpb=16, bsz=16, num_updates=41, lr=5.56838e-07, gnorm=32.218, clip=100, loss_scale=16, train_wall=1, gb_free=14.5, wall=67
2022-03-21 23:16:51 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 8.0
2022-03-21 23:16:53 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 4.0
2022-03-21 23:16:54 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 2.0
2022-03-21 23:16:55 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 1.0
2022-03-21 23:16:56 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.5
2022-03-21 23:16:57 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.25
2022-03-21 23:16:58 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.125
2022-03-21 23:16:59 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.0625
2022-03-21 23:17:01 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.03125
2022-03-21 23:17:02 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.015625
2022-03-21 23:17:02 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.0078125
2022-03-21 23:17:03 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.00390625
2022-03-21 23:17:04 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.001953125
2022-03-21 23:17:05 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.0009765625
2022-03-21 23:17:06 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.00048828125
2022-03-21 23:17:07 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.000244140625
2022-03-21 23:17:08 - trainer.py[line:922] - INFO: NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.0001220703125
/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py:787: UserWarning: Using a non-full backward hook when the forward contains multiple autograd Nodes is deprecated and will be removed in future versions. This hook will be missing some grad_input. Please use register_full_backward_hook to get the documented behavior.
  warnings.warn("Using a non-full backward hook when the forward contains multiple autograd Nodes "
/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py:787: UserWarning: Using a non-full backward hook when the forward contains multiple autograd Nodes is deprecated and will be removed in future versions. This hook will be missing some grad_input. Please use register_full_backward_hook to get the documented behavior.
  warnings.warn("Using a non-full backward hook when the forward contains multiple autograd Nodes "
/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py:787: UserWarning: Using a non-full backward hook when the forward contains multiple autograd Nodes is deprecated and will be removed in future versions. This hook will be missing some grad_input. Please use register_full_backward_hook to get the documented behavior.
  warnings.warn("Using a non-full backward hook when the forward contains multiple autograd Nodes "
/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py:752: UserWarning: Using non-full backward hooks on a Module that does not return a single Tensor or a tuple of Tensors is deprecated and will be removed in future versions. This hook will be missing some of the grad_output. Please use register_full_backward_hook to get the documented behavior.
  warnings.warn("Using non-full backward hooks on a Module that does not return a "
/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py:752: UserWarning: Using non-full backward hooks on a Module that does not return a single Tensor or a tuple of Tensors is deprecated and will be removed in future versions. This hook will be missing some of the grad_output. Please use register_full_backward_hook to get the documented behavior.
  warnings.warn("Using non-full backward hooks on a Module that does not return a "
/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py:787: UserWarning: Using a non-full backward hook when the forward contains multiple autograd Nodes is deprecated and will be removed in future versions. This hook will be missing some grad_input. Please use register_full_backward_hook to get the documented behavior.
  warnings.warn("Using a non-full backward hook when the forward contains multiple autograd Nodes "
/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py:752: UserWarning: Using non-full backward hooks on a Module that does not return a single Tensor or a tuple of Tensors is deprecated and will be removed in future versions. This hook will be missing some of the grad_output. Please use register_full_backward_hook to get the documented behavior.
  warnings.warn("Using non-full backward hooks on a Module that does not return a "
/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py:752: UserWarning: Using non-full backward hooks on a Module that does not return a single Tensor or a tuple of Tensors is deprecated and will be removed in future versions. This hook will be missing some of the grad_output. Please use register_full_backward_hook to get the documented behavior.
  warnings.warn("Using non-full backward hooks on a Module that does not return a "
/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py:762: UserWarning: Using non-full backward hooks on a Module that does not take as input a single Tensor or a tuple of Tensors is deprecated and will be removed in future versions. This hook will be missing some of the grad_input. Please use register_full_backward_hook to get the documented behavior.
  warnings.warn("Using non-full backward hooks on a Module that does not take as input a "
/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py:762: UserWarning: Using non-full backward hooks on a Module that does not take as input a single Tensor or a tuple of Tensors is deprecated and will be removed in future versions. This hook will be missing some of the grad_input. Please use register_full_backward_hook to get the documented behavior.
  warnings.warn("Using non-full backward hooks on a Module that does not take as input a "
/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py:762: UserWarning: Using non-full backward hooks on a Module that does not take as input a single Tensor or a tuple of Tensors is deprecated and will be removed in future versions. This hook will be missing some of the grad_input. Please use register_full_backward_hook to get the documented behavior.
  warnings.warn("Using non-full backward hooks on a Module that does not take as input a "
/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py:762: UserWarning: Using non-full backward hooks on a Module that does not take as input a single Tensor or a tuple of Tensors is deprecated and will be removed in future versions. This hook will be missing some of the grad_input. Please use register_full_backward_hook to get the documented behavior.
  warnings.warn("Using non-full backward hooks on a Module that does not take as input a "
/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py:777: UserWarning: Using a non-full backward hook when outputs are generated by different autograd Nodes is deprecated and will be removed in future versions. This hook will be missing some grad_output. Please use register_full_backward_hook to get the documented behavior.
  warnings.warn("Using a non-full backward hook when outputs are generated by different autograd Nodes "
/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py:777: UserWarning: Using a non-full backward hook when outputs are generated by different autograd Nodes is deprecated and will be removed in future versions. This hook will be missing some grad_output. Please use register_full_backward_hook to get the documented behavior.
  warnings.warn("Using a non-full backward hook when outputs are generated by different autograd Nodes "
/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py:777: UserWarning: Using a non-full backward hook when outputs are generated by different autograd Nodes is deprecated and will be removed in future versions. This hook will be missing some grad_output. Please use register_full_backward_hook to get the documented behavior.
  warnings.warn("Using a non-full backward hook when outputs are generated by different autograd Nodes "
/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py:777: UserWarning: Using a non-full backward hook when outputs are generated by different autograd Nodes is deprecated and will be removed in future versions. This hook will be missing some grad_output. Please use register_full_backward_hook to get the documented behavior.
  warnings.warn("Using a non-full backward hook when outputs are generated by different autograd Nodes "
2022-03-21 23:17:09 - nan_detector.py[line:89] - WARNING: NaN detected in output of encoder.layers.2.moe.moe_layer, shape: torch.Size([60, 1, 768]), forward input max: 3.67578125, input min: -7.75
Traceback (most recent call last):
  File "/workspace/OFA/trainer.py", line 871, in train_step
    grad_norm = self.clip_grad_norm(self.cfg.optimization.clip_norm)
  File "/workspace/OFA/trainer.py", line 1208, in clip_grad_norm
    return self.optimizer.clip_grad_norm(
  File "/workspace/OFA/fairseq/fairseq/optim/fp16_optimizer.py", line 200, in clip_grad_norm
    self.scaler.check_overflow(grad_norm)
  File "/workspace/OFA/fairseq/fairseq/optim/dynamic_loss_scaler.py", line 61, in check_overflow
    raise FloatingPointError(
FloatingPointError: Minimum loss scale reached (0.0001). Your loss is probably exploding. Try lowering the learning rate, using gradient clipping or increasing the batch size.

Then the training broke down. So how can I fix this problem? Hyperparameter Tuning? Or something else I need to pay attention to?
I will really really really appreciate it if you can help me!

VQA input construction

Hi, guys:
Thank you for your diligent work. I'm trying to prepare VQA input for single sample inference.
I'm not sure about the architecture of the VQA model, Such as the "decoder_prompts" , "prefix_tokens" in the autonomously constructed "sample".
and following sentence in the readme description about VQA is vague to me:
"we transform original VQA training questions with multiple golden answers into multiple training samples. "
Do you have any suggestions?

Pretraining data release

Thanks for releasing the amazing work!
Will the pretraining data be released? There is very small number of examples available right now.

PIL.UnidentifiedImageError: cannot identify image file

I made tsv builder for custom image inference

but error occurs when I run it with my tsv file

PIL.UnidentifiedImageError: cannot identify image file <_io.BytesIO object at 0x7f8897ba1e50>

I find that str convert into float64 automatically when I save tsv file with pandas

here's my custom make_tsv function


def make_tsv(datapath, img_list, start_index):

    data = pd.DataFrame({"unique_id": [], "img_id": [], "caption": [], "predicted_object_labels": [], "base64_str": []})
    for i, img_name in enumerate(img_list):
        print("index: ", i)
        print("img_name: ", img_name)
        img = Image.open(os.path.join(datapath, img_name))
        img_buffer = BytesIO()
        img.save(img_buffer, format=img.format)
        byte_data = img_buffer.getvalue()
        unique_id = start_index + i
        img_id = img_name.split(".")[-2]
        caption = ""
        predicted_object_labels = ""
        base64_str = base64.b64encode(byte_data)
        temp_dataframe = pd.DataFrame({"unique_id": [unique_id], "img_id": [img_id], "caption": [caption], \
                                            "predicted_object_labels": [predicted_object_labels], \
                                            "base64_str": [str(base64_str)]})

        data = pd.concat([data, temp_dataframe], ignore_index=True)
        
    return data

and I can't find solution yet.

can u give me advise??

Cannot unzip vqa_data.zip file

image

The provided vqa_data.zip file is too large and the unzip failed (as shown in the figure above).

We have tried to fix it by zip -F file.zip --out file-large.zip but it doesn't work.

Could you please help to provide a better way to process data files? Thanks!

questions about "extract_features_scriptable" method

return value of "extract_features_scriptable" in unify_transformer.py,

I think inner states[1:] is forwarded values of decoder list

and attn is attention value

are they right??

sorry for annoying you with trivial questions

Unsuccessful GLUE fine-tuning

Hi, when I follow the procedure of glue fine-tuning ( let's say mnli for example), there is a RuntimeError triggered, which goes like:

2022-03-04 13:14:31 - train.py[line:295] - INFO: Start iterating over samples
slice_id 1 seek offset 98176
Total steps 30680, warmup steps 1840, warmup_factor 0.0005434782608695652
Traceback (most recent call last):
  File "../../train.py", line 527, in <module>
    cli_main()
  File "../../train.py", line 520, in cli_main
    distributed_utils.call_main(cfg, main)
  File "/workspace/OFA/fairseq/fairseq/distributed/utils.py", line 374, in call_main
    distributed_main(cfg.distributed_training.device_id, main, cfg, kwargs)
  File "/workspace/OFA/fairseq/fairseq/distributed/utils.py", line 348, in distributed_main
    main(cfg, **kwargs)
  File "../../train.py", line 189, in main
    valid_losses, should_stop = train(cfg, trainer, task, epoch_itr)
  File "/opt/conda/envs/OFA/lib/python3.7/contextlib.py", line 74, in inner
    return func(*args, **kwds)
  File "../../train.py", line 300, in train
    log_output = trainer.train_step(samples)
  File "/opt/conda/envs/OFA/lib/python3.7/contextlib.py", line 74, in inner
    return func(*args, **kwds)
  File "/workspace/OFA/trainer.py", line 806, in train_step
    raise e
  File "/workspace/OFA/trainer.py", line 780, in train_step
    **extra_kwargs,
  File "/workspace/OFA/tasks/ofa_task.py", line 319, in train_step
    loss, sample_size, logging_output = criterion(model, sample, update_num=update_num)
  File "/opt/conda/envs/OFA/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/OFA/criterions/label_smoothed_cross_entropy.py", line 200, in forward
    loss, nll_loss, ntokens = self.compute_loss(model, net_output, sample, update_num, reduce=reduce)
  File "/workspace/OFA/criterions/label_smoothed_cross_entropy.py", line 245, in compute_loss
    lprobs, target, constraint_masks = self.get_lprobs_and_target(model, net_output, sample)
  File "/workspace/OFA/criterions/label_smoothed_cross_entropy.py", line 222, in get_lprobs_and_target
    net_output[0].masked_fill_(~constraint_masks, -math.inf)
RuntimeError: Output 0 of _DDPSinkBackward is a view and is being modified inplace. This view is the output of a function that returns multiple views. Such functions do not allow the output views to be modified inplace. You should replace the inplace operation by an out-of-place one.

How can I solve it?
I will really appreciate it!

custom image inference

I just want to test my custom image with code

can you let me know what model used for image extraction?

I mean, I want to know label and checkpoint info of Vin VL

The visualization problem

Hello, thanks for sharing your code.
My question is how to get the caption of a input image like figure 1 in your artical?
Could you share the code?

Unsuccessful Image Caption step 2 fine-tuning

Hi, when running the train_caption_stage2.sh, it triggers a AssertError like:

Traceback (most recent call last):
  File "../../train.py", line 527, in <module>
    cli_main()
  File "../../train.py", line 520, in cli_main
    distributed_utils.call_main(cfg, main)
  File "/workspace/OFA/fairseq/fairseq/distributed/utils.py", line 374, in call_main
    distributed_main(cfg.distributed_training.device_id, main, cfg, kwargs)
  File "/workspace/OFA/fairseq/fairseq/distributed/utils.py", line 348, in distributed_main
    main(cfg, **kwargs)
  File "../../train.py", line 189, in main
    valid_losses, should_stop = train(cfg, trainer, task, epoch_itr)
  File "/opt/conda/lib/python3.8/contextlib.py", line 75, in inner
    return func(*args, **kwds)
  File "../../train.py", line 300, in train
    log_output = trainer.train_step(samples)
  File "/opt/conda/lib/python3.8/contextlib.py", line 75, in inner
    return func(*args, **kwds)
  File "/workspace/OFA/trainer.py", line 773, in train_step
    loss, sample_size_i, logging_output = self.task.train_step(
  File "/workspace/OFA/tasks/ofa_task.py", line 319, in train_step
    loss, sample_size, logging_output = criterion(model, sample, update_num=update_num)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 881, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/workspace/OFA/criterions/scst_loss.py", line 88, in forward
    loss, score, ntokens, nsentences = self.compute_loss(model, sample, reduce=reduce)
  File "/workspace/OFA/criterions/scst_loss.py", line 239, in compute_loss
    gen_target, gen_res, gt_res = self.get_generator_out(model, sample)
  File "/workspace/OFA/criterions/scst_loss.py", line 149, in get_generator_out
    gen_out = self.task.scst_generator.generate([model], sample)
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/OFA/models/sequence_generator.py", line 207, in generate
    return self._generate(models, sample, **kwargs)
  File "/workspace/OFA/models/sequence_generator.py", line 480, in _generate
    assert step < max_len, f"{step} < {max_len}"
AssertionError: 16 < 16

How can I fix this problem? Thx!

Inference model using Huggingface library

First of all, Thanks for your amazing work.👍
I'm very surprised at the results you've made.
But I have a question. Is it possible to use this model using the transformers library using the checkpoint of the model?
You made it possible to infer to the model in the spaces of the transformers library, so are you planning to upload a checkpoint in the transformers library and use that library for the inference?
When I saw the colab you posted, it said how to use only fairseq
I'll be waiting for the reply.
Once again, thank you for the amazing results!

Text-to-Image Generation Code Example

Thank you for your awesome project. I look at your repository but I could not find any example for Text-To-Image generation(I mean like what you have provided in google Colab for other tasks).

Could you please provide a code example that just generates an image from the input text?

Where should I modify if I want to modify the architecture based on OFA model ?

Hi, I am trying some minor changes to ofa base model and doing the same experiments as OFA. I have modify the models/ofa/unified_transformer_layer.py for bottom layer design, and register as ('ofa', 'tuning ofa') in the models/ofa/ofa.py. Is there anything I need to pay attention to?
Also, if I need to load the weights from ofa_base checkpoints directly to the modified ofa ( tuning ofa ), where should I implement the special loading part? I am a little bit confused when finding in the code.
I will really appreciate it if you can help me!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.