jerinphilip / ilmulti Goto Github PK

Tooling to play around with multilingual machine translation for Indian Languages.

Home Page: http://preon.iiit.ac.in/~jerin/bhasha

License: MIT License

Python 82.07% Shell 1.75% OCaml 3.92% UrWeb 12.27%

machine-translation machine-translation-models multilingual-translation tokenizer indian-languages multilingual-translations pytorch wrappers

ilmulti's Introduction

ilmulti

This repository houses tooling used to create the models on the leaderboard of WAT-Tasks. We provide wrappers to models which are trained via pytorch/fairseq to translate. Installation and usage intructions are provided below.

Training: We use a separate fork of pytorch/fairseq at jerinphilip/fairseq-ilmt for training to optimize for our cluster and to plug and play data easily.
Pretrained Models and Other Resources: preon.iiit.ac.in/~jerin/bhasha

Installation

The code is tested to work with the fairseq-fork which is branched from v0.8.0 and torch version 1.0.0.

# --user is optional

# Check requirements.txt, packages for translation:
# fairseq-ilmt@lrec-2020 and torch  are not enabled by default.
python3 -m pip install -r requirements.txt --user  

# Once requirements are installed, you can install ilmulti into library.

python3 setup.py install --user

Downloading Models: The script scripts/download-and-setup-models.sh downloads the model and dictionary files required for running examples/mm_all.py. Which models to download can be configured in the script.

A working example using the wrappers in this code can be found in this colab notebook. Thanks @Nimishasri.

Usage

from ilmulti.translator import from_pretrained

translator = from_pretrained(tag='mm-all')
sample = translator("The quick brown fox jumps over the lazy dog", tgt_lang='hi')

The code works with three main components:

1. Segmenter

Also sentence-tokenizer. To handle segmenting a block of text into sentences, accounting for some Indian Language delimiters.

PatternSegmenter: There is a bit crude and rule based implementation contributed by Binu Jasim.
PunktSegmenter: changed this to an unsupervised learnt PunktTokenizer

2. Tokenization

We use SentencePiece to as an unsupervised tokenizer for Indian languages, which works surprisingly well in our experiments. There are trained models on whatever corpora we could find for the specific languages in sentencepiece/models of 4000 vocabulary units and 8000 vocabulary units.

Training a joint SentencePiece over all languages lead to character level tokenization for under-represented languages and since there isn't much to gain due to the difference in scripts, we use individual tokenizers for each language. Combined however, this will have less than 4000 x |#languages| as some common English code mixes come in. This however, makes the MT system robust in some sense to code-mixed inputs.

3. Translator

Translator is a wrapper around a fairseq which we have reused for some web-interfaces and demos.

ilmulti's People

Contributors

Stargazers

Watchers

Forkers

vinaypn chiragsanghvi10 rahulraj80 sumanthd17

ilmulti's Issues

[Improvement] Over the current of adding new datasets

The "code adaptation" for training the model on datasets and languages that aren't already present in the library is:

Present:
Changing corpora.py in fairseq-ilmt this has been moved to ilmulti in this commit

BUT

Proposed:
A. Can we change this to something like how fairseq handles user-defined tasks/models where we pass in a module containing user-defined "things" like this
OR
B. Since already all the information except for the path is already present in the configuration file. We can add another path key in the datasets that gives the path to the dataset folder. If this route is taken, we have to ensure a standard format on data files.

Reason for this suggestion: Cleaner (Opinion), Doesn't have to change the library code i.e, becomes "plug and play" type approach.

Exception: Please define ILMULTI_CORPUS_ROOT in environment variable

System:
Description: Ubuntu 18.04.4 LTS
Release: 18.04
Codename: bionic

Python 3.7.6

Created a folder Research/MT on Home. Setup and activated Virtual environment in Research/MT
Cloned ilmulti in Research/MT
Ran the following:

python3 -m pip install -r requirements.txt --user  
python3 setup.py install

Downloaded the models by running scripts/download-and-setup-models.sh
No errors/warnings uptill now
Opened up a Jupyter notebook and ran the following:

from ilmulti.translator import from_pretrained

translator = from_pretrained(tag='mm-all')
sample = translator("The quick brown fox jumps over the lazy dog", tgt_lang='hi')

Exception: Please define ILMULTI_CORPUS_ROOT in environment variable

Can you provide the method to train using our own corpora using your version of fairseq ?

I normally use indicnlp to tokenize and moses to train the MT but your model is giving better accuracy and can you give an insight into the amount or corpus used to train the model? Thank you.

KeyError: 'shared-multilingual-translation'

translator = from_pretrained(tag='mm-all-iter1')

| [src] dictionary: 40897 types
| [tgt] dictionary: 40897 types
/content/ilmulti/ilmulti/translator/translator.py:37: UserWarning: utils.load_ensemble_for_inference is deprecated. Please use checkpoint_utils.load_model_ensemble instead.
  self.models, model_args = fairseq.utils.load_ensemble_for_inference(model_paths, self.task, model_arg_overrides=eval(args.model_overrides))
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-22-f8e48d2129df> in <module>()
----> 1 translator = from_pretrained(tag='mm-to-en-iter1')

7 frames
/content/ilmulti/ilmulti/translator/pretrained.py in from_pretrained(tag, use_cuda)
     60     from .mt_engine import MTEngine
     61 
---> 62     translator = build_translator(config['model'], use_cuda=use_cuda)
     63     segmenter = build_segmenter(config['segmenter'])
     64     tokenizer = build_tokenizer(config['tokenizer'])

/content/ilmulti/ilmulti/translator/translator.py in build_translator(model, use_cuda)
    169     args.enhance(**keyword_arguments)
    170 
--> 171     fseq_translator = FairseqTranslator(args, use_cuda=use_cuda)
    172     return fseq_translator
    173 

/content/ilmulti/ilmulti/translator/translator.py in __init__(self, args, use_cuda)
     35         # print('| loading model(s) from {}'.format(args.path))
     36         model_paths = args.path.split(':')
---> 37         self.models, model_args = fairseq.utils.load_ensemble_for_inference(model_paths, self.task, model_arg_overrides=eval(args.model_overrides))
     38         self.tgt_dict = self.task.target_dictionary
     39 

/usr/local/lib/python3.6/dist-packages/fairseq/utils.py in load_ensemble_for_inference(filenames, task, model_arg_overrides)
     28     )
     29     return checkpoint_utils.load_model_ensemble(
---> 30         filenames, arg_overrides=model_arg_overrides, task=task,
     31     )
     32 

/usr/local/lib/python3.6/dist-packages/fairseq/checkpoint_utils.py in load_model_ensemble(filenames, arg_overrides, task)
    154         task (fairseq.tasks.FairseqTask, optional): task to use for loading
    155     """
--> 156     ensemble, args, _task = _load_model_ensemble(filenames, arg_overrides, task)
    157     return ensemble, args
    158 

/usr/local/lib/python3.6/dist-packages/fairseq/checkpoint_utils.py in _load_model_ensemble(filenames, arg_overrides, task)
    165         if not os.path.exists(filename):
    166             raise IOError('Model file not found: {}'.format(filename))
--> 167         state = load_checkpoint_to_cpu(filename, arg_overrides)
    168 
    169         args = state['args']

/usr/local/lib/python3.6/dist-packages/fairseq/checkpoint_utils.py in load_checkpoint_to_cpu(path, arg_overrides)
    141         for arg_name, arg_val in arg_overrides.items():
    142             setattr(args, arg_name, arg_val)
--> 143     state = _upgrade_state_dict(state)
    144     return state
    145 

/usr/local/lib/python3.6/dist-packages/fairseq/checkpoint_utils.py in _upgrade_state_dict(state)
    319 
    320     # set any missing default values in the task, model or other registries
--> 321     set_defaults(tasks.TASK_REGISTRY[state['args'].task])
    322     set_defaults(models.ARCH_MODEL_REGISTRY[state['args'].arch])
    323     for registry_name, REGISTRY in registry.REGISTRIES.items():

KeyError: 'shared-multilingual-translation'

Error(s) in loading state_dict for TransformerModel:

downloaded all the models (mm-all was commented out in scripts/download-and-setup-models.sh).
Ran the test translation script as mentioned in the Readme.

Error(s) in loading state_dict for TransformerModel:
	size mismatch for encoder.embed_tokens.weight: copying a param with shape torch.Size([28168, 512]) from checkpoint, the shape in current model is torch.Size([26346, 512]).
	size mismatch for decoder.embed_out: copying a param with shape torch.Size([28160, 512]) from checkpoint, the shape in current model is torch.Size([26346, 512]).
	size mismatch for decoder.embed_tokens.weight: copying a param with shape torch.Size([28160, 512]) from checkpoint, the shape in current model is torch.Size([26346, 512]).

Sample Colab Notebook seems to have a bug

Hi Jerin,

Is the first cell needed for anything other than storage/ vscode linkage?

If not, the sample Colab Notebook seems to have some bug. Have a look at this when you get the time : https://colab.research.google.com/gist/rahulraj80/2f45c7ab1b44c616b12917f5211c51d3/ilmulti-sample-run-notebook.ipynb

It complaints:

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
/content/ilmulti/ilmulti/translator/translator.py in <module>()
      5 try:
----> 6     import fairseq
      7     import torch

ModuleNotFoundError: No module named 'fairseq'

During handling of the above exception, another exception occurred:

NameError                                 Traceback (most recent call last)
2 frames
<ipython-input-8-e2b51881e661> in <module>()
----> 1 from ilmulti.translator import from_pretrained
      2 
      3 translator = from_pretrained(tag='mm-all-iter0')
      4 
      5 sample = translator("The quick brown fox jumps over the lazy dog", tgt_lang='hi')

/content/ilmulti/ilmulti/translator/__init__.py in <module>()
      1 
----> 2 from .translator import FairseqTranslator
      3 from .mt_engine import MTEngine
      4 from .pretrained import from_pretrained, mm_all

/content/ilmulti/ilmulti/translator/translator.py in <module>()
      9     from fairseq import data, options, tasks, tokenizer, utils
     10 except ImportError:
---> 11     warnings.warn(
     12     """
     13     Please check if you have installed specified versions of torch,

NameError: name 'warnings' is not defined

The last torch version error seems to be erroneous as a few cells up, it said:

Requirement already satisfied: torch==1.0.0 in /usr/local/lib/python3.6/dist-packages (1.0.0)
Requirement already satisfied: torchvision==0.2.1 in /usr/local/lib/python3.6/dist-packages (0.2.1)
Requirement already satisfied: pillow>=4.1.1 in /usr/local/lib/python3.6/dist-packages (from torchvision==0.2.1) (7.0.0)
Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from torchvision==0.2.1) (1.16.0)
Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from torchvision==0.2.1) (1.15.0)

Cheers,
Rahul

Facing errors in basic test run

Unable to run either examples/mm_all.py or the basic test script provided.

Steps:

cd ~
git clone https://github.com/jerinphilip/fairseq-ilmt.git
cd fairseq-ilmt
pip3 install --editable .
cd ~
git clone https://github.com/jerinphilip/ilmulti.git
cd ilmulti
pwd
python3 -m pip install -r requirements.txt 
python3 setup.py install

No errors reported.

Downloaded models via"

cd ~
bash ilmulti/scripts/download-and-setup-models.sh

Test Code:

from ilmulti.translator import from_pretrained

translator = from_pretrained(tag='mm-all-iter0')
sample = translator("The quick brown fox jumps over the lazy dog", tgt_lang='hi')

Error log:

| [src] dictionary: 40897 types
| [tgt] dictionary: 40897 types
./ilmulti/translator/translator.py:23: UserWarning: utils.load_ensemble_for_inference is deprecated. Please use checkpoint_utils.load_model_ensemble instead.
  self.models, model_args = fairseq.utils.load_ensemble_for_inference(model_paths, self.task, model_arg_overrides=eval(args.model_overrides))
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-19-d75d9cb5891a> in <module>()
      2 
      3 translator = from_pretrained(tag='mm-all-iter0')
----> 4 sample = translator("The quick brown fox jumps over the lazy dog", tgt_lang='hi')

5 frames
/content/ilmulti/ilmulti/translator/mt_engine.py in __call__(self, source, tgt_lang, src_lang, detokenize)
     22             sources.append(content)
     23 
---> 24         export = self.translator(sources)
     25         export = self._handle_empty_lines_noise(export)
     26         if detokenize:

/content/ilmulti/ilmulti/translator/translator.py in __call__(self, lines, attention)
     65                 },
     66             }
---> 67             translations = self.task.inference_step(self.generator, self.models, sample)
     68             for i, (id, hypos) in enumerate(zip(batch.ids.tolist(), translations)):
     69                 src_tokens_i = utils.strip_pad(src_tokens[i], tgt_dict.pad())

/usr/local/lib/python3.6/dist-packages/fairseq/tasks/fairseq_task.py in inference_step(self, generator, models, sample, prefix_tokens)
    242     def inference_step(self, generator, models, sample, prefix_tokens=None):
    243         with torch.no_grad():
--> 244             return generator.generate(models, sample, prefix_tokens=prefix_tokens)
    245 
    246     def update_step(self, num_updates):

/usr/local/lib/python3.6/dist-packages/torch/autograd/grad_mode.py in decorate_context(*args, **kwargs)
     13         def decorate_context(*args, **kwargs):
     14             with self:
---> 15                 return func(*args, **kwargs)
     16         return decorate_context
     17 

/usr/local/lib/python3.6/dist-packages/fairseq/sequence_generator.py in generate(self, models, sample, prefix_tokens, bos_token, **kwargs)
    374                 step,
    375                 lprobs.view(bsz, -1, self.vocab_size),
--> 376                 scores.view(bsz, beam_size, -1)[:, :, :step],
    377             )
    378 

/usr/local/lib/python3.6/dist-packages/fairseq/search.py in step(self, step, lprobs, scores)
     79             out=(self.scores_buf, self.indices_buf),
     80         )
---> 81         torch.div(self.indices_buf, vocab_size, out=self.beams_buf)
     82         self.indices_buf.fmod_(vocab_size)
     83         return self.scores_buf, self.indices_buf, self.beams_buf

RuntimeError: Integer division of tensors using div or / is no longer supported, and in a future release div will perform true division as in Python 3. Use true_divide or floor_divide (// in Python) instead.