Firefox Translations Evaluation

Calculates BLEU scores for Firefox Translations models using bergamot-translator and compares them to other translation systems.

Running

Clone repo

git clone https://github.com/mozilla/firefox-translations-evaluation.git
cd firefox-translations-evaluation

Download models

Use install/download-model.sh to get Firefox Translations models or use your own ones.

Start docker

Recommended memory size for Docker is 8gb.

export MODELS=<absolute path to a local directory with models>

# Specify Azure key and location if you want to add Azure Translator API for comparison
export AZURE_TRANSLATOR_KEY=<Azure translator resource API key>
# optional, specify if it's different than default 'global'
export AZURE_LOCATION=<location>

# Specify GCP credentials json path if you want to add Google Translator API for comparison
export GCP_CREDS_PATH=<absolute path to .json>

# Build and run docker container
bash start_docker.sh

On completion, your terminal should be attached to the launched container.

Run evaluation

From inside docker container run:

python3 eval/evaluate.py --translators=bergamot,microsoft,google --pairs=all --skip-existing --models-dir=/models/models/prod --results-dir=/models/evaluation/prod

More options:

python3 eval/evaluate.py --help

Details

Installation scripts

install/install-bergamot-translator.sh - clones and compiles bergamot-translator and marian (launched in docker image).

install/download-models.sh - downloads current Mozilla production models.

Translators

bergamot - uses compiled bergamot-translator in wasm mode
marian - uses compiled marian
google - users Google Translation API
microsoft - users Azure Cognitive Services Translator API

Reuse already calculated scores

Use --skip-existing option to reuse already calculated scores saved as results/xx-xx/*.bleu files. It is useful to continue evaluation if it was interrupted or to rebuild a full report reevaluating only selected translators.

Datasets

SacreBLEU - all available datasets for a language pair are used for evaluation.

Flores - parallel evaluation dataset for 101 languages.

Language pairs

With option --pairs=all, language pairs will be discovered in the specified models folder (option --models-dir) and evaluation will run for all of them.

Results

Results will be written to the specified directory (option --results-dir).

Evaluation results for models that are used in Firefox Translation can be found in firefox-translations-models/evaluation

dymil / firefox-translations-evaluation Goto Github PK