Light

lsyip / mt-qe-filtering Goto Github PK

View Code? Open in Web Editor NEW

0.0 2.0 0.0 13 KB

Experiment setup for evaluating the effects of quality estimation filtering for machine translation models

Shell 47.51% Python 52.49%

mt-qe-filtering's Introduction

mt-qe-filtering

Experiment setup for evaluating the effects of quality estimation filtering for machine translation models

Requirements and Installation

PyTorch version >= 1.10.0
Python version >= 3.8 (Experiments were run using Python 3.9.12)
fairseq : pip install fairseq
TransQuest : pip install transquest

Dataset

The German-English IWSLT17 dataset can be found here.

Download 2017-01-trnmted.tgz
Extract files with tar -xzvf 2017-01-trnmted.tgz
From the unzipped folder, also extract texts\DeEnItNlRo\DeEnItNlRo\DeEnItNlRo.tgz
Run prep-iwslt17.sh to prepare the train, test, and valid sets.

Initial Data Split Preparation and Preprocessing

Split train.de and train.en files using Data_Preprocessing\split-data.sh. This will make all 7 dataset splits at once.
Preprocess all split datasets using Data_Preprocessing\preprocess.sh. Change paths to to point to your train, valid, and test set locations.

Experiment 1

Run fs_tq_v1.py for each split. Change the source and target data to test.de and test.en, respectively. Change the checkpoint and databin paths for all 7 splits. In line 76, save sentences that are rated less than 0.70 : if (pred < 0.70)
Preprocess the saved sentences using Data_Preprocessing\preprocess.sh
Fine-tune the original models for each split with finetune.sh . Change the paths to the tokenized data, checkpoint_best.pt, and save directory.

Experiment 2

Run fs_tq_v1.py for each split. Change the source and target data to train.de and train.en, respectively. Change the checkpoint and databin paths for all 7 splits. In line 76, save sentences that are rated less than 0.712 : if (pred < 0.712)
Preprocess the saved sentences using Data_Preprocessing\preprocess.sh
Fine-tune the original models for each split with finetune.sh . Change the paths to the tokenized data, checkpoint_best.pt, and save directory.

Experiment 3

Run fs_tq_v1.py for each split. Change the source and target data to train.de and train.en, respectively. Change the checkpoint and databin paths for all 7 splits. In line 76, save sentences that are rated higher than 0.712 : if (pred > 0.712)
Preprocess the saved sentences using Data_Preprocessing\preprocess.sh
Fine-tune the original models for each split with finetune.sh . Change the paths to the tokenized data, checkpoint_best.pt, and save directory.

Experiment 4

Run tq_iwslt17.py for each original dataset split. Change the source and target data to train.de and train.en, respectively. In line 76, save sentences that are rated higher than 0.712 : if (pred > 0.712)
Preprocess the saved sentences using Data_Preprocessing\preprocess.sh
Train a new model for each split with finetune.sh with a learning rate --lr 5e-4. Omit line 18 --finetune-from-model checkpoints-v4/checkpoint7/checkpoint_best.pt. Change the paths to the tokenized data, and save directory.

Experiment 5

Run fs_tq_v4.py for each split. Change the source and target data to train.de and train.en, respectively. Change the checkpoint to point to the models trained in experiment 4 and the databin path for all 7 splits. In line 79, save sentences that are rated higher than 0.712 : if (pred > 0.712)
Preprocess the saved sentences using Data_Preprocessing\preprocess.sh
Train a new model for each split with finetune.sh . Change the paths to the tokenized data, and save directory.

Experiment 6

Run fs_tq_v4.py for each split. Change the source and target data to train.de and train.en, respectively. Change the checkpoint to point to the models trained in experiment 4 and the databin path for all 7 splits. In line 79, save sentences that are rated lower than 0.712 : if (pred < 0.712)
Preprocess the saved sentences using Data_Preprocessing\preprocess.sh
Train a new model for each split with finetune.sh . Change the paths to the tokenized data, and save directory.

mt-qe-filtering's People

Contributors

Watchers

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.