Comments (9)
- Does it follow this approach ? A brief summary of it can be found here
Not exactly that approach, but a similar one: See Sec 3.3 from this paper. An unknown (target) word is replaced using alignment information. To do that, we assume that the attention mechanism acts as an alignment model. So, when we generate an unknown word (let's call it unk
), we select the source word (let's call it src_candidate
) with the highest attention. Then we apply one of the heuristics to replace it:
0
: Replace the unknown word with the aligned source word (unk
->src_candidate
).1
: Replace the unknown word with the translation of the aligned source word (unk
-> translation(src_candidate
)). The translation here is given by a statistical dictionary (e.g. fast_align).2
: Applies heuristic1
ifsrc_candidate
starts with a lower case, otherwise, it applies heuristic0
. The rationale behind this is that proper nouns (starting with a capital case) should appear as they are in the translation.
- How does heuristic 2 handle cases where the languages are different from english i.e lower casing ?
All heuristics are language agnostic. If the source language has no casing information, heuristic 2 falls back to heuristic 1.
- What happens if
POS_UNK
is disabled to False ?
Then your model can generate unknown words (see the discussion in the papers above).
An alternative to all these tricks is to use subwords instead of words. This is a standard practice in NMT and I recommend to do that.
from nmt-keras.
You can use the utils/build_mapping_file.sh script to obtain it. You'll need to install fast_align and change the path to the executable in that script. It will create a .pkl
file containing the alignments. You then need to set the MAPPING
variable from the config file pointing to this file:
Line 82 in 3f97677
from nmt-keras.
Okay ! Will try that right now
Finally is to possible to run subword based nmt with this ?
from nmt-keras.
If the files set in the config (
Lines 16 to 18 in 3f97677
TOKENIZATION_METHOD=tokenize_bpe
because it would apply the segmentation twice. In that case you should set TOKENIZATION_METHOD=tokenize_none
.
Maybe a update about this script in README ?
Yes, feel free to open a PR describing how you did this. I can review it.
from nmt-keras.
For Heuristic 1,how is the alignment
calculated for this toolkit ?
from nmt-keras.
Yes, they are compatible. But if you use subwords, the unk problem is unlikely to happen (at least with the latin writing system or similar ones). Because, if the segmenter (say, BPE) finds an unknown word, it will segment it into known subwords. The extreme case of this are characters (e.g. Word
-> W@@ o@@ r@@ d
). So you end up with effective no unknown words.
If you want to prevent this behavior, you can set up words that shouln't be broken up. In Subword-nmt you can do this with the --glossaries
option.
Finally, note that a (standard) NMT system doesn't consider these linguistic features. It only models sequences. The elements of the sequence are encoded as indices, independently of its linguistic meaning (words, chars or subwords).
P.S.: When using subwords you may still have unknown words: an unseen character would still be considered an unknown word.
from nmt-keras.
Hey,
After learning BPE and reapplying with vocabulary filter from subword-nmt I'm not sure on
- how to incorporate it into the configuration for the train-validation-test files ?
It's a bit ambiguos about BPE_CODES_PATH = DATA_ROOT_PATH + '/training_codes.joint'
Assuming Files train.BPE.L1
and train.BPE.L2
are obtained with subword-nmt from train.L1
and train.L2
from nmt-keras.
Currently, only joint BPE is supported (see its section in subwordnmt). This generates a single BPE file and that should be placed as BPE_CODES_PATH
. If you want to use this file to segment your sentences, you should also set TOKENIZATION_METHOD = 'tokenize_bpe
In addition, if these are your first steps using subword techniques, I recommend you to make them explicit, as they are not obscured by other processes. I would:
- Learn a BPE from the training data.
- Apply it to my training data (source and target).
- Apply it to the source validation/test data, leaving the target as it is. This is because we don't want to evaluate MT quality on segmented data.
- In
config.py
set the detokenization options to revert BPE tokenization (DETOKENIZATION_METHOD = 'detokenize_bpe'
andAPPLY_DETOKENIZATION = True
:
Lines 89 to 91 in 3f97677
- Train and evaluate as usual. So, you'll train with BPE but you'll generate and evaluate sentences without the segmentation.
You can check how to do the first 3 steps in this script and you can find config examples under the examples directory.
from nmt-keras.
Thanks ! I can obtain Train.L1 and Train.L2; dev.L1 and test.L1 that are BPE processed after following the above steps.
So while training TOKENIZATION_METHOD=tokenize_bpe
should be set and while decoding it both DETOKENIZATION_METHOD=detokenize_bpe
and APPLY_DETOKENIZATION= True
must be enabled in addition to the above ?
Maybe a update about this script in README ?
from nmt-keras.
Related Issues (20)
- Support for Factored Models ? HOT 1
- consume long time for predicting validation output HOT 3
- Confusion with opennmt-tf HOT 1
- Missing auto setup of required packages for running this library HOT 1
- How to use pretrained word2vec embeddings? HOT 1
- Getting error index out of range when training a Transformer model HOT 10
- Using CPU for inference with GPU-trained model HOT 20
- Evaluating perplexity HOT 4
- Getting error when using Tensorboard HOT 2
- Save perplexity on training and validation sets HOT 5
- Sampling decoding HOT 1
- Strange behavior with plotting metrics for validation HOT 2
- Issue with ensemble scoring method HOT 3
- AssertionError: Reduction function "Noam" unimplemented! HOT 1
- Data Error ? HOT 6
- Detecting multiple GPUs HOT 9
- Training Error HOT 1
- Conversion to TFJS HOT 1
- Example Colab Fails HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from nmt-keras.