Need a few clarifications regarding how to handle rare words and heuristics in the <a

Does it follow <a href="https://nlp.stanford.edu/pubs/acl15_nmt.pdf

You can use the <a href="https://github.com/lvapeab/nmt-%20keras/blob/master/utils/bui

If the files set in the config ( <div class

Currently, only joint BPE is supported (see <a href="https://github.com/rsennrich/subw

Regd Rare Words/OOV Tokens ? about nmt-keras HOT 9 CLOSED

VP007-py commented on June 5, 2024

Regd Rare Words/OOV Tokens ?

from nmt-keras.

Comments (9)

lvapeab commented on June 5, 2024 1

Does it follow this approach ? A brief summary of it can be found here

Not exactly that approach, but a similar one: See Sec 3.3 from this paper. An unknown (target) word is replaced using alignment information. To do that, we assume that the attention mechanism acts as an alignment model. So, when we generate an unknown word (let's call it unk), we select the source word (let's call it src_candidate) with the highest attention. Then we apply one of the heuristics to replace it:

0: Replace the unknown word with the aligned source word (unk-> src_candidate).
1: Replace the unknown word with the translation of the aligned source word (unk -> translation(src_candidate)). The translation here is given by a statistical dictionary (e.g. fast_align).
2: Applies heuristic 1 if src_candidate starts with a lower case, otherwise, it applies heuristic 0. The rationale behind this is that proper nouns (starting with a capital case) should appear as they are in the translation.

How does heuristic 2 handle cases where the languages are different from english i.e lower casing ?

All heuristics are language agnostic. If the source language has no casing information, heuristic 2 falls back to heuristic 1.

What happens if POS_UNK is disabled to False ?

Then your model can generate unknown words (see the discussion in the papers above).

An alternative to all these tricks is to use subwords instead of words. This is a standard practice in NMT and I recommend to do that.

from nmt-keras.

lvapeab commented on June 5, 2024 1

You can use the utils/build_mapping_file.sh script to obtain it. You'll need to install fast_align and change the path to the executable in that script. It will create a .pkl file containing the alignments. You then need to set the MAPPING variable from the config file pointing to this file:

nmt-keras/config.py

Line 82 in 3f97677

MAPPING = DATA_ROOT_PATH + '/mapping.%s_%s.pkl' % (SRC_LAN, TRG_LAN)

from nmt-keras.

VP007-py commented on June 5, 2024 1

Okay ! Will try that right now

Finally is to possible to run subword based nmt with this ?

from nmt-keras.

lvapeab commented on June 5, 2024 1

If the files set in the config (

nmt-keras/config.py

Lines 16 to 18 in 3f97677

    
           TEXT_FILES = {'train': 'training.',             # Data files. 
        
                         'val': 'dev.', 
        
                         'test': 'test.'}

) have been already processed by BPE, you don't want to set TOKENIZATION_METHOD=tokenize_bpe because it would apply the segmentation twice. In that case you should set TOKENIZATION_METHOD=tokenize_none.

Maybe a update about this script in README ?

Yes, feel free to open a PR describing how you did this. I can review it.

from nmt-keras.

VP007-py commented on June 5, 2024

For Heuristic 1,how is the alignment calculated for this toolkit ?

from nmt-keras.

lvapeab commented on June 5, 2024

Yes, they are compatible. But if you use subwords, the unk problem is unlikely to happen (at least with the latin writing system or similar ones). Because, if the segmenter (say, BPE) finds an unknown word, it will segment it into known subwords. The extreme case of this are characters (e.g. Word -> W@@ o@@ r@@ d). So you end up with effective no unknown words.

If you want to prevent this behavior, you can set up words that shouln't be broken up. In Subword-nmt you can do this with the --glossaries option.

Finally, note that a (standard) NMT system doesn't consider these linguistic features. It only models sequences. The elements of the sequence are encoded as indices, independently of its linguistic meaning (words, chars or subwords).

P.S.: When using subwords you may still have unknown words: an unseen character would still be considered an unknown word.

from nmt-keras.

VP007-py commented on June 5, 2024

Hey,
After learning BPE and reapplying with vocabulary filter from subword-nmt I'm not sure on

how to incorporate it into the configuration for the train-validation-test files ?

It's a bit ambiguos about BPE_CODES_PATH = DATA_ROOT_PATH + '/training_codes.joint'

Assuming Files train.BPE.L1 and train.BPE.L2 are obtained with subword-nmt from train.L1 and train.L2

from nmt-keras.

lvapeab commented on June 5, 2024

Currently, only joint BPE is supported (see its section in subwordnmt). This generates a single BPE file and that should be placed as BPE_CODES_PATH. If you want to use this file to segment your sentences, you should also set TOKENIZATION_METHOD = 'tokenize_bpe

In addition, if these are your first steps using subword techniques, I recommend you to make them explicit, as they are not obscured by other processes. I would:

Learn a BPE from the training data.
Apply it to my training data (source and target).
Apply it to the source validation/test data, leaving the target as it is. This is because we don't want to evaluate MT quality on segmented data.

In config.py set the detokenization options to revert BPE tokenization (DETOKENIZATION_METHOD = 'detokenize_bpe' and APPLY_DETOKENIZATION = True:

nmt-keras/config.py

Lines 89 to 91 in 3f97677

    
           DETOKENIZATION_METHOD = 'detokenize_none'     # Select which de-tokenization method we'll apply. 
        
           APPLY_DETOKENIZATION = False                  # Wheter we apply a detokenization method.

Train and evaluate as usual. So, you'll train with BPE but you'll generate and evaluate sentences without the segmentation.

You can check how to do the first 3 steps in this script and you can find config examples under the examples directory.

from nmt-keras.

VP007-py commented on June 5, 2024

Thanks ! I can obtain Train.L1 and Train.L2; dev.L1 and test.L1 that are BPE processed after following the above steps.

So while training TOKENIZATION_METHOD=tokenize_bpe should be set and while decoding it both DETOKENIZATION_METHOD=detokenize_bpe and APPLY_DETOKENIZATION= True must be enabled in addition to the above ?

Maybe a update about this script in README ?

from nmt-keras.

Regd Rare Words/OOV Tokens ? about nmt-keras HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	TEXT_FILES = {'train': 'training.', # Data files.
	'val': 'dev.',
	'test': 'test.'}

	DETOKENIZATION_METHOD = 'detokenize_none' # Select which de-tokenization method we'll apply.

	APPLY_DETOKENIZATION = False # Wheter we apply a detokenization method.