Coder Social home page Coder Social logo

Comments (9)

lvapeab avatar lvapeab commented on June 5, 2024 1

Not exactly that approach, but a similar one: See Sec 3.3 from this paper. An unknown (target) word is replaced using alignment information. To do that, we assume that the attention mechanism acts as an alignment model. So, when we generate an unknown word (let's call it unk), we select the source word (let's call it src_candidate) with the highest attention. Then we apply one of the heuristics to replace it:

  • 0: Replace the unknown word with the aligned source word (unk-> src_candidate).
  • 1: Replace the unknown word with the translation of the aligned source word (unk -> translation(src_candidate)). The translation here is given by a statistical dictionary (e.g. fast_align).
  • 2: Applies heuristic 1 if src_candidate starts with a lower case, otherwise, it applies heuristic 0. The rationale behind this is that proper nouns (starting with a capital case) should appear as they are in the translation.
  • How does heuristic 2 handle cases where the languages are different from english i.e lower casing ?

All heuristics are language agnostic. If the source language has no casing information, heuristic 2 falls back to heuristic 1.

  • What happens if POS_UNK is disabled to False ?

Then your model can generate unknown words (see the discussion in the papers above).


An alternative to all these tricks is to use subwords instead of words. This is a standard practice in NMT and I recommend to do that.

from nmt-keras.

lvapeab avatar lvapeab commented on June 5, 2024 1

You can use the utils/build_mapping_file.sh script to obtain it. You'll need to install fast_align and change the path to the executable in that script. It will create a .pkl file containing the alignments. You then need to set the MAPPING variable from the config file pointing to this file:

MAPPING = DATA_ROOT_PATH + '/mapping.%s_%s.pkl' % (SRC_LAN, TRG_LAN)

from nmt-keras.

VP007-py avatar VP007-py commented on June 5, 2024 1

Okay ! Will try that right now

Finally is to possible to run subword based nmt with this ?

from nmt-keras.

lvapeab avatar lvapeab commented on June 5, 2024 1

If the files set in the config (

nmt-keras/config.py

Lines 16 to 18 in 3f97677

TEXT_FILES = {'train': 'training.', # Data files.
'val': 'dev.',
'test': 'test.'}
) have been already processed by BPE, you don't want to set TOKENIZATION_METHOD=tokenize_bpe because it would apply the segmentation twice. In that case you should set TOKENIZATION_METHOD=tokenize_none.

Maybe a update about this script in README ?

Yes, feel free to open a PR describing how you did this. I can review it.

from nmt-keras.

VP007-py avatar VP007-py commented on June 5, 2024

For Heuristic 1,how is the alignment calculated for this toolkit ?

from nmt-keras.

lvapeab avatar lvapeab commented on June 5, 2024

Yes, they are compatible. But if you use subwords, the unk problem is unlikely to happen (at least with the latin writing system or similar ones). Because, if the segmenter (say, BPE) finds an unknown word, it will segment it into known subwords. The extreme case of this are characters (e.g. Word -> W@@ o@@ r@@ d). So you end up with effective no unknown words.

If you want to prevent this behavior, you can set up words that shouln't be broken up. In Subword-nmt you can do this with the --glossaries option.

Finally, note that a (standard) NMT system doesn't consider these linguistic features. It only models sequences. The elements of the sequence are encoded as indices, independently of its linguistic meaning (words, chars or subwords).

P.S.: When using subwords you may still have unknown words: an unseen character would still be considered an unknown word.

from nmt-keras.

VP007-py avatar VP007-py commented on June 5, 2024

Hey,
After learning BPE and reapplying with vocabulary filter from subword-nmt I'm not sure on

  • how to incorporate it into the configuration for the train-validation-test files ?

It's a bit ambiguos about BPE_CODES_PATH = DATA_ROOT_PATH + '/training_codes.joint'

Assuming Files train.BPE.L1 and train.BPE.L2 are obtained with subword-nmt from train.L1 and train.L2

from nmt-keras.

lvapeab avatar lvapeab commented on June 5, 2024

Currently, only joint BPE is supported (see its section in subwordnmt). This generates a single BPE file and that should be placed as BPE_CODES_PATH. If you want to use this file to segment your sentences, you should also set TOKENIZATION_METHOD = 'tokenize_bpe

In addition, if these are your first steps using subword techniques, I recommend you to make them explicit, as they are not obscured by other processes. I would:

  1. Learn a BPE from the training data.
  2. Apply it to my training data (source and target).
  3. Apply it to the source validation/test data, leaving the target as it is. This is because we don't want to evaluate MT quality on segmented data.
  4. In config.py set the detokenization options to revert BPE tokenization (DETOKENIZATION_METHOD = 'detokenize_bpe' and APPLY_DETOKENIZATION = True:

    nmt-keras/config.py

    Lines 89 to 91 in 3f97677

    DETOKENIZATION_METHOD = 'detokenize_none' # Select which de-tokenization method we'll apply.
    APPLY_DETOKENIZATION = False # Wheter we apply a detokenization method.
  5. Train and evaluate as usual. So, you'll train with BPE but you'll generate and evaluate sentences without the segmentation.

You can check how to do the first 3 steps in this script and you can find config examples under the examples directory.

from nmt-keras.

VP007-py avatar VP007-py commented on June 5, 2024

Thanks ! I can obtain Train.L1 and Train.L2; dev.L1 and test.L1 that are BPE processed after following the above steps.

So while training TOKENIZATION_METHOD=tokenize_bpe should be set and while decoding it both DETOKENIZATION_METHOD=detokenize_bpe and APPLY_DETOKENIZATION= True must be enabled in addition to the above ?

Maybe a update about this script in README ?

from nmt-keras.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.