edinburghnlp / nematus Goto Github PK

Open-Source Neural Machine Translation in Tensorflow

License: BSD 3-Clause "New" or "Revised" License

Python 83.99% Shell 1.29% Perl 5.42% Smalltalk 0.35% Emacs Lisp 3.15% JavaScript 3.15% NewLisp 0.29% Ruby 0.30% Slash 0.07% SystemVerilog 0.03% Hack 1.96%

machine-translation mt neural-machine-translation nmt sequence-to-sequence

nematus's People

Contributors

Stargazers

Watchers

Forkers

wanjinchang little1tow nieshaoshuai techstone qiuyuew dengwc thanhleha lixiangnlp aagohary ericxsun fangzheng354 hwidongna andre-martins chenb67 emjotde cherryc panyang vikingmew anoopkunchukuttan scfrank jenifferyingyiwu timwee gaetangate elliottd shihuaxing benjamesbabala kevinduh vitaka chrhad protonish mfomicheva huguanglong jiejiang taylorj7 tpetmanson afaji rubeeny cristinae msl02719 wudapeng268 giancds aaronlifenghan unbabel mjpost shuoyangd cshanbo fbougares xingniu ugermann blues5 keisks vivinastase cjmmya stevenlol caoge4 codeaudit amoliu zhang-jian rillaha fancycheung iamsile fpsluozi pkuosadevelopers xylary ai42 fireae hitum-dev mldl maureendss hassyma benakiva ricelingz fermat97 ht1221 xbmu tythonlee cosmmb jmallins zzmjohn jodaiber noisychannel isi-usc-edu karthi2016 hoangcuong2011 roeeaharoni chozelinek azinnai zhenyangiacas young11234 ccsquare meijiesky proyag pjwilliams justatest248 premjithb tomekd rbawden longyuewangdcu zhengzx-nlp sterlesser

nematus's Issues

Epoch number is lost after resume

When training has been interrupted and is later resumed the number of iterations is correctly displayed, however, the epoch number is lost and does not seem to increase later on during training.

major differences between translation time and training time

When I use translate.py the results are quite weird, if I train my model for low number of iterations I end up with blank translations(each sentence from the test is translated to eos right away).
If I train my model for high number of iterations I end up with sentences which contains single word with repetitions.

The validation examples during training time looks OK (--sampleFreq param).

Do you have any thoughts?

stdout pollution

I translated from stdin to stdout. This garbage appeared in my stdout along with the translated text.
['nvcc', '-shared', '-O3', '-m64', '-Xcompiler', '-DCUDA_NDARRAY_CUH=c72d035fdf91890f3b36710688069b2e,-DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION,-fPIC,-fvisibility=hidden', '-Xlinker', '-rpath,/home/heafield/.theano/compiledir_Linux-4.4--generic-x86_64-with-Ubuntu-16.04-xenial-x86_64-2.7.12-64/cuda_ndarray', '-I/usr/local/lib/python2.7/dist-packages/theano/sandbox/cuda', '-I/usr/local/lib/python2.7/dist-packages/numpy/core/include', '-I/usr/include/python2.7', '-I/usr/local/lib/python2.7/dist-packages/theano/gof', '-o', '/home/heafield/.theano/compiledir_Linux-4.4--generic-x86_64-with-Ubuntu-16.04-xenial-x86_64-2.7.12-64/cuda_ndarray/cuda_ndarray.so', 'mod.cu', '-L/usr/lib', '-lcublas', '-lpython2.7', '-lcudart']
ESC]0;IPython: ro-en/docs^G

(Note the garbage has been postprocessed)

Better message for encoder_truncate_gradient

Ran this (which I assume is a slightly out of date model):

THEANO_FLAGS=mode=FAST_RUN,floatX=float32,device=gpu,on_unused_input=warn python /fs/magni0/heafield/ro-en/nematus/nematus/translate.py -m /mnt/baldur0/rsennrich/wmt16_neural/ro-en/exp6/model.npz -k 12 -n -p 1 --suppress-unk

Get an error message:

Process Process-1:
Traceback (most recent call last):
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/fs/magni0/heafield/ro-en/nematus/nematus/translate.py", line 42, in translate_model
    f_init, f_next = build_sampler(tparams, option, use_noise, trng, return_alignment=return_alignment)
  File "/mnt/magni0/heafield/ro-en/nematus/nematus/nmt.py", line 393, in build_sampler
    x, ctx = build_encoder(tparams, options, trng, use_noise, x_mask=None, sampling=True)
  File "/mnt/magni0/heafield/ro-en/nematus/nematus/nmt.py", line 229, in build_encoder
    truncate_gradient=options['encoder_truncate_gradient'],
KeyError: 'encoder_truncate_gradient'

Note that this config does not mention encoder_truncate_gradient at all.

nematus/score.py broken because of error in nematus/theano_util.py

Hi there,

I've been testing score.py as in master commit b5469b4 with the following script.

#!/bin/sh

# theano device, in case you do not want to compute on gpu, change it to cpu
# device=gpu
device=cpu

# path to nematus ( https://www.github.com/rsennrich/nematus )
nematus=~/Research/Resources/nematus

## Path to the directory to save corpus data
DATA=..

# path to source files
ST=$DATA/alignments/sentence/mbitexts/word/en_ceb

# SL
SL=en

# TL
TL=es

# path to the target files
TT=$DATA/alignments/sentence/mbitexts/word/es

# path to the output directory
OUTDIR=$DATA/alignments/sentence/nmt_cbe_output

# mkdir OUTDIR
mkdir -p $OUTDIR

## model
MODEL=~/CORPORA/nmt-cristina/model_L1L2w_v80k.npz

for i in $ST/*.txt
do
    echo ${i##*/}
    THEANO_FLAGS=mode=FAST_RUN,floatX=float32,device=$device,on_unused_input=warn python $nematus/nematus/score.py \
         -b 80 \
         -v \
         -m $MODEL \
         -s $i \
         -t $TT/${i##*/} \
         -o $OUTDIR/${i##*/}
done

And I got an error whose traceback is as follows:

Traceback (most recent call last):
  File "/Users/jmmmac/Research/Resources/nematus/nematus/score.py", line 132, in <module>
    args.output, b=args.b, normalization_alpha=args.n, verbose=args.v, alignweights=args.walign)
  File "/Users/jmmmac/Research/Resources/nematus/nematus/score.py", line 106, in main
    rescore_model(source_file, nbest_file, saveto, models, options, b, normalization_alpha, verbose, alignweights)
  File "/Users/jmmmac/Research/Resources/nematus/nematus/score.py", line 35, in rescore_model
    params = load_params(model, param_list)
  File "/Users/jmmmac/Research/Resources/nematus/nematus/theano_util.py", line 72, in load_params
    new_params[with_prefix+kk] = pp[kk].astype(floatX, copy=False)
TypeError: float() argument must be a string or a number

Best!

Using Minimum Risk Training (MRT)

Hi Rico,

We are trying to use the MRT feature of Nematus but somehow unable to train it properly. PFA the attached configuration of the model. Please suggest if any issue in the configuration.

config.py.txt

The trained model is not even able to translate properly (translation output is "." only ).

IOError

Hi,
I'm getting some strange IOError exceptions while running Nematus training (both on baldur and meili).

Traceback (most recent call last):
File "train.py", line 60, in
main()
File "train.py", line 56, in main
external_validation_script=WDIR + '/scripts/validate.sh')
File "/fs/meili0/amiceli/nematus-dev/nematus/nmt.py", line 964, in train
numpy.savez(saveto, history_errs=history_errs, uidx=uidx, **params)
File "/mnt/meili0/rsennrich/tools/virtual_environment/local/lib/python2.7/site-packages/numpy/lib/npyio.py", line 574, in savez
_savez(file, args, kwds, False)
File "/mnt/meili0/rsennrich/tools/virtual_environment/local/lib/python2.7/site-packages/numpy/lib/npyio.py", line 642, in _savez
zipf.write(tmpfile, arcname=fname)
File "/usr/lib/python2.7/zipfile.py", line 1184, in write
self.fp.write(buf)
IOError: [Errno 5] Input/output error

Python documentation says that Errno 5 is a generic I/O error.

I've also got some IOErrors of the same kind while reading the corpus in data_iterator.py but I kinda fixed them just by catching the exception and reshuffling and reopening the files.

Any ideas on what is going on?

how to run a nmt example

Could you please provide steps for cpu

number of next word predictions at translation time

Dear nematus community,

It has been couple of months I have started to deal with neural machine translation. I have a question related to nematus code, and I am sorry if it is so primitive for you.

In translation time, in the function gen_sample() under nmt.py, next_w predictions are not one, but it varies "as I have experimented up to now" up to 5, why is it?

So basically, I am getting alignment vector, context vector and next word predictions at translation time under the following loop:

# x is a sequence of word ids followed by 0, eos id
for ii in xrange(maxlen):

But the number of values at each time step at alignment vector, context vector and next word vector is more than one, up to 5.

Why?

Thanks for your time and answer.

Kind Regards,

traslate.py

little guidance required

hey, if i want to use NEMATUS for code2doc task how to use it . Can u please guide me with the steps.Thank you so much.
I have already done all the preprocessing tast as suggested by you . now i am not able to get it that how to use the vocab file .json file and the x.train , y.train.

rescore.py with -w parameter

Hi Rico,
thank you for Nematus, it is such a great tool. We've run into a slight problem while using reranking with -w option to get the attention matrix, eg:

THEANO_FLAGS=mode=FAST_RUN,floatX=float32,device=$device,on_unused_input=warn,lib.cnmem=0.8 python $nematus/nematus/rescore.py \
     -m model.r2l.npz \
     -s  preprocessed.tok \
     -i  reversed.tok \
     -o rescored.tok \
     -b 80 -n -w

  File "/home/current_nematus/nematus/nematus/rescore.py", line 97, in <module>
    main(source_file, nbest_file, output_file, rescorer_settings)
  File "/home/current_nematus/nematus/nematus/rescore.py", line 88, in main
    rescore_model(source_file, nbest_file, output_file, rescorer_settings, options)
  File "/home/current_nematus/nematus/nematus/rescore.py", line 77, in rescore_model
    align_OUT.write(line + "\n")
TypeError: can only concatenate list (not "str") to list

As you can see, variable line is a list, so I've made a simple modification in the code. I added a check for a variable type on the line before the error occurs (76 in current rescore.py), e.g. from:

if rescorer_settings.alignweights:
	for line in alignments:
		align_OUT.write(line + "\n")

to:

if rescorer_settings.alignweights:
	for line in alignments:
		if type(line)==list:
			for l in line:
				align_OUT.write(l + "\n")
		else:
			align_OUT.write(line + "\n")

I'm not sure if that it is the correct way to handle this, it works ok for us though, hope it will help if somebody else comes across this issue. If there is a better way to solve this, please let me know if I can be of any help.
Thanks, Josef.

alpha_c not supported yet still in the README

Just pulled the latest version and got:

nmt.py: error: unrecognized arguments: --alpha_c 0.0

Yet it is still shown in the README documentation. Was it renamed?

how to train with multi gpu and won't lower the traing rate？

i train the model in k40 with gpu0-4，but the rate is about '40 sents/s' every piece of card。the rate is so low comparing to the training process with single card。

batch_size=128
maxlen=50

training became extremely slow on GPU

Hi,

I am training a translation model with around 15 million parallel corpora. After around 15 epochs, my training went down from around 70s/s to around 20s/s.

I use tesla k40, and cudnn 5.1. What might be the problem?

I have checked cpu and gpu usage, I saw that gpu is allocated, and only 1 cpu is allocated.

What might be the problem?

Thanks,

Spanish translation

Hello, I would like to collaborate with the Spanish translation, is it done?

Nematus Server doesn't run in Docker container

The server backend paste causes multithreading issues in Docker containers. It should be replaced with a more modern, Docker-compatible backend.

nematus and device=cuda versus device=gpu in theano 0.8.2 vs dev

Hi Nematus Team,

Apologies in advance if this is a known issue or if I am misunderstanding something.

For reference, our hardware is an Intel Xeon 1620v3 and a GeForce 1070.

We are able to use nematus quite easily under Debian 8.7, Theano 0.8.2, running:
THEANO_FLAGS=mode=FAST_RUN,floatX=float32,device=gpu ./test_train.sh
and
THEANO_FLAGS=mode=FAST_RUN,floatX=float32,device=gpu ./test_translate.sh

When moving to Theano 0.9.0dev5.dev, this command:
THEANO_FLAGS=mode=FAST_RUN,floatX=float32,device=gpu ./test_train.sh
works fine but results in a message about a deprecated device interface:

WARNING (theano.sandbox.cuda): The cuda backend is deprecated and will be removed in the next release.  Please switch to the gpuarray backend. You can get more information about how to switch at this URL:
 https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29

So we use this instead:
THEANO_FLAGS=mode=FAST_RUN,floatX=float32,device=cuda ./test_train.sh

So far, so good. We see speed improvement from 147 sentences/sec to 204 sentences/sec with device=cuda instead of device=gpu.

However, when we run:
THEANO_FLAGS=mode=FAST_RUN,floatX=float32,device=cuda ./test_translate.sh

we receive the following output:

Translating ../../en-de/in ...
Using cuDNN version 5105 on context None
Mapped name None to device cuda0: GeForce GTX 1070 (0000:02:00.0)
Building f_init... Done
Building f_next.. Done
Process Process-1:
Traceback (most recent call last):
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/home/cmc/nmt/nematus/nematus/translate.py", line 72, in translate_model
    seq = _translate(x)
  File "/home/cmc/nmt/nematus/nematus/translate.py", line 52, in _translate
    suppress_unk=suppress_unk, return_hyp_graph=return_hyp_graph)
  File "/home/cmc/nmt/nematus/nematus/nmt.py", line 489, in gen_sample
    ret = f_init[i](x)
  File "/usr/local/lib/python2.7/dist-packages/theano/compile/function_module.py", line 886, in __call__
    storage_map=getattr(self.fn, 'storage_map', None))
  File "/usr/local/lib/python2.7/dist-packages/theano/gof/link.py", line 325, in raise_with_op
    reraise(exc_type, exc_value, exc_trace)
  File "/usr/local/lib/python2.7/dist-packages/theano/compile/function_module.py", line 873, in __call__
    self.fn() if output_subset is None else\
RuntimeError: Invalid value or operation
Apply node that caused the error: GpuAdvancedSubtensor1(Wemb, GpuReshape{1}.0)
Toposort index: 73
Inputs types: [GpuArrayType<None>(float32, (False, False)), GpuArrayType<None>(int64, (False,))]
Inputs shapes: [(85000, 500), (10,)]
Inputs strides: [(2000, 4), (-8,)]
Inputs values: ['not shown', 'not shown']
Outputs clients: [[GpuIncSubtensor{InplaceSet;::, int64:int64:}(GpuAlloc<None>{memset_0=True}.0, GpuAdvancedSubtensor1.0, Constant{0}, ScalarFromTensor.0)]]

But, it is still possible to run it with device=gpu (albeit with the deprecation warning message).

We appreciate any advice you can give us!

Processes in deadlock on using -p 2 or more

In translate_single.sh script, when I am using number of processes -p with value 2 or more I am getting following output.

$model_dir/preprocess.sh | \
THEANO_FLAGS=mode=FAST_RUN,floatX=float32,device=$device python $nematus_home/nematus/translate.py \
     -m $model_dir/model.l2r.ens1.npz --suppress-unk \
     -k 5 -n -p 2  | \
$model_dir/postprocess.sh

Output:

Detokenizer Version $Revision: 4134 $
Language: en
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.138 seconds.
Prefix dict has been built succesfully.
Using cuDNN version 6021 on context None
Mapped name None to device cuda: GeForce GTX TITAN X (0000:02:00.0)
Using cuDNN version 6021 on context None
Mapped name None to device cuda: GeForce GTX TITAN X (0000:02:00.0)
INFO: Waiting for existing lock by process '14569' (I am process '14570')
INFO: To manually release the lock, delete /home/himanshu/.theano/compiledir_Linux-4.4--generic-x86_64-with-debian-jessie-sid-x86_64-2.7.15-64/lock_dir
INFO: Waiting for existing lock by process '14570' (I am process '14569')
INFO: To manually release the lock, delete /home/himanshu/.theano/compiledir_Linux-4.4--generic-x86_64-with-debian-jessie-sid-x86_64-2.7.15-64/lock_dir
INFO: Waiting for existing lock by process '14570' (I am process '14569')
INFO: To manually release the lock, delete /home/himanshu/.theano/compiledir_Linux-4.4--generic-x86_64-with-debian-jessie-sid-x86_64-2.7.15-64/lock_dir
INFO: Waiting for existing lock by process '14570' (I am process '14569')
INFO: To manually release the lock, delete /home/himanshu/.theano/compiledir_Linux-4.4--generic-x86_64-with-debian-jessie-sid-x86_64-2.7.15-64/lock_dir

And these two processes, keeps repeating these messages. Looks like this is a deadlock.
Let me know if I am missing something, also how can I fix this if it is an issue?

Cost 0.0 in training

Hi!

I'm running the last version of Nematus in a machine with Ubuntu16.04, cuda-8.0 and Theano 0.9, but when I run test_train.sh the cost of the iteration is 0.0.
The same thing happens when running an actual training on parallel data (IWSLT 2016 en-fr). The objective functions is cross-entropy.
I used previous versions of nematus but I've never encountered this problem

the copy_unknown_words.py can not be used

Hi Rico,
Have you considered to output the original word without the UNK symbol? I found that the script copy_unknown_words.py in the UTILS folder can not replace the unknown words in target sentences with their aligned words in source sentences.
Do you have any suggestions ?

Algorithm

hi ,
can u please tell me what is the algorithm of you neural machine translator. what all steps you have done to make it so different from other neural machine translator .
So, i request you to please tell me the algorithm.

best,
Bhagat

Different result after converting model from Theano to TF

Hi, I am experimenting with the pre-trained Nematus models for WMT'17, zh-en language pair.

I converted the pre-trained model to Tensorflow using this command: python nematus/theano_tf_convert.py --from_theano --in ../wmt17_systems/zh-en/model.l2r.ens1.npz --out ../wmt17_systems/zh-en/model-tf.l2r.ens1.npz.

And run a single model translation using this command:

$model_dir/preprocess.sh | \
CUDA_VISIBLE_DEVICES=0 python $nematus_tf_home/nematus/translate.py \
      -m $model_dir/model-tf.l2r.ens1.npz \
      -k 12 -n -p 1  | \
$model_dir/postprocess.sh

I observed quite significant drop in BLEU:

Theano version:
BLEU = 22.84, 56.7/29.1/17.0/10.4 (BP=0.982, ratio=0.982, hyp_len=52856, ref_len=53827)

Tensorflow version:
BLEU = 21.26, 53.8/26.8/15.4/9.2 (BP=1.000, ratio=1.044, hyp_len=56182, ref_len=53827)

By the way, when I run the conversion code, the following was printed:

Not saving decoder_c_tt because no TF equivalent
The following TF variables were not assigned (excluding Adam vars):
You should see only 'beta1_power', 'beta2_power' and 'time' variable listed
time:0
beta1_power:0
beta2_power:0

I also noticed that --suppress-unk is no longer available when calling translate.py. Anything I've missed here? thanks :)

search_graph : unable to display chinese or other non ASCII character

I have been trying set different UTF8 encoding setting, it still not working in your hypgraph.py program.

I wrote a single pygraphviz python program for testing the chinese display problem, it works.

Error when run test_train.sh

There are error message in console:
Loading data
Building model
Traceback (most recent call last):
File "../nematus/nmt.py", line 1208, in
train(**vars(args))
File "../nematus/nmt.py", line 795, in train
build_model(tparams, model_options)
File "../nematus/nmt.py", line 237, in build_model
x, ctx = build_encoder(tparams, options, trng, use_noise, x_mask, sampling=False)
File "../nematus/nmt.py", line 194, in build_encoder
profile=profile)
File "/Users/Mr.Wu/wup/nematus/nematus/layers.py", line 171, in gru_layer
strict=True)
File "/Users/Mr.Wu/anaconda/lib/python2.7/site-packages/theano/scan_module/scan.py", line 1041, in scan
scan_outs = local_op(*scan_inputs)
File "/Users/Mr.Wu/anaconda/lib/python2.7/site-packages/theano/gof/op.py", line 611, in call
node = self.make_node(*inputs, **kwargs)
File "/Users/Mr.Wu/anaconda/lib/python2.7/site-packages/theano/scan_module/scan_op.py", line 538, in make_node
inner_sitsot_out.type.dtype))
ValueError: When compiling the inner function of scan the following error has been encountered: The initial state (outputs_info in scan nomenclature) of variable IncSubtensor{Set;:int64:}.0 (argument number 3) has dtype float32, while the result of the inner function (fn) has dtype float64. This can happen if the inner function of scan results in an upcast or downcast.

If you suspect this is an IPython bug, please report it at:
https://github.com/ipython/ipython/issues
or send an email to the mailing list at [email protected]

You can print a more detailed traceback right now with "%tb", or use "%debug"
to interactively debug it.

Extra-detailed tracebacks for bug-reporting purposes can be enabled via:
%config Application.verbose_crash=True

My environment are：
Python 2.7.3
numpy 1.11.3
theano 0.8.2
Could you help me solve it ?thanks

translate.py

hi i want to knwo that i ran my training till 350 epochs an i got these models 👍
model.npz-30000.data-00000-of-00001 model.npz.index
model.npz-30000.index model.npz.json
model.npz-30000.meta model.npz.meta
model.json model.npz-30000.progress.json model.npz.progress.json
model.npz model.npz.data-00000-of-00001

Then i ran the command ~/data/nematus-master# python nematus/score.py --models model model.npz model.npz.progress model.npz-30000.meta model.npz-30000.progress --source /root/data/nematus-master/data2/test/decldesc_test_bpe --target /root/data/nematus-master/data2/test/bodies_test_bpe --output /root/data/nematus-master/data2/test/output2.txt

and i got an error
Traceback (most recent call last):
File "nematus/score.py", line 82, in
main(source_file, target_file, output_file, scorer_settings)
File "nematus/score.py", line 68, in main
fill_options(options[-1])
File "/root/data/nematus-master/nematus/compat.py", line 19, in fill_options
first_factor_size = options['n_words_src']
KeyError: u'n_words_src'

can you please tell me why this eror is coming . please guide me where i am going wrong.

score.py can not save alignments

Traceback (most recent call last):
File "/fs/meili0/amiceli/nematus-crelu/nematus/score.py", line 132, in
args.output, b=args.b, normalization_alpha=args.n, verbose=args.v, alignweights=args.walign)
File "/fs/meili0/amiceli/nematus-crelu/nematus/score.py", line 106, in main
rescore_model(source_file, nbest_file, saveto, models, options, b, normalization_alpha, verbose, alignweights)
File "/fs/meili0/amiceli/nematus-crelu/nematus/score.py", line 91, in rescore_model
for line in all_alignments:
NameError: global name 'all_alignments' is not defined

Training speed benchmarks

Hi,

We are looking into training speed, using the test_train.sh script. Comparing our numbers to the benchmarks currently reported in the readme, our "words/s" numbers are in the range of the reported "sentences/s". So either our training is really slow, or the benchmark numbers should actually be words per second.

Comparing a commit from November (when the benchmarks where added) to the current version, supports the latter hypothesis. Could you confirm and adapt the benchmark numbers?

Thank you!

n_factors related bug in nmt.py

Hi all,

There are 2 potential conditions that might encounter a list index out of range exception (see below) in nmt.py here (condition 1) and data_iterator.py here (condition 2)

When the first sentence of a mini-batch is an empty line and the skip_empty flag is not set as True.
When we don't want to use any factor other than the word itself, but the first sentence of a mini-batch startswith a | symbol in the training data.
I think it's better that we use a try, exception to capture such exceptions.
I'll test it and raise a PR if you want.

##############################
The exception is like:

Traceback (most recent call last):
  File "../../nematus/nmt_worker.py", line 666, in <module>
    update_algorithm=args.update_algorithm
  File "../../nematus/nmt_worker.py", line 443, in train_on_multi_gpu
    epoch, x, x_mask, y, y_mask = get_one_mini_batch(train_it)
  File "../../nematus/nmt_worker.py", line 422, in get_one_mini_batch
    n_words=n_words)
  File "/nematus/nmt.py", line 68, in prepare_data
    n_factors = len(seqs_x[0][0])
IndexError: list index out of range

If you suspect this is an IPython bug, please report it at:
    https://github.com/ipython/ipython/issues
or send an email to the mailing list at [email protected]

You can print a more detailed traceback right now with "%tb", or use "%debug"
to interactively debug it.

Extra-detailed tracebacks for bug-reporting purposes can be enabled via:
    %config Application.verbose_crash=True

json and pkl files missing

Hi, I think the translate.py script (at line 111) should say something more meaningful when both the .pkl and .json files are missing. I'm currently missing both files, I think it's because saveFreq was not set properly.

Score for stochastic sampling

Line: https://github.com/rsennrich/nematus/blob/master/nematus/nmt.py#L958

sample_score += next_p[0][0, nw]

Should be

sample_score -= np.log(next_p[0][0, nw])

Float16 does not work

In this branch, I removed all hardcoded references to float32 and I tried to train with float16, but it does not work:

Using cuDNN version 5105 on context None
Mapped name None to device cuda0: TITAN X (Pascal) (0000:02:00.0)
Loading data
Building model
Building sampler
Building f_init... Done
Building f_next.. Done
Building f_log_probs... Done
Computing gradient... Done
Building optimizers...Disabling C code for Elemwise{Cast{float32}} due to unsupported float16
Done
Total compilation time: 198.4s
Optimization
Seen 846 samples
NaN detected

I've also tried increasing the epsilon in the Adam optimizer, but it doesn't solve the issue.

we train Neural Machine Translation (NMT) models in both direction using Nematus

Hi,
I was reading your paper "A Parallel Corpus of Python Functions and Documentation Strings for
Automated Code Documentation and Code Generation" where you have told that you have trained "Neural Machine Translation (NMT) models in both direction using Nematus"

my question is how to preprocess the dataset as the dataset is a parallel corpus of python funtion and doc strings.
for ex- if we use a parallel corpora of only text like eng-deu , we use word embeddings. so , in this case what we will use ?
Thank you so much

Please consider providing a demo container

Ideally a container with a pre-trained model in it could be available so that we can easily try the system without having to run numerous setup and training steps manually.

TypeError: param_init_embedding_layer() missing 2 required positional arguments: 'n_words' and 'dims

I have been trying to get nematus to work using the wmt16 model, but I think there is something wrong in the code. I tried to fix it but no matter what I (guessed) to change, it keeps crashing in different places. Here's the root problem:

Warning: No built-in rules for language de.
Detokenizer Version $Revision: 4134 $
Language: de
Tokenizer Version 1.1
Language: en
Number of threads: 1
Translating <stdin> ...
Using cuDNN version 5110 on context None
Mapped name None to device cuda: GeForce GTX 960M (0000:01:00.0)
Process Process-1:
Traceback (most recent call last):
  File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/media/data/translation/nematus/nematus/translate.py", line 54, in translate_model
    f_init, f_next = build_sampler(tparams, option, use_noise, trng, return_alignment=return_alignment)
  File "/media/data/translation/nematus/nematus/nmt.py", line 359, in build_sampler
    x, ctx = build_encoder(tparams, options, trng, use_noise, x_mask=None, sampling=True)
  File "/media/data/translation/nematus/nematus/nmt.py", line 186, in build_encoder
    emb = get_layer_constr('embedding')(tparams, x, suffix='', factors= options['factors'])
TypeError: param_init_embedding_layer() missing 2 required positional arguments: 'n_words' and 'dims'
Error: translate worker process 10737 crashed with exitcode 1

If you look at layers.py, line 76, I think it's true. But where to get n_words and dims from?

infinite loop when there is a single token "|"

Hi,
I met this problem during decoding.
Since there is one sentence at WMT newstest13.fr:146
"Soins palliatifs - La meilleure façon de mourir . . . | Le Devoir"

Could you confirm that? Thanks.

nematus/nematus/translate.py

Line 388 in c4adba6

    
           w = [self._word_dicts[i][f] if f in self._word_dicts[i] else 1 for (i,f) in enumerate(w.split('|'))]

nematus/server directory deleted during TF merge, breaking server.py

After the 9b1ebb5 merge, on a fresh install, nematus/server.py is going to crash since these two lines

nematus/nematus/server.py

Lines 15 to 16 in 5727727

    
           from server.response import TranslationResponse 
        
           from server.api.provider import request_provider, response_provider

refer to the now-deleted server directory. I've figured out a way to revert that using PyCharm (I'm not enough of a git wizard to do it "by hand", I guess 😄) that conserves the files' git history using git log --full-history (my test commit is the only one I see via the usual git log nematus/server), which I can submit as a PR if you'd like.

Also, 🎉 for TF and Python 3 compatibility!

build_dictionary.py : python 3 version

I just wanted to upload a python 3.6 version of the build_dictionary.py file for anyone that would like to use it.

I used this stackoverflow suggestion as the reasoning behind my changes.
https://stackoverflow.com/questions/39284842/order-dictionary-index-in-python

#!/usr/bin/python

import numpy
import json

import sys
import io

from collections import OrderedDict

def main():
    for filename in sys.argv[1:]:
        print ('Processing', filename)
        word_freqs = OrderedDict()
        with open(filename, 'r') as f:
            for line in f:
                words_in = line.strip().split(' ')
                for w in words_in:
                    if w not in word_freqs:
                        word_freqs[w] = 0
                    word_freqs[w] += 1
        words = list(word_freqs.keys())
        freqs = list(word_freqs.values())

        sorted_idx = numpy.argsort(freqs)
        sorted_words = [words[ii] for ii in sorted_idx[::-1]]

        worddict = OrderedDict()
        worddict['eos'] = 0
        worddict['UNK'] = 1
        for ii, ww in enumerate(sorted_words):
            worddict[ww] = ii+2

        with open('%s.json'%filename, 'w', encoding="utf-8") as f:
            json.dump(worddict, f, indent=2, ensure_ascii=False)

        print('Done')

if __name__ == '__main__':
    main()

ValueError: Parent directory of model doesn't exist, can't save.

INFO: Validation loss (AVG/SUM/N_SENT): 212.140906401 93129.8579102 439
2018-06-30 02:08:31.421233: W tensorflow/core/framework/op_kernel.cc:1318] OP_REQUIRES failed at save_restore_v2_ops.cc:109 : Not found: ; No such file or directory
Traceback (most recent call last):
File "nematus/nmt.py", line 692, in
train(config, sess)
File "nematus/nmt.py", line 313, in train
saver.save(sess, save_path=config.saveto)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1720, in save
raise exc
ValueError: Parent directory of model doesn't exist, can't save.

Can you please tell me why i am getting this error when i am running the nematus model for code generation task?

translate.py outputs probabilities greater than one

python $nematus/translate.py
-m $prefix.dev.npz
-i $file_base.$src -o $file_base.$src.output.dev -k 1 -n -p 5 --suppress-unk --print-word-probabilities

results in something like:

ein Kampf der Republikaner gegen die Wiederwahl Obamas
1.98620128632 0.375202327967 0.935490012169 0.990142166615 5.79434633255 0.540984451771 0.961822271347 1.74049687386 0.97704654932

Any idea why this happens? Is it related to the length normalization?

EOFError in ERROR: test_ende (main.TestTranslate)

I'm a beginner in NMT. I got an error when I try to run the test_traslate.py with the script: THEANO_FLAGS=mode=FAST_RUN,floatX=float32,device=cpu python test_translate.py

The feedback was:

It will be great if you can help with my problem. Thanks a lot!

Character-based NMT

Do you or will you implement character-based translation?

maxlen in data iterators

In domain_iterator.py and domain_interpolation_data_iterator.py there is this fragment of code:

            if len(ss) > self.maxlen and len(tt) > self.maxlen:
                continue

which skips a sentence pair if both the source sentence and the target sentence exceed the maximum allowed length. Is this behavior correct? Shouldn't we skip the pair if any of the sentences exceeds the maximum length?

Cross-entropy broken for small training set or batches?

Hi,
when using the WMT de-en model to initialize a training run on a single sentence corpus, I get very incorrect cross-entropy results. I am using only default setting, no dropout or anything. I verified that the model is being loaded correctly. The values in the final layer before it goes into cross-entropy are also correct.

For instance for the pair

das ist ein Test .
this is a test .

The cost should be around 0.61868, verified by scoring with Amun. Nematus in its current version from master produces 102 for the first forward step. When repeating this sentence 20 times and increasing the batch size accordingly, the cost becomes 111.15, which makes no sense as it should at least average to the same cost as for the single sentences.

After inspecting the cost vector manually, it seems only the first value is being calculated correctly. Can anyone confirm this? Or is something wrong with my setup?

KeyError: 'deep_fusion_lm'

File "/nematus/nematus/nmt.py", line 609, in build_sampler
if options['deep_fusion_lm']:
KeyError: 'deep_fusion_lm'
Does anyone know how to solve the problem? Thank you:)

Approximated softmax

Hi,

I am wondering why the last layer is a softmax and not an approximated version such as Hierarchical softmax or noise contrastive estimation.
Maybe the improvement on time performance wouldn't be significant?

Thanks,
Mattia

code generation

In the code generation as you have mentioned to use Namatus for the nmt in that the input is a declaration,docstring and output is body . then how the code will be generated ?

in translation like en to german we know that hi means hello .. but over here how the translation actually working and how the tokenization has been done?

i am new to generation of code .. so need some advice to implement this model and i have also read the nematus how it is working but nothing is described about the translaion of declation,docstring+body
i will be very thankfull to you if you can guide me how to use nematus for code generation ..

Support for multi-layers encoder/decoders

Hi,

Is there any plans to add support for multi-layers of GRUs?
It has proven to be effective for other seq2seq tasks.

Thanks

Domain Adaptation with nematus

Hi folks,

I have trained a translation model with a dataset. After 510000th iteration, I have killed the training, and started a new training with a new dataset, by using the last 510000th model I had. For this reason, I have created a new models folder, and copied model.iter510000.npz as model.npz, and model.iter510000.npz.gradinfo.npz as model.npz.gradinfo.npz. But I forgot to copy model.iter510000.progress.json to my new models file.

Theoretically, it shouldn't effect the fact that I continue training from 510000th iteration, right? Because since I have not copied progress.json file, the output of the code shows like I have started from 0.

How can I get the context vector at translation time?

in translation time I would like to get the context vector(with soft attention) and words embeddings for each word wi with all the alphas and hidden state, for example (english to french):

h1 h2 h3 h4 h5 h6
| | | | | |
the boy played -> Le garçon jouait
| | | | | |
x1 x2 x3 x4 x5 x6

if we look at the word "garçon" at translation time, I would like to get alpha1h1+alpha2h2+alpha3*h3 sum, and h1,h2,h3,alpha1,alpha2,alpha3 for this word.
I would also like to get the source word embeddings(x1,x2,x3)

	from server.response import TranslationResponse
	from server.api.provider import request_provider, response_provider