udibr / headlines Goto Github PK

View Code? Open in Web Editor NEW

526.0 28.0 150.0 649 KB

Automatically generate headlines to short articles

License: MIT License

Jupyter Notebook 100.00%

keras nlp generation summarization rnn

headlines's Introduction

Automatically generate headlines to short articles

This project attempts to reproduce the results in the paper: Generating News Headlines with Recurrent Neural Networks

How to run

Software

The code is running with jupyter notebook
Install Keras
pip install python-Levenshtein

Data

It is assumed that you already have training and test data. The data is made from many examples (I'm using 684K examples), each example is made from the text from the start of the article, which I call description (or desc), and the text of the original headline (or head). The texts should be already tokenized and the tokens separated by spaces.

Once you have the data ready save it in a python pickle file as a tuple: (heads, descs, keywords) were heads is a list of all the head strings, descs is a list of all the article strings in the same order and length as heads. I ignore the keywords information so you can place None.

Build a vocabulary of words

The vocabulary-embedding notebook describes how a dictionary is built for the tokens and how an initial embedding matrix is built from GloVe

Train a model

train notebook describes how a model is trained on the data using Keras

Use model to generate new headlines

predict generate headlines by the trained model and showes the attention weights used to pick words from the description. The text generation includes a feature which was not described in the original paper, it allows for words that are outside the training vocabulary to be copied from the description to the generated headline.

Examples of headlines generated

Good (cherry picking) examples of headlines generated

Examples of attention weights

headlines's People

Contributors

Stargazers

Watchers

Forkers

little1tow hitluobin ml-lab rsarxiv liormagen snazz2001 salopge xypan1232 tammyyang yangjunpro manasrk journey0621 poojits vikingmew jemisa fatmas1982 suraj-deshmukh wai7niu8 wjbianjason jayinai gaetangate drschilling stifflerhe gnanam336 avr92 vyraun silunwang oneproton atana1 sagaruprety chagri edwardchow33 alexyandy2016 jolinxql rpadgilw vladprytula llsourcell pramodshenoy tabishsada deeplearningsky albertusk95 sunnysai12345 milesqli djher iabhi7 goodrahstar rpspace bharathkumarks ultimatepritam pcannon67 qgzang d0tn3t mjunaidi houzhenzhen kaeflint meronp jjdblast akshayjh abhimanyu96 iamjoshbinder dljr0122 praggie victorpu mesutkrp dthorsrud learningneo uditennam nanfengpo reavil01 guptam bear1988520 zhhengcs zhudaoruyi satishjasthi feiying12343 mamonraab shubhampachori12110095 harshadeepg ashishrana160796 malcolmgreaves anniepap anish-john clairett kepengxu sohit-nayak deepak-ceg junix sedflix elb5465 sonakoch victordongy mjlassila tusharbihani hpylieva auserj anubhaagrawal oldcoffeecat chunyuanxu tcandzq healingl

headlines's Issues

Running OOM in cell 30 when using GPU

First off, thanks for this share. This NLP stuff is really cool, but can be overwhelming. This project has helped immensely in helping me to better understand. Unfortunately, it seems troubleshooting these models and implementations is almost an art in itself.

So I am currently using Tensorflow and I wrote a scraper to pull data from Buzzfeed to act as my training set since the reuters data didn't seem to be enough due to glove containing 40k vocab. When I was running against the CPU, all worked fine but in a period of about 7 hours, I only made it through about 4 iterations out of 500. So I have attempted to jump over to using the GPU. I am running on a Mac with an Nvidia 650M and 1 GB of VRAM. So I am aware that this isn't the best of hardware, but I believe it still should be doable right?

So when I run the train.py file (converted from ipynb), I am getting the below OOM error. I know you have been using Theano, so if you aren't sure then just disregard. However if you know how I may be able to overcome this issue I'd love to hear it. It seems to be erroring out around cell 30. What is baffling to me is that it says total memory is 1023.MiB but free is 49.Mib. As I run each time it reduces. Now I saw in tF forums that Tensorflow allocates all GPU and that you really can't tell how much is free because it is managed internally....so maybe this is nothing. I was trying to figure out how to flush the GPU mem but I haven't had any luck with that either. Even after restarting the mac, I still see about the same thing.

I have tried adjusting the training sample size and some of the other variables to see if I could get it to just run through even once but it still ends up crashing with an OOM error. My next thing I am going to try is getting the TF summarizer working so perhaps I might be able to get a bit more insight via tensorboard. I do have a much better GPU on my PC, but I will need to set this all up in a docker vm or something if I take that approach. However if it works, I will be happy.

If you have any insight, please send it my way. Thanks again!!

`Using TensorFlow backend.
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.dylib locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.dylib locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.dylib locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.dylib locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.dylib locally
1.0.7
number of examples 49372 49372
dimension of embedding space for words 100
vocabulary size 40000 the last 10 words can be used as place holders for unknown/oov words
total number of different words 74477 74477
number of words outside vocabulary which we can substitue using glove similarity 12523
number of words that will be regarded as unknonw(unk)/out-of-vocabulary(oov) 21954
46372 46372 3000 3000
H: oops building
D: there’s something different about this building can you guess what
H: mathematical formula for beer goggles
D: british scientists discover the exact equation so-called beer goggles
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:883] OS X does not support NUMA - returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties:
name: GeForce GT 650M
major: 3 minor: 0 memoryClockRate (GHz) 0.9
pciBusID 0000:01:00.0
Total memory: 1023.69MiB
Free memory: 49.59MiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:755] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GT 650M, pci bus id: 0000:01:00.0)
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (256): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.

.....

I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x703a04a00 of size 1048576
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x703b04a00 of size 1048576
I tensorflow/core/common_runtime/bfc_allocator.cc:683] Free at 0x700b8dc00 of size 149504
I tensorflow/core/common_runtime/bfc_allocator.cc:683] Free at 0x700de4400 of size 204800
I tensorflow/core/common_runtime/bfc_allocator.cc:683] Free at 0x703c04a00 of size 1123840
I tensorflow/core/common_runtime/bfc_allocator.cc:689] Summary of in-use Chunks by size:
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 28 Chunks of size 256 totalling 7.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 24 Chunks of size 2048 totalling 48.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 4 Chunks of size 204800 totalling 800.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 31 Chunks of size 1048576 totalling 31.00MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 1 Chunks of size 1139200 totalling 1.09MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 1 Chunks of size 16000000 totalling 15.26MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] Sum Total of in-use chunks: 48.18MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:698] Stats:
Limit: 51998720
InUse: 50520576
MaxInUse: 50520576
NumAllocs: 129
MaxAllocSize: 16000000

W tensorflow/core/common_runtime/bfc_allocator.cc:270] **************************************************************************************************__
W tensorflow/core/common_runtime/bfc_allocator.cc:271] Ran out of memory trying to allocate 144.04MiB. See logs for memory state.
W tensorflow/core/framework/op_kernel.cc:907] Resource exhausted: OOM when allocating tensor with shape[944,40000]
Traceback (most recent call last):
File "train.py", line 275, in
name = 'timedistributed_1')))
File "/usr/local/lib/python2.7/site-packages/keras/models.py", line 307, in add
output_tensor = layer(self.outputs[0])
File "/usr/local/lib/python2.7/site-packages/keras/engine/topology.py", line 484, in call
self.build(input_shapes[0])
File "/usr/local/lib/python2.7/site-packages/keras/layers/wrappers.py", line 102, in build
self.layer.build(child_input_shape)
File "/usr/local/lib/python2.7/site-packages/keras/layers/core.py", line 604, in build
name='{}_W'.format(self.name))
File "/usr/local/lib/python2.7/site-packages/keras/initializations.py", line 59, in glorot_uniform
return uniform(shape, s, name=name)
File "/usr/local/lib/python2.7/site-packages/keras/initializations.py", line 32, in uniform
return K.random_uniform_variable(shape, -scale, scale, name=name)
File "/usr/local/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py", line 248, in random_uniform_variable
return variable(value, dtype=dtype, name=name)
File "/usr/local/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py", line 132, in variable
get_session().run(v.initializer)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 343, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 567, in _run
feed_dict_string, options, run_metadata)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 640, in _do_run
target_list, options, run_metadata)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 662, in _do_call
e.code)
tensorflow.python.framework.errors.ResourceExhaustedError: OOM when allocating tensor with shape[944,40000]
[[Node: random_uniform_13/RandomUniform = RandomUniformT=DT_INT32, dtype=DT_FLOAT, seed=0, seed2=0, _device="/job:localhost/replica:0/task:0/gpu:0"]]
Caused by op u'random_uniform_13/RandomUniform', defined at:
File "train.py", line 275, in
name = 'timedistributed_1')))
File "/usr/local/lib/python2.7/site-packages/keras/models.py", line 307, in add
output_tensor = layer(self.outputs[0])
File "/usr/local/lib/python2.7/site-packages/keras/engine/topology.py", line 484, in call
self.build(input_shapes[0])
File "/usr/local/lib/python2.7/site-packages/keras/layers/wrappers.py", line 102, in build
self.layer.build(child_input_shape)
File "/usr/local/lib/python2.7/site-packages/keras/layers/core.py", line 604, in build
name='{}_W'.format(self.name))
File "/usr/local/lib/python2.7/site-packages/keras/initializations.py", line 59, in glorot_uniform
return uniform(shape, s, name=name)
File "/usr/local/lib/python2.7/site-packages/keras/initializations.py", line 32, in uniform
return K.random_uniform_variable(shape, -scale, scale, name=name)
File "/usr/local/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py", line 247, in random_uniform_variable
value = tf.random_uniform_initializer(low, high, dtype=tf_dtype)(shape)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/ops/init_ops.py", line 98, in _initializer
return random_ops.random_uniform(shape, minval, maxval, dtype, seed=seed)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/ops/random_ops.py", line 182, in random_uniform
seed2=seed2)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/ops/gen_random_ops.py", line 96, in _random_uniform
seed=seed, seed2=seed2, name=name)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/ops/op_def_library.py", line 694, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2154, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1154, in init
self._traceback = _extract_stack()
`

Prediction is not working

I'm running the same code on test data and get strange weights back.

import h5py
with h5py.File('data/%s.hdf5'%FN1, mode='r') as f:
    if 'layer_names' not in f.attrs and 'model_weights' in f:
        f = f['model_weights']
    weights = [np.copy(v) for v in f['timedistributed_1'].itervalues()]

and
map(lambda x: x.shape, weights)
is giving me back:
[(2,)]

Also run the code with Keras 2.0.0 and the actual version. Can it be due to different versions?

Thanks in advance!

Is GPU processing a prerequisite?

Hi. First of all, thanks a lot for your work.
I am about to try your scripts and I wanted to now if I could run everything without using GPUs. I have a server with 40 CPUs. Will it run?

why to use our own softmax instead of inbuilt?

# out very own softmax
def output2probs(output):
    output = np.dot(output, weights[0]) + weights[1]
    output -= output.max()
    output = np.exp(output)
    output /= output.sum()
    return output

I tried example

x = np.array([0.5,.3,.2])
x -= x.max()    #array([ 0. , -0.2, -0.3])
x = np.exp(x)   #array([ 1.        ,  0.81873075,  0.74081822])
x /= x.sum()    #array([ 0.39069383,  0.31987306,  0.28943311])

It seems it smooths out probability big gaps? Why we want this? is it ahead we done sampling? why are we not simply taking top k most probable word provided by predict_proba function?

Data Set

Does the data set support for this kind of model if we use : .csv types

I mean as an example: news_headings.csv

Please help me!

RuntimeWarning: invalid value encountered in log

I am getting this error when I call gensamples() for prediction

RuntimeWarning: invalid value encountered in log cand_scores = np.array(live_scores)[:,None] - np.log(probs)

Also, the output shows HEAD: 3.14...
But no words(headlines) are shown along with it

Sometimes the output is nan

prob contains negative values. What could be the possible reason?

When I replace np.log(probs) by np.log(np.absolute(probs))
I get the output sentence with no meaning. The words seem random.

Any help would be greatly appreciated!

procedure for running your code

could you please tell step step procedure for running your code

Suggestions to tackle this MemoryError

Hello, I am using ami-125b2c72 (g2.2xlarge) with spot price as you suggested in another issue (thanks a lot). After struggling a bit with CUDA drivers I finally got to run some epochs and I am able to save and load all the training files from S3. Now, I have 1441135 examples. I trained one epoch, saved weights, stop ami, re-run script, load weights, train one more epoch and then crashed. I got this output. I wonder if you, @udibr , could give me some ideas or intuitions about what is my problem. One of my questions is, is the memory error about regular RAM or GPU memory? (maybe I could use another AMI).I also got the warning about Epoch comprised more than samples_per_epoch samples, but I am not sure if I should do anything about it.

`ubuntu@ip-xxxxxx:~/auris$ python train2.py
Using Theano backend.
Using gpu device 0: GRID K520 (CNMeM is enabled with initial size: 95.0% of memory, cuDNN 4007)
READING WORD EMBEDDING
('/home/ubuntu/auris//en3_vocabulary-embedding.pkl', ' already downloaded')
('/home/ubuntu/auris//en3_vocabulary-embedding.data.pkl', ' already downloaded')
number of examples 1441135 1441135
dimension of embedding space for words 100
vocabulary size 40000 the last 10 words can be used as place holders for unknown/oov words
total number of different words 1481094 1481094
number of words outside vocabulary which we can substitue using glove similarity 208580
number of words that will be regarded as unknonw(unk)/out-of-vocabulary(oov) 1232514
H: Vuwani schools damage prompts call for new law to punish vandals
D: Department officials on Tuesday briefed MPs on the recovery plans for the protest-ravaged^ Vuwani area in Limpopo . Earlier in May Vuwani residents protested against the creation of a new municipality which will include Malamulele^ residents . The violent protests led to the torching of several schools and other public amenities in the area which disrupted classes .
H: Kathmandu Post- Mitsubishi Motors admits cheating fuel tests since 1991
D: Mitsubishi 's eK^ Wagon^ was one of the models affected Reuters Apr 26 , 2016- Mitsubishi Motors has admitted to falsifying some fuel consumption tests since 1991 . The admission follows last week 's revelation that it had falsified fuel economy data for more than 600,000 vehicles sold in Japan . 'For the domestic market , we have been using that method since 1991 , ' said vice-president Ryugo^ Nakao^ at a press conference in Tokyo on Tuesday .
MODEL
0 cls=Embedding name=embedding_1
40000x100
1 cls=LSTM name=lstm_1
100x512 512x512 512 100x512 512x512 512 100x512 512x512 512 100x512 512x512 512
2 cls=Dropout name=dropout_1

3 cls=LSTM name=lstm_2
512x512 512x512 512 512x512 512x512 512 512x512 512x512 512 512x512 512x512 512
4 cls=Dropout name=dropout_2

5 cls=LSTM name=lstm_3
512x512 512x512 512 512x512 512x512 512 512x512 512x512 512 512x512 512x512 512
6 cls=Dropout name=dropout_3

7 cls=SimpleContext name=simplecontext_1

8 cls=TimeDistributed name=timedistributed_1
944x40000 40000
9 cls=Activation name=activation_1

LOAD
downloading train.hdf5 to /home/ubuntu/auris/train.hdf5:
downloaded /home/ubuntu/auris/train.hdf5
Weights downloaded
TEST
....
....

H: ~ Kate Beckinsale and teenage daughter text naked pictures of Michael Sheen to each other to <0>^ themselves up ' _ _ _ _ _
D: the Port Talbot actor to each other . The underworld star , who lives in the States with 17-year-old Lily , made the odd revelation
TRAIN
Iteration 0
Epoch 1/1
29952/30000 [============================>.] - ETA: 1s - loss: 7.7543/home/ubuntu/anaconda2/lib/python2.7/site-packages/keras/engine/training.py:1403: UserWarning: Epoch comprised more than 'samples_per_epoch' samples, which might affect learning results. Set 'samples_per_epoch' correctly to avoid this warning.
30016/30000 [==============================] - 746s - loss: 7.7548 - val_loss: 7.8132
('Uploaded ', '/home/ubuntu/auris/train.history.pkl', ' succesfully')
('Uploaded ', '/home/ubuntu/auris/train.hdf5', ' succesfully')
HEAD: A Python^ Bit^ A Man 's P
DESC: The man fought to remove
HEADS:
34.5337700502 Syrian , Attaporn^ Attaporn^ at , to
43.0280796466 Former wife.She^ : for hour.Eventually^ wife.She^ to in wife.She^ wife.She^
Iteration 1
Epoch 1/1
29952/30000 [============================>.] - ETA: 1s - loss: 7.7520Exception in thread Thread-4:
Traceback (most recent call last):
File "/home/ubuntu/anaconda2/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/home/ubuntu/anaconda2/lib/python2.7/threading.py", line 754, in run
self.__target(_self.__args, *_self.__kwargs)
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/keras/engine/training.py", line 404, in data_generator_task
generator_output = next(generator)
File "train2.py", line 498, in gen
yield conv_seq_labels(xds, xhs, nflips=nflips, model=model, debug=debug)
File "train2.py", line 459, in conv_seq_labels
y = np.zeros((batch_size, maxlenh, vocab_size))
MemoryError

Traceback (most recent call last):
File "train2.py", line 538, in
nb_epoch=1, validation_data=valgen, nb_val_samples=nb_val_samples
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/keras/models.py", line 656, in fit_generator
max_q_size=max_q_size)
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/keras/engine/training.py", line 1412, in fit_generator
max_q_size=max_q_size)
File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/keras/engine/training.py", line 1474, in evaluate_generator
'or (x, y). Found: ' + str(generator_output))
Exception: output of generator should be a tuple (x, y, sample_weight) or (x, y). Found: None
`
And as always, thanks for giving us the oportunity of using state of the art machine learning techniques into our own projects :)

Rank and Dimension errors

I am trying to run the code on Colab with Python 2 and Keras but I am getting this error

----> wmodel.add(WSimpleContext())

ValueError: Shape must be rank 0 but is rank 3 for 'cond_208/Switch' (op: 'Switch') with input shapes: [?,1,50], [?,1,50].

at this line
activation_energies = K.switch(mask[:, None, :maxlend], activation_energies, -1e20)

When I comment that line then I get this error
ValueError: shapes (512,) and (944,40000) not aligned: 512 (dim 0) != 944 (dim 0)

at this line
output = np.dot(output, weights[0]) + weights[1])

Any help would be greatly appreciated!

Some doubts

Hello @udibr

Thank you for the awesome project!

However, I am stuck at training the model as done in train.ipynb. I have a 8 core CPU and Nvidia GTX 660M (2 GB) which is pretty standard. I created vocabulary-embedding using 100K dataset (News articles) Using tensorflow, I encountered resource error (Out of memory). Using theano and reducing the batch size resolved it.

But completion of iteration 0 itself takes around 3-4 hrs

Here is what I dont understand.

Are we training only nb_train_samples no of data?
How much data does one iteration train?

I know it might sound noob, but I am really interested in it. I am a undergrad, and trying my best to learn.

Waiting for your early reply!

About "avoid" in beamsearch in predict.ipynb

Hello @udibr , I'm confused about the parameter "avoid" in beamsearch and gensamples in predict.ipynb, and in train.ipynb, it doesn't exist. And, what role does the parameter "short" in gensamples play? And what's the role of "codes" in gensamples?
I'll appreciate it if you can answer the "simple" questions for me.

Questions for a different use case

Will the model give good results if it is trained to generate an abstractive summary of a large piece of text?

Will the model give good results if it is trained in generating an abstractive summary of 3-4 sentences?

If not what changes would you suggest for good results?

failed to find layer timedistributed_1 in model

Hi I am currently trying to execute the following line of code...
weights = [np.copy(v) for v in f['timedistributed_1'].itervalues()] inside predict.ipynb , but keep getting the following error. I am not sure what might be causing it.

Perhaps an issue could be with the following code inside train.ipynb...

model.add(TimeDistributed(Dense(vocab_size, kernel_regularizer=regularizer, bias_regularizer=regularizer, name = 'timedistributed_1')))

but it seems ok to me

PS I am using TensorFlow if this helps

Error:

KeyError Traceback (most recent call last)
in ()
3 if 'layer_names' not in f.attrs and 'model_weights' in f:
4 f = f['model_weights']
----> 5 weights = [np.copy(v) for v in f['timedistributed_1'].itervalues()]

h5py/_objects.pyx in h5py._objects.with_phil.wrapper (/private/var/folders/my/m6ynh3bn6tq06h7xr3js0z7r0000gn/T/pip-wdzlRM-build/h5py/_objects.c:2840)()

h5py/_objects.pyx in h5py._objects.with_phil.wrapper (/private/var/folders/my/m6ynh3bn6tq06h7xr3js0z7r0000gn/T/pip-wdzlRM-build/h5py/_objects.c:2798)()

/usr/local/lib/python2.7/site-packages/h5py/_hl/group.pyc in getitem(self, name)
167 raise ValueError("Invalid HDF5 object reference")
168 else:
--> 169 oid = h5o.open(self.id, self._e(name), lapl=self._lapl)
170
171 otype = h5i.get_type(oid)

h5py/_objects.pyx in h5py._objects.with_phil.wrapper (/private/var/folders/my/m6ynh3bn6tq06h7xr3js0z7r0000gn/T/pip-wdzlRM-build/h5py/_objects.c:2840)()

h5py/_objects.pyx in h5py._objects.with_phil.wrapper (/private/var/folders/my/m6ynh3bn6tq06h7xr3js0z7r0000gn/T/pip-wdzlRM-build/h5py/_objects.c:2798)()

h5py/h5o.pyx in h5py.h5o.open (/private/var/folders/my/m6ynh3bn6tq06h7xr3js0z7r0000gn/T/pip-wdzlRM-build/h5py/h5o.c:3734)()

KeyError: "Unable to open object (Object 'timedistributed_1' doesn't exist)"

epoch loss and accuracy

epoch training accuracy is not going above 60 percentage.
I tried with different optimizers like adam, sgd and learning rates like 0.1, 0.001, 1e-4, 0.5 etc and also with batch size like 16 and 8 and also number of rnn layers 3, 6, 12.

please let me know how to improve accuracy?

how to get the training data?

How can I get the training data?

Prediction support for custom text

@udibr, I tried the prediction model and it works well with the dataset on which I trained it on. However, I am not able to figure out a way to train it on a custom text/article.
Kindly provide support for that also.

Issue Generating Predictions

When executing the command

samples = gensamples(X=X, skips=2, batch_size=batch_size, k=10, temperature=1.)

If X is the same size of maxlend nothing gets outputted. If X is larger or smaller than maxlend then I get the following error:

'HEADS:

IndexError Traceback (most recent call last)
in ()
----> 1 samples = gensamples(X=X, skips=2, batch_size=batch_size, k=10, temperature=1.)

in gensamples(X, X_test, Y_test, avoid, avoid_score, skips, k, batch_size, short, temperature, use_unk)
38 fold_start = vocab_fold(start)
39 sample, score = beamsearch(predict=keras_rnn_predict, start=fold_start, avoid=avoid, avoid_score=avoid_score,
---> 40 k=k, temperature=temperature, use_unk=use_unk)
41 assert all(s[maxlend] == eos for s in sample)
42 samples += [(s,start,scr) for s,scr in zip(sample,score)]

in beamsearch(predict, start, avoid, avoid_score, k, maxsample, use_unk, oov, empty, eos, temperature)
27 # for every possible live sample calc prob for every possible label
28 probs = predict(live_samples, empty=empty)
---> 29 assert vocab_size == probs.shape[1]
30
31 # total score for every sample is sum of -log of word prb

IndexError: tuple index out of range'

ValueError: generator already executing

Hello @udibr

i have a variant of the this code. While running on local machine I am getting "ValueError: generator already executing" error. Not sure what am I doing wrong.

I don't see any thread locking in your code. So how did you avoid the threading issue?

Thanks,
Dhruven Vora

some questions for training model

Hi, udibr,
I was running your code with the data of reuters news, the vocabulary-embedding.ipynb is OK for everything.
train.ipynb also is OK, but with a warning, and too slow to train the model.
I have some questions:
1, How to read the file train.history.pkl? i want to check the history of trainning.
2, I was running predict.ipynb with the model only training 5 iteration, but find a wrong message: "failed to find layer timedistributed_1 in model " when load the model. Why got the wrong message?
3, I find the predict result is not good with the model by 5 iteration. How many iteration that is fine for the result is just OK? I have more than 3K news for training model in CPU, 500 iteration is too slow.

Thank you.

Provide Trained weights

I am trying to train the model but it is taking a lot of time since I am doing this on CPU, so can anyone provide the trained weights (if anyone has successfully trained and saved the weights)? @udibr

Edit1: Providing the trained weights would be great for someone who is just trying to play around with the model and test it out with different datasets.

Cheers

Keyerror when running gensamples in predict in Tensorflow

I was wondering if you had any issues running predict in Tensorflow as I was having a few issues. Once I get these all resolved, I will go about requesting a pull request. One of the first errors I received was stating that 25 and 50 sizes were not compatible. I was able to resolve this by setting maxlend to 25.
Then I ran and was getting some issues with the K.switch call for the assignment to activation_energies. I noticed that this was similar logic to what used to be in training's simple_context and since it was changed in training, I assumed that making it the same here would resolve the problem and it worked as well. The modification in predict's simple_context was to change from the previous entry to:

activation_energies = activation_energies + -1e20*K.expand_dims(1.-K.cast(mask[:, :maxlend],'float32'),1)

I then received a KeyError: '*' when running genSamples against the Billy Joel entry. After lookin through other issues on git I saw that this was due to the value not being in the dictionary. So I modified the else at the top of GenSamples to:

else:
        for w in X.split():
            w = w.rstrip('^')
            if not w in word2idx:
                word2idx[w] = word2idx.get(w, len(word2idx))

        x = [word2idx[w.rstrip('^')] for w in X.split()]

Now when I run genSamples in cell [43] I get the below error and I'm not quite understanding why the implementation would be providing me an index out of range error. I am able to get around this by modifying the for to also check for whether w is out of the range of idx2word, but this just seems so hacky and totally incorrect. For now this is what I have been trying to track the source of the problem of. Should you have any enlightenment for me, I'd love to hear it. Thanks and I will let you know if I get something worked out.

for w in sample:
            if w == eos or w >= len(idx2word):
                break

ERROR I AM GETTING:

HEADS:
17.3501874208 analysts kopparbergs
27.0937678814 cello better” owners
31.910461247 firefox fisher your adhesives

KeyError Traceback (most recent call last)
in ()
----> 1 samples = gensamples(X=X, skips=2, batch_size=batch_size, k=10, temperature=1.)

in gensamples(X, X_test, Y_test, avoid, avoid_score, skips, k, batch_size, short, temperature, use_unk)
52 if w == eos:
53 break
---> 54 words.append(idx2word[w])
55 code += chr(w//(256*256)) + chr((w//256)%256) + chr(w%256)
56 if short:

KeyError: 74477

This is the output of cell [12] for me:

dimension of embedding space for words 100
vocabulary size 40000 the last 10 words can be used as place holders for unknown/oov words
total number of different words 74477 74477
number of words outside vocabulary which we can substitue using glove similarity 12519
number of words that will be regarded as unknonw(unk)/out-of-vocabulary(oov) 21958

A question about maxlenh and maxlend

I understand that if description has more than maxlend words it will be clipped, so some of the "interesting" information could potentially be lost (same with headlines and maxlenh although headlines are shorter so I guess its not that bad). Is there any reason why you set both of them to 25 instead of, let's say, 25 (h) and 70 (d)?
I understand that it may be due to limited computing resources... it's just that my intuition tells me that it is a high price to pay

After cell [4] you say:

I've started with maxlend=0 (see below) in which the description was ignored.

Why would you want to ignore the description? What is the purpose of feeding just headlines without descriptions?

Thanks @udibr :)

ValueError: too many values to unpack error while trying to pickle.load in vocabulary-embedding

Hi udbir,

Merry Christmas and a Happy New year to you! I am trying to experiment with your code to reproduce the results. I created the dataset from DUC2003 dataset. I have two questions -

I see "descs is a list of all the article strings in the same order and length as heads" is an instruction on the readme page. Does it mean the number of words in the descs list should be same as the number of words in the heads of the corresponding record?
For example : head has 7 words in row 1 title so I should restrict desc of row 1 to 7 words .... 10 words exist in row 2 title so I should restrict desc of row 2 to 10 words.
Is my understanding correct in this regard.
I created the dataset as instructed : create list of all titles , created list of all article strings, created list of keywords (none). Created a tuple out of the three lists and then a pkl but when I try to load the pkl in vocabulary-embedding it resulted in ValueError: too many values to unpack error while trying to pickle.load

Here are the example tuples prior to creating a pkl -

[('Cashew shortage causes Vietnam to seek outside sources to continue processing', 'HANOI Vietnam AP Vietnam has floated the price of cashewsdue to a shortage of nuts to keep processing plants running anexecutive of the Vietnam Cashew Nuts Association said Friday Cashews have been trading at 8 000 dong to 10 000 dong 62 80cents per kilogram 2 2 pounds well up from the ceiling price of6 500 dong 50 cents set by the government the executive said The executive speaking on condition he not be identified saidforeign sellers have raised their offer for raw nuts to more thandlrs 700 per ton from dlrs 650 earlier He said the price has been pushed up because there aren t enoughcashews available to run the 60 nut processing plants which have acapacity of more than 200 000 tons per year Vietnam produced about 110 000 tons in the February May crop down from the average of 150 000 180 000 tons largely because ofthe country s worst drought in a century Cashews are planted only in southern Vietnam which wasparticularly parched Vietnam never has imported cashews for processing for re exportbut plans to import 30 000 tons this year to fill the shortfall Vietnam exported 33 000 processed cashew nuts last year fetching dlrs 133 million About 4 5 kilograms of raw cashews canproduce one kilogram 2 2 pounds of refined nuts The cashew association has 50 members', 'None'), ('Vietnam cashew production down shortage leads to price float', 'HANOI Vietnam AP Vietnam has floated the price of cashewsdue to a shortage of nuts to keep processing plants running anexecutive of the Vietnam Cashew Nuts Association said Friday Cashews have been trading at 8 000 dong to 10 000 dong 62 80cents per kilogram 2 2 pounds well up from the ceiling price of6 500 dong 50 cents set by the government the executive said The executive speaking on condition he not be identified saidforeign sellers have raised their offer for raw nuts to more thandlrs 700 per ton from dlrs 650 earlier He said the price has been pushed up because there aren t enoughcashews available to run the 60 nut processing plants which have acapacity of more than 200 000 tons per year Vietnam produced about 110 000 tons in the February May crop down from the average of 150 000 180 000 tons largely because ofthe country s worst drought in a century Cashews are planted only in southern Vietnam which wasparticularly parched Vietnam never has imported cashews for processing for re exportbut plans to import 30 000 tons this year to fill the shortfall Vietnam exported 33 000 processed cashew nuts last year fetching dlrs 133 million About 4 5 kilograms of raw cashews canproduce one kilogram 2 2 pounds of refined nuts The cashew association has 50 members', 'None')]

Any help here is grately appreciated. Thanks much in advance for your time and help!

Sunil.

Where is the model-dot-fit function?

I want to use Encoder-Decoder model for some other data. I am trying to understand this code. But I couldn't find the fit method in train.ipynb. After padding of description and heading, how to use these vector for training the model. What is the dimension for X and Y in model-dot-fit? The dimension of X may be #descriptions x 50 and the dimension of Y may be #headings x 50. And #descriptions equals to #headings.

Below is the command I used to fit the model.
model_fit = model.fit(nxTrain, nyTrain, nb_epoch=1, batch_size=64, verbose=2)
The dimensions of X and Y of model.fit method.
xTrain.shape
(17853, 50)

yTrain.shape
(17853, 25)

But I got below error.
Exception: Error when checking model target: expected activation_1 to have 3 dimensions, but got array with shape (17853, 25)

Please check the model summary.
print(model.summary())

Layer (type) Output Shape Param # Connected to

embedding_1 (Embedding) (None, 50, 100) 4000000 embedding_input_1[0][0]

lstm_1 (LSTM) (None, 50, 512) 1255424 embedding_1[0][0]

dropout_1 (Dropout) (None, 50, 512) 0 lstm_1[0][0]

lstm_2 (LSTM) (None, 50, 512) 2099200 dropout_1[0][0]

dropout_2 (Dropout) (None, 50, 512) 0 lstm_2[0][0]

lstm_3 (LSTM) (None, 50, 512) 2099200 dropout_2[0][0]

dropout_3 (Dropout) (None, 50, 512) 0 lstm_3[0][0]

simplecontext_1 (SimpleContext) (None, 25, 944) 0 dropout_3[0][0]

timedistributed_1 (TimeDistribut (None, 25, 40000) 37800000 simplecontext_1[0][0]

activation_1 (Activation) (None, 25, 40000) 0 timedistributed_1[0][0]

Total params: 47253824

None

I used the same model as explained in train.ipynb. I am not getting what's wrong here?

Getting issue on running In [69]

Hi,

Thanks for this project. I am trying this scripts and I could process fine with vocabulary-embedding and train scripts. When I tried predict, I am facing error on line In [69]. The error is as follows -

TypeError Traceback (most recent call last)
in ()
----> 1 samples = gensamples(X, avoid=avoid, avoid_score=.1, skips=2, batch_size=batch_size, k=10, temperature=1.)

in gensamples(X, X_test, Y_test, avoid, avoid_score, skips, k, batch_size, short, temperature, use_unk)
21 avoid = [a.split() if isinstance(a,str) else a for a in avoid]
22 avoid = [vocab_fold([w if isinstance(w,int) else word2idx[w] for w in a])
---> 23 for a in avoid]
24
25 print 'HEADS:'

TypeError: 'numpy.int64' object is not iterable

Please let me know if you need any information on this to debug.

'unzip' is not recognized as an internal or external command, operable program or batch file.

Can someone help me with issue?

when i am trying to run this block

fname = 'glove.6B.%dd.txt'%embedding_dim
import os
datadir_base = os.path.expanduser(os.path.join('~', '.keras'))
if not os.access(datadir_base, os.W_OK):
datadir_base = os.path.join('/tmp', '.keras')
datadir = os.path.join(datadir_base, 'datasets')
glove_name = os.path.join(datadir, fname)
if not os.path.exists(glove_name):
path = 'glove.6B.zip'
path = get_file(path, origin="http://nlp.stanford.edu/data/glove.6B.zip")
!unzip {datadir}/{path}

i am getting this error

'unzip' is not recognized as an internal or external command,
operable program or batch file.

RNN SIZE and RNN LAYERS

how to determine RNN SIZE and RNN layers for our own task?
i am using RNN SIZE = 512 and RNN layers as 3
does changing this will help in reducing the EPOCH loss.
the epoch loss for now is around 6 which i guess is bad
how to reduce this epoch loss if this loss cant be reduced by changing the above parameters?

ValueError: Shape (?, 50, 512) must have rank 2

Hey Udibr! I am receiving the above error when training and was wondering if you have ran into this by chance or perhaps have an idea as to what may be causing it. First off, I am using the latest Tensorflow. I found someone getting the same issue in another project at the link below that he said he wasnt seeing in Theano but he was seeing in Tensorflow: jocicmarko/ultrasound-nerve-segmentation@e143399

I figured I would post here just in-case you had any more insight into what might be happening. I have pasted the error output below. I am still tracing this down, but if you have any ideas, please let me know. Thanks!

H: mathematical formula for beer goggles
D: british scientists discover the exact equation so-called beer goggles
Traceback (most recent call last):
File "train.py", line 267, in
model.add(SimpleContext())
File "/usr/local/lib/python2.7/site-packages/keras/models.py", line 307, in add
output_tensor = layer(self.outputs[0])
File "/usr/local/lib/python2.7/site-packages/keras/engine/topology.py", line 511, in call
self.add_inbound_node(inbound_layers, node_indices, tensor_indices)
File "/usr/local/lib/python2.7/site-packages/keras/engine/topology.py", line 569, in add_inbound_node
Node.create_node(self, inbound_layers, node_indices, tensor_indices)
File "/usr/local/lib/python2.7/site-packages/keras/engine/topology.py", line 150, in create_node
output_tensors = to_list(outbound_layer.call(input_tensors[0], mask=input_masks[0]))
File "/usr/local/lib/python2.7/site-packages/keras/layers/core.py", line 455, in call
return self.function(x, **arguments)
File "train.py", line 227, in simple_context
desc, head = X[:,:maxlend], X[:,maxlend:]
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/ops/array_ops.py", line 167, in _SliceHelper
sliced = slice(tensor, indices, sizes)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/ops/array_ops.py", line 217, in slice
return gen_array_ops.slice(input, begin, size, name=name)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/ops/gen_array_ops.py", line 1318, in _slice
name=name)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/ops/op_def_library.py", line 655, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2156, in create_op
set_shapes_for_outputs(ret)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1612, in set_shapes_for_outputs
shapes = shape_func(op)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/ops/array_ops.py", line 865, in _SliceShape
input_shape.assert_has_rank(ndims)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/framework/tensor_shape.py", line 605, in assert_has_rank
raise ValueError("Shape %s must have rank %d" % (self, rank))
ValueError: Shape (?, 50, 512) must have rank 2

How do i produce train.hdf5 file ?

How do i produce train.hdf5 file to load weights in model ?
or How do i download train.hdf5 file to load weights in model in train.py?

A question in function simple_context

Hi:
in train.ipynb, i'm confused about this line:
activation_energies = activation_energies + -1e20*K.expand_dims(1.-K.cast(mask[:, :maxlend],'float32'),1)
i think this line is unnecessary(i maybe wrong), please explain this line for me in detail.
And, when computing the attention weights, i think we should only use the current word's ht(ht is the time step t's hidden state) in decoding, but in the function simple_context, it use all headline words' ht every time step?
What's more,can you show me the paper or other references about how to implement the attention layer, i think i am not particularly familiar with it. Thank you.

rnn size

what should be the rnn size?
here it says it should be equal to 16030 word gen? what is mean by that?

beamsearch(): index 0 is out of bounds for axis 1 with size 0?

The network can train smoothly by calling
gensamples(batch_size=batch_size)
, but I am just wodering why the function beamsearch() doesn't work when it is called from this line

gensamples(skips=2, batch_size=batch_size, k=10, temperature=1.)

def beamsearch():
...
cand_scores[:,empty] = 1e20
...

The error is IndexError: index 0 is out of bounds for axis 1 with size 0. I am new in ML, could you please give me some advice to fix it? thank you

Prediction is giving error of input shape in cell 28.

Dimensions must be equal, but are 40 and 50 for 'simplecontext_1/add' (op: 'Add') with input shapes: [64,40,40], [64,1,50].

Bidirectional LSTM

I tried to use bidirectional lstm with merge_mode='sum' for encoding, but when I try to predict headlines, the model barely generates anything; However, the loss is lower than when I use the simple lstm. This is the only change that I made. Do you know why this happens?

Do we need to learn beam search for testing our model?

As you mentioned for training we don't need to take into consideration beam search as we have (desc+headline). For testing, do we need to learn beam search?

training model input is (25 desc +25 headline words), what about testing because we have only (25 headline words and model input size if 50)?

Paper says about generate next word from given 25 word and feed to next layer. So we start for 26th input and will end on 50th word (25 word headline). For example, 28th word generation will take 0-27 words as input.

Are we using beam search to keep top k probable words into account for i th layer and try individual feeding this this to next layer i.e. i+1 th layer and select most probable candiadate?

About skips and k in gensamples

Hello @udibr , I was wondering how could I optimize speed of predictions because I don't have GPU to run them. I am trying to understand gensamples and beamsearch:

Is it true that higher values of k tend to provide better results of predictions?
What is exactly the skips parameter for?

Thanks!

Check Efficiency.

Hey do you have the trained models which I can use to check the efficiency ?

udibr / headlines Goto Github PK

headlines's Introduction

Automatically generate headlines to short articles

How to run

Software

Data

Build a vocabulary of words

Train a model

Use model to generate new headlines

Examples of headlines generated

Examples of attention weights

headlines's People

Contributors

Stargazers

Watchers

Forkers

headlines's Issues

Error:

'HEADS:

Layer (type) Output Shape Param # Connected to

activation_1 (Activation) (None, 25, 40000) 0 timedistributed_1[0][0]

Recommend Projects

Recommend Topics

Recommend Org