majortal / deepspell Goto Github PK

View Code? Open in Web Editor NEW

225.0 16.0 99.0 17 KB

a Deep Learning based Speller

License: MIT License

Python 100.00%

deepspell's Introduction

DeepSpell

a Deep Learning based Speller

See https://medium.com/@majortal/deep-spelling-9ffef96a24f6#.2c9pu8nlm

Additional details:

I used this AMI to train the system: https://aws.amazon.com/marketplace/pp/B06VSPXKDX On a p2.xlarge instance (currently at $0.9 per Hour)

deepspell's People

Contributors

Stargazers

Watchers

Forkers

oldmonk101 kishorenc amitbeka vyraun fancysimon jinsoo-kang-maker rash19 snakeroot91 yusufquadri glebalshanskii surmenok pcgreat christofhenkel shuvayan fengli zhanninggao shuidongliu mrelich ml-lab shuaaai amrahstija eyeem feilaoda colinsongf giovannirescia qhungnb slbinilkumar vunb bigrlab yidizhou manishneversu stas-semenov rudi77 mathias3 taylr adivarekarbhumit ufukhurriyetoglu duytinvo ahmadovs novolabs pchankh riadsouissi dsadulla kiranvarghesev emezac shubhampachori12110095 hmuys vigneshprajapati appcoreopc chandreshiit hemanthsavasere sirvany ramananm afcarl boyko vico4445 soon2soon q-biljana cmantas vikaskaviya ngo010 ncdingari yangyunxu dfoteinakis acodingpuppy lionaruc arunkumarramanan darkhash yukeyi gedman4b pendry-sahithi sinazam1997 neerajajaja kowsertusher cherokeelanguage vanpeltj alirezabayatmk chizala homeshmsgai winfredemalx54 benedict1986 guidobaez rkt34-hoya7 iq-scm sandy4321 abumq

deepspell's Issues

continue training after crash

When I train seq2seq in tensorflow in examples (an rnn in darknet) I use backup between epochs

But I don't sure keras_spell.py save model from time to time, how can I do it in properly way?

usage train dataset with errors

What about using big dataset with errors?

I have specific domain (user slang, etc.)
And I don't have clear data, but I hope, I expact that for the most cases users write messages correctly rather then make any mistakes.

What do you think about this? Can we build model on this dataset?

preprocesses_split_lines4(): gensim Deprecation Warning + Error

Hi everyone,

I ran into a Deprecation Warning in the function preprocesses_split_lines4(): "DeprecationWarning: Deprecated. Use gensim.models.KeyedVectors.load_word2vec_format instead."
The gensim word2vec import seems to be out of date. I tried to fix it with KeyedVectors, however then the (seemingly pre-trained ?) model "fw2v.bin" cannot be found. I could also not find this model anywhere on the web...
Any suggestions?

Thanks a lot in advance!

is there any released model for downloading ?

It seems very interesting to using deep learning for spell correction. So is there any released model binary file fo this task ?

thank you.

char_frequency.json is missing

Hi,
I tried to rerun the code but found char_frequency.json was missing.
May I ask the availability of the json file?

MemoryError

I have 64G memory + 64G swap, but...

Is it ok?

47590536
answer:   'To listen to the audio turn off the.....'
question: 'To listen to the audio turn off the.....'

47590536
answer:   'leader, who sent out fund-raising.......'
question: 'leader, who sent out fund-raising.......'

Vectorization...
X = np_zeros
Traceback (most recent call last):
  File "keras_spell.py", line 302, in <module>
    main_news()
  File "keras_spell.py", line 296, in main_news
    X_train, X_val, y_train, y_val, y_maxlen, ctable = vectorize(questions, answers, chars)
  File "keras_spell.py", line 97, in vectorize
    X = np_zeros((len_of_questions, x_maxlen, len(chars)), dtype=np.bool)
MemoryError

Hi, could you please provide your last checkpoint ? It would be of great help

optimal params for training

I trying to build model on data in your code but I have limited memory on my GPU (8G on GTX 1080)

What best options to use this device?

Now I limited dataset from 15.437.674 to 5.000.000 lines
My ~/.theanorc

[global]
floatX = float32
device = gpu0

[nvcc]
fastmath = True

[lib]
cnmem = .9

but what about program options, now I have such

NUMBER_OF_ITERATIONS = 10000 # 20000
EPOCHS_PER_ITERATION = 1 # 5
RNN = recurrent.LSTM
INPUT_LAYERS = 2
OUTPUT_LAYERS = 2
AMOUNT_OF_DROPOUT = 0.3
BATCH_SIZE = 500
HIDDEN_SIZE = 700
INITIALIZATION = "he_normal" # : Gaussian initialization scaled by fan_in (He et al., 2014)
MAX_INPUT_LEN = 40
MIN_INPUT_LEN = 3
INVERTED = True
AMOUNT_OF_NOISE = 0.2 / MAX_INPUT_LEN
NUMBER_OF_CHARS = 100 # 75
CHARS = list("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ .")

What can you advice me?

Thank you for any help.

getting wrong output

Hi, i have made some changes in code according to my dataset but i'm getting the output like this:

crbinerrnnnerrnnrr
spectronnnnnnnnnnnnn
pvcleipsiczessiiggss
syphoniccccccccccccc
phillipsiiliissiiiii
spllshggngggggggggg
embrittlementlementl
tensionnnnnnnnnnnnnn
victulictliccricc
unitronichhnncchhccc
hrttttttttttttttttt
spllshggngggggggggg
solinonennoocnnnnenn
siloxeneneeeeeeeeeee
serrtionnnnnnnnnnnn
prozesssszsssssssssz
plstisollggllllnnll
peligroooooooooooooo
plletizinggnginggng
otimllyyllyyyllyyy
welettessssessessess
flinkoink
lutronichonnechhnnec
kohleroolleeroeioero
kevlronirrrirrrrrrr
illumtechhhcchhhhhh
dynbrdedeyedeeeeee
drivbilityyiityytit
dezincifictiontion
ruggedizeddizddizedd
stnnnenneneeneeeee
bommniblemenibemeni
bommniblemenibemeni
bhrctersserrsserss
brdyctersssssssssss
rmidizeddidddddddd
siloxeneneeeeeeeeeee
furnollloolooooooo
dimineeicteonten
krbinerrnnnerrnerr
zirconi
blngnteesngteenggne
ccumultorrrorrrroo
plteisollgesollggoo
pipeeiizinggioinggin
universlllslllllll
brsiversseessserss
wheelleseeeeeeeeeeee
kitlrnerrrinerrnerr
solinonennoocnnnnenn

Can you please help me to overcome with this error.

How the prediction is working?

Hello Major,
Thanks for writing the great tutorial.For model training I am using single words and unable to understand that how the prediction comes out in "print_random_prediction" function?
Also after 200 number of iterations I am getting the output like -
Q eahc
A each
☒ eachhh.......................................!!!!!!!!!!!!!!!

Q Henxry
A Henry
☒ Lenyyyyyyyyyyyyy????????????????????????????????????????????

Kindly explain as I am unable to get it?
Thanks in advance

use for two language!

hi.
I want to use this model to train for Persian language but sometimes there is Engilsh word in my sentence but number of them is not enough that model could learn them? how dealing with this problem?

real word error

can your model check real word error?

tensorflow or theano?

How to make choice to get best results? Why?

train_speller()

If I run train_speller(), I get the below error.

generator_output = next(output_generator)

StopIteration

Do I have to run train_speller(os.path.join(DATA_FILES_FULL_PATH, "keras_spell_e15.h5"))
before train_speller().

I get the below error if I run train_speller(os.path.join(DATA_FILES_FULL_PATH, "keras_spell_e15.h5"))

OSError: Unable to open file (Unable to open file: name = '/users/harish/desktop/testing/deepspell/downloads/data/keras_spell_e15.h5', errno = 2, error message = 'no such file or directory', flags = 0, o_flags = 0)

Trailing Periods

I've noticed that the model's predictions all have trailing periods. For example:

Q Possibly even for good reasons.
A Possibly even for good reasons.
☒ Possibly even for good rensuss..............................

Is this possibly a bug?

Running on Tensorflow Backend

Hi, I tried to run it on TensorFlow backend, limit the amount of lines read to 10k but the process Failed with error
"ValueError: Cannot take the length of Shape with unknown rank."

root@8ff7b76ac7d2:~# python sharedfolder/keras/DeepSpell/keras_spell.py
Using TensorFlow backend.
reading news
read news
NORMALIZE_WHITESPACE_REGEX 0.0619759559631
RE_DASH_FILTER 0.0761659145355
RE_APOSTROPHE_FILTER 0.100703954697
RE_LEFT_PARENTH_FILTER
RE_RIGHT_PARENTH_FILTER
RE_BASIC_CLEANER
cleaned text

 !"$%&'()*,-./0123456789:;?@ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz£½ÂÃàáâèéíñóôüˈλο€
Read 10000 lines of input corpus
Left with 9984 lines of input corpus
Generating Data
suffle Done
Vectorization...
X = np_zeros
for i, sentence in enumerate(questions):
y = np_zeros
for i, sentence in enumerate(answers):
(33229, 40, 96)
(33229, 40, 96)
y_maxlen, chars 40  £$(,048@ÃDHLPTXdhlpótx'€/37;?CGKOSW_àcgèkosôwü"&*.26:½BFJNRVZábféjínñrvz!%)-159AÂEIMQUYaâeimquy
Build model...
Traceback (most recent call last):
  File "sharedfolder/keras/DeepSpell/keras_spell.py", line 302, in <module>
    main_news()
  File "sharedfolder/keras/DeepSpell/keras_spell.py", line 298, in main_news
    model = generate_model(y_maxlen, chars)
  File "sharedfolder/keras/DeepSpell/keras_spell.py", line 132, in generate_model
    return_sequences=layer_number + 1 < INPUT_LAYERS))
  File "/usr/local/lib/python2.7/dist-packages/Keras-1.2.0-py2.7.egg/keras/models.py", line 298, in add
    layer.create_input_layer(batch_input_shape, input_dtype)
  File "/usr/local/lib/python2.7/dist-packages/Keras-1.2.0-py2.7.egg/keras/engine/topology.py", line 398, in create_input_layer
    self(x)
  File "/usr/local/lib/python2.7/dist-packages/Keras-1.2.0-py2.7.egg/keras/engine/topology.py", line 569, in __call__
    self.add_inbound_node(inbound_layers, node_indices, tensor_indices)
  File "/usr/local/lib/python2.7/dist-packages/Keras-1.2.0-py2.7.egg/keras/engine/topology.py", line 632, in add_inbound_node
    Node.create_node(self, inbound_layers, node_indices, tensor_indices)
  File "/usr/local/lib/python2.7/dist-packages/Keras-1.2.0-py2.7.egg/keras/engine/topology.py", line 164, in create_node
    output_tensors = to_list(outbound_layer.call(input_tensors[0], mask=input_masks[0]))
  File "/usr/local/lib/python2.7/dist-packages/Keras-1.2.0-py2.7.egg/keras/layers/recurrent.py", line 227, in call
    input_length=input_shape[1])
  File "/usr/local/lib/python2.7/dist-packages/Keras-1.2.0-py2.7.egg/keras/backend/tensorflow_backend.py", line 1836, in rnn
    axes = [1, 0] + list(range(2, len(outputs.get_shape())))
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/tensor_shape.py", line 462, in __len__
    raise ValueError("Cannot take the length of Shape with unknown rank.")
ValueError: Cannot take the length of Shape with unknown rank.

The problem can be here:

for layer_number in range(INPUT_LAYERS):
    model.add(recurrent.LSTM(HIDDEN_SIZE, input_shape=(None, len(chars)), init=INITIALIZATION,
                            return_sequences=layer_number + 1 < INPUT_LAYERS))

Any clue?
Thank you!

what is wrong happened?

I have loss 0.05 at 165 iteration

--------------------------------------------------
Iteration 165
Train on 1580553 samples, validate on 175618 samples
Epoch 1/1
1580553/1580553 [==============================] - 2413s - loss: 0.0580 - acc: 0.9845 - val_loss: 0.2363 - val_acc: 0.9613
Q rubber bullet fired by police...........
A rubber bullet fired by police...........
☑ rubber bullet fired by police...........
---
Q other members of Congress campaigning...
A other members of Congress campaigning...
☑ other members of Congress campaigning...
---
Q The Lakers were looking to break a......
A The Lakers were looking to break a......
☑ The Lakers were looking to break a......
---
Q Francisco/New York Times as a quaint....
A Francisco/New York Times as a quaint....
☑ Francisco/New York Times as a quaint....
---
Q final night of the season...............
A final night of the season...............
☑ final night of the season...............
---
Q already half-completed 2011 fiscal year.
A already half-completed 2011 fiscal year.
☑ already half-completed 2011 fiscal year.
---
Q governments Run-After-The-Smugglers.....
A government's Run-After-The-Smugglers....
☒ governments Run-After-The-Smugglers.....
---
Q demonstrations in Cairo and several.....
A demonstrations in Cairo and several.....
☑ demonstrations in Cairo and several.....
---
Q military action, even in limited........
A military action, even in limited........
☑ military action, even in limited........
---
Q healthcare."............................
A healthcare."............................
☑ healthcare."............................

but after 165 iteration I see loss 2.56

what happened?

--------------------------------------------------
Iteration 166
Train on 1580553 samples, validate on 175618 samples
Epoch 1/1
1580553/1580553 [==============================] - 2412s - loss: 2.5680 - acc: 0.3418 - val_loss: 2.5517 - val_acc: 0.3255
Q imported commodities, can no longer be..
A imported commodities, can no longer be..
☒ in                              ........
---
Q hesitate to take ction" should it.......
A hesitate to take action" should it......
☒ hee                        .............
---
Q When the two agets grabbed him, he......
A When the two agents grabbed him, he.....
☒ The                             ........
---
Q The omst difficult thing about  food....
A The most difficult thing about a food...
☒ The  e                         .........
---
Q sincerity...............................
A sincerity...............................
☒ so  ....................................
---
Q trot onut his old New England accent....
A trot out his old New England accent.....
☒ to                         .............
---
Q publishers and rehtailers over the sale.
A publishers and retailers over the sale..
☒ pus  e                           .......
---
Q invited down to see them, but I'm.......
A invited down to see them, but I'm.......
☒ in e                        ............
---
Q during a typically slow season for......
A during a typically slow season for......
☒ doee                       .............
---
Q the national and omIal values of the....
A the national and moral values of the....
☒ the                            .........

and now still loss 2.85 at 181 itteration

--------------------------------------------------
Iteration 181
Train on 1580553 samples, validate on 175618 samples
Epoch 1/1
1580553/1580553 [==============================] - 2412s - loss: 2.8511 - acc: 0.2510 - val_loss: 2.8094 - val_acc: 0.2583
Q Medicare scam...........................
A Medicare scam...........................
☒ tee                       ..............
---
Q pleaded guilty to one counZ each of.....
A pleaded guilty to one count each of.....
☒ to            ..........................
---
Q they cannot repay.......................
A they cannot repay.......................
☒ te                       ...............
---
Q the governmnet-run Russian State Circus.
A the government-run Russian State Circus.
☒ tee                        .............
---
Q Israeli officials now talk of a.........
A Israeli officials now talk of a.........
☒ te                      ................
---
Q "My biggest concern is whether there....
A "My biggest concern is whether there....
☒ Tee                        .............
---
Q Crapo and Coburn - all fiscal...........
A Crapo and Coburn - all fiscal...........
☒ aee                    .................
---
Q statemnt in Arabic, which she...........
A statement in Arabic, which she..........
☒ ae                                      
---
Q consiering tighter controls on animal...
A considering tighter controls on animal..
☒ tee                       ..............
---
Q FAA'so peating authority through........
A FAA's operating authority through.......
☒ tee                         ............

Memory Issue

Hi,
Thank you for sharing your code publicly, but I'm having some memory issues when running it on AWS.

I'm spinning a g2.2xlarge instance on AWS and try to run your code for only the first 1000 lines of news.2011.en.shuffled.

Have you ever got an error message like this one (see below)? And if so, is there a way to change the parameters to avoid or maybe should I select another type of AWS instance?

Just for completeness these are the parameters I was trying to test

NUMBER_OF_ITERATIONS = 20000
EPOCHS_PER_ITERATION = 5
RNN = recurrent.LSTM
INPUT_LAYERS = 2
OUTPUT_LAYERS = 2
AMOUNT_OF_DROPOUT = 0.3
BATCH_SIZE = 500
HIDDEN_SIZE = 700
INITIALIZATION = "he_normal" # : Gaussian initialization scaled by fan_in (He et al., 2014)
MAX_INPUT_LEN = 40
MIN_INPUT_LEN = 3
INVERTED = True
AMOUNT_OF_NOISE = 0.2 / MAX_INPUT_LEN
NUMBER_OF_CHARS = 100 # 75

And this is the error that I'm getting

Iteration 1
Train on 3376 samples, validate on 376 samples
Epoch 1/5
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/lib/python3.5/site-packages/theano/compile/function_module.py", line 884, in __call__
    self.fn() if output_subset is None else\
RuntimeError: Cuda error: GpuElemwise node_m71c627ae87c918771aac75471af66509_0 Add: out of memory.
    n_blocks=30 threads_per_block=256
   Call: kernel_Add_node_m71c627ae87c918771aac75471af66509_0_Ccontiguous<<<n_blocks, threads_per_block>>>(numEls, local_dims[0], local_dims[1], i0_data, local_str[0][0], local_str[0][1], i1_data, local_str[1][0], local_str[1][1], o0_data, local_ostr[0][0], local_ostr[0][1])


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 10, in main_news
  File "<stdin>", line 8, in iterate_training
  File "/home/ubuntu/anaconda3/lib/python3.5/site-packages/keras/models.py", line 672, in fit
    initial_epoch=initial_epoch)
  File "/home/ubuntu/anaconda3/lib/python3.5/site-packages/keras/engine/training.py", line 1196, in fit
    initial_epoch=initial_epoch)
  File "/home/ubuntu/anaconda3/lib/python3.5/site-packages/keras/engine/training.py", line 891, in _fit_loop
    outs = f(ins_batch)
  File "/home/ubuntu/anaconda3/lib/python3.5/site-packages/keras/backend/theano_backend.py", line 959, in __call__
    return self.function(*inputs)
  File "/home/ubuntu/anaconda3/lib/python3.5/site-packages/theano/compile/function_module.py", line 898, in __call__
    storage_map=getattr(self.fn, 'storage_map', None))
  File "/home/ubuntu/anaconda3/lib/python3.5/site-packages/theano/gof/link.py", line 325, in raise_with_op
    reraise(exc_type, exc_value, exc_trace)
  File "/home/ubuntu/anaconda3/lib/python3.5/site-packages/six.py", line 685, in reraise
    raise value.with_traceback(tb)
  File "/home/ubuntu/anaconda3/lib/python3.5/site-packages/theano/compile/function_module.py", line 884, in __call__
    self.fn() if output_subset is None else\
RuntimeError: Cuda error: GpuElemwise node_m71c627ae87c918771aac75471af66509_0 Add: out of memory.
    n_blocks=30 threads_per_block=256
   Call: kernel_Add_node_m71c627ae87c918771aac75471af66509_0_Ccontiguous<<<n_blocks, threads_per_block>>>(numEls, local_dims[0], local_dims[1], i0_data, local_str[0][0], local_str[0][1], i1_data, local_str[1][0], local_str[1][1], o0_data, local_ostr[0][0], local_ostr[0][1])

Apply node that caused the error: GpuElemwise{add,no_inplace}(GpuDot22.0, GpuDimShuffle{x,0}.0)
Toposort index: 207
Inputs types: [CudaNdarrayType(float32, matrix), CudaNdarrayType(float32, row)]
Inputs shapes: [(20000, 700), (1, 700)]
Inputs strides: [(700, 1), (0, 1)]
Inputs values: ['not shown', 'not shown']
Outputs clients: [[GpuReshape{3}(GpuElemwise{add,no_inplace}.0, MakeVector{dtype='int64'}.0)]]

HINT: Re-running with most Theano optimization disabled could give you a back-trace of when this node was created. This can be done with by setting the Theano flag 'optimizer=fast_compile'. If that does not work, Theano optimizations can be disabled with 'optimizer=None'.
HINT: Use the Theano flag 'exception_verbosity=high' for a debugprint and storage map footprint of this apply node.

unable to generate char_frequency.json

I can't find the line which generates char_frequency.json

Not reaching quoted accuracy

Hello!

I have been playing around with this model for a few days now and I am unable to reach the accuracy you quoted in the original blog post [1]. To run, I used the 2013 dataset with the preprecessing step preprocesses_split_lines2().

My issue is that after ~24 hours on the same AWS instance (g2.2xlarge) I'm not seeing accuracy levels close to what you quoted (eg. 90% after 12 hours). I was wondering if you either did some different preprocessing, or if after you switched to batch learning you didn't update what to expect... Any comments will likely save others some time in the future.

Bellow I'm also attaching accuracy and loss figures.

[1] https://medium.com/@majortal/deep-spelling-9ffef96a24f6

time out error

It's my code

# encoding: utf-8
'''
Created on Nov 26, 2015

@author: tal

Based in part on:
Learn math - https://github.com/fchollet/keras/blob/master/examples/addition_rnn.py

See https://medium.com/@majortal/deep-spelling-9ffef96a24f6#.2c9pu8nlm
'''

from __future__ import print_function, division, unicode_literals

import os
import errno
from collections import Counter
from hashlib import sha256
import re
import json
import itertools
import logging
import requests
import numpy as np
from numpy.random import choice as random_choice, randint as random_randint, shuffle as random_shuffle, seed as random_seed, rand
from numpy import zeros as np_zeros # pylint:disable=no-name-in-module


from keras.models import Sequential, load_model
from keras.layers import Activation, TimeDistributed, Dense, RepeatVector, Dropout, recurrent
from keras.callbacks import Callback

# Set a logger for the module
LOGGER = logging.getLogger(__name__) # Every log will use the module name
LOGGER.addHandler(logging.StreamHandler())
LOGGER.setLevel(logging.DEBUG)

random_seed(123) # Reproducibility

class Configuration(object):
    """Dump stuff here"""

CONFIG = Configuration()
#pylint:disable=attribute-defined-outside-init
# Parameters for the model:
CONFIG.input_layers = 2
CONFIG.output_layers = 2
CONFIG.amount_of_dropout = 0.2
CONFIG.hidden_size = 500
CONFIG.initialization = "he_normal" # : Gaussian initialization scaled by fan-in (He et al., 2014)
CONFIG.number_of_chars = 100
CONFIG.max_input_len = 60
CONFIG.inverted = True

# parameters for the training:
CONFIG.batch_size = 100 # As the model changes in size, play with the batch size to best fit the process in memory
CONFIG.epochs = 500 # due to mini-epochs.
CONFIG.steps_per_epoch = 1000 # This is a mini-epoch. Using News 2013 an epoch would need to be ~60K.
CONFIG.validation_steps = 10
CONFIG.number_of_iterations = 10
#pylint:enable=attribute-defined-outside-init

DIGEST = sha256(json.dumps(CONFIG.__dict__,sort_keys=True).encode('utf8')).hexdigest()


# Parameters for the dataset
MIN_INPUT_LEN = 5
AMOUNT_OF_NOISE = 0.2 / CONFIG.max_input_len
CHARS = list("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ .")
PADDING = "☕"

DATA_FILES_PATH = "/content/sample_data"
DATA_FILES_FULL_PATH = os.path.expanduser(DATA_FILES_PATH)
DATA_FILES_URL = "http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shuffled.gz"
NEWS_FILE_NAME_COMPRESSED = os.path.join(DATA_FILES_FULL_PATH, "news.2013.en.shuffled.gz") # 1.1 GB
NEWS_FILE_NAME_ENGLISH = "news.2013.en.shuffled"
NEWS_FILE_NAME = os.path.join(DATA_FILES_FULL_PATH, NEWS_FILE_NAME_ENGLISH)
NEWS_FILE_NAME_CLEAN = os.path.join(DATA_FILES_FULL_PATH, "news.2013.en.clean")
NEWS_FILE_NAME_FILTERED = os.path.join(DATA_FILES_FULL_PATH, "news.2013.en.filtered")
NEWS_FILE_NAME_SPLIT = os.path.join(DATA_FILES_FULL_PATH, "news.2013.en.split")
NEWS_FILE_NAME_TRAIN = os.path.join(DATA_FILES_FULL_PATH, "news.2013.en.train")
NEWS_FILE_NAME_VALIDATE = os.path.join(DATA_FILES_FULL_PATH, "news.2013.en.validate")
CHAR_FREQUENCY_FILE_NAME = os.path.join(DATA_FILES_FULL_PATH, "char_frequency.json")
SAVED_MODEL_FILE_NAME = os.path.join(DATA_FILES_FULL_PATH, "keras_spell_e{}.h5") # an HDF5 file

# Some cleanup:
NORMALIZE_WHITESPACE_REGEX = re.compile(r'[^\S\n]+', re.UNICODE) # match all whitespace except newlines
RE_DASH_FILTER = re.compile(r'[\-\˗\֊\‐\‑\‒\–\—\⁻\₋\−\﹣\－]', re.UNICODE)
RE_APOSTROPHE_FILTER = re.compile(r'&#39;|[ʼ՚＇‘’‛❛❜ߴߵ`‵´ˊˋ{}{}{}{}{}{}{}{}{}]'.format(chr(768), chr(769), chr(832),
                                                                                      chr(833), chr(2387), chr(5151),
                                                                                      chr(5152), chr(65344), chr(8242)),
                                  re.UNICODE)
RE_LEFT_PARENTH_FILTER = re.compile(r'[\(\[\{\⁽\₍\❨\❪\﹙\（]', re.UNICODE)
RE_RIGHT_PARENTH_FILTER = re.compile(r'[\)\]\}\⁾\₎\❩\❫\﹚\）]', re.UNICODE)
ALLOWED_CURRENCIES = """¥£₪$€฿₨"""
ALLOWED_PUNCTUATION = """-!?/;"'%&<>.()[]{}@#:,|=*"""
RE_BASIC_CLEANER = re.compile(r'[^\w\s{}{}]'.format(re.escape(ALLOWED_CURRENCIES), re.escape(ALLOWED_PUNCTUATION)), re.UNICODE)

# pylint:disable=invalid-name

def download_the_news_data():
    """Download the news data"""
    LOGGER.info("Downloading")
   
#if os.path.isfile(DATA_FILES_FULL_PATH +"/news.2013.en.shuffled.gz") == False:          
    try:
        os.makedirs(os.path.dirname(NEWS_FILE_NAME_COMPRESSED))
    except OSError as exception:
        if exception.errno != errno.EEXIST:
            raise
    with open(NEWS_FILE_NAME_COMPRESSED, "wb") as output_file:
        response = requests.get(DATA_FILES_URL, stream=True)
        total_length = response.headers.get('content-length')
        downloaded = percentage = 0
        print("»"*100)
        total_length = int(total_length)
        for data in response.iter_content(chunk_size=4096):
            downloaded += len(data)
            output_file.write(data)
            new_percentage = 100 * downloaded // total_length
            if new_percentage > percentage:
                print("☑", end="")
                percentage = new_percentage
    print()


def uncompress_data():
    """Uncompress the data files"""
    import gzip
    with gzip.open(NEWS_FILE_NAME_COMPRESSED, 'rb') as compressed_file:
        with open(NEWS_FILE_NAME_COMPRESSED[:-3], 'wb') as outfile:
            outfile.write(compressed_file.read())

def add_noise_to_string(a_string, amount_of_noise):
    """Add some artificial spelling mistakes to the string"""
    if rand() < amount_of_noise * len(a_string):
        # Replace a character with a random character
        random_char_position = random_randint(len(a_string))
        a_string = a_string[:random_char_position] + random_choice(CHARS[:-1]) + a_string[random_char_position + 1:]
    if rand() < amount_of_noise * len(a_string):
        # Delete a character
        random_char_position = random_randint(len(a_string))
        a_string = a_string[:random_char_position] + a_string[random_char_position + 1:]
    if len(a_string) < CONFIG.max_input_len and rand() < amount_of_noise * len(a_string):
        # Add a random character
        random_char_position = random_randint(len(a_string))
        a_string = a_string[:random_char_position] + random_choice(CHARS[:-1]) + a_string[random_char_position:]
    if rand() < amount_of_noise * len(a_string):
        # Transpose 2 characters
        random_char_position = random_randint(len(a_string) - 1)
        a_string = (a_string[:random_char_position] + a_string[random_char_position + 1] + a_string[random_char_position] +
                    a_string[random_char_position + 2:])
    return a_string

def _vectorize(questions, answers, ctable):
    """Vectorize the data as numpy arrays"""
    len_of_questions = len(questions)
    X = np_zeros((len_of_questions, CONFIG.max_input_len, ctable.size), dtype=np.bool)
    for i in xrange(len(questions)):
        sentence = questions.pop()
        for j, c in enumerate(sentence):
            try:
                X[i, j, ctable.char_indices[c]] = 1
            except KeyError:
                pass # Padding
    y = np_zeros((len_of_questions, CONFIG.max_input_len, ctable.size), dtype=np.bool)
    for i in xrange(len(answers)):
        sentence = answers.pop()
        for j, c in enumerate(sentence):
            try:
                y[i, j, ctable.char_indices[c]] = 1
            except KeyError:
                pass # Padding
    return X, y

def slice_X(X, start=None, stop=None):
    """This takes an array-like, or a list of
    array-likes, and outputs:
        - X[start:stop] if X is an array-like
        - [x[start:stop] for x in X] if X in a list
    Can also work on list/array of indices: `slice_X(x, indices)`
    # Arguments
        start: can be an integer index (start index)
            or a list/array of indices
        stop: integer (stop index); should be None if
            `start` was a list.
    """
    if isinstance(X, list):
        if hasattr(start, '__len__'):
            # hdf5 datasets only support list objects as indices
            if hasattr(start, 'shape'):
                start = start.tolist()
            return [x[start] for x in X]
        else:
            return [x[start:stop] for x in X]
    else:
        if hasattr(start, '__len__'):
            if hasattr(start, 'shape'):
                start = start.tolist()
            return X[start]
        else:
            return X[start:stop]

def vectorize(questions, answers, chars=None):
    """Vectorize the questions and expected answers"""
    print('Vectorization...')
    chars = chars or CHARS
    ctable = CharacterTable(chars)
    X, y = _vectorize(questions, answers, ctable)
    # Explicitly set apart 10% for validation data that we never train over
    split_at = int(len(X) - len(X) / 10)
    (X_train, X_val) = (slice_X(X, 0, split_at), slice_X(X, split_at))
    (y_train, y_val) = (y[:split_at], y[split_at:])

    print(X_train.shape)
    print(y_train.shape)

    return X_train, X_val, y_train, y_val, CONFIG.max_input_len, ctable


def generate_model(output_len, chars=None):
    """Generate the model"""
    print('Build model...')
    chars = chars or CHARS
    model = Sequential()
    # "Encode" the input sequence using an RNN, producing an output of hidden_size
    # note: in a situation where your input sequences have a variable length,
    # use input_shape=(None, nb_feature).
    for layer_number in range(CONFIG.input_layers):
        model.add(recurrent.LSTM(CONFIG.hidden_size, input_shape=(None, len(chars)), kernel_initializer=CONFIG.initialization,
                                 return_sequences=layer_number + 1 < CONFIG.input_layers))
        model.add(Dropout(CONFIG.amount_of_dropout))
    # For the decoder's input, we repeat the encoded input for each time step
    model.add(RepeatVector(output_len))
    # The decoder RNN could be multiple layers stacked or a single layer
    for _ in range(CONFIG.output_layers):
        model.add(recurrent.LSTM(CONFIG.hidden_size, return_sequences=True, kernel_initializer=CONFIG.initialization))
        model.add(Dropout(CONFIG.amount_of_dropout))

    # For each of step of the output sequence, decide which character should be chosen
    model.add(TimeDistributed(Dense(len(chars), kernel_initializer=CONFIG.initialization)))
    model.add(Activation('softmax'))

    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model


class Colors(object):
    """For nicer printouts"""
    green = '\033[92m'
    red = '\033[91m'
    close = '\033[0m'


class CharacterTable(object):
    """
    Given a set of characters:
    + Encode them to a one hot integer representation
    + Decode the one hot integer representation to their character output
    + Decode a vector of probabilities to their character output
    """
    def __init__(self, chars):
        self.chars = sorted(set(chars))
        self.char_indices = dict((c, i) for i, c in enumerate(self.chars))
        self.indices_char = dict((i, c) for i, c in enumerate(self.chars))

    @property
    def size(self):
        """The number of chars"""
        return len(self.chars)

    def encode(self, C, maxlen):
        """Encode as one-hot"""
        X = np_zeros((maxlen, len(self.chars)), dtype=np.bool) # pylint:disable=no-member
        for i, c in enumerate(C):
            X[i, self.char_indices[c]] = 1
        return X

    def decode(self, X, calc_argmax=True):
        """Decode from one-hot"""
        if calc_argmax:
            X = X.argmax(axis=-1)
        return ''.join(self.indices_char[x] for x in X if x)

def generator(file_name):
    """Returns a tuple (inputs, targets)
    All arrays should contain the same number of samples.
    The generator is expected to loop over its data indefinitely.
    An epoch finishes when  samples_per_epoch samples have been seen by the model.
    """
    ctable = CharacterTable(read_top_chars())
    batch_of_answers = []
    while True:
        with open(file_name) as answers:
            for answer in answers:
                batch_of_answers.append(answer.strip().decode('utf-8'))
                if len(batch_of_answers) == CONFIG.batch_size:
                    random_shuffle(batch_of_answers)
                    batch_of_questions = []
                    for answer_index, answer in enumerate(batch_of_answers):
                        question, answer = generate_question(answer)
                        batch_of_answers[answer_index] = answer
                        assert len(answer) == CONFIG.max_input_len
                        question = question[::-1] if CONFIG.inverted else question
                        batch_of_questions.append(question)
                    X, y = _vectorize(batch_of_questions, batch_of_answers, ctable)
                    yield X, y
                    batch_of_answers = []

def print_random_predictions(model, ctable, X_val, y_val):
    """Select 10 samples from the validation set at random so we can visualize errors"""
    print()
    for _ in range(10):
        ind = random_randint(0, len(X_val))
        rowX, rowy = X_val[np.array([ind])], y_val[np.array([ind])] # pylint:disable=no-member
        preds = model.predict_classes(rowX, verbose=0)
        q = ctable.decode(rowX[0])
        correct = ctable.decode(rowy[0])
        guess = ctable.decode(preds[0], calc_argmax=False)
        if CONFIG.inverted:
            print('Q', q[::-1]) # inverted back!
        else:
            print('Q', q)
        print('A', correct)
        print(Colors.green + '☑' + Colors.close if correct == guess else Colors.red + '☒' + Colors.close, guess)
        print('---')
    print()


class OnEpochEndCallback(Callback):
    """Execute this every end of epoch"""

    def on_epoch_end(self, epoch, logs=None):
        """On Epoch end - do some stats"""
        ctable = CharacterTable(read_top_chars())
        X_val, y_val = next(generator(NEWS_FILE_NAME_VALIDATE))
        print_random_predictions(self.model, ctable, X_val, y_val)
        self.model.save(SAVED_MODEL_FILE_NAME.format(epoch))

ON_EPOCH_END_CALLBACK = OnEpochEndCallback()

def itarative_train(model):
    """
    Iterative training of the model
     - To allow for finite RAM...
     - To allow infinite training data as the training noise is injected in runtime
    """
    model.fit_generator(generator(NEWS_FILE_NAME_TRAIN), steps_per_epoch=CONFIG.steps_per_epoch,
                        epochs=CONFIG.epochs,
                        verbose=1, callbacks=[ON_EPOCH_END_CALLBACK, ], validation_data=generator(NEWS_FILE_NAME_VALIDATE),
                        validation_steps=CONFIG.validation_steps,
                        class_weight=None, max_q_size=10, workers=1,
                        pickle_safe=False, initial_epoch=0)


def iterate_training(model, X_train, y_train, X_val, y_val, ctable):
    """Iterative Training"""
    # Train the model each generation and show predictions against the validation dataset
    for iteration in range(1, CONFIG.number_of_iterations):
        print()
        print('-' * 50)
        print('Iteration', iteration)
        model.fit(X_train, y_train, batch_size=CONFIG.batch_size, epochs=CONFIG.epochs,
                  validation_data=(X_val, y_val))
        print_random_predictions(model, ctable, X_val, y_val)

def clean_text(text):
    """Clean the text - remove unwanted chars, fold punctuation etc."""
    result = NORMALIZE_WHITESPACE_REGEX.sub(' ', text.strip())
    result = RE_DASH_FILTER.sub('-', result)
    result = RE_APOSTROPHE_FILTER.sub("'", result)
    result = RE_LEFT_PARENTH_FILTER.sub("(", result)
    result = RE_RIGHT_PARENTH_FILTER.sub(")", result)
    result = RE_BASIC_CLEANER.sub('', result)
    return result

def preprocesses_data_clean():
    """Pre-process the data - step 1 - cleanup"""
    with open(NEWS_FILE_NAME_CLEAN, "wb") as clean_data:
        for line in open(NEWS_FILE_NAME):
            decoded_line = line.decode("utf-8")
            cleaned_line = clean_text(decoded_line)
            encoded_line = cleaned_line.encode("utf-8")
            clean_data.write(encoded_line + b"\n")

def preprocesses_data_analyze_chars():
    """Pre-process the data - step 2 - analyze the characters"""
    counter = Counter()
    LOGGER.info("Reading data:")
    for line in open(NEWS_FILE_NAME_CLEAN):
        decoded_line = line.decode('utf-8')
        counter.update(decoded_line)
#     data = open(NEWS_FILE_NAME_CLEAN).read().decode('utf-8')
#     LOGGER.info("Read.\nCounting characters:")
#     counter = Counter(data.replace("\n", ""))
    LOGGER.info("Done.\nWriting to file:")
    with open(CHAR_FREQUENCY_FILE_NAME, 'wb') as output_file:
            output_file.write(json.dumps(counter).encode('utf-8'))
    most_popular_chars = {key for key, _value in counter.most_common(CONFIG.number_of_chars)}
    LOGGER.info("The top %s chars are:", CONFIG.number_of_chars)
    LOGGER.info("".join(sorted(most_popular_chars)))

def read_top_chars():
    """Read the top chars we saved to file"""
    chars = json.loads(open(CHAR_FREQUENCY_FILE_NAME).read())
    counter = Counter(chars)
    most_popular_chars = {key for key, _value in counter.most_common(CONFIG.number_of_chars)}
    return most_popular_chars

def preprocesses_data_filter():
    """Pre-process the data - step 3 - filter only sentences with the right chars"""
    most_popular_chars = read_top_chars()
    LOGGER.info("Reading and filtering data:")
    with open(NEWS_FILE_NAME_FILTERED, "wb") as output_file:
        for line in open(NEWS_FILE_NAME_CLEAN):
            decoded_line = line.decode('utf-8')
            if decoded_line and not bool(set(decoded_line) - most_popular_chars):
                output_file.write(line)
    LOGGER.info("Done.")

def read_filtered_data():
    """Read the filtered data corpus"""
    LOGGER.info("Reading filtered data:")
    lines = open(NEWS_FILE_NAME_FILTERED).read().decode('utf-8').split("\n")
    LOGGER.info("Read filtered data - %s lines", len(lines))
    return lines

def preprocesses_split_lines():
    """Preprocess the text by splitting the lines between min-length and max_length
    I don't like this step:
      I think the start-of-sentence is important.
      I think the end-of-sentence is important.
      Sometimes the stripped down sub-sentence is missing crucial context.
      Important NGRAMs are cut (though given enough data, that might be moot).
    I do this to enable batch-learning by padding to a fixed length.
    """
    LOGGER.info("Reading filtered data:")
    answers = set()
    with open(NEWS_FILE_NAME_SPLIT, "wb") as output_file:
        for _line in open(NEWS_FILE_NAME_FILTERED):
            line = _line.decode('utf-8')
            while len(line) > MIN_INPUT_LEN:
                if len(line) <= CONFIG.max_input_len:
                    answer = line
                    line = ""
                else:
                    space_location = line.rfind(" ", MIN_INPUT_LEN, CONFIG.max_input_len - 1)
                    if space_location > -1:
                        answer = line[:space_location]
                        line = line[len(answer) + 1:]
                    else:
                        space_location = line.rfind(" ") # no limits this time
                        if space_location == -1:
                            break # we are done with this line
                        else:
                            line = line[space_location + 1:]
                            continue
                answers.add(answer)
                output_file.write(answer.encode('utf-8') + b"\n")

def preprocesses_split_lines2():
    """Preprocess the text by splitting the lines between min-length and max_length
    Alternative split.
    """
    LOGGER.info("Reading filtered data:")
    answers = set()
    for encoded_line in open(NEWS_FILE_NAME_FILTERED):
        line = encoded_line.decode('utf-8')
        if CONFIG.max_input_len >= len(line) > MIN_INPUT_LEN:
            answers.add(line)
    LOGGER.info("There are %s 'answers' (sub-sentences)", len(answers))
    LOGGER.info("Here are some examples:")
    for answer in itertools.islice(answers, 10):
        LOGGER.info(answer)
    with open(NEWS_FILE_NAME_SPLIT, "wb") as output_file:
        output_file.write("".join(answers).encode('utf-8'))

def preprocesses_split_lines3():
    """Preprocess the text by selecting only max n-grams
    Alternative split.
    """
    LOGGER.info("Reading filtered data:")
    answers = set()
    for encoded_line in open(NEWS_FILE_NAME_FILTERED):
        line = encoded_line.decode('utf-8')
        if line.count(" ") < 5:
            answers.add(line)
    LOGGER.info("There are %s 'answers' (sub-sentences)", len(answers))
    LOGGER.info("Here are some examples:")
    for answer in itertools.islice(answers, 10):
        LOGGER.info(answer)
    with open(NEWS_FILE_NAME_SPLIT, "wb") as output_file:
        output_file.write("".join(answers).encode('utf-8'))

def preprocesses_split_lines4():
    """Preprocess the text by selecting only sentences with most-common words AND not too long
    Alternative split.
    """
    LOGGER.info("Reading filtered data:")
    from gensim.models.word2vec import Word2Vec
    FILTERED_W2V = "fw2v.bin"
    model = Word2Vec.load_word2vec_format(FILTERED_W2V, binary=True) # C text format
    print(len(model.wv.index2word))
#     answers = set()
#     for encoded_line in open(NEWS_FILE_NAME_FILTERED):
#         line = encoded_line.decode('utf-8')
#         if line.count(" ") < 5:
#             answers.add(line)
#     LOGGER.info("There are %s 'answers' (sub-sentences)", len(answers))
#     LOGGER.info("Here are some examples:")
#     for answer in itertools.islice(answers, 10):
#         LOGGER.info(answer)
#     with open(NEWS_FILE_NAME_SPLIT, "wb") as output_file:
#         output_file.write("".join(answers).encode('utf-8'))

def preprocess_partition_data():
    """Set asside data for validation"""
    answers = open(NEWS_FILE_NAME_SPLIT).read().decode('utf-8').split("\n")
    print('shuffle', end=" ")
    random_shuffle(answers)
    print("Done")
    # Explicitly set apart 10% for validation data that we never train over
    split_at = len(answers) - len(answers) // 10
    with open(NEWS_FILE_NAME_TRAIN, "wb") as output_file:
        output_file.write("\n".join(answers[:split_at]).encode('utf-8'))
    with open(NEWS_FILE_NAME_VALIDATE, "wb") as output_file:
        output_file.write("\n".join(answers[split_at:]).encode('utf-8'))


def generate_question(answer):
    """Generate a question by adding noise"""
    question = add_noise_to_string(answer, AMOUNT_OF_NOISE)
    # Add padding:
    question += PADDING * (CONFIG.max_input_len - len(question))
    answer += PADDING * (CONFIG.max_input_len - len(answer))
    return question, answer

def generate_news_data():
    """Generate some news data"""
    print ("Generating Data")
    answers = open(NEWS_FILE_NAME_SPLIT).read().decode('utf-8').split("\n")
    questions = []
    print('shuffle', end=" ")
    random_shuffle(answers)
    print("Done")
    for answer_index, answer in enumerate(answers):
        question, answer = generate_question(answer)
        answers[answer_index] = answer
        assert len(answer) == CONFIG.max_input_len
        if random_randint(100000) == 8: # Show some progress
            print (len(answers))
            print ("answer:   '{}'".format(answer))
            print ("question: '{}'".format(question))
            print ()
        question = question[::-1] if CONFIG.inverted else question
        questions.append(question)

    return questions, answers

def train_speller_w_all_data():
    """Train the speller if all data fits into RAM"""
    questions, answers = generate_news_data()
    chars_answer = set.union(*(set(answer) for answer in answers))
    chars_question = set.union(*(set(question) for question in questions))
    chars = list(set.union(chars_answer, chars_question))
    X_train, X_val, y_train, y_val, y_maxlen, ctable = vectorize(questions, answers, chars)
    print ("y_maxlen, chars", y_maxlen, "".join(chars))
    model = generate_model(y_maxlen, chars)
    iterate_training(model, X_train, y_train, X_val, y_val, ctable)

def train_speller(from_file=None):
    """Train the speller"""
    if from_file:
        model = load_model(from_file)
    else:
        model = generate_model(CONFIG.max_input_len, chars=read_top_chars())
    itarative_train(model)


#--- Choose this step or:
#if __name__ == '__main__':
#download_the_news_data()
#uncompress_data()
#preprocesses_data_clean()
#preprocesses_data_analyze_chars()
#preprocesses_data_filter()
#preprocesses_split_lines() 
#preprocesses_split_lines2()
#preprocesses_split_lines4()
#preprocess_partition_data()
train_speller()

It's error and error function:

Build model...
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:352: UserWarning: Update your `fit_generator` call to the Keras 2 API: `fit_generator(<generator..., steps_per_epoch=1000, epochs=500, verbose=1, callbacks=[<__main__..., validation_data=<generator..., validation_steps=10, class_weight=None, workers=1, initial_epoch=0, use_multiprocessing=False, max_queue_size=10)`
Epoch 1/500
/usr/local/lib/python3.6/dist-packages/keras/utils/data_utils.py:709: UserWarning: An input could not be retrieved. It could be because a worker has died.We do not have any information on the lost sample.
  UserWarning)

https://github.com/keras-team/keras/blob/7a39b6c62d43c25472b2c2476bd2a8983ae4f682/keras/utils/data_utils.py#L708

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.