Coder Social home page Coder Social logo

seq2vec's Introduction

seq2vec 0.4.0

Turn sequence of words into a fix-length representation vector

This is a version to refactor all the seq2vec structures and use customed layers in yklz.

Install

pip install seq2vec

or clone the repo, then install:

git clone --recursive https://github.com/Yoctol/seq2vec.git
python setup.py install

Usage

Simple hash:

from seq2vec import Seq2VecHash

transformer = Seq2VecHash(vector_length=100)
seqs = [
    ['我', '有', '一個', '蘋果'],
    ['我', '有', 'pineapple'],
]
result = transformer.transform(seqs)
print(result)
'''
array([[ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
         0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
         0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
         0.,  1.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
         0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.]])
'''

Sequence-to-sequence auto-encoder:

  • LSTM to LSTM auto-encoder with word embedding (RNN to RNN architecture)

    from seq2vec.word2vec import GensimWord2vec
    from seq2vec import Seq2VecR2RWord
    
    # load Gensim word2vec from word2vec_model_path
    word2vec = GensimWord2vec(word2vec_model_path)
    
    transformer = Seq2VecR2RWord(
          word2vec_model=word2vec,
          max_length=20,
          latent_size=300,
          encoding_size=300,
          learning_rate=0.05
    )
    
    train_seq = [
      ['我', '有', '一個', '蘋果'],
      ['我', '有', '筆'],
      ['一個', '鳳梨'],
    ]
    test_seq = [
      ['我', '愛', '吃', '鳳梨'],
    ]
    transformer.fit(train_seq)
    result = transformer.transform(test_seq)
  • CNN to LSTM auto-encoder with word embedding (CNN to RNN architecture)

    from seq2vec.word2vec import GensimWord2vec
    from seq2vec import Seq2VecC2RWord
    
    # load Gensim word2vec from word2vec_model_path
    word2vec = GensimWord2vec(word2vec_model_path)
    
    transformer = Seq2VecC2RWord(
          word2vec_model=word2vec,
          max_length=20,
          latent_size=300,
          conv_size=5,
          channel_size=10,
          learning_rate=0.05,
    )
    
    train_seq = [
      ['我', '有', '一個', '蘋果'],
      ['我', '有', '筆'],
      ['一個', '鳳梨'],
    ]
    test_seq = [
      ['我', '愛', '吃', '鳳梨'],
    ]
    transformer.fit(train_seq)
    result = transformer.transform(test_seq)
  • CNN to LSTM auto-encoder with char embedding (CNN to RNN architecture)

    from seq2vec.word2vec import GensimWord2vec
    from seq2vec import Seq2VecC2RChar
    
    # load Gensim word2vec from word2vec_model_path
    word2vec = GensimWord2vec(word2vec_model_path)
    
    transformer = Seq2VecC2RChar(
          word2vec_model=word2vec,
          max_index=1000,
          max_length=20,
          embedding_size=200,
          latent_size=200,
          learning_rate=0.05,
          channel_size=10,
          conv_size=5
    )
    
    train_seq = [
      ['我', '有', '一個', '蘋果'],
      ['我', '有', '筆'],
      ['一個', '鳳梨'],
    ]
    test_seq = [
      ['我', '愛', '吃', '鳳梨'],
    ]
    transformer.fit(train_seq)
    result = transformer.transform(test_seq)
  • LSTM to LSTM auto-encoder with hash word embedding (RNN to RNN architecture)

from seq2vec import Seq2VecR2RHash

transformer = Seq2VecR2RHash(
    max_index=1000,
    max_length=10,
    latent_size=20,
    embedding_size=200,
    encoding_size=300,
    learning_rate=0.05
)

train_seq = [
    ['我', '有', '一個', '蘋果'],
    ['我', '有', '筆'],
    ['一個', '鳳梨'], 
]
test_seq = [
    ['我', '愛', '吃', '鳳梨'],
]
transformer.fit(train_seq)
result = transformer.transform(test_seq)

Training with generator on file

We provide an example with LSTM to LSTM auto-encoder (word embedding).

Use the following training method while lack of memory is an issue for you.

The file should be a tokenized txt file splitted by whitespace with a sequence per line.

from seq2vec.word2vec import GensimWord2vec

from seq2vec.model import Seq2VecR2RWord
from seq2vec.transformer import WordEmbeddingTransformer
from seq2vec.util import DataGenterator

word2vec = GensimWord2vec(word2vec_model_path)
max_length = 20

transformer = Seq2VecR2RWord(
    word2vec_model=word2vec,
    max_length=max_length,
    latent_size=200,
    encoding_size=300,
    learning_rate=0.05
)

train_data = DataGenterator(
    corpus_for_training_path, 
    transformer.input_transformer,
    transformer.output_transformer, 
    batch_size=128
)
test_data = DataGenterator(
    corpus_for_validation_path, 
    transformer.input_transformer,
    transformer.output_transformer, 
    batch_size=128
)

transformer.fit_generator(
    train_data,
    test_data,
    epochs=10,
    batch_number=1250 # The number of batch per epoch
)

transformer.save_model(model_path) # save your model

# You can reload your model and retrain it.
transformer.load_model(model_path)
transformer.fit_generator(
    train_data,
    test_data,
    epochs=10,
    batch_number=1250 # The number of batch per epoch
)

Customized your seq2vec model with our auto-encoder framework

You can customize your seq2vec model easily with our framework.

import keras
from seq2vec.model import TrainableSeq2VecBase

class YourSeq2Vec(TrainableSeq2VecBase):

   def __init__(self
      max_length,
      latent_size,
      learning_rate
   ):
      # initialize your setting and set input_transformer
      # and output_transformer
      # Input and output transformers transform data from 
      # raw sequence into Keras Layer input format
      # See seq2vec.transformer for more detail

      self.input_transformer = YourInputTransformer()
      self.output_transformer = YourOutputTransformer()

      # add your customized layer
      self.custom_objects = {}
      self.custom_objects[customized_class_name] = customized_class

      super(YourSeq2Vec, self).__init__(
         max_length,
         latent_size,
         learning_rate
      )

   def create_model(self):
      # create and compile your model in this function
      # You should return your model and encoder here
      # encoder is the one encoded input sequences

      model.compile(loss)
      return model, encoder

   def load_model(self, file_path):
      # load your seq2vec model here and set its attribute values
      self.model = self.load_customed_model(file_path)

Lint

pylint --rcfile=./yoctol-pylintrc/.pylintrc seq2vec

Test

python -m unittest

seq2vec's People

Contributors

solumilken avatar stegben avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

seq2vec's Issues

requirement

不知道要不要多加
tensorflow-gpu ??
(如果同時pip install tensorflow 和 tensorflow-gpu 不知道會發生什麼事情??)

Not a single example in README is running properly.

I tried to execute every single example from the initial README page, but everything throws different errors.

transformer = Seq2VecR2RWord(
    word2vec_model=word2vec,
    max_length=max_length,
    latent_size=200,
    encoding_size=300,
    learning_rate=0.05
)
Traceback (most recent call last):

  File "<ipython-input-42-34cf056a2e62>", line 6, in <module>
    learning_rate=0.05

  File "/home/estathop/anaconda2/envs/tensorflow/lib/python2.7/site-packages/seq2vec/model/seq2vec_R2R_word.py", line 55, in __init__
    learning_rate=learning_rate

  File "/home/estathop/anaconda2/envs/tensorflow/lib/python2.7/site-packages/seq2vec/model/seq2vec_base.py", line 68, in __init__
    self.model, self.encoder = self.create_model()

  File "/home/estathop/anaconda2/envs/tensorflow/lib/python2.7/site-packages/seq2vec/model/seq2vec_R2R_word.py", line 89, in create_model
    )(masked_inputs)

  File "/home/estathop/anaconda2/envs/tensorflow/lib/python2.7/site-packages/keras/layers/wrappers.py", line 325, in __call__
    return super(Bidirectional, self).__call__(inputs, **kwargs)

  File "/home/estathop/anaconda2/envs/tensorflow/lib/python2.7/site-packages/keras/engine/topology.py", line 619, in __call__
    output = self.call(inputs, **kwargs)

  File "/home/estathop/anaconda2/envs/tensorflow/lib/python2.7/site-packages/yklz/recurrent/bidirectional_rnn_encoder.py", line 31, in call
    func_args = inspect.getfullargspec(self.layer.call).args

AttributeError: 'module' object has no attribute 'getfullargspec'
from seq2vec import Seq2VecR2RHash

transformer = Seq2VecR2RHash(
    max_index=1000,
    max_length=10,
    latent_size=20,
    embedding_size=200,
    encoding_size=300,
    learning_rate=0.05
)

Traceback (most recent call last):

  File "<ipython-input-38-0274b84666d1>", line 9, in <module>
    learning_rate=0.05

  File "/home/estathop/anaconda2/envs/tensorflow/lib/python2.7/site-packages/seq2vec/model/seq2vec_R2R_hash.py", line 58, in __init__
    learning_rate

  File "/home/estathop/anaconda2/envs/tensorflow/lib/python2.7/site-packages/seq2vec/model/seq2vec_base.py", line 68, in __init__
    self.model, self.encoder = self.create_model()

  File "/home/estathop/anaconda2/envs/tensorflow/lib/python2.7/site-packages/seq2vec/model/seq2vec_R2R_hash.py", line 96, in create_model
    )(char_embedding)

  File "/home/estathop/anaconda2/envs/tensorflow/lib/python2.7/site-packages/keras/layers/wrappers.py", line 325, in __call__
    return super(Bidirectional, self).__call__(inputs, **kwargs)

  File "/home/estathop/anaconda2/envs/tensorflow/lib/python2.7/site-packages/keras/engine/topology.py", line 619, in __call__
    output = self.call(inputs, **kwargs)

  File "/home/estathop/anaconda2/envs/tensorflow/lib/python2.7/site-packages/yklz/recurrent/bidirectional_rnn_encoder.py", line 31, in call
    func_args = inspect.getfullargspec(self.layer.call).args

AttributeError: 'module' object has no attribute 'getfullargspec'
In [37]: transformer = Seq2VecC2RChar(
      word2vec_model=word2vec,
      max_index=1000,
      max_length=20,
      embedding_size=200,
      latent_size=200,
      learning_rate=0.05,
      channel_size=10,
      conv_size=5
)

Traceback (most recent call last):

  File "<ipython-input-37-1396dd8326f9>", line 9, in <module>
    conv_size=5

  File "/home/estathop/anaconda2/envs/tensorflow/lib/python2.7/site-packages/seq2vec/model/seq2vec_C2R_char.py", line 69, in __init__
    learning_rate

  File "/home/estathop/anaconda2/envs/tensorflow/lib/python2.7/site-packages/seq2vec/model/seq2vec_base.py", line 68, in __init__
    self.model, self.encoder = self.create_model()

  File "/home/estathop/anaconda2/envs/tensorflow/lib/python2.7/site-packages/seq2vec/model/seq2vec_C2R_char.py", line 153, in create_model
    )(encoded_feature)

  File "/home/estathop/anaconda2/envs/tensorflow/lib/python2.7/site-packages/keras/engine/topology.py", line 619, in __call__
    output = self.call(inputs, **kwargs)

  File "/home/estathop/anaconda2/envs/tensorflow/lib/python2.7/site-packages/yklz/recurrent/rnn_decoder.py", line 95, in call
    constants = self.layer.get_constants(inputs, training=None)

  File "/home/estathop/anaconda2/envs/tensorflow/lib/python2.7/site-packages/yklz/recurrent/rnn_cell.py", line 277, in get_constants
    constants = self.recurrent_layer.get_constants(

AttributeError: 'LSTM' object has no attribute 'get_constants'

version in requirements.txt

Hi,
can you kindly update the requirements.txt file with the correct version of all modules?
Thanks a lot

Candidates 0.1.0

  • Hash (for any size)
  • TFIDF
  • BD25
  • doc2vec
  • seq2seq (hash version)
  • seq2seq (word2vec)

Need new release !!!

  • modified readme
  • Since the version on pip is 0.6.0, but the released version is 0.4.0.
  • New new pip version and new release for fixed keras version.

can't load word2vec model when running example code

I am trying to execute the LSTM to LSTM auto-encoder with word embedding (RNN to RNN architecture) example. I have already trained my own word2vec model via gensim and saved it with the command
model.save('/home/estathop/Documents/word2vecmodel/w2v1model') #save model
when trying to use the

# load Gensim word2vec from word2vec_model_path
word2vec = GensimWord2vec('/home/estathop/Documents/word2vecmodel/w2v1model')

the following error occurs:

Traceback (most recent call last):

File "", line 5, in
word2vec = GensimWord2vec('/home/estathop/Documents/word2vecmodel/w2v1model')

File "/home/estathop/anaconda2/envs/tensorflow/lib/python2.7/site-packages/seq2vec/word2vec/gensim_word2vec.py", line 9, in init
model_path, binary=True

File "/home/estathop/anaconda2/envs/tensorflow/lib/python2.7/site-packages/gensim/models/keyedvectors.py", line 1120, in load_word2vec_format
limit=limit, datatype=datatype)

File "/home/estathop/anaconda2/envs/tensorflow/lib/python2.7/site-packages/gensim/models/utils_any2vec.py", line 174, in _load_word2vec_format
header = utils.to_unicode(fin.readline(), encoding=encoding)

File "/home/estathop/anaconda2/envs/tensorflow/lib/python2.7/site-packages/gensim/utils.py", line 359, in any2unicode
return unicode(text, encoding, errors=errors)

File "/home/estathop/anaconda2/envs/tensorflow/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)

UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte

any ideas how to fix/bypass this ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.