Coder Social home page Coder Social logo

passage's Introduction

Passage

A little library for text analysis with RNNs.

Warning: very alpha, work in progress.

Install

via Github (version under active development)

git clone http://github.com/IndicoDataSolutions/passage.git
python setup.py develop

or via pip

sudo pip install passage

Example

Using Passage to do binary classification of text, this example:

  • Tokenizes some training text, converting it to a format Passage can use.
  • Defines the model's structure as a list of layers.
  • Creates the model with that structure and a cost to be optimized.
  • Trains the model for one iteration over the training text.
  • Uses the model and tokenizer to predict on new text.
  • Saves and loads the model.
from passage.preprocessing import Tokenizer
from passage.layers import Embedding, GatedRecurrent, Dense
from passage.models import RNN
from passage.utils import save, load

tokenizer = Tokenizer()
train_tokens = tokenizer.fit_transform(train_text)

layers = [
	Embedding(size=128, n_features=tokenizer.n_features),
	GatedRecurrent(size=128),
	Dense(size=1, activation='sigmoid')
]

model = RNN(layers=layers, cost='BinaryCrossEntropy')
model.fit(train_tokens, train_labels)

model.predict(tokenizer.transform(test_text))
save(model, 'save_test.pkl')
model = load('save_test.pkl')

Where:

  • train_text is a list of strings ['hello world', 'foo bar']
  • train_labels is a list of labels [0, 1]
  • test_text is another list of strings

Datasets

Without sizeable datasets RNNs have difficulty achieving results better than traditional sparse linear models. Below are a few datasets that are appropriately sized, useful for experimentation. Hopefully this list will grow over time, please feel free to propose new datasets for inclusion through either an issue or a pull request.

Note: None of these datasets were created by indico, nor should their inclusion here indicate any kind of endorsement

Blogger Dataset: http://www.cs.biu.ac.il/~koppel/blogs/blogs.zip (Age and gender data)

passage's People

Contributors

gchrupala avatar gwulfs avatar madisonmay avatar newmu avatar sihrc avatar slater-victoroff avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

passage's Issues

Problem with Multiclass Classification

My data has three classes and I used cce as cost function. Here is the code for layers and model.

layers = [
    Embedding(size=256, n_features=tokenizer.n_features),
    GatedRecurrent(size=512, seq_output=False, p_drop=0.75), #LstmRecurrent(size=512, seq_output=False, p_drop=0.75)
    Dense(size = 1, activation='sigmoid')
    ]
model = RNN(layers=layers, cost='cce', updater = Adadelta(lr=0.5))

This is the error I receive. Any idea?

/usr/local/lib/python2.7/dist-packages/theano/scan_module/scan_perform_ext.py:133: RuntimeWarning: numpy.ndarray size changed, may indicate binary incompatibility
from scan_perform.scan_perform import *
Traceback (most recent call last):
File "myRNN.py", line 44, in
model.fit(train_tokens, train_labels, n_epochs = 9)
File "/home/naeemul/passage/passage/models.py", line 82, in fit
c = self._train(xmb, ymb)
File "/usr/local/lib/python2.7/dist-packages/theano/compile/function_module.py", line 606, in call
storage_map=self.fn.storage_map)
File "/usr/local/lib/python2.7/dist-packages/theano/compile/function_module.py", line 595, in call
outputs = self.fn()
ValueError: Input dimension mis-match. (input[0].shape[1] = 2, input[1].shape[1] = 1)
Apply node that caused the error: Elemwise{Composite{(i0 * scalar_softplus((-i1)))}}(<TensorType(float64, matrix)>, Elemwise{Add}[(0, 0)].0)
Inputs types: [TensorType(float64, matrix), TensorType(float64, matrix)]
Inputs shapes: [(64, 2), (64, 1)]
Inputs strides: [(16, 8), (8, 8)]
Inputs values: ['not shown', 'not shown']

HINT: Re-running with most Theano optimization disabled could give you a back-trace of when this node was created. This can be done with by setting the Theano flag 'optimizer=fast_compile'. If that does not work, Theano optimizations can be disabled with 'optimizer=None'.
HINT: Use the Theano flag 'exception_verbosity=high' for a debugprint and storage map footprint of this apply node.

Thanks!

EDIT: Found the problem. I set the Dense size wrong. Closing the issue.

Understanding Multiclass classification problem

Here is the thing.

I have 3 class labels in my data. The predict function returns an array of shape (N, 3) where N is the size of the train set. Here is an example,

>>> prediction = model.predict(tokenizer.transform(test_text))
>>> prediction
    array([[ 0.99999995,  0.99998585,  0.99999268],
       [ 0.99999993,  0.99998351,  0.99999069],
       [ 0.9999999 ,  0.99997884,  0.99998756],
       [ 0.99999992,  0.99998125,  0.99998932],
       [ 0.99999988,  0.99997679,  0.99998638],
       [ 0.99999996,  0.99998655,  0.99999323],
       [ 0.99999942,  0.99992762,  0.99995597],
       [ 0.99999987,  0.99997481,  0.99998479],
       [ 0.99999995,  0.99998605,  0.99999269],
       [ 0.99999996,  0.99998733,  0.99999372]])

How to interpret the prediction. I have 3 class labels but the array is giving close to 1 in all the classes. The value don't even sum up to 1. I must be doing something wrong. Any suggestion is highly appreciated.

Thanks!

Stroing trained model

Hello,

What is the proper way to store and restore the trained model?
I tried to pickle the model variable but stuck with following error:

*** PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed

I bet you don't retrain the whole network on each run. How do you store it?

Setting batch_size in fit()

While using real-valued inputs with the Generic layer, I noticed that setting the batch_size in fit() has no effect, it always uses the default value in iterators.py/Linear.

RNN.predict should return labels, not probabilties

There's a slight incompatibility with sklearn in the RNN.predict method: this one should return predicted class labels. predict_proba is the name of the method that returns probabilities. In Passage's case it's like the existing predict except that, for binary classification tasks, sklearn expects a (n,2) matrix with one column for each of negative and positive probabilities.

Here's the two methods (a hack) that I use in a subclass to implement predict and predict_proba to work with sklearn, on top of the existing RNN.predict. As it is, it only works with binary classification:

    def predict_proba(self, X):
        proba_pos = super(RNN, self).predict(X)
        proba_neg = 1 - proba_pos
        return np.hstack([proba_neg, proba_pos])

    def predict(self, X):
        return self.predict_proba(X).argmax(1)

As I'm not sure what else the current predict can return (i.e. when it's not doing binary classification), I'm also not sure what's the right way to change the original code, so that it still works with all the tasks that it was designed for.

Theano Error: 'TensorVariable' object has no attribute 'cast'

Hi,
I'm trying to run the "example" on my own data.

After fitting the tokenizer, I get a big, long list of error messages from Theano.
Background: Windows 8.1. Anaconda + Python 2.7 64 bit. Latest Theano version, runs on GPU.
Passage installed from Pip.

Code:
tokenizer = Tokenizer( min_df=1, lowercase=False, character=False, max_features=10000)
X_train = tokenizer.fit_transform(X_train)
X_test = tokenizer.transform(X_test)
layers = [
Embedding(size=128, n_features=tokenizer.n_features),
GatedRecurrent(size=256, activation='tanh', gate_activation='steeper_sigmoid', init='orthogonal', seq_output=False),
Dense(size=1, activation='sigmoid', init='orthogonal') # sigmoid for binary classification
]
model = RNN(layers=layers, cost='bce')

Output:

'
ERROR (theano.gof.opt): Optimization failure due to: local_gpu_advanced_subtensor1
ERROR:theano.gof.opt:Optimization failure due to: local_gpu_advanced_subtensor1
ERROR (theano.gof.opt): TRACEBACK:
ERROR:theano.gof.opt:TRACEBACK:
ERROR (theano.gof.opt): Traceback (most recent call last):
File "d:\git\theano\theano\gof\opt.py", line 1485, in process_node
replacements = lopt.transform(node)
File "d:\git\theano\theano\sandbox\cuda\opt.py", line 980, in local_gpu_advanced_subtensor1
return [host_from_gpu(GpuAdvancedSubtensor1()(gpu_x, coords))]
File "d:\git\theano\theano\gof\op.py", line 507, in call
node = self.make_node(inputs, **kwargs)
File "d:\git\theano\theano\sandbox\cuda\basic_ops.py", line 2570, in make_node
ilist
= ilist
.cast('int64')
AttributeError: 'TensorVariable' object has no attribute 'cast'

ERROR:theano.gof.opt:Traceback (most recent call last):
File "d:\git\theano\theano\gof\opt.py", line 1485, in process_node
replacements = lopt.transform(node)
File "d:\git\theano\theano\sandbox\cuda\opt.py", line 980, in local_gpu_advanced_subtensor1
return [host_from_gpu(GpuAdvancedSubtensor1()(gpu_x, coords))]
File "d:\git\theano\theano\gof\op.py", line 507, in call
node = self.make_node(inputs, **kwargs)
File "d:\git\theano\theano\sandbox\cuda\basic_ops.py", line 2570, in make_node
ilist
= ilist
.cast('int64')
AttributeError: 'TensorVariable' object has no attribute 'cast'

ERROR (theano.gof.opt): Optimization failure due to: local_gpu_advanced_subtensor1
ERROR:theano.gof.opt:Optimization failure due to: local_gpu_advanced_subtensor1
ERROR (theano.gof.opt): TRACEBACK:
ERROR:theano.gof.opt:TRACEBACK:
ERROR (theano.gof.opt): Traceback (most recent call last):
File "d:\git\theano\theano\gof\opt.py", line 1485, in process_node
replacements = lopt.transform(node)
File "d:\git\theano\theano\sandbox\cuda\opt.py", line 974, in local_gpu_advanced_subtensor1
return [GpuAdvancedSubtensor1()(as_cuda_ndarray_variable(x), coords)]
File "d:\git\theano\theano\gof\op.py", line 507, in call
node = self.make_node(inputs, **kwargs)
File "d:\git\theano\theano\sandbox\cuda\basic_ops.py", line 2570, in make_node
ilist
= ilist
.cast('int64')
AttributeError: 'TensorVariable' object has no attribute 'cast'

ERROR:theano.gof.opt:Traceback (most recent call last):
File "d:\git\theano\theano\gof\opt.py", line 1485, in process_node
replacements = lopt.transform(node)
File "d:\git\theano\theano\sandbox\cuda\opt.py", line 974, in local_gpu_advanced_subtensor1
return [GpuAdvancedSubtensor1()(as_cuda_ndarray_variable(x), coords)]
File "d:\git\theano\theano\gof\op.py", line 507, in call
node = self.make_node(inputs, **kwargs)
File "d:\git\theano\theano\sandbox\cuda\basic_ops.py", line 2570, in make_node
ilist
= ilist
.cast('int64')
AttributeError: 'TensorVariable' object has no attribute 'cast'

ERROR (theano.gof.opt): Optimization failure due to: local_gpu_advanced_subtensor1
ERROR:theano.gof.opt:Optimization failure due to: local_gpu_advanced_subtensor1
ERROR (theano.gof.opt): TRACEBACK:
ERROR:theano.gof.opt:TRACEBACK:
ERROR (theano.gof.opt): Traceback (most recent call last):
File "d:\git\theano\theano\gof\opt.py", line 1485, in process_node
replacements = lopt.transform(node)
File "d:\git\theano\theano\sandbox\cuda\opt.py", line 974, in local_gpu_advanced_subtensor1
return [GpuAdvancedSubtensor1()(as_cuda_ndarray_variable(x), coords)]
File "d:\git\theano\theano\gof\op.py", line 507, in call
node = self.make_node(inputs, **kwargs)
File "d:\git\theano\theano\sandbox\cuda\basic_ops.py", line 2570, in make_node
ilist
= ilist
.cast('int64')
AttributeError: 'TensorVariable' object has no attribute 'cast'

ERROR:theano.gof.opt:Traceback (most recent call last):
File "d:\git\theano\theano\gof\opt.py", line 1485, in process_node
replacements = lopt.transform(node)
File "d:\git\theano\theano\sandbox\cuda\opt.py", line 974, in local_gpu_advanced_subtensor1
return [GpuAdvancedSubtensor1()(as_cuda_ndarray_variable(x), coords)]
File "d:\git\theano\theano\gof\op.py", line 507, in call
node = self.make_node(inputs, **kwargs)
File "d:\git\theano\theano\sandbox\cuda\basic_ops.py", line 2570, in make_node
ilist
= ilist
.cast('int64')
AttributeError: 'TensorVariable' object has no attribute 'cast'
'

  • This continues for hundreds of cases, repeating the error.
    Thanks!

Invalid memory access of location 0x0 rip=0x0

I get "Invalid memory access of location 0x0 rip=0x0" this when I try to import any passage code.

Running on a Mac with theano and cuda installed. Here's my theano settings:

[global]
floatX = float32
device = gpu0
mode = FAST_RUN

[nvcc]
fastmath = True

[cuda]
root = /usr/local/cuda

If I don't specify the CUDA location, then I get:

ERROR (theano.sandbox.cuda): nvcc compiler not found on $PATH. Check your nvcc installation and try again.

Generic input layer

While I understand that the library's main purpose is text processing, it would be great to have a generic input layer as well for sequences of real-valued input vectors.

Model instantiation is very slow

While trying out the example provided in the Readme, I noticed that this line executes very slowly:

model = RNN(layers=layers, cost='BinaryCrossEntropy')

Profiling shows that this line takes around 20 seconds to execute on both CPU as well as GPU. I am observing similar performance for the load function in utils.py as well. Am I missing something here? If not, can this be sped up somehow? ( Such performance poses a serious obstacle in loading trained models in real time applications. )

Error while building RNN model

model = RNN(layers=layers, cost='CategoricalCrossEntropy')

I received this error while building the RNN model. Any ideas? The complete error message is very long.

Thanks!

/home/naeemul/anaconda/lib/python2.7/site-packages/theano/tensor/subtensor.py:110: FutureWarning: comparison to None will result in an elementwise object comparison in the future.
start in [None, 0] or
/home/naeemul/anaconda/lib/python2.7/site-packages/theano/tensor/subtensor.py:114: FutureWarning: comparison to None will result in an elementwise object comparison in the future.
stop in [None, length, maxsize] or
/home/naeemul/anaconda/lib/python2.7/site-packages/theano/tensor/opt.py:2165: FutureWarning: comparison to None will result in an elementwise object comparison in the future.
if (replace_x == replace_y and
WARNING (theano.tensor.blas): We did not found a dynamic library into the library_dir of the library we use for blas. If you use ATLAS, make sure to compile it with dynamics library.

WARNING:theano.tensor.blas:We did not found a dynamic library into the library_dir of the library we use for blas. If you use ATLAS, make sure to compile it with dynamics library.

/usr/bin/ld: cannot find -lf77blas
/usr/bin/ld: cannot find -lcblas
/usr/bin/ld: cannot find -latlas
collect2: error: ld returned 1 exit status


Exception Traceback (most recent call last)
in ()
----> 1 model = RNN(layers=layers, cost='CategoricalCrossEntropy')
2 model.fit(train_tokens, train.verdict)
3
4 model.predict(tokenizer.transform(test.text))

/home/naeemul/anaconda/lib/python2.7/site-packages/passage/models.pyc in init(self, layers, cost, updater, verbose, Y, iterator)
49 self.updates = self.updater.get_updates(self.params, cost)
50
---> 51 self._train = theano.function([self.X, self.Y], cost, updates=self.updates)
52 self._cost = theano.function([self.X, self.Y], cost)
53 self._predict = theano.function([self.X], self.y_te)

/home/naeemul/anaconda/lib/python2.7/site-packages/theano/compile/function.pyc in function(inputs, outputs, mode, updates, givens, no_default_updates, accept_inplace, name, rebuild_strict, allow_input_downcast, profile, on_unused_input)
221 allow_input_downcast=allow_input_downcast,
222 on_unused_input=on_unused_input,
--> 223 profile=profile)
224 # We need to add the flag check_aliased inputs if we have any mutable or
225 # borrowed used defined inputs

/home/naeemul/anaconda/lib/python2.7/site-packages/theano/compile/pfunc.pyc in pfunc(params, outputs, mode, updates, givens, no_default_updates, accept_inplace, name, rebuild_strict, allow_input_downcast, profile, on_unused_input)
510 return orig_function(inputs, cloned_outputs, mode,
511 accept_inplace=accept_inplace, name=name, profile=profile,
--> 512 on_unused_input=on_unused_input)
513
514

/home/naeemul/anaconda/lib/python2.7/site-packages/theano/compile/function_module.pyc in orig_function(inputs, outputs, mode, accept_inplace, name, profile, on_unused_input)
1310 profile=profile,
1311 on_unused_input=on_unused_input).create(
-> 1312 defaults)
1313
1314 t2 = time.time()

/home/naeemul/anaconda/lib/python2.7/site-packages/theano/compile/function_module.pyc in create(self, input_storage, trustme)
1179 # Get a function instance
1180 start_linker = time.time()
-> 1181 _fn, _i, _o = self.linker.make_thunk(input_storage=input_storage_lists)
1182 end_linker = time.time()
1183

/home/naeemul/anaconda/lib/python2.7/site-packages/theano/gof/link.pyc in make_thunk(self, profiler, input_storage, output_storage)
432 return self.make_all(profiler=profiler,
433 input_storage=input_storage,
--> 434 output_storage=output_storage)[:3]
435
436 def make_all(self, profiler, input_storage, output_storage):

/home/naeemul/anaconda/lib/python2.7/site-packages/theano/gof/vm.pyc in make_all(self, profiler, input_storage, output_storage)
845 storage_map,
846 compute_map,
--> 847 no_recycling))
848 except Exception, e:
849 e.args = ("The following error happened while"

/home/naeemul/anaconda/lib/python2.7/site-packages/theano/gof/op.pyc in make_thunk(self, node, storage_map, compute_map, no_recycling)
604 logger.debug('Trying CLinker.make_thunk')
605 outputs = cl.make_thunk(input_storage=node_input_storage,
--> 606 output_storage=node_output_storage)
607 fill_storage, node_input_filters, node_output_filters = outputs
608

/home/naeemul/anaconda/lib/python2.7/site-packages/theano/gof/cc.pyc in make_thunk(self, input_storage, output_storage, keep_lock)
946 cthunk, in_storage, out_storage, error_storage = self.compile(
947 input_storage, output_storage,
--> 948 keep_lock=keep_lock)
949
950 res = _CThunk(cthunk, init_tasks, tasks, error_storage)

/home/naeemul/anaconda/lib/python2.7/site-packages/theano/gof/cc.pyc in compile(self, input_storage, output_storage, keep_lock)
889 input_storage,
890 output_storage,
--> 891 keep_lock=keep_lock)
892 return (thunk,
893 [link.Container(input, storage) for input, storage in

/home/naeemul/anaconda/lib/python2.7/site-packages/theano/gof/cc.pyc in cthunk_factory(self, error_storage, in_storage, out_storage, keep_lock)
1320 else:
1321 module = get_module_cache().module_from_key(
-> 1322 key=key, fn=self.compile_cmodule_by_step, keep_lock=keep_lock)
1323
1324 vars = self.inputs + self.outputs + self.orphans

/home/naeemul/anaconda/lib/python2.7/site-packages/theano/gof/cmodule.pyc in module_from_key(self, key, fn, keep_lock, key_data)
994 # The module should be returned by the last
995 # step of the compilation.
--> 996 module = next(compile_steps)
997 except StopIteration:
998 break

/home/naeemul/anaconda/lib/python2.7/site-packages/theano/gof/cc.pyc in compile_cmodule_by_step(self, location)
1235 lib_dirs=self.lib_dirs(),
1236 libs=libs,
-> 1237 preargs=preargs)
1238 except Exception, e:
1239 e.args += (str(self.fgraph),)

/home/naeemul/anaconda/lib/python2.7/site-packages/theano/gof/cmodule.pyc in compile_str(module_name, src_code, location, include_dirs, lib_dirs, libs, preargs, py_module)
1969 # difficult to read.
1970 raise Exception('Compilation failed (return status=%s): %s' %
-> 1971 (status, compile_stderr.replace('\n', '. ')))
1972 elif config.cmodule.compilation_warning and compile_stderr:
1973 # Print errors just below the command line.

Exception: ('The following error happened while compiling the node', Dot22(Reshape{2}.0, Reshape{2}.0), '\n', 'Compilation failed (return status=1): /usr/bin/ld: cannot find -lf77blas. /usr/bin/ld: cannot find -lcblas. /usr/bin/ld: cannot find -latlas. collect2: error: ld returned 1 exit status. ', '[Dot22(<TensorType(float64, matrix)>, <TensorType(float64, matrix)>)]')

Tokeniser issue

It seems like the Tokenizer is broken since the following code snippet:
train_text = ['hello world', 'foo bar']

tokenizer = Tokenizer()
train_tokens = tokenizer.fit_transform(train_text)

results in:

[[2, 2], [2, 2]]

Stacking recurrent layers

I get dimension mismatch error when trying to stack recurrent layers (LSTM or Gated) to create a deep recurrent network. Multiple dense layers seem to work fine though.

fit_transform returns unorderable types: dict_values() >= int() error

I just installed passage through git. Tried this small code.

example_text = ['This. is.', 'Example TEXT', 'is text']
tokenizer = Tokenizer(min_df=1, lowercase=True, character=False)
tokenized = tokenizer.fit_transform(example_text)

It returns--

TypeError Traceback (most recent call last)
in ()
5 example_text = ['This. is.', 'Example TEXT', 'is text']
6 tokenizer = Tokenizer(min_df=1, lowercase=True, character=False)
----> 7 tokenized = tokenizer.fit_transform(example_text)
8 tokenized

/home/naeemul/anaconda3/lib/python3.4/site-packages/passage/preprocessing.py in fit_transform(self, texts)
128
129 def fit_transform(self, texts):
--> 130 self.fit(texts)
131 tokens = self.transform(texts)
132 return tokens

/home/naeemul/anaconda3/lib/python3.4/site-packages/passage/preprocessing.py in fit(self, texts)
109 else:
110 tokens = [tokenize(text) for text in texts]
--> 111 self.encoder = token_encoder(tokens, max_features=self.max_features-3, min_df=self.min_df)
112 self.encoder['PAD'] = 0
113 self.encoder['END'] = 1

/home/naeemul/anaconda3/lib/python3.4/site-packages/passage/preprocessing.py in token_encoder(texts, max_features, min_df)
54 df[token] = 1
55 k, v = df.keys(), np.asarray(df.values())
---> 56 valid = v >= min_df
57 k = lbf(k, valid)
58 v = v[valid]

TypeError: unorderable types: dict_values() >= int()

I again installed through pip and got the same error. Any idea how to get it solved? Thanks!

tokenizer confusion

I was a little confused with the tokenizer since if I follow the example in the readme I would get something like:

train_text = ['hello world', 'foo bar']
tokenizer = Tokenizer()
train_tokens = tokenizer.fit_transform(train_text)

and then train_tokens is just a list of 2s, train_tokens = [[2,2],[2,2]]. I was explained that the default minimum frequency is 10, and one can change this with tokenizer = Tokenizer(min_df=1). Maybe this should be a bit more clear in the readme example... Thank!

Use as a Tagging Model?

This is more of a feature request.

I'd love to try to use an LTSM model as as tagging model. I have tagged words for my training data (not POS tags or any common NLP tagging problem). The previous tags can influence the current word's tags. Is it possible to use this library as a word tagger? Right now it looks like it trains on an entire document, sequentially, but with one target label per document.

How to avoid overfitting

Dear contributors,

Thanks a ton for putting this library together.

I based my classification model on the sentiment.py example. I find that my (Embedding + GatedRecurrent / LstmRecurrent + Dense) combination of layers overfits the data. Training error goes down steadily, but at the same time, prediction error on some completely unseen data goes up steadily.

Do you have any suggestions? p_drop is 0.75, updater=Adadelta(lr=0.5). I have already tried removing last dense layer, reducing the size of each layer and increasing the size of training data.

Thanks in advance!
Uma

Sequential output

It would be helpful to have an example of a network configuration where a label is predicted for each element in the sequence. This is a common scenario in NLP (e.g. named entity recognition).

LSTM example rise a `dtype` difference error

Hello,
I just used exactly the same example in Passage/mnist.py
The only modification is to change GatedRecurrent into LstmRecurrent:

import ...
...

trX, teX, trY, teY = load_mnist()

#Use generic layer - RNN processes a size 28 vector at a time scanning from left to right
layers = [
    Generic(size=28),
    LstmRecurrent(size=512, p_drop=0.2),
    Dense(size=10, activation='softmax', p_drop=0.5)
]

#A bit of l2 helps with generalization, higher momentum helps convergence
updater = NAG(momentum=0.95, regularizer=Regularizer(l2=1e-4))

#Linear iterator for real valued data, cce cost for softmax
model = RNN(layers=layers, updater=updater, iterator='linear', cost='cce')
model.fit(trX, trY, n_epochs=20)

tr_preds = model.predict(trX[:len(teY)])
te_preds = model.predict(teX)

tr_acc = np.mean(trY[:len(teY)] == np.argmax(tr_preds, axis=1))
te_acc = np.mean(teY == np.argmax(te_preds, axis=1))

# Test accuracy should be between 98.9% and 99.3%
print 'train accuracy', tr_acc, 'test accuracy', te_acc

However, there arose an error:

Traceback (most recent call last):
  File "/.../ex2.py", line 24, in <module>
    model = RNN(layers=layers, updater=updater, iterator='linear', cost='cce')
  File "/.../models.py", line 44, in __init__
    self.y_tr = self.layers[-1].output(dropout_active=True)
  File "/.../layers.py", line 297, in output
    X = self.l_in.output(dropout_active=dropout_active)
  File "/.../layers.py", line 190, in output
    truncate_gradient=self.truncate_gradient
  File "/.../theano/scan_module/scan.py", line 1042, in scan
    scan_outs = local_op(*scan_inputs)
  File "/.../theano/gof/op.py", line 507, in __call__
    node = self.make_node(*inputs, **kwargs)
  File "/.../theano/scan_module/scan_op.py", line 374, in make_node
    inner_sitsot_out.type.dtype))
ValueError: When compiling the inner function of scan the following error has been encountered: The initial state (`outputs_info` in scan nomenclature) of variable IncSubtensor{Set;:int64:}.0 (argument number 4) has dtype float32, while the result of the inner function (`fn`) has dtype float64. This can happen if the inner function of scan results in an upcast or downcast.

How could I fix this? or is there anything I can do to make the program run smoothly?

Implement learning rate schedule

I was able to get better results with SGD on my own dataset with this library, if I use SGD and half the learning rate after each epoch, than using Eddy without such a learning rate schedule.

It would be nice if this was an official feature. I would make a pull request, but I am not sure what the best way is to implement it.

TypeError: not all arguments converted during string formatting

Hi,
getting this error.

Traceback (most recent call last):
File "dummy.py", line 15, in
model = RNN(layers=layers, cost='BinaryCrossEntropy')
File "c:\users\user1\documents\github\passage\passage\models.py", line 44, in init
self.y_tr = self.layers[-1].output(dropout_active=True)
File "c:\users\user1\documents\github\passage\passage\layers.py", line 275, in output
X = self.l_in.output(dropout_active=dropout_active)
File "c:\users\user1\documents\github\passage\passage\layers.py", line 239, in output
outputs_info=[repeat(self.h0, x_h.shape[1], axis=0)],
File "C:\Users\user1\AppData\Local\Enthought\Canopy\User\lib\site-packages\theano\tensor\extra_ops.py", line 360, in repeat
return RepeatOp(axis=axis)(x, repeats)
File "C:\Users\user1\AppData\Local\Enthought\Canopy\User\lib\site-packages\theano\gof\op.py", line 399, in call
node = self.make_node(_inputs, *_kwargs)
File "C:\Users\user1\AppData\Local\Enthought\Canopy\User\lib\site-packages\theano\tensor\extra_ops.py", line 259, in make_node
% numpy_unsupported_dtypes), repeats.dtype)
TypeError: not all arguments converted during string formatting

Contents of file dummy.py

from passage.preprocessing import Tokenizer
from passage.layers import Embedding, GatedRecurrent, Dense
from passage.models import RNN
from passage.utils import save, load
train_text= ['hello world', 'foo bar']
train_labels= [0, 1]
tokenizer = Tokenizer()
train_tokens = tokenizer.fit_transform(train_text)

layers = [
Embedding(size=128, n_features=tokenizer.n_features),
GatedRecurrent(size=128),
Dense(size=1, activation='sigmoid')
]

model = RNN(layers=layers, cost='BinaryCrossEntropy')
model.fit(train_tokens, train_labels)

model.predict(tokenizer.transform(test_text))
save(model, 'save_test.pkl')
model = load('save_test.pkl')

Go directly into RNN layers without using Embeddings and preprocessing methods.

Hello,
We are currently working on an NLP research project.
We have already obtained our way of preprocessing, such as give a long document a specified vector representer. So, what the training set now is a fixed length decimal vector, and the output should be different labels.
However, as far as I know, Passage has its own preprocessing methods, and is kind of "required" for a training input.
Could I have any way to reduce Passage's own preprocessing procedures?
such as tokenizer and Embedding layer.

For example the layers looks like this

layers = [
            Embedding(size=256, n_features=tokenizer.n_features),
            GatedRecurrent(size=256, seq_output=True),
            GatedRecurrent(size=256, seq_output=False), # activation='t_rectify',
            Dense(size=1, activation='sigmoid')
        ]

So, is there any way to remove the first layer, so that the training inputs can directly go to RNN layers.

layers = [
            # Embedding(size=256, n_features=tokenizer.n_features),
            GatedRecurrent(size=256, seq_output=True),
            GatedRecurrent(size=256, seq_output=False), # activation='t_rectify',
            Dense(size=1, activation='sigmoid')
        ]

Well, if I do so, it will, of course, give error msg:

line 41, in __init__
    self.params = flatten([l.params for l in layers])
AttributeError: 'GatedRecurrent' object has no attribute 'params'

Thank you.

Trying to apply Passage to regression problem

Hello,

It's more a question than a bug report, cause I'm not sure that Passage is meant to work for my problem.
It's definitely regression problem but the values are very chaotic and regular regression approach doesn't work well. So I hoped that I can apply Passage to regression problem just by passing 'linear' to activation functions of each layer.

So I have time series and data looks like this (I simplified the data to make example more obvious):

step 1: [34, 53, 10]
step 2: [23, 14, 77]
step 3: [12, 43, 90]
step 4: [93, 22, 31]
step 5: [1, 10, 53]

I'm trying to predict next step value using data from previous two steps.
So X and Y looks like this:

X = [
    [34, 53, 10] + [23, 14, 77],
    [23, 14, 77] + [12, 43, 90],
    [12, 43, 90] + [93, 22, 31]
]

Y = [
    [12, 43, 90],
    [93, 22, 31],
    [1, 10, 53]
]

When I call .fit() I see that Passage trying to treat vectors from X as sparse vectors. It assumes that, for example, 34 number (X[0][0] value) is index but not a value.

So I have two questions.

  1. Is it possible to apply Passage to regression problems?
  2. How to tell Passage to treat data as usual vectors and not sparse vectors?

Thank you

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.