Comments (3)
All lines in the entire set of HTML documents would be one big matrix X
. Each row in this matrix is a sample (line). All the labels of all the lines are a single target vector y
of the same length (len(y) == X.shape[0]
).
The lengths of the actual sequences need to be an array lengths
that contains the length of each sequence (document).
So, suppose you have a function that computes the features of a single line as a vector (1-d NumPy array):
def features(line):
return np.array([feature1(line), feature2(line)])
... then you should be able to construct the input as follows:
X, y, lengths = [], [], []
for doc, label in training_set:
lines = doc.splitlines()
lengths.append(len(lines))
X.append(features(line))
y.append(label)
X, y, lengths = map(np.asarray, [X, y, lengths])
Does that answer your question?
from seqlearn.
Thank you, that was very helpful. So the X matrix can be a dense matrix, where each row is a feature vector and each column is a different sample? And the feature vectors can be float valued with multiple features being nonzero?
I was confused by trying to deconstruct the included conll.py file in the example. I wasn't sure if features had to be translated to a list of strings (such as ["feature1:val1", "feature2:val2"]) and then encoded into a sparse matrix using the FeatureHasher from sklearn. Also, from your documentation, I wasn't sure what this line
"Make sure the training set (X) is one-hot encoded; if more than one feature in X is on, the emission probabilities will be multiplied."
meant in terms of having my feature vectors as floats, many of which are nonzero.
Finally, I wasn't sure (from deconstructing conll.py) whether features from the previous and subsequent sample in the sequence need to be included in the current sample (as is shown in the conll.py example).
Thanks again for all your help!
from seqlearn.
X
may be either a dense array or a sparse matrix. It follows scikit-learn conventions.
Re: one-hot encoding, that's because the HMM is meant to deal with categorical data and each feature should represent the identity of an event as a boolean. I think you should be using a StructuredPerceptron
if your data is anything different (sorry, hadn't thought about this earlier, I very seldom use HMMs).
from seqlearn.
Related Issues (20)
- Can seqlearn use hmm/gmm?
- ValueError: Buffer dtype mismatch, expected 'npy_intp' but got 'int' HOT 1
- Does seq learn supports multiple core of machine?
- Using a `requirements.txt` file HOT 1
- How to use seqlearn in Anaconda? HOT 1
- Installation Error on Python 3.5 HOT 2
- [Question] Training Algorithm for hmm
- Sklearn compatibility HOT 1
- How to use word embedding as a feature ? HOT 2
- Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)
- transition feature
- Alternatives for load_conll
- How to use this for sequence of events.?
- can you share more references for ideas for algorithm for Viterbi perceptron ?
- in hmm.py there is a wrong reference to logsumexp method HOT 1
- Installation error: command 'clang' failed with exit status 1 HOT 1
- seqlearn not working since new version of sklearn HOT 1
- will it work for multivariate time series? HOT 1
- great code, may you share new repos of others in this direction ?
- matlab in python ?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from seqlearn.