I'm trying to train a HMM that classifies lines in an HTML document as belonging to a

Dataset loading about seqlearn HOT 3 CLOSED

danintheory commented on July 26, 2024

Dataset loading

from seqlearn.

Comments (3)

larsmans commented on July 26, 2024

All lines in the entire set of HTML documents would be one big matrix X. Each row in this matrix is a sample (line). All the labels of all the lines are a single target vector y of the same length (len(y) == X.shape[0]).

The lengths of the actual sequences need to be an array lengths that contains the length of each sequence (document).

So, suppose you have a function that computes the features of a single line as a vector (1-d NumPy array):

def features(line):
    return np.array([feature1(line), feature2(line)])

... then you should be able to construct the input as follows:

X, y, lengths = [], [], []

for doc, label in training_set:
    lines = doc.splitlines()
    lengths.append(len(lines))
    X.append(features(line))
    y.append(label)

X, y, lengths = map(np.asarray, [X, y, lengths])

Does that answer your question?

from seqlearn.

danintheory commented on July 26, 2024

Thank you, that was very helpful. So the X matrix can be a dense matrix, where each row is a feature vector and each column is a different sample? And the feature vectors can be float valued with multiple features being nonzero?

I was confused by trying to deconstruct the included conll.py file in the example. I wasn't sure if features had to be translated to a list of strings (such as ["feature1:val1", "feature2:val2"]) and then encoded into a sparse matrix using the FeatureHasher from sklearn. Also, from your documentation, I wasn't sure what this line

"Make sure the training set (X) is one-hot encoded; if more than one feature in X is on, the emission probabilities will be multiplied."

meant in terms of having my feature vectors as floats, many of which are nonzero.

Finally, I wasn't sure (from deconstructing conll.py) whether features from the previous and subsequent sample in the sequence need to be included in the current sample (as is shown in the conll.py example).

Thanks again for all your help!

from seqlearn.

larsmans commented on July 26, 2024

X may be either a dense array or a sparse matrix. It follows scikit-learn conventions.

Re: one-hot encoding, that's because the HMM is meant to deal with categorical data and each feature should represent the identity of an event as a boolean. I think you should be using a StructuredPerceptron if your data is anything different (sorry, hadn't thought about this earlier, I very seldom use HMMs).

from seqlearn.

Recommend Projects

Dataset loading about seqlearn HOT 3 CLOSED

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent