Coder Social home page Coder Social logo

Dataset loading about seqlearn HOT 3 CLOSED

danintheory avatar danintheory commented on July 26, 2024
Dataset loading

from seqlearn.

Comments (3)

larsmans avatar larsmans commented on July 26, 2024

All lines in the entire set of HTML documents would be one big matrix X. Each row in this matrix is a sample (line). All the labels of all the lines are a single target vector y of the same length (len(y) == X.shape[0]).

The lengths of the actual sequences need to be an array lengths that contains the length of each sequence (document).

So, suppose you have a function that computes the features of a single line as a vector (1-d NumPy array):

def features(line):
    return np.array([feature1(line), feature2(line)])

... then you should be able to construct the input as follows:

X, y, lengths = [], [], []

for doc, label in training_set:
    lines = doc.splitlines()
    lengths.append(len(lines))
    X.append(features(line))
    y.append(label)

X, y, lengths = map(np.asarray, [X, y, lengths])

Does that answer your question?

from seqlearn.

danintheory avatar danintheory commented on July 26, 2024

Thank you, that was very helpful. So the X matrix can be a dense matrix, where each row is a feature vector and each column is a different sample? And the feature vectors can be float valued with multiple features being nonzero?

I was confused by trying to deconstruct the included conll.py file in the example. I wasn't sure if features had to be translated to a list of strings (such as ["feature1:val1", "feature2:val2"]) and then encoded into a sparse matrix using the FeatureHasher from sklearn. Also, from your documentation, I wasn't sure what this line

"Make sure the training set (X) is one-hot encoded; if more than one feature in X is on, the emission probabilities will be multiplied."

meant in terms of having my feature vectors as floats, many of which are nonzero.

Finally, I wasn't sure (from deconstructing conll.py) whether features from the previous and subsequent sample in the sequence need to be included in the current sample (as is shown in the conll.py example).

Thanks again for all your help!

from seqlearn.

larsmans avatar larsmans commented on July 26, 2024

X may be either a dense array or a sparse matrix. It follows scikit-learn conventions.

Re: one-hot encoding, that's because the HMM is meant to deal with categorical data and each feature should represent the identity of an event as a boolean. I think you should be using a StructuredPerceptron if your data is anything different (sorry, hadn't thought about this earlier, I very seldom use HMMs).

from seqlearn.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.