jmschrei / yahmm Goto Github PK

Yet Another Hidden Markov Model repository.

License: MIT License

Python 100.00%

yahmm's Introduction

yahmm

Yet Another Hidden Markov Model library

NOTE: While yahmm is still fully functional, active development has moved over to pomegranate. Please switch over at your convenience.

This module implements Hidden Markov Models (HMMs) with a compositional, graph- based interface. Models can be constructed node by node and edge by edge, built up from smaller models, loaded from files, baked (into a form that can be used to calculate probabilities efficiently), trained on data, and saved.

Implements the forwards, backwards, forward-backward, and Viterbi algorithms, and training by both Baum-Welch and Viterbi algorithms.

Silent states are accounted for, but loops containing all silent states are prohibited. Tied states are also implemented, and handled appropriately in the training of models.

Installation

Since yahmm is on PyPi, installation is as easy as running

pip install yahmm

Contributing

If you would like to contribute a feature then fork the master branch (fork the release if you are fixing a bug). Be sure to run the tests before changing any code. You'll need to have nosetests installed. The following command will run all the tests:

nosetests -w tests/

Let us know what you want to do just in case we're already working on an implementation of something similar. This way we can avoid any needless duplication of effort. Also, please don't forget to add tests for any new functions.

Documentation

See the wiki for documentation of yahmm's functions and design. For real-world usage check out the examples.

Tutorial

For our examples here we're going to make the random number generator deterministic:

>>> random.seed(0)

To use this module, first create a Model, which is the main HMM class:

>>> model = Model(name="ExampleModel")

You then need to populate the Model with State objects. States are constructed from emission distributions; right now a few continuous distributions over floats are available, but new Distribution classes are simple to write. For our example, we will use the UniformDistribution:

>>> distribution = UniformDistribution(0.0, 1.0)

And then construct a state that emits from the distribution:

>>> state = State(distribution, name="uniform")

And another state, emitting from a normal distribution with mean 0 and standard deviation 2:

>>> state2 = State(NormalDistribution(0, 2), name="normal")

If None is used as the distribution when creating a state, that state is a "silent state". Silent states don't emit anything, but are useful for wiring together complex HMMs. By default, a model has two special silent states: a start state Model.start, and an end state Model.end.

Topologies which include cycles of only silent states are prohibited; most HMM algorithms cannot process them.

>>> silent = State(None, name="silent")

We then add states to the HMM with the Model.add_state method:

>>> model.add_state(state)
>>> model.add_state(state2)

You can then add transitions between states, with associated probabilities. Out-edge probabilities are normalized to 1. for every state when the model is baked, not before.

>>> model.add_transition(state, state, 0.4)
>>> model.add_transition(state, state2, 0.4)
>>> model.add_transition(state2, state2, 0.4)
>>> model.add_transition(state2, state, 0.4)

Don't forget transitions in from the start state and out to the end state:

>>> model.add_transition(model.start, state, 0.5)
>>> model.add_transition(model.start, state2, 0.5)
>>> model.add_transition(state, model.end, 0.2)
>>> model.add_transition(state2, model.end, 0.2)

If you want to look at your model, try Model.draw(). Note that this unfortunately cannot plot self loops. If you want to do a better job of drawing the model, the underlying HMM graph is accessible as the graph attribute of the model object.

If you want to compose two Models together, use the Model.add_model() method. Note that you should not try to use the added model separately once you do this. You can also make use of the Model.concatenate_model() method, which will assume you simply want to connect model_a.end to model_b.start with a 1. probability edge.

Once we've finished building our model, we have to bake it. Internally, baking the model generates the transition log probability matrix, and imposes a numerical ordering on the states. If you add more states to the model, you will have to bake it again. Baking is also where edge normalization occurs to ensure that the out-edges for all nodes (except Model.end) sum to 1. Lastly, a simplification of the graph occurs here, merging any silent states which are connected simply by a 1.0 probability edge, as they cannot add value to the graph. You may toggle 'verbose=True' in the bake method to get a log of when either change occurs to your graph.

>>> model.bake()

Now that our model is complete, we can generate an example sequence from it:

>>> sequence = model.sample()
>>> sequence
[0.7579544029403025, 0.25891675029296335, 0.4049341374504143, \
0.30331272607892745, 0.5833820394550312]

And another:

>>> model.sample()
[0.28183784439970383, 0.6183689966753316, -2.411068768608379]

And another:

>>> model.sample()
[0.47214271545271336, -0.5804485412450214]

We can calculate the log probability of the sequence given the model (the log likelihood), summing over all possible paths, using both the forward and backward algorithms. Log probability is reported in nats (i.e. it is natural log probability). Both algorithms return the full table of size len( observations ) x len( states ). For the forward algorithm, the entry at position i, j represents the log probability of beginning at the start of the sequence, and summing over all paths to align observation i to hidden state j. This state can be recovered by pulling it from model.states[j].

>>> model.forward(sequence)
[[       -inf        -inf        -inf  0.        ]
 [-2.37704475 -0.69314718 -2.1322948         -inf]
 [-3.05961307 -1.43914762 -2.86809348        -inf]
 [-3.80752847 -2.1749463  -3.60588302        -inf]
 [-4.53632138 -2.91273584 -4.34219628        -inf]
 [-5.30367664 -3.6490491  -5.08355666        -inf]]

In order to get the log probability of the full sequence given the model, you can write the following:

>>> model.forward(sequence)[ len(sequence), model.end_index ]
-5.0835566645

Or, use a wrapper to get that value by default:

>>> model.log_probability(sequence)
-5.0835566645

The same paradigm is used for the backward algorithm. Indices i, j represent the probability of having aligned observation i to state j and continued aligning the remainder of the sequence till the end.

>>> model.backward(sequence)
[[-5.30670022 -5.30670022        -inf -5.08355666]
 [-4.56069977 -4.56069977        -inf -4.33755622]
 [-3.8249011  -3.8249011         -inf -3.60175755]
 [-3.08711156 -3.08711156        -inf -2.863968  ]
 [-2.3507983  -2.3507983         -inf -2.12765475]
 [-1.60943791 -1.60943791  0.                -inf]]

>>> model.backward(sequence)[ 0, model.start_index ]
-5.0835566645

The forward-backward algorithm is also implemented in a similar manner. It will return a tuple of the estimated transition probabilities given with that sequence and the table of log probabilities of the sum of all paths of the alignment of observation i with state j. Indices i, j represent having started at the beginning of the sequence, aligned observation i to state j, and then continued on to align the remainder of the sequence to the model.

>>> model.forward_backward(sequence)
(array([[-2.03205947, -0.39913252, -1.61932212,        -inf],
       [-2.03481952, -0.40209763, -1.60753724,        -inf],
       [       -inf,        -inf,        -inf,        -inf],
       [-1.85418786, -0.17029029,        -inf,        -inf]]), 
array([[-1.85418786, -0.17029029,  0.        ,  0.        ],
       [-1.80095751, -0.18049206,  0.        ,  0.        ],
       [-1.81108336, -0.17850119,  0.        ,  0.        ],
       [-1.80356301, -0.17997747,  0.        ,  0.        ],
       [-1.82955788, -0.17493035,  0.        ,  0.        ]]))

We can also find the most likely path, and the probability thereof, using the Viterbi algorithm. This returns a tuple of the likelihood under the ML path and the ML path itself. The ML path is in turn a list of tuples of State objects and the number of items in the sequence that had been generated by that point in the path (to account for the presence of silent states).

>>> model.viterbi(sequence)
(-5.9677480204906654, \
[(0, State(ExampleModel-start, None)), \
(1, State(uniform, UniformDistribution(0.0, 1.0))), \
(2, State(uniform, UniformDistribution(0.0, 1.0))), \
(3, State(uniform, UniformDistribution(0.0, 1.0))), \
(4, State(uniform, UniformDistribution(0.0, 1.0))), \
(5, State(uniform, UniformDistribution(0.0, 1.0))), \
(5, State(ExampleModel-end, None))])

Given a list of sequences, we can train our HMM by calling Model.train(). This returns the final log score: the log of the sum of the probabilities of all training sequences. It also prints the improvement in log score on each training iteration, and stops if the improvement gets too small or actually goes negative.

>>> sequences = [sequence]
>>> model.log_probability(sequence)
-5.0835566644993735

>>> log_score = model.train(sequences)
Training improvement: 5.81315226327
Training improvement: 0.156159401683
Training improvement: 0.0806734819188
Training improvement: 0.0506679952827
Training improvement: 0.142593661095
Training improvement: 0.305806209012
Training improvement: 0.301331333752
Training improvement: 0.380117757466
Training improvement: 0.773814416569
Training improvement: 1.58660096053
Training improvement: 0.439182120777
Training improvement: 0.0067603436265
Training improvement: 5.5971526649e-06
Training improvement: 3.75166564481e-12

>>> model.log_probability(sequence)
-4.9533088776424528

In addition to the Baum-Welch algorithm, viterbi training is also included. This training is quicker, but less exact than the Baum-Welch algorithm. It makes the probability of a transition equal to the frequency of seeing that transition in the viterbi path of all the training sequences, and emissions to be the distribution retrained on all obervations tagged with that state in the viterbi path.

Model.train is a wrapper for both the Viterbi and Baum-Welch algorithms, which can be specified with "algorithm='Baum-Welch'" or "algorithm='Viterbi'". The Baum-Welch algorithm can also take min_iterations to do at least that any iterations of Baum-Welch training, and stop_threshold to indicate the log score improvement at which to stop at-- currently set at 1e-9. Viterbi training takes no arguments.

Lastly, tied states are supported in both training algorithms. This is useful if many states are supposed to represent the same underlying distribution, which should be kept the same even upon being retrained. When not tied, these states may diverge slightly from each other. Tying them both keeps them all the same, and increases the amount of training data each distribution gets, to hopefully get a better result.

In order to use a tied state, simply pass the same distribution object into multiple states. See the following example.

# NOT TIED STATES
>>> a = State( NormalDistribution( 5, 2 ), name="A" )
>>> b = State( UniformDistribution( 2, 7 ), name="B" )
>>> c = State( NormalDistribution( 5, 2 ), name="C" )

# A AND C TIED STATES
>>> d = NormalDistribution( 5, 2 )
>>>
>>> a = State( d, name="A" )
>>> b = State( UniformDistribution( 2, 7 ), name="B" )
>>> c = State( d, name="C" )

Once you're done working with your model, you can write it out to a stream with Model.write(), to be read back in later with Model.read().

>>> model.write(sys.stdout)
ExampleModel 4
302687936 ExampleModel-end 1.0 None
302688008 ExampleModel-start 1.0 None
302688080 normal 1.0 NormalDistribution(0.281114738186, 0.022197987893)
302688152 uniform 1.0 UniformDistribution(0.258916750293, 0.75795440294)
uniform uniform 6.02182522366e-25 0.4 302688152 302688152
uniform ExampleModel-end 0.333333333333 0.2 302688152 302687936
uniform normal 0.666666666667 0.4 302688152 302688080
normal uniform 1.0 0.4 302688080 302688152
normal ExampleModel-end 9.71474187173e-184 0.2 302688080 302687936
normal normal 2.59561866186e-45 0.4 302688080 302688080
ExampleModel-start uniform 1.0 0.5 302688008 302688152
ExampleModel-start normal 0.0 0.5 302688008 302688080

This file contains states, and then transitions. The first line is the name of the model and the number of states present. Then, each line contains a single state containing a unique ID, the name, the state weight, and the distribution that the state contains. For the start and end state, this value is None, as they are silent states. Then, the remaining lines contain transitions in the model, formatted by from_state_name, to_state_name, probability, pseudocount, from_state_id, and to_state_id. The IDs are unique tags generated from the memory address of the state, and are needed in case the user names two states with the same name. As an example, the first transition is from the state named uniform to the state named uniform with a very low probability, and the IDs are the same meaning that it is a self loop.

Lets explore the bake method a little more. In addition to finalizing the internal structure of the model, it will normalize out-edge weights, and also merge silent states with a probability 1. edge between them to simplify the model. Lets see this in action.

	model_a = Model( "model_a" )
	s1 = State( NormalDistribution( 25., 1. ), name="S1" )
	s2 = State( NormalDistribution( 13., 1. ), name="S2" )

	model_a.add_state( s1 )
	model_a.add_state( s2 )
	model_a.add_transition( model.start, s1, 0.95 )
	model_a.add_transition( s1, s1, 0.40 )
	model_a.add_transition( s1, s2, 0.50 )
	model_a.add_transition( s2, s1, 0.50 )
	model_a.add_transition( s2, s2, 0.40 )
	model_a.add_transition( s1, model.end, 0.1 )
	model_a.add_transition( s2, model.end, 0.1 )

	model_b = Model( "model_b" )
	s3 = State( NormalDistribution( 34., 1. ), name="S3" )
	s4 = State( NormalDistribution( 45., 1. ), name="S4" )

	model_b.add_state( s3 )
	model_b.add_state( s4 )
	model_b.add_transition( model.start, s3, 1.0 )
	model_b.add_transition( s3, s3, 0.50 )
	model_b.add_transition( s3, s4, 0.30 )
	model_b.add_transition( s4, s4, 0.20 )
	model_b.add_transition( s4, s3, 0.30 )
	model_b.add_transition( s4, model.end, 1.0 )

If at this point we baked model_a and ran it, we'd get the following:

>>> sequence = [ 24.57, 23.10, 11.56, 14.3, 36.4, 33.2, 44.2, 46.7 ]
>>> model_a.bake( verbose=True )
model_a : model_a-start summed to 0.95, normalized to 1.0
>>>
>>> print model_a.forward( sequence )
[[         -inf          -inf          -inf    0.        ]
 [         -inf   -1.01138853   -3.31397363          -inf]
 [ -53.62847425   -4.6516178    -6.95420289          -inf]
 [  -7.30050351  -96.80364706   -9.60308861          -inf]
 [  -9.98073278  -66.15758923  -12.28331787          -inf]
 [-285.59596204  -76.57281849  -78.87540358          -inf]
 [-282.2049042  -112.02804776 -114.33063285          -inf]
 [-600.36013347 -298.18327702 -300.48586211          -inf]
 [-867.64036273 -535.46350629 -537.76609138          -inf]]
>>> 
>>> print model_a.log_probability( sequence )
-537.766091379

By setting verbose=True, we get a log that the out-edges from model.start have been normalized to 1.0. This forward log probability matrix is the same as if we had originally set the transition to 1.0

If instead of the above, we concatenated the models and ran the code, we'd get the following:

>>> sequence = [ 24.57, 23.10, 11.56, 14.3, 36.4, 33.2, 44.2, 46.7 ]
>>> model_a.concatenate_model( model_b )
>>> model_a.bake( verbose=True )
model_a : model_a-end (silent) - model_b-start (silent) merged
model_a : model_a-start summed to 0.95, normalized to 1.0
model_a : S3 summed to 0.8, normalized to 1.0
>>> 
>>> print model_a.forward( sequence )
[[         -inf          -inf          -inf          -inf          -inf		
 -inf            0.]
 [         -inf   -1.01138853          -inf          -inf   -3.31397363	
 	 -inf          -inf]
 [ -63.63791216   -4.6516178           -inf  -53.62847425   -6.95420289	
 	 -inf          -inf]
 [-259.64994142  -96.80364706 -624.65447995   -7.30050351   -9.60308861 
 -624.65447995          -inf]
 [-204.56702714  -66.15758923 -732.79470921   -9.98073278  -12.28331787	
  	 -inf          -inf]
 [ -16.0822564   -76.57281849 -243.44679492 -285.59596204  -78.87540358	
 -243.44679492          -inf]
 [ -17.79119857 -112.02804776  -87.60202419 -282.2049042  -114.33063285 
  -87.60202419          -inf]
 [ -71.20014073 -298.18327702  -20.01096635 -600.36013347 -300.48586211 
  -20.01096635          -inf]
 [-102.77887769 -535.46350629  -23.9843428  -867.64036273 -537.76609138 
   -23.9843428          -inf]]

>>>
>>> print model_a.log_probability( sequence )
-23.9843427976

We see both bake processing operations in effect. Both model_a.start and S3 did not have properly summed out-edges, and needed to have them normalized. But now there was a useless edge between model_a.end and model_b.start due to the concatenate method. This allowed those two states to be merged, speeding up later algorithms. We can also see that the addition of model_b made the sequence significantly more likely given the model, as a sanity check that concatenate_model really did work.

As said above, this module provides a few distributions over floats by default:

UniformDistribution( start, end )

NormalDistribution( mean, std )

ExponentialDistribution( rate )

GammaDistribution( shape, rate ) (Note that this differs from the parameterization used in the random module, even though the parameters have the same names.

InverseGammaDistribution( shape, rate )

GaussianKernelDensity( points, bandwidth, weights=None )

UniformKernelDensity( points, bandwidth, weights=None )

TriangleKernelDensity( points, bandwidth, weights=None )

MixtureDistribution( distributions, weights=None )

The module also provides two other distributions:

DiscreteDistribution( characters ) ( Allows you to pass in a dictionary of key: probability pairs )

LambdaDistribution( lambda_funct ) ( Allows you to pass in an arbitrary function that returns a log probability for a given symbol )

To add a new Distribution, with full serialization and deserialization support, you have to make a new class that inherits from Distribution. That class must have:

* A class-level name attribute that is unique amoung all distributions, and 
  is used for serialization.
* An __init__ method that stores all constructor arguments into 
  self.parameters as a list.
* A log_probability method that returns the log of the probability of the 
  given value under the distribution, and which reads its parameters from 
  the self.parameters list. This module's log() and exp() functions can be 
  used instead of the default Python ones; they handle numpy arrays and 
  "properly" consider the log of 0 to be negative infinity.
* A from_sample method, which takes a Numpy array of samples and an optional
  Numpy array of weights, and re-estimates self.parameters to maximize the 
  likelihood of the samples weighted by the weights. Note that weighted 
  maximum likelihood estimation can be somewhat complicated for some 
  distributions (see, for example, the GammaDistribution here).
* A sample method, which returns a randomly sampled value from the 
  distribution.

The easiest way to define a new distribution is to just copy-paste the UniformDistribution from the module and replace all its method bodies. Any distribution you define can be easily plugged in with other distributions, assuming that it has the correct methods. However, if you write the model and give it to someone else, they might not have the custom distribution.

Here is an example discrete distribution over {True, False}:

>>> class BernoulliDistribution(Distribution):
...     name = "BernoulliDistribution"
...     def __init__(self, p):
...         self.parameters = [p]
...     def log_probability(self, sample):
...         if sample:
...             return log(self.parameters[0])
...         else:
...             return log(1 - self.parameters[0])
...     def from_sample(self, items, weights=None):
...         if weights is None:
...             weights = numpy.ones_like(items, dtype=float)
...         self.parameters = [float(numpy.dot(items, weights)) / len(items)]
...     def sample(self):
...         return random.random() < self.parameters[0]
>>> bernoulli = BernoulliDistribution(0.5)
>>> exp(bernoulli.log_probability(True))
0.5
>>> sample = [bernoulli.sample() for i in xrange(10)]
>>> sample
[False, True, False, True, False, False, True, False, True, False]
>>> bernoulli.from_sample(sample)
>>> bernoulli.write(sys.stdout)
BernoulliDistribution(0.4)

	# Test HMMS
	
	model_a = Model(name="A")
	model_b = Model(name="B")
	
	s1 = State(UniformDistribution(0.0, 1.0), name="S1")
	s2 = State(UniformDistribution(0.5, 1.5), name="S2")
	s3 = State(UniformDistribution(-1.0, 1.0), name="S3")
	
	# Make a simple 2-state model
	model_a.add_state(s1)
	model_a.add_state(s2)
	model_a.add_transition(s1, s1, 0.70)
	model_a.add_transition(s1, s2, 0.25)
	model_a.add_transition(s1, model_a.end, 0.05)
	model_a.add_transition(s2, s2, 0.70)
	model_a.add_transition(s2, s1, 0.25)
	model_a.add_transition(s2, model_a.end, 0.05)
	model_a.add_transition(model_a.start, s1, 0.5)
	model_a.add_transition(model_a.start, s2, 0.5)
	
	# Make another model with that model as a component
	model_b.add_state(s3)
	model_b.add_transition(model_b.start, s3, 1.0)
	model_b.add_model(model_a)
	model_b.add_transition(s3, model_a.start, 1.0)
	model_b.add_transition(model_a.end, model_b.end, 1.0) 
	
	model_b.bake()
	
	print "HMM:"
	print model_b
	
	print "HMM serialization:"
	model_b.write(sys.stdout)
	
	print "A sample from the HMM:"
	print model_b.sample()
	
	print "Forward algorithm:"
	print model_b.forward([]) # Impossible
	print model_b.forward([-0.5, 0.2, 0.2]) # Possible
	print model_b.forward([-0.5, 0.2, 0.2 -0.5]) # Impossible
	print model_b.forward([-0.5, 0.2, 1.2, 0.8]) # Possible
	
	print "Backward algorithm:"
	print model_b.backward([]) # Impossible
	print model_b.backward([-0.5, 0.2, 0.2]) # Possible
	print model_b.backward([-0.5, 0.2, 0.2 -0.5]) # Impossible
	print model_b.backward([-0.5, 0.2, 1.2, 0.8]) # Possible
	
	print "Viterbi:"
	print model_b.viterbi([]) # Impossible
	print model_b.viterbi([-0.5, 0.2, 0.2]) # Possible
	print model_b.viterbi([-0.5, 0.2, 0.2 -0.5]) # Impossible
	print model_b.viterbi([-0.5, 0.2, 1.2, 0.8]) # Possible
	
	# Train on some of the possible data
	print "Training..."
	model_b.train([[-0.5, 0.2, 0.2], [-0.5, 0.2, 1.2, 0.8]], 
		transition_pseudocount=1)
	print "HMM after training:"
	print model_b
	
	print "HMM serialization:"
	model_b.write(sys.stdout)
	
	print "Probabilities after training:"
	print model_b.forward([]) # Impossible
	print model_b.forward([-0.5, 0.2, 0.2]) # Possible
	print model_b.forward([-0.5, 0.2, 0.2 -0.5]) # Impossible
	print model_b.forward([-0.5, 0.2, 1.2, 0.8]) # Possible

yahmm's People

Contributors

Stargazers

Watchers

yahmm's Issues

Code refactoring

Listing out few things. This issue can be updated to make the code more legible and maintainable.

Make the code less coupled. eg. Distributions should be in a folder of their own
Following pep8

pip install does not install dependencies

Attempted pip install yahmm on a Red Hat 64-bit system. Crashed because numpy was not installed, despite numpy being a requirement in the setup file. Problem also seen on a Windows system. Problem not seen on Mac system, where pip install yahmm successfully installed dependencies.

Error installing with pip3 on Mac OS X 10.9.4

Hi,
I got an error when trying to install yahmm with pip3 (Python 3.4.1) on Mavericks.
It looks like Cython could not process the yahmm.pyx file.

Error compiling Cython file:
    ------------------------------------------------------------
    ...
                # list of tuples of sequence, path pairs, not a list of sequences.
                # The log probability sum is the log-sum-exp of the log
                # probabilities of all members of the sequence. In this case,
                # sequences is made up of ( sequence, path ) tuples, instead of
                # just sequences.
                log_probability_sum = reduce( lambda x, y: pair_lse( x, y ),
                                  ^
    ------------------------------------------------------------

    yahmm/yahmm.pyx:3154:31: undeclared name not builtin: reduce
    error: command 'clang' failed with exit status 1

I tried with Cython version 0.20.2 and 0.21.
The version of the required packages are as follow:
Python 3.4.1
pip (1.5.6)
Cython (0.21)
numpy (1.9.0)
scipy (0.14.0)
networkx (1.9)
matplotlib (1.3.1)

Do you have any idea how this could be fixed?
Thanks in advance for your help!

Marc

JSON representation of objects

I think it would be better if we change the underlying representation of all the objects to be just a JSON. For example, a distribution might be something like this:

{
    name: "NormalDistribution",
    parameters: [ 5, 2 ],
    summaries: []
}

and a state might look something like this:

{
    name: "s1",
    distribution: {
            name: "NormalDistribution",
            parameters: [ 5, 2 ],
            summaries: []
        },
    weight: 1.0
}

If we use the default JSON parser, then we can add more parameters at will without changing the read or write functions at all. The only problem is that it means that the text file is less readable.

Tied Edges

When the edges leaving a state should have the same respective probabilities as the edges leaving another state after training, the edges are 'tied'. This can be useful if you're training a repeated structure like a global sequence alignment HMM, and want all of the transitions in each repeated subunit to be the same as the respective transitions of the others, because the probability of deleting or inserting at each position should be the same.

I am unsure how we want to specify that an edge is tied together. Distributions were easy to tie together using the same underlying distribution object.

Length Distribution on States

We should add an ability for states to handle multiple emissions in a conditional manner. Each state should take in a number of observations that it can model, and a distribution on that number of observations.

An example of this is if you're trying to model a gene, with the possibilities of insertions after a codon or deletions of the last nucleotide of the codon. The model would be linear, but with three states at every position; one which matched the codon, one which matched the codon except for last codon, and one which matched the codon plus another nucleotide. This allows you to identify where nucleotides have been added or deleted in a gene, but requires each state looks at 2, 3, or 4 nucleotides respectively.

While this may be simple with DiscreteDistribution, it becomes difficult with continuous distributions, because you're creating a conditional distribution. A solution Adam and I talked about was to implement an appropriate DiscreteDistribution, and a MultivariateGaussian which takes in a covariance matrix, and just have users define their own custom distributions if they need it for anything else.

yahmm Model can't be pickled

IPython QtConsole 3.1.0
Python 2.7.9 |Anaconda 2.1.0 (64-bit)| (default, Mar  9 2015, 16:20:48) 
Type "copyright", "credits" or "license" for more information.

IPython 3.1.0 -- An enhanced Interactive Python.
Anaconda is brought to you by Continuum Analytics.

In [1]: import cPickle as pickle

In [2]: import yahmm as ym

In [3]: m=ym.Model()

In [4]: m.bake()

In [5]: m
Out[5]: <yahmm.yahmm.Model at 0xd8df30>

In [6]: pickle.dump(m,open("ttt.p","wb"))

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-6-a10217961e3f> in <module>()
----> 1 pickle.dump(m,open("ttt.p","wb"))

/opt/Anaconda/lib/python2.7/copy_reg.pyc in _reduce_ex(self, proto)
     68     else:
     69         if base is self.__class__:
---> 70             raise TypeError, "can't pickle %s objects" % base.__name__
     71         state = base(self)
     72     args = (self.__class__, base, state)

TypeError: can't pickle Model objects

My understanding is that cdef classes can not be automatically pickled by pickle. This can apparently be solved by adding the appropriate __reduce__() function to the class. So perhaps this issue should be labeled as an enhancement request.

Remove dist/* from github

import models from other packages

We should add in functionality to import models from other packages, such as HMMer or HMMoC. This will involve figuring out what format we want to read from these modules, and implement our own parsers to handle their features. I will look into this.

Cannot read models with Multivariate Distributions.

I don't know if the problem is in reading the files or writing. The sample file after writing looks something like this:

zero 5
139840366077272 10611008 1.0 MultivariateDistribution(<map object at 0x7f2f1f1f1a20>)
139840366077416 10611008 1.0 MultivariateDistribution(<map object at 0x7f2f1f1f1828>)
139840366077560 10611008 1.0 MultivariateDistribution(<map object at 0x7f2f1f1f1a20>)
139840366026400 zero-end 1.0 None
139840366024888 zero-start 1.0 None
10611008 10611008 0.9894736842105722 0.3 139840366077272 139840366077272
10611008 10611008 0.010526315789427733 0.7 139840366077272 139840366077416
10611008 10611008 1.8680545612364728e-20 0.4 139840366077416 139840366077416
10611008 10611008 1.0 0.6 139840366077416 139840366077560
10611008 zero-end 0.016666666666564085 0.55 139840366077560 139840366026400
10611008 10611008 0.9833333333334359 0.45 139840366077560 139840366077560
zero-start 10611008 1.0 1.0 139840366024888 139840366077272

And the error I get while using model = Model.read(<file_stream>) is

Traceback (most recent call last):
  File "test_recording.py", line 197, in <module>
    model[i].read(filestream)
  File "yahmm/yahmm.pyx", line 3837, in yahmm.yahmm.Model.read (yahmm/yahmm.c:60161)
  File "yahmm/yahmm.pyx", line 2018, in yahmm.yahmm.State.read (yahmm/yahmm.c:32384)
  File "<string>", line 1
    State( MultivariateDistribution(<map object at 0x7f2f1f1f1a20>), name='10611008', weight=1.0, identity='139840366077272' )
                                    ^

Create examples of commonly used HMMs

Would be good to have some complete code snippets for commonly used HMM examples, such as the following:

Discrete emission HMMs

Rainy-sunny example from Wikipedia

Continuous emission HMMs

Density mixtures

Maybe even an illustrative example of each type from Bayes Net toolbox

MixtureDistribution log_probability slightly wrong

Pointed out by Hwanchul Yoo:

def log_probability( self, symbol ):
        """
        What's the probability of a given float under this mixture? It's
        the log-sum-exp of the distances from the symbol calculated under all
        distributions. Currently in python, not cython, to allow for ducktyping
        of both numeric and not-necessarily-numeric distributions. 
        """

        (d, w), n = self.parameters, len(self.parameters)
        return _log( numpy.sum([ cexp( d[i].log_probability(symbol) ) \
            * w[i] for i in xrange(n) ]) )

This code should be (d, w), n = self.parameters, len(self.parameters[0]). Currently it will ignore any distributions past the second. A trivial fix, but I'm leaving a note for myself so I don't forget.

Training fails on 1000 samples from sunny-rainy example

Notebook here

NB: I had modified sample as per #8 to sample 1000 sequence length.

Enhancements for sample method

Currently no way to sample 'x' samples from the model. Moreover, if we keep the transition to end states very low, the program would take very long to terminate.

An easy solution could be to modify the definition of sample method to include a parameter nsamples. For convenience, it can be set default to np.inf. The while loop could be modified ever so slightly.

while (state != self.end_index) and (sample_count < nsamples):

where sample_count is the number of samples so far generated.

Secondly, the sample method only gives the emitted state sequence. It is easy to also give the hidden state sequence, which can be very useful! Maybe that should be added to the output.

The forward-backward algorithm output order is random

I was in the process of reproducing Sean Eddy's toy 5' splice site recognition model to add the example list when I noticed that the order of the columns in the DP matrix returned by the forward-backward algorithm is random with no way of differentiating the columns. Here's a schematic of Eddy's example:

Here's what I have so far:

import yahmm as hmm, numpy as np, seaborn as sns


# Using the example from Eddy 2004
# http://www.nature.com/nbt/journal/v22/n10/abs/nbt1004-1315.html

# Construct the states
state_e = hmm.State(hmm.DiscreteDistribution({'A':0.25, 
                                              'C':0.25,
                                              'G':0.25,
                                              'T':0.25}), name='E')
state_5 = hmm.State(hmm.DiscreteDistribution({'A':0.05, 
                                              'C':0,
                                              'G':0.95,
                                              'T':0}), name='5')
state_i = hmm.State(hmm.DiscreteDistribution({'A':0.4, 
                                              'C':0.1,
                                              'G':0.1,
                                              'T':0.4}), name='I')

model = hmm.Model("naive 5' ss recognition")

# Add transition information
model.add_transition(model.start, state_e, 1)
model.add_transition(state_e, state_e, 0.9)
model.add_transition(state_e, state_5, 0.1)
model.add_transition(state_5, state_i, 1)
model.add_transition(state_i, state_i, 0.9)
model.add_transition(state_i, model.end, 0.1)

# Finalize the model
model.bake()
seq = 'CTTCATGTGAAAGCAGACGTAAGTCA'

print(seq)
# ML path
print(''.join(state[1].name for state in model.viterbi(list(seq))[1][1:-1]))

# Get transition probability of entering the second state at each index
probs = [np.exp(prob[1]) for prob in model.forward_backward(list(seq))[1]]

This last line is the problem. I had originally assumed that the second column would always be the transition probabilities of entering the second state, but this was naive. The order of the output can be variable as long as this is documented and if there is a way of distinguishing the columns. As far as I'm aware, there is no way of distinguishing the differences between the columns in the DP matrix and, as a result, no way of determining the transition probabilities of entering a specific state. Is this correct?

The desired output is to get the transition probability into State 5 at each position in the sequence so I can plot something like the bar graph at the bottom of the schematic.

Distribution Inertia

Currently, the parameters of distributions are overwritten when trained, ignoring the initial values of the distributions. In the same way that we have distribution inertia, we need to implement distribution inertia. This seems somewhat trivial to do, though. In the from_summaries case, we simply add in the current parameters as a summary and appropriately reweight the samples so that the inertia value works. For the from_sample method, we do somewhat the same thing, where we end up with a summary, and then have the parameters be some percentage the old parameters and some percentage the new ones.

I can probably get this working soon.

Numpy import error in _log_probability()

When I call log_probability() on a yahmm.Model, I get the following error:

Exception NameError: "global name 'np' is not defined" in 'yahmm.yahmm.Model._log_probability' ignored

Please let me know if there's any more info I could send your way.

Using setup.py develop

Why: http://stackoverflow.com/questions/19048732/python-setup-py-develop-vs-install
How : http://stackoverflow.com/questions/20996639/why-does-setup-py-develop-not-work

Forward-Backward handling on impossible sequences

As discovered in #10 , forward-backward will return -INF when given an impossible sequence, as opposed to returning a matrix. Is this okay, or should it return something of the matrix form so that it doesn't crash someones program?

Split into Multiple Files

yahmm.pyx is getting to be pretty big. We should split this into multiple files. I need to figure out how to import a cython file into another cython file without duplicating the code we have in __init__.py.

Factoring Out Distributions

As my research has changed, I've begun using Bayesian networks more and more. In order to fully understand and appreciate them, I'm working on a rough implementation of linear Bayesian networks named Yet Another Bayesian Network (take that, reviewers). However, there are many similarities, including all of the distributions we implement. I suggest that we remove all the distributions from YAHMM and put them in their own library (Yet Another Statistics Library?). That way, not just our projects, but anyone's projects, can use the fully-featured distributions we've made. It would also significantly reduce the size of the code. What are your thoughts, @adamnovak ?

Set up Travis build system?

Fyi, the build status indicator in the README is broken. Have you setup Travis to build yahmm? I set the link to point to jmschrei/yahmm so it should work.

Retrieving the posterior probabilities?

It is often informative to compute the posterior probability of being in a specific state at each position within a series of observations. The implemented _maximum_a_posteriori method does almost exactly this, but it returns the most probable state at observation k instead of the probability of being in state S_i at observation k where S_1,S_2,...,S_i,...,S_n are the n states of the HMM (Rabiner discusses both on p. 263, Eq. 26-29).

Sometimes having something like this:

provides new insight into the model's behavior and is quite commonly used in sequence analysis. I think having both would be useful and they could use mostly the same logic internally.

Reference:
Lawrence R. Rabiner: A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE 77(2) p.257-286, 1989. http://www.cs.ubc.ca/~murphyk/Bayes/rabiner.pdf

force state.name to be unique

I had added state.identity as a way for people to use repeated names for their states, but I think this overcomplicates the issue. We should remove this, and force people to use uniquely named states. This will change I/O, and so break back-compatibility again. Perhaps adding a reader of old files would be warranted, given this.

Rename bin folder as examples

summary statistics for training

The current training involves creating a matrix with all observed symbols on one axis and the character generating states on the other axis. This can be extremely slow with either very long sequences, or many sequences, or both. This can be made better if, for every sequence, summary statistics were calculated for each of the states. The new emissions would be calculated for each state, with a weight based on how likely they are. Then, when all sequences are analyzed, the distributions would be trained on these summary statistics. The problem is that each distribution class now has to have a summary statistics method which will return the summary statistic, and a separate train method which will take in a matrix of summary statistics and train on them. This can't be implemented until all distributions have an appropriate summary statistic method. I'm going to take a look at this next week.

Infinite HMMs

The next update will include infinite HMMs. I have already updated forward, backward, and sampling to allow for infinite HMMs. A HMM is classified as an infinite HMM simply by not having any edges to the end state, and you can check the type of the HMM through Model.is_infinite().

Parameters not Normalized

A suggestion to the package was to not force parameters to sum to 1. Currently people can implement their own distributions, so I think maybe adding an option to make edges not normalize could be added.

support distributions not having a training function

Currently, ExtremeValueDistribution does not have a training function. This is because I haven't found a way to train generalized extreme value distributions--mostly due to a lack of looking.

I am implementing inertia in #27 and it seems trivial to do. I am proposing that instead of simply raising a NotImplementedError by default, by default the distribution acts as if it has an inertia=1.0 if from_sample is not implemented, which is to ignore any possible training data. This could raise a warning as well, to let people know that a distribution was not trained.

@adamnovak what do you think?

Random number weirdness

I've been having this "bug" for a little bit, wanted to see if anyone else knew about it.

When I write test code, I seed random.seed(0). I will then randomly generate a sequence to test, with the assumption that the sequence will be the same each time, since I set the seed.

Occasionally, what will happen is that the first time I run a program, I will get sequence A, then every other time I will get sequence B, just by rerunning the code. All yahmm operations function appropriately, it's just that the random seed different. If I modify yahmm.pyx in any way (even to add comments), I will get sequence A again, then sequence B every other time.

Any thoughts?

Bake fails with `ValueError: Buffer dtype mismatch, expected 'int' but got 'long'`

Trace

In [1]: run example.py 
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/home/nipun/git/yahmm/bin/example.py in <module>()
     32 model.add_transition(state2, model.end, 0.2)
     33 
---> 34 model.bake()
     35 sequence = model.sample()
     36 print sequence

/home/nipun/miniconda/lib/python2.7/site-packages/yahmm/yahmm.so in yahmm.yahmm.Model.bake (yahmm/yahmm.c:19319)()

ValueError: Buffer dtype mismatch, expected 'int' but got 'long'

yahmm not importing correctly on python 3.4

I can successfully install yahmm using pip3, but none of the modules are loaded on import. The install works for python2.

osx 10.10.1
Cython (0.21.1)
numpy (1.9.1)
scipy (0.14.0)
networkx (1.9.1)
matplotlib (1.4.0)
yahmm (1.1.1)

Add support for Python 3?

Python 3 support would be really nice. Almost my entire tool set has transitioned to Python 3 except for yahmm. I will see if I can start putting something together this weekend.

Use git-tags to delimit releases?

Just an idea, but I think it would be nice to use git tags for releases. It would make it easy to quickly see what has happened since the last release. Github has a nice interface for it too: https://github.com/jmschrei/yahmm/tags

Very slow training with multivariate normal distributions

I'm trying to use yahmm to classify motions. Each motion has approx. 12.000 samples, where each sample has 43 features. I use 10 states for the HMM, each with a multivariate Gaussian distribution. The code that I use to create states looks something like this:

# Add states
std = np.sqrt(np.abs(covar.diagonal()))
normals = [[yahmm.NormalDistribution(mean[i], std[i]) for i in range(n_features)] for _ in range(n_states)]
distributions = [yahmm.MultivariateDistribution(normals[i]) for i in range(n_states)]
states = [yahmm.State(distributions[i]) for i in range(n_states)]
model.add_states(states)

However, once I try to train the model with baum-welch, the performance tanks. After having a look at the implementation, this is not really a surprise since every distribution is trained separately. Is there anything I can do to speed this up? I have even debated to implement my own MultivariateNormalDistribution class to make the training more efficient, but wanted to ask here first for help.

Any input is greatly appreciated!

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.