This problem is of Natural Language Processing (NLP) variant in which the task is part-of-speech (POS) tagging. By implementing two separate Bayes Nets (simplified, and HMM), the program aims to identify each word in a given sentence with its corresponding part of speech (verb, adjective, noun, determiner, adposition, etc.). Here, I will point out that a variety of words can be assigned different parts of speech depending on how the word is used in context. Thus the program built works to make sense of the meaning of words in a given sentence, and this necessarily implies it also understand the relationship between pairs/sets of words. A training corpus as well as small and larger test corpus is provided. Each line holds a sentence. The sentence is made up of two tuples: the first, a tuple of words, and the second, a tuple of corresponding POS tags. There are 12 POS tags: adjective, adverb, adposition, conjunction , determiner, noun, number, pronoun, particle, verb, x (foreign word), and punctuation mark.
posterior
calculates the log of the posterior probability of a given sentence with a given POS tag for both simple and HMM models. For the HMM the probability consists of two parts: (1) the probability of word emitting a POS; (2) the probability of transitioning from one state to another. For the transition from one state to another, a calculation from start to current state, and a calculation for the end state.
Training is done in train
where the emission and transition default dictionaries are created. Here, the program loops through all (word, POS) pairs, and outputs the two dictionaries. For the emissions, a given word is appended to its given POS; and for the transitions, POS is appended to a given state. Dictionaries are also created to hold emission counts and transition counts.
The simplified model is created in simplified
with sentence as the parameter. For each word from the input, the program determines what POS tag is most likely. Looping through each sentence I predefine a default best_pos as "noun", and set best_prob = 0 to start. Then loop through all possible POS, and find the probability by dividing the number of times a word appears into the number of times the POS appeared. If the probability is greater than the best probability so far, then the best probability is updated; and the associated POS. The function returns a list of results holding best POS. The HMM model implements Viterbi algorithm. hmm_viterbi
function holds two lists: the Viterbi probabilities assigned to viterbi_prob
, and the Viterbi path assigned to viterbi_path
. Dictionaries for probabilities: prob_dict
, and path dictionary path_dict
are created. For each word in a given sentence, the program loops through possible POS’s, and calculates emission probabilities similar to the simplified model, and adds in the transition probabilities calculation.
For every (word, POS) combination the program loops through the previous word “choices” (viterbi_prob[-1].keys
), and calculates all emission and transition probabilities. The log probability of going into any possible state is emissions (a given word came from a given POS) plus transitions (a given POS came from a previous POS). This is a “local” calculation. Where Viterbi comes in is the next log_prob calculation that acknowledges that the program could’ve come from many different states. There is a check to identify which one of the given paths is doing best in its own right. This is the cumulative probability of a word and a given state coming from its history. If the new log_prob is better than the best probability (initialized as -inf), then lob_prob is used to update the best probability. Similarly, best_state
(initialized as None) is updated with the current state. Then the program appends the best probability to prob_dict
(the cumulative probability of having come to the current POS overall). The same is done for best_state
, and appended to path_dict
. These dictionaries are appended to viterbi_prob
and viterbi_path
respectively. This provides the forward run through Viterbi.
The last piece of Viterbi is calculating the probability of transitioning from a given state to the end state. Recall, the program knows there is an end sate, so it takes the Viterbi probabilities plus the log probabilities of going from a given state to the end, and then takes the best one. That is the correct answer for the last state. And finally, the program reverses the Viterbi path.