Coder Social home page Coder Social logo

pos_hmm's Introduction

Problem Type

This problem is of Natural Language Processing (NLP) variant in which the task is part-of-speech (POS) tagging. By implementing two separate Bayes Nets (simplified, and HMM), the program aims to identify each word in a given sentence with its corresponding part of speech (verb, adjective, noun, determiner, adposition, etc.). Here, I will point out that a variety of words can be assigned different parts of speech depending on how the word is used in context. Thus the program built works to make sense of the meaning of words in a given sentence, and this necessarily implies it also understand the relationship between pairs/sets of words. A training corpus as well as small and larger test corpus is provided. Each line holds a sentence. The sentence is made up of two tuples: the first, a tuple of words, and the second, a tuple of corresponding POS tags. There are 12 POS tags: adjective, adverb, adposition, conjunction , determiner, noun, number, pronoun, particle, verb, x (foreign word), and punctuation mark. posterior calculates the log of the posterior probability of a given sentence with a given POS tag for both simple and HMM models. For the HMM the probability consists of two parts: (1) the probability of word emitting a POS; (2) the probability of transitioning from one state to another. For the transition from one state to another, a calculation from start to current state, and a calculation for the end state.

Algorithm used

Training is done in train where the emission and transition default dictionaries are created. Here, the program loops through all (word, POS) pairs, and outputs the two dictionaries. For the emissions, a given word is appended to its given POS; and for the transitions, POS is appended to a given state. Dictionaries are also created to hold emission counts and transition counts. The simplified model is created in simplified with sentence as the parameter. For each word from the input, the program determines what POS tag is most likely. Looping through each sentence I predefine a default best_pos as "noun", and set best_prob = 0 to start. Then loop through all possible POS, and find the probability by dividing the number of times a word appears into the number of times the POS appeared. If the probability is greater than the best probability so far, then the best probability is updated; and the associated POS. The function returns a list of results holding best POS. The HMM model implements Viterbi algorithm. hmm_viterbi function holds two lists: the Viterbi probabilities assigned to viterbi_prob, and the Viterbi path assigned to viterbi_path. Dictionaries for probabilities: prob_dict, and path dictionary path_dict are created. For each word in a given sentence, the program loops through possible POS’s, and calculates emission probabilities similar to the simplified model, and adds in the transition probabilities calculation. For every (word, POS) combination the program loops through the previous word “choices” (viterbi_prob[-1].keys), and calculates all emission and transition probabilities. The log probability of going into any possible state is emissions (a given word came from a given POS) plus transitions (a given POS came from a previous POS). This is a “local” calculation. Where Viterbi comes in is the next log_prob calculation that acknowledges that the program could’ve come from many different states. There is a check to identify which one of the given paths is doing best in its own right. This is the cumulative probability of a word and a given state coming from its history. If the new log_prob is better than the best probability (initialized as -inf), then lob_prob is used to update the best probability. Similarly, best_state (initialized as None) is updated with the current state. Then the program appends the best probability to prob_dict (the cumulative probability of having come to the current POS overall). The same is done for best_state, and appended to path_dict. These dictionaries are appended to viterbi_prob and viterbi_path respectively. This provides the forward run through Viterbi. The last piece of Viterbi is calculating the probability of transitioning from a given state to the end state. Recall, the program knows there is an end sate, so it takes the Viterbi probabilities plus the log probabilities of going from a given state to the end, and then takes the best one. That is the correct answer for the last state. And finally, the program reverses the Viterbi path.

pos_hmm's People

Contributors

rebecca-my avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.