Coder Social home page Coder Social logo

triptych / probabilistic-earley-parser-javascript Goto Github PK

View Code? Open in Web Editor NEW

This project forked from digitalheir/probabilistic-earley-parser-javascript

0.0 2.0 0.0 154 KB

๐ŸŽฒ An efficient implementation of a probabilistic Context Free Grammar parser in Javascript

License: MIT License

TypeScript 98.54% JavaScript 1.46%

probabilistic-earley-parser-javascript's Introduction

Build Status npm version License

Probabilistic Earley parser

This is a library for parsing a sequence of tokens (like words) into tree structures, along with the probability that the particular sequence generates that tree structure. This is mainly useful for linguistic purposes, such as morphological parsing, speech recognition and generally information extraction. It also finds applications in computational biology.

For example:

tokens parse tree
[i, want, british, food] i want british food
tokens parse tree
GGGC``UAUU``AGCU``CAGU
UGGU``UAGA``GCGC``ACCC
CUGA``UAAG``GGUG``AGGU
CGCU``GAUU``CGAA``UUCA
GCAU``AGCC``CA
rna secondary structure

This library allows you to do these things efficiently, as long as you can describe the rules as a Context-free Grammar (CFG).

The innovation of this library with respect to the many other parsing libraries is that this one allows the production rules in your grammar to have a probability attached to them. That is: it parses Stochastic Context-free Grammars. This allows us to make better choices in case of ambiguous sentences: we can order them by probability. If you do not need probabilities attached to your parse trees, you are probably better off using nearley instead.

For a theoretical grounding of this work, refer to Stolcke; An Efficient Probabilistic Context-Free Parsing Algorithm that Computes Prefix Probabilities.

Motivation

While libraries for nondeterministic grammars abound, I could not find an existing JavaScript implementation of the Probabilistic Earley Parser. I have made a stochastic CYK parser before, but I wanted something more top down that makes it easier to intervene in the parsing process, for instance when an unexpected token is encountered. In many cases Earley also parses faster than CYK (sparse grammars) and it doesn't require the grammar to be rewritten in any normal form.

Usage

Get the most likely parse tree (the Viterbi parse) for the sentence "the man chases the man with a stick":

import {getViterbiParse, Grammar} from 'probabilistic-earley-parser';
import treeify from 'treeify';

// Nonterminals are string
const S = "S"; // : NonTerminal 
const NP = "NP"; // : NonTerminal 
const VP = "VP"; // : NonTerminal 
const TV = "TV"; // : NonTerminal 
const Det = "Det"; // : NonTerminal 
const N = "N"; // : NonTerminal 
const Mod = "Mod"; // : NonTerminal 

// Terminals are functions that should return true when the parameter is of given type
const transitiveVerb = (token) => !!token.match(/(hit|chased)/); // : Terminal<string>
const the = (token) => !!token.match(/the/i);// : Terminal<string> 
const a = (token) => !!token.match(/a/i);// : Terminal<string> 
const man = (token) => !!token.match(/man/);// : Terminal<string> 
const stick = (token) => !!token.match(/stick/);// : Terminal<string> 
const with_ = (token) => !!token.match(/with/);// : Terminal<string> 

const grammar = Grammar.builder("test") //: Grammar<string,number> 
    .addNewRule(
        1.0,   // Probability between 0.0 and 1.0, defaults to 1.0. The builder takes care of converting it to the semiring element
        S,     // Left hand side of the rule
        [NP, VP] // Right hand side of the rule
    )
    // NP -> Det N (1.0)
    .addNewRule(
        1.0,
        NP,
        [Det, N] // eg. The man
    )
    // NP -> Det N Mod (1.0)
    .addNewRule(
        1.0,
        NP,
        [Det, N, Mod] // eg. The man (with a stick)
    )
    // VP -> TV NP Mod (0.4)
    .addNewRule(
        0.4,
        VP,
        [TV, NP, Mod] // eg. (chased) (the man) (with a stick)
    )
    // VP -> TV NP (0.6)
    .addNewRule(
        0.6,
        VP,
        [TV, NP] // eg. (chased) (the man with a stick)
    )
    .addNewRule(1.0, Det, [a])
    .addNewRule(1.0, Det, [the])
    .addNewRule(1.0, N, [man])
    .addNewRule(1.0, N, [stick])
    .addNewRule(1.0, TV, [transitiveVerb])
    .addNewRule(1.0, Mod, [with_, NP]) // eg. with a stick
    .build();

const tokens = ["The", "man", "chased", "the", "man", "with", "a", "stick"];
const viterbi = getViterbiParse(
    S,
    grammar,
    tokens
); // : ParseTreeWithScore<string>

console.log(viterbi.probability); // 0.6

/*
0.6
โ””โ”€ S
   โ”œโ”€ NP
   โ”‚  โ”œโ”€ Det
   โ”‚  โ”‚  โ””โ”€ The
   โ”‚  โ””โ”€ N
   โ”‚     โ””โ”€ man
   โ””โ”€ VP
      โ”œโ”€ TV
      โ”‚  โ””โ”€ chased
      โ””โ”€ NP
         โ”œโ”€ Det
         โ”‚  โ””โ”€ the
         โ”œโ”€ N
         โ”‚  โ””โ”€ man
         โ””โ”€ Mod
            โ”œโ”€ with
            โ””โ”€ NP
               โ”œโ”€ Det
               โ”‚  โ””โ”€ a
               โ””โ”€ N
                  โ””โ”€ stick
*/
function printTree(tree) {
  function makeTree(o){if(o.children && o.children.length > 0){const obj = {};
        for(var i=0;i<o.children.length;i++){
            const name = o.children[i].token?o.children[i].token:o.children[i].category;
            obj[name] = makeTree(o.children[i]);
        }
        return obj;
    }else {if(o.token) {return o.token;}
    else {return o.category;}}
  }
  console.log(treeify.asTree(makeTree(tree)));
}

printTree(viterbi.parseTree);

You may pass a function to the parser with an addition probability multiplier for parsed tokens for additional logic that is hard to capture in a grammar. It is also possible to define predict, scan and complete callbacks, but not currently implemented. (Pull requests welcome!)

Some notes on implementation

Written in TypeScript, published as a commonjs module on NPM (ES6; use --harmony_collections flag if your Node version is < 6) and a single-file minified UMD module on Github in vulgar ES5.

This is an implementation of a probabilistic Earley parsing algorithm, which can parse any Probabilistic Context Free Grammar (PCFG) (also known as Stochastic Context Free Grammar (SCFG)), or equivalently any language described in Backus-Naur Form (BNF). In these grammars, rewrite rules may be non-deterministic and have a probability attached to them.

The probability of a parse is defined as the product of the probalities all the applied rules. Usually, we define probability as a number between 0 and 1 inclusive, and use common algebraic notions of addition and multiplication.

This code makes it possible to use any semiring for computing scores. My use for this is to avoid arithmetic underflow: imagine a computation like 0.1 * 0.1 * ... * 0.1. At some point, floating point arithmetic will be unable to represent a number so small. To counter, we use the Log semiring which holds the minus log of the probability. So that maps the numbers 0 and 1 to the numbers between infinity and zero, skewed towards lower probabilities:

Graph plot of f(x) = -log(x)

Graph for f(x) = -log x

Runtime complexity

The Earley algorithm has nice complexity properties. In particular, it can parse:

  • any CFG in O(nยณ),
  • unambiguous CFGs in O(nยฒ)
  • left-recursive unambiguous grammars in O(n)

Note that this implementation does not apply innovations such as Joop Leo's improvement to run linearly on on right-recursive grammars as well. It might be complicated to implement this: making the parser stochastic is not as easy for Earley as it is for CYK.

For a faster parser that work on non-probabilistic grammars, look into nearley.

Limitations

  • I have not provisioned for ฮต-rules (rules with an empty right hand side)
  • Rule probability estimation may be performed using the inside-outside algorithm, but is not currently implemented
  • Higher level concepts such as wildcards, * and + are not implemented
  • Viterbi parsing (querying the most likely parse tree) only returns one single parse. In the case of an ambiguous sentence in which multiple dervation have the highest probability, the returned parse is not guaranteed the left-most parse (I think).

License

This software is licensed under a permissive MIT license.

References

Stolcke, Andreas. "An efficient probabilistic context-free parsing algorithm that computes prefix probabilities." Computational linguistics 21.2 (1995): 165-201. APA

Contact

Inquiries go to [email protected]

probabilistic-earley-parser-javascript's People

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.