Coder Social home page Coder Social logo

danielkornev / relevance-based-on-parse-trees Goto Github PK

View Code? Open in Web Editor NEW

This project forked from bgalitsky/relevance-based-on-parse-trees

1.0 0.0 0.0 35.29 MB

Sentence and paragraph - level relevance and applications

License: Other

Java 99.55% HTML 0.45%

relevance-based-on-parse-trees's Introduction

OpenNLP.Similarity Component

It is a project under Apache OpenNLP which subjects results of parsing, part-of-speech tagging and rhetoric parsing to machine learning. It is leveraged in search, content generation & enrichment, chat bots and other text processing domains where relevance assessment task is a key.

What is new?

You can try a chatbot with the virtual dialogue feature

via UI

Or in a command-line mode:

ssh [email protected]

pw = chatbot123

cd home

java -jar cb_virtDialog.jar

Then you can chat with the bot from the Terminal

A brief 2 min intro

5 min video on Virtual Chatbot in financial domain

10 min video on Virtual Chatbot in financial domain

A longer video where users solve a nontrivial problem

What is OpenNLP.Similarity?

OpenNLP.Similarity is an NLP engine which solves a number of text processing and search tasks based on OpenNLP and Stanford NLP parsers. It is designed to be used by a non-linguist software engineer to build linguistically-enabled:

  • search engines
  • chat bots
  • recommendation systems
  • dialogue systems
  • text analysis and semantic processing engines
  • data-loss prevention system
  • content & document generation tools
  • text writing style, authenticity, sentiment, sensitivity to sharing recognizers
  • general-purpose deterministic inductive learner equipped with abductive, deductive and analogical reasoning which also embraces concept learning and tree kernel learning.

OpenNLP similarity provides a series of techniques to support the overall content pipeline, from text collection to cleaning, classification, personalization and distribution. Technology and implementation of content pipeline developed at eBay is described here.

Installation

  1. Do git clone to setup the environment including resources. Besides what you get from git, /resources directory requires some additional work:

  2. Download the main jar.

  3. Set all necessary jars in /lib folder. Larger size jars are not on git so please download them from Stanford NLP site

  • edu.mit.jverbnet-1.2.0.jar
  • ejml-0.23.jar
  • joda-time.jar
  • jollyday.jar
  • stanford-corenlp-3.5.2-models.jar
  • xom.jar
  • The rest of jars are available via maven.
    1. Set up src/test/resources directory
    • new_vn.zip needs to be unzipped
    • OpenNLP models need to be downloaded into the directory 'models' from here

    As a result the following folders should be in in /resources: As obtained from git:

  • /new_vn (VerbNet)
  • /maps (some lookup files such as products, brands, first names etc.)
  • /external_rst (examples of import of rhetoric parses from other systems)
  • /fca (Formal Concept Analysis learning)
  • /taxonomies (for search support, taxonomies are auto-mined from the web)
  • /tree_kernel (for tree kernel learning, representation of parse trees, thickets and trained models)
  • Manual downloading is also required for:
  • /new_vn
  • /w2v (where word2vector model needs to be downloaded, if desired)
    1. Try running tests which will give you a hint on how to integrate OpenNLP.Similarity functionality into your application. You can start with Matcher test and observe how long paragraphs can be linguistically matched (you can compare this with just an intersection of keywords)

    2. Look at example POMs for how to better integrate into your existing project

    Creating a simple project

    Create a project from MyMatcher.java.

    Running a chat bot

    First you need to set the resource directory. The simplest way is to download and unzip it from here. Then get a chat bot jar from here. To run it (once the resource directory is set): java -jar cbjar It will take you to the prompt ">" to type your query. An example session is in [examples] (https://github.com/bgalitsky/relevance-based-on-parse-trees/tree/master/examples/botSessionExample.txt). The entry point for chat bot integration is here

    Engines and Systems of OpenNLP.Similarity

    Main relevance assessment function

    It takes two texts and returns the cardinality of a maximum common subgraph representations of these texts. This measure is supposed to be much more accurate than keyword statistics, compositional semantic models word2vec because linguistic structure is taken into account, not just co-occurrences of keywords. Matching class in [matching package] (https://github.com/bgalitsky/relevance-based-on-parse-trees/tree/master/src/main/java/opennlp/tools/parse_thicket/matching) has

    List<List<ParseTreeChunk>> assessRelevance(String para1, String para2)

    function which returns the list of [common phrases between these paragraph]s.

    To avoid re-parsing the same strings and improve the speed, use

    List<List<ParseTreeChunk>> assessRelevanceCache(String para1, String para2)

    It operates on the level of sentences (giving maximal common subtree) and paragraphs (giving maximal common sub-parse thicket). Maximal common sub-parse thicket is also represented as a list of common phrases.

  • Search results re-ranker based on linguistic similarity
  • Request Handler for SOLR which used parse tree similarity
  • Search engine

    The following set of functionalities is available to enable search with linguistic features. It is desirable when query is long (more than 4 keywords), logically complex, or ambiguous.

  • Search results re-ranker based on linguistic similarity
  • Request Handler for SOLR which used parse tree similarity
  • Taxonomy builder via learning from the web
  • Appropriate rhetoric map of an answer verifier. If parts of the answer are located in distinct discourse units, this answer might be irrelevant even if all keywords are mapped
  • Tree kernel learning re-ranker to improve search relevance within a given domain with pre-trained model
  • SOLR request handlers are available here.

    Taxonomy builder is here. Examples of pre-built taxonomy are available in this directory. Please pay attention at taxonomies built for languages other than English. A music taxonomy is an example of the seed data for taxonomy building, and this taxonomy hashmap dump is a good example of what can be automatically constructed. A paper on taxonomy learning is here.

    Search results re-ranker

    Re-ranking scores similarity between a given orderedListOfAnswers and question

    Matcher m = new Mather();

    List<Pair<String,Double>> pairList = new ArrayList<Pair<String,Double>>();

    for (String ans: orderedListOfAnswers) {

            `List<List<ParseTreeChunk>> similarityResult = m.assessRelevanceCache(question, ans);`
            
            `double score = parseTreeChunkListScorer.getParseTreeChunkListScoreAggregPhraseType(similarityResult);`
            
            `Pair<String,Double> p = new Pair<String, Double>(ans, score);`
            
            `pairList.add(p);`
            
        `}`
    

    Collections.sort(pairList, Comparator.comparing(p -> p.getSecond()));

    Then pairList is then ranked according to the linguistic relevance score. This score can be combined with other sources such as popularity, geo-proximity and others.

    Content generator

    It takes a topic, builds a taxonomy for it and forms a table of content. It then mines the web for documents for each table of content item, finds relevant sentences and paragraphs and merges them into a document package. The resultant document has a TOC, sections, figures & captions and also a reference section. We attempt to reproduce how humans cut-and-paste content from the web while writing on a topic. Content generation has a demo and to run it from IDE start here. Examples of written documents are here. Another content generation option is about opinion data. Reviews are mined for, cross-bred and made "original" for search engines. This and general content generation is done for SEO purposes. Review builder composes fake reviews which are in turn should be recognized by a Fake Review detector

    Text classifier / feature detector in text

    The classifier code is the same but the model files vary for the applications below:

  • detect security leaks
  • detect argumentation
  • detect low cohesiveness in text
  • detect authors’ doubt and low confidence
  • detect fake review

    Document classification to six major classes {finance, business, legal, computing, engineering, health} is available via nearest neighbor model. A Lucene training model (1G file) is obtained from Wikipedia corpus. This classifier can be trained for an arbitrary classes once respective Wiki pages are selected and respective Lucene index is built. Once proper training documents are selected from Wikipedia with adequate coverage, the accuracy is usually higher than what can be achieved by word2vec classification models.

    General-purpose deterministic inductive learner implements JS Mills method of induction and abduction (deduction is also partially implemented).

    Inductive learning implemented as a base for syntactic tree-based learning is similar to the family of approaches such as Explanation-based Learning and Inductive Logic Programming.

    Tree-kernel learning

    is integrated to allow application of SVM learning to sentence-level and paragraph-level linguistic data including discourse. Unlike learning in numerical space, each dimension in tree kernel learning is an occurrence of a particular subtree. Similarity is not a numerical distance but a count of common subtrees. A set of parse trees for individual sentences to represent a paragraph is called parse thicket. Its representation as a graph is coded in a tree representation via parenthesis such as [model*.txt] (https://github.com/bgalitsky/relevance-based-on-parse-trees/blob/master/src/test/resources/tree_kernel/model_pos_neg_sentiment.txt). To do model building and predictions, C modules are run in this directory, so proper choice need to be made: {svm_classify.linux, svm_classify.max, svm_classify.exe, svm_learn.*}. Also, proper run permissions needs to be set for these files.

    Concept learning

    is a branch of deterministic learning which is applied to attribute-value pairs and possesses useful explainability feature, unlike statistical and deep learning. It is fairly useful for data exploration and visualization since all interesting relations can be visualized. Concept learning covers inductive and abductive learning and also some cases of deduction. Explore this package for the concept learning-related features.

    Filtering results for Speech Recognition based on semantic meaningfulness

    It takes results from a speech-to-text system and subjects them to [filtering] (https://github.com/bgalitsky/relevance-based-on-parse-trees/blob/master/src/main/java/opennlp/tools/similarity/apps/SpeechRecognitionResultsProcessor.java). Those recognized candidate words which do not make sense together are filtered out, based on the frequency of co-occurrences found on the web.

    Related Research

    Here's the link to the book on question-answering

    and research papers.

    Also the recent book related to reasoning and linguistics in humans & machines

    Configuring OpenNLP.Similarity component

    VerbNet model is included by default, so that the hand-coded meanings of the verb are used when simularity between verb phrases are computed.

    To include word2vector model, download it and make sure the following path is valid: resourceDir + "/w2v/GoogleNews-vectors-negative300.bin.gz"

  • relevance-based-on-parse-trees's People

    Contributors

    bgalitsky avatar

    Stargazers

     avatar

    Recommend Projects

    • React photo React

      A declarative, efficient, and flexible JavaScript library for building user interfaces.

    • Vue.js photo Vue.js

      🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

    • Typescript photo Typescript

      TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

    • TensorFlow photo TensorFlow

      An Open Source Machine Learning Framework for Everyone

    • Django photo Django

      The Web framework for perfectionists with deadlines.

    • D3 photo D3

      Bring data to life with SVG, Canvas and HTML. 📊📈🎉

    Recommend Topics

    • javascript

      JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

    • web

      Some thing interesting about web. New door for the world.

    • server

      A server is a program made to process requests and deliver data to clients.

    • Machine learning

      Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

    • Game

      Some thing interesting about game, make everyone happy.

    Recommend Org

    • Facebook photo Facebook

      We are working to build community through open source technology. NB: members must have two-factor auth.

    • Microsoft photo Microsoft

      Open source projects and samples from Microsoft.

    • Google photo Google

      Google ❤️ Open Source for everyone.

    • D3 photo D3

      Data-Driven Documents codes.