spencermountain / compromise Goto Github PK

View Code? Open in Web Editor NEW

11.4K 163.0 648.0 56.99 MB

modest natural-language processing

Home Page: http://compromise.cool

License: MIT License

JavaScript 99.73% TypeScript 0.20% HTML 0.07%

nlp part-of-speech named-entity-recognition

compromise's Introduction

compromise

modest natural language processing

npm install compromise

_{by
Spencer Kelly and

many contributors}

_{french • german • italian • spanish}

don't you find it strange,

_{how easy text is to make,}

↬_ᔐᖜ↬ and how hard it is to actually parse and use?

compromise tries its best to turn text into data.

it makes limited and sensible decisions.
_{it's not as smart as you'd think.}

import nlp from 'compromise'

let doc = nlp('she sells seashells by the seashore.')
doc.verbs().toPastTense()
doc.text()
// 'she sold seashells by the seashore.'

don't be fancy, at all:

if (doc.has('simon says #Verb')) {
  return true
}

grab parts of the text:

let doc = nlp(entireNovel)
doc.match('the #Adjective of times').text()
// "the blurst of times?"

match docs

and get data:

import plg from 'compromise-speech'
nlp.extend(plg)

let doc = nlp('Milwaukee has certainly had its share of visitors..')
doc.compute('syllables')
doc.places().json()
/*
[{
  "text": "Milwaukee",
  "terms": [{
    "normal": "milwaukee",
    "syllables": ["mil", "wau", "kee"]
  }]
}]
*/

json docs

avoid the problems of brittle parsers:

let doc = nlp("we're not gonna take it..")

doc.has('gonna') // true
doc.has('going to') // true (implicit)

// transform
doc.contractions().expand()
doc.text()
// 'we are not going to take it..'

contraction docs

and whip stuff around like it's data:

let doc = nlp('ninety five thousand and fifty two')
doc.numbers().add(20)
doc.text()
// 'ninety five thousand and seventy two'

number docs

_{-because it actually is-}

let doc = nlp('the purple dinosaur')
doc.nouns().toPlural()
doc.text()
// 'the purple dinosaurs'

noun docs

Use it on the client-side:

<script src="https://unpkg.com/compromise"></script>
<script>
  var doc = nlp('two bottles of beer')
  doc.numbers().minus(1)
  document.body.innerHTML = doc.text()
  // 'one bottle of beer'
</script>

or likewise:

import nlp from 'compromise'

var doc = nlp('London is calling')
doc.verbs().toNegative()
// 'London is not calling'

compromise is ~250kb (minified):

it's pretty fast. It can run on keypress:

it works mainly by conjugating all forms of a basic word list.

The final lexicon is ~14,000 words:

you can read more about how it works, here. it's weird.

_{okay -}

`compromise/one`

A tokenizer of words, sentences, and punctuation.

import nlp from 'compromise/one'

let doc = nlp("Wayne's World, party time")
let data = doc.json()
/* [{
  normal:"wayne's world party time",
    terms:[{ text: "Wayne's", normal: "wayne" },
      ...
      ]
  }]
*/

tokenizer docs

compromise/one splits your text up, wraps it in a handy API,

_{and does nothing else -}

/one is quick - most sentences take a 10th of a millisecond.

It can do ~1mb of text a second - or 10 wikipedia pages.

Infinite jest takes 3s.

You can also parallelize, or stream text to it with compromise-speed.

`compromise/two`

A part-of-speech tagger, and grammar-interpreter.

import nlp from 'compromise/two'

let doc = nlp("Wayne's World, party time")
let str = doc.match('#Possessive #Noun').text()
// "Wayne's World"

tagger docs

compromise/two automatically calculates the very basic grammar of each word.

_{this is more useful than people sometimes realize.}

Light grammar helps you write cleaner templates, and get closer to the information.

compromise has 83 tags, arranged in a handsome graph.

#FirstName → #Person → #ProperNoun → #Noun

you can see the grammar of each word by running doc.debug()

you can see the reasoning for each tag with nlp.verbose('tagger').

if you prefer Penn tags, you can derive them with:

let doc = nlp('welcome thrillho')
doc.compute('penn')
doc.json()

`compromise/three`

Phrase and sentence tooling.

import nlp from 'compromise/three'

let doc = nlp("Wayne's World, party time")
let str = doc.people().normalize().text()
// "wayne"

selection docs

compromise/three is a set of tooling to zoom into and operate on parts of a text.

.numbers() grabs all the numbers in a document, for example - and extends it with new methods, like .subtract().

When you have a phrase, or group of words, you can see additional metadata about it with .json()

let doc = nlp('four out of five dentists')
console.log(doc.fractions().json())
/*[{
    text: 'four out of five',
    terms: [ [Object], [Object], [Object], [Object] ],
    fraction: { numerator: 4, denominator: 5, decimal: 0.8 }
  }
]*/

let doc = nlp('$4.09CAD')
doc.money().json()
/*[{
    text: '$4.09CAD',
    terms: [ [Object] ],
    number: { prefix: '$', num: 4.09, suffix: 'cad'}
  }
]*/

API

Compromise/one

Output

.text() - return the document as text
.json() - return the document as data
.debug() - pretty-print the interpreted document
.out() - a named or custom output
.html({}) - output custom html tags for matches
.wrap({}) - produce custom output for document matches

Utils

.found [getter] - is this document empty?
.docs [getter] get term objects as json
.length [getter] - count the # of characters in the document (string length)
.isView [getter] - identify a compromise object
.compute() - run a named analysis on the document
.clone() - deep-copy the document, so that no references remain
.termList() - return a flat list of all Term objects in match
.cache({}) - freeze the current state of the document, for speed-purposes
.uncache() - un-freezes the current state of the document, so it may be transformed
.freeze({}) - prevent any tags from being removed, in these terms
.unfreeze({}) - allow tags to change again, as default

Accessors

.all() - return the whole original document ('zoom out')
.terms() - split-up results by each individual term
.first(n) - use only the first result(s)
.last(n) - use only the last result(s)
.slice(n,n) - grab a subset of the results
.eq(n) - use only the nth result
.firstTerms() - get the first word in each match
.lastTerms() - get the end word in each match
.fullSentences() - get the whole sentence for each match
.groups() - grab any named capture-groups from a match
.wordCount() - count the # of terms in the document
.confidence() - an average score for pos tag interpretations

Match

(match methods use the match-syntax.)

.match('') - return a new Doc, with this one as a parent
.not('') - return all results except for this
.matchOne('') - return only the first match
.if('') - return each current phrase, only if it contains this match ('only')
.ifNo('') - Filter-out any current phrases that have this match ('notIf')
.has('') - Return a boolean if this match exists
.before('') - return all terms before a match, in each phrase
.after('') - return all terms after a match, in each phrase
.union() - return combined matches without duplicates
.intersection() - return only duplicate matches
.complement() - get everything not in another match
.settle() - remove overlaps from matches
.growRight('') - add any matching terms immediately after each match
.growLeft('') - add any matching terms immediately before each match
.grow('') - add any matching terms before or after each match
.sweep(net) - apply a series of match objects to the document
.splitOn('') - return a Document with three parts for every match ('splitOn')
.splitBefore('') - partition a phrase before each matching segment
.splitAfter('') - partition a phrase after each matching segment
.join() - merge any neighbouring terms in each match
.joinIf(leftMatch, rightMatch) - merge any neighbouring terms under given conditions
.lookup([]) - quick find for an array of string matches
.autoFill() - create type-ahead assumptions on the document

Case

.toLowerCase() - turn every letter of every term to lower-cse
.toUpperCase() - turn every letter of every term to upper case
.toTitleCase() - upper-case the first letter of each term
.toCamelCase() - remove whitespace and title-case each term

Whitespace

.pre('') - add this punctuation or whitespace before each match
.post('') - add this punctuation or whitespace after each match
.trim() - remove start and end whitespace
.hyphenate() - connect words with hyphen, and remove whitespace
.dehyphenate() - remove hyphens between words, and set whitespace
.toQuotations() - add quotation marks around these matches
.toParentheses() - add brackets around these matches

Loops

.map(fn) - run each phrase through a function, and create a new document
.forEach(fn) - run a function on each phrase, as an individual document
.filter(fn) - return only the phrases that return true
.find(fn) - return a document with only the first phrase that matches
.some(fn) - return true or false if there is one matching phrase
.random(fn) - sample a subset of the results

Insert

.replace(match, replace) - search and replace match with new content
.replaceWith(replace) - substitute-in new text
.remove() - fully remove these terms from the document
.insertBefore(str) - add these new terms to the front of each match (prepend)
.insertAfter(str) - add these new terms to the end of each match (append)
.concat() - add these new things to the end
.swap(fromLemma, toLemma) - smart replace of root-words,using proper conjugation

Transform

.sort('method') - re-arrange the order of the matches (in place)
.reverse() - reverse the order of the matches, but not the words
.normalize({}) - clean-up the text in various ways
.unique() - remove any duplicate matches

Lib

(these methods are on the main nlp object)

nlp.tokenize(str) - parse text without running POS-tagging
nlp.lazy(str, match) - scan through a text with minimal analysis
nlp.plugin({}) - mix in a compromise-plugin
nlp.parseMatch(str) - pre-parse any match statements into json
nlp.world() - grab or change library internals
nlp.model() - grab all current linguistic data
nlp.methods() - grab or change internal methods
nlp.hooks() - see which compute methods run automatically
nlp.verbose(mode) - log our decision-making for debugging
nlp.version - current semver version of the library
nlp.addWords(obj, isFrozen?) - add new words to the lexicon
nlp.addTags(obj) - add new tags to the tagSet
nlp.typeahead(arr) - add words to the auto-fill dictionary
nlp.buildTrie(arr) - compile a list of words into a fast lookup form
nlp.buildNet(arr) - compile a list of matches into a fast match form

compromise/two:

Contractions

.contractions() - things like "didn't"
.contractions().expand() - things like "didn't"
.contract() - things like "didn't"

compromise/three:

Nouns

.nouns() - return any subsequent terms tagged as a Noun
- .nouns().json() - overloaded output with noun metadata
- .nouns().parse() - get tokenized noun-phrase
- .nouns().isPlural() - return only plural nouns
- .nouns().isSingular() - return only singular nouns
- .nouns().toPlural() - 'football captain' → 'football captains'
- .nouns().toSingular() - 'turnovers' → 'turnover'
- .nouns().adjectives() - get any adjectives describing this noun

Verbs

.verbs() - return any subsequent terms tagged as a Verb
- .verbs().json() - overloaded output with verb metadata
- .verbs().parse() - get tokenized verb-phrase
- .verbs().subjects() - what is doing the verb action
- .verbs().adverbs() - return the adverbs describing this verb.
- .verbs().isSingular() - return singular verbs like 'spencer walks'
- .verbs().isPlural() - return plural verbs like 'we walk'
- .verbs().isImperative() - only instruction verbs like 'eat it!'
- .verbs().toPastTense() - 'will go' → 'went'
- .verbs().toPresentTense() - 'walked' → 'walks'
- .verbs().toFutureTense() - 'walked' → 'will walk'
- .verbs().toInfinitive() - 'walks' → 'walk'
- .verbs().toGerund() - 'walks' → 'walking'
- .verbs().toPastParticiple() - 'drive' → 'had driven'
- .verbs().conjugate() - return all conjugations of these verbs
- .verbs().isNegative() - return verbs with 'not', 'never' or 'no'
- .verbs().isPositive() - only verbs without 'not', 'never' or 'no'
- .verbs().toNegative() - 'went' → 'did not go'
- .verbs().toPositive() - "didn't study" → 'studied'

Numbers

.numbers() - grab all written and numeric values
- .numbers().parse() - get tokenized number phrase
- .numbers().get() - get a simple javascript number
- .numbers().json() - overloaded output with number metadata
- .numbers().toNumber() - convert 'five' to 5
- .numbers().toLocaleString() - add commas, or nicer formatting for numbers
- .numbers().toText() - convert '5' to five
- .numbers().toOrdinal() - convert 'five' to fifth or 5th
- .numbers().toCardinal() - convert 'fifth' to five or 5
- .numbers().isOrdinal() - return only ordinal numbers
- .numbers().isCardinal() - return only cardinal numbers
- .numbers().isEqual(n) - return numbers with this value
- .numbers().greaterThan(min) - return numbers bigger than n
- .numbers().lessThan(max) - return numbers smaller than n
- .numbers().between(min, max) - return numbers between min and max
- .numbers().isUnit(unit) - return only numbers in the given unit, like 'km'
- .numbers().set(n) - set number to n
- .numbers().add(n) - increase number by n
- .numbers().subtract(n) - decrease number by n
- .numbers().increment() - increase number by 1
- .numbers().decrement() - decrease number by 1
.money() - things like '$2.50'
- .money().get() - retrieve the parsed amount(s) of money
- .money().json() - currency + number info
- .money().currency() - which currency the money is in
.fractions() - like '2/3rds' or 'one out of five'
- .fractions().parse() - get tokenized fraction
- .fractions().get() - simple numerator, denomenator data
- .fractions().json() - json method overloaded with fractions data
- .fractions().toDecimal() - '2/3' -> '0.66'
- .fractions().normalize() - 'four out of 10' -> '4/10'
- .fractions().toText() - '4/10' -> 'four tenths'
- .fractions().toPercentage() - '4/10' -> '40%'
.percentages() - like '2.5%'
- .percentages().get() - return the percentage number / 100
- .percentages().json() - json overloaded with percentage information
- .percentages().toFraction() - '80%' -> '8/10'

Sentences

.sentences() - return a sentence class with additional methods
- .sentences().json() - overloaded output with sentence metadata
- .sentences().toPastTense() - he walks -> he walked
- .sentences().toPresentTense() - he walked -> he walks
- .sentences().toFutureTense() -- he walks -> he will walk
- .sentences().toInfinitive() -- verb root-form he walks -> he walk
- .sentences().toNegative() - - he walks -> he didn't walk
- .sentences().isQuestion() - return questions with a ?
- .sentences().isExclamation() - return sentences with a !
- .sentences().isStatement() - return sentences without ? or !

Adjectives

.adjectives() - things like 'quick'
- .adjectives().json() - get adjective metadata
- .adjectives().conjugate() - return all inflections of these adjectives
- .adjectives().adverbs() - get adverbs describing this adjective
- .adjectives().toComparative() - 'quick' -> 'quicker'
- .adjectives().toSuperlative() - 'quick' -> 'quickest'
- .adjectives().toAdverb() - 'quick' -> 'quickly'
- .adjectives().toNoun() - 'quick' -> 'quickness'

Misc selections

.clauses() - split-up sentences into multi-term phrases
.chunks() - split-up sentences noun-phrases and verb-phrases
.hyphenated() - all terms connected with a hyphen or dash like 'wash-out'
.phoneNumbers() - things like '(939) 555-0113'
.hashTags() - things like '#nlp'
.emails() - things like '[email protected]'
.emoticons() - things like :)
.emojis() - things like 💋
.atMentions() - things like '@nlp_compromise'
.urls() - things like 'compromise.cool'
.pronouns() - things like 'he'
.conjunctions() - things like 'but'
.prepositions() - things like 'of'
.abbreviations() - things like 'Mrs.'
.people() - names like 'John F. Kennedy'
- .people().json() - get person-name metadata
- .people().parse() - get person-name interpretation
.places() - like 'Paris, France'
.organizations() - like 'Google, Inc'
.topics() - people() + places() + organizations()
.adverbs() - things like 'quickly'
- .adverbs().json() - get adverb metadata
.acronyms() - things like 'FBI'
- .acronyms().strip() - remove periods from acronyms
- .acronyms().addPeriods() - add periods to acronyms
.parentheses() - return anything inside (parentheses)
- .parentheses().strip() - remove brackets
.possessives() - things like "Spencer's"
- .possessives().strip() - "Spencer's" -> "Spencer"
.quotations() - return any terms inside paired quotation marks
- .quotations().strip() - remove quotation marks
.slashes() - return any terms grouped by slashes
- .slashes().split() - turn 'love/hate' into 'love hate'

.extend():

This library comes with a considerate, common-sense baseline for english grammar.

You're free to change, or lay-waste to any settings - which is the fun part actually.

the easiest part is just to suggest tags for any given words:

let myWords = {
  kermit: 'FirstName',
  fozzie: 'FirstName',
}
let doc = nlp(muppetText, myWords)

or make heavier changes with a compromise-plugin.

import nlp from 'compromise'
nlp.extend({
  // add new tags
  tags: {
    Character: {
      isA: 'Person',
      notA: 'Adjective',
    },
  },
  // add or change words in the lexicon
  words: {
    kermit: 'Character',
    gonzo: 'Character',
  },
  // change inflections
  irregulars: {
    get: {
      pastTense: 'gotten',
      gerund: 'gettin',
    },
  },
  // add new methods to compromise
  api: View => {
    View.prototype.kermitVoice = function () {
      this.sentences().prepend('well,')
      this.match('i [(am|was)]').prepend('um,')
      return this
    }
  },
})

.plugin() docs

Docs:

gentle introduction:

Documentation:

Concepts	API	Plugins
Accuracy	Accessors	Adjectives
Caching	Constructor-methods	Dates
Case	Contractions	Export
Filesize	Insert	Hash
Internals	Json	Html
Justification	Character Offsets	Keypress
Lexicon	Loops	Ngrams
Match-syntax	Match	Numbers
Performance	Nouns	Paragraphs
Plugins	Output	Scan
Projects	Selections	Sentences
Tagger	Sorting	Syllables
Tags	Split	Pronounce
Tokenization	Text	Strict
Named-Entities	Utils	Penn-tags
Whitespace	Verbs	Typeahead
World data	Normalization	Sweep
Fuzzy-matching	Typescript	Mutation
Root-forms

Talks:

Language as an Interface - by Spencer Kelly
Coding Chat Bots - by KahWee Teng
On Typing and data - by Spencer Kelly

Articles:

Geocoding Social Conversations with NLP and JavaScript - by Microsoft
Microservice Recipe - by Eventn
Adventure Game Sentence Parsing with Compromise
Building Text-Based Games - by Matt Eland
Fun with javascript in BigQuery - by Felipe Hoffa
Natural Language Processing... in the Browser? - by Charles Landau

Some fun Applications:

Automated Bechdel Test - by The Guardian
Story generation framework - by Jose Phrocca
Tumbler blog of lists - horse-ebooks-like lists - by Michael Paulukonis
Video Editing from Transcription - by New Theory
Browser extension Fact-checking - by Alexander Kidd
Siri shortcut - by Michael Byrns
Amazon skill - by Tajddin Maghni
Tasking Slack-bot - by Kevin Suh [see more]

Comparisons

Plugins:

These are some helpful extensions:

Dates

npm install compromise-dates

.dates() - find dates like June 8th or 03/03/18
- .dates().get() - simple start/end json result
- .dates().json() - overloaded output with date metadata
- .dates().format('') - convert the dates to specific formats
- .dates().toShortForm() - convert 'Wednesday' to 'Wed', etc
- .dates().toLongForm() - convert 'Feb' to 'February', etc
.durations() - 2 weeks or 5mins
- .durations().get() - return simple json for duration
- .durations().json() - overloaded output with duration metadata
.times() - 4:30pm or half past five
- .times().get() - return simple json for times
- .times().json() - overloaded output with time metadata

Stats

npm install compromise-stats

.tfidf({}) - rank words by frequency and uniqueness
.ngrams({}) - list all repeating sub-phrases, by word-count
.unigrams() - n-grams with one word
.bigrams() - n-grams with two words
.trigrams() - n-grams with three words
.startgrams() - n-grams including the first term of a phrase
.endgrams() - n-grams including the last term of a phrase
.edgegrams() - n-grams including the first or last term of a phrase

Speech

npm install compromise-syllables

.syllables() - split each term by its typical pronunciation
.soundsLike() - produce a estimated pronunciation

Wikipedia

npm install compromise-wikipedia

.wikipedia() - compressed article reconciliation

Typescript

we're committed to typescript/deno support, both in main and in the official-plugins:

import nlp from 'compromise'
import stats from 'compromise-stats'

const nlpEx = nlp.extend(stats)

nlpEx('This is type safe!').ngrams({ min: 1 })

typescript docs

Limitations:

slash-support: We currently split slashes up as different words, like we do for hyphens. so things like this don't work: nlp('the koala eats/shoots/leaves').has('koala leaves') //false
inter-sentence match: By default, sentences are the top-level abstraction. Inter-sentence, or multi-sentence matches aren't supported without a plugin: nlp("that's it. Back to Winnipeg!").has('it back')//false
nested match syntax: the ~~danger~~ beauty of regex is that you can recurse indefinitely. Our match syntax is much weaker. Things like this are not (yet) possible: doc.match('(modern (major|minor))? general') complex matches must be achieved with successive .match() statements.
dependency parsing: Proper sentence transformation requires understanding the syntax tree of a sentence, which we don't currently do. We should! Help wanted with this.

FAQ

☂️ Isn't javascript too...

here

💃 Can it run on my arduino-watch?

quick start

🌎 Compromise in other Languages?

✨ Partial builds?

(spencer's cool)

(spencer's house)

compromise's People

Contributors

Stargazers

Watchers

Forkers

fragglebob vunb socialcloudtech big-data naturallanguage ichaib saidelimam yawetse uyghurdev goraneza hugobuddel matthi0uw chukonu somejeff jaforbes hacksparrow aminembarki cflann nfons cranesandcaff ulflander gunyarakun miji-dev damianofusco clns janusnic redaktor vidul-nikolaev-petrov siyuan1990 baselex01 ashokpant abacusadvertising clintg zzmjohn siawyoung cyancymk rsunder10 prashantagarwal w0lfeagle outout tscheys daliu jaags slneufeld v32itas outmost chriswli24 palmeral silky amitdjs antimatter15 robertomalatesta nithesh1992 cloudtracer randomeizer openbizgit aboutlo archtechfl kendelljoseph mahangu nathanmarks aterrien creatorrr cryptixcoder cgilboy ericcarraway personsg njuhugn icaas fy2c tfg-urjc-2017 cmelgarejo jaredmansaakintola pauline-ng terrytompkins kshreyas91 tniewiar gbraad dak martindale yanlinaung benjamingr pasupunuri jadbox dleen ivanhoe011 divs1210 vibster dexterbrylle jimbog seescode asabovici wanjinchang little1tow fupio joskid jdrew1303 xenyal curtiszimmerman techscientist

compromise's Issues

Add Fahrenheit, Celcius, and Kelvin to Inflect Uncountables

I'm terrible with GitHub, and I'll probably screw stuff up trying to do this myself. But anyway, these need adding. Thanks!

The example "Ten and a half million" does not work

Yet to go over the code but that specific example does not yield a result

nlp.sentences() collapses whitespace

Pretty self explanatory, but if there are multiple white space characters between a word, the sentence detected collapses these characters together

Sentence negate will negate everything

for example:
[Orig]
They are based on different physical effects use to guarantee a stable grasping between a gripper and the object to be grasped.

[negate]
They are not based on different physical didn't effects use to doesn't guarantee a stable grasping between a gripper and the object to be grasped.

Maybe just negate the first verb would be sufficient.

Question - Advanced Date Parsing

First off, this is such a great project!
Do you have any thoughts returning an array of dates from a parsed sentense? Or more advanced logic like ranges?

It looks like this does a lot of what Ive done in https://github.com/silentrob/normalizer but I suspect much faster (basic normalization and commonwelth => american conversation).

I also have some code that deals with numbers and parsing math expressions here https://github.com/silentrob/superscript/blob/master/lib/math.js.

Double quotes to specify pos

We are using nlp_compromise to parse requests for data pulls. In many cases, a product or retailer will get parsed in an undesirable fashion, i.e. "Stop & Shop" will not be thought of as a noun.

Is it possible today, or would it be possible, to allow double-quotes to group words together and default them to a particular part of speech, like NN?

Implementing the package on a data structure ?

Hey I am quite new to meteor and computer science basically. I am linguistic trying to learn computational linguistics somehow. I guess I am struggling still. I was wondering how it would be possible to use this on my own corpus ? Let's say that I have a list of sentences and whenever I choose the sentence I want to see the properties of the sentence. Would be possible ?

performance / why is it running twice ...

Hey there,
contributing from my fork doesn't make sense because the structure will change to 'only the 3 dictionary files and a factory' soon.
However - let me ask some perfomance questions.
Maybe I missed something, hidden in the code, but
--> several 'autoclosure' functions run every time when a module is required.

Let's take an example - the conjugation of verbs which is used quite often.
I'll use simple console.log to demonstrate it.

In src/parents/verb/index

put some logs in the conjugate function

the.conjugate = function() {
  console.log( 'BEWARE! conjugate is conjugating' );
  verb_conjugate = require('./conjugate');
  var conjugated = verb_conjugate(the.word);
  console.log( 'conjugate result', conjugated );
  return conjugated; //verb_conjugate(the.word);
}

and in the 'autoclosure' form function

the.form = (function() {
    console.log( 'BEWARE! the.form is conjugating' );
    verb_conjugate = require('./conjugate');
    // don't choose infinitive if infinitive == present
    var order = [
      'past',
      'present',
      'gerund',
      'infinitive'
    ];
    var forms = verb_conjugate(the.word);
    console.log( 'forms result', forms );
    for (var i = 0; i < order.length; i++) {
        if (forms[order[i]] === the.word) {
            return order[i];
        }
    }
})()

When I do

console.log( nlp.verb('last') );

it will conjugate

and when I do

console.log( nlp.verb('last').conjugate() );

it will conjugate twice

'Sixteen one' is two numbers, not one

Twenty five should register as one number,
Sixteen one should not.

referenced_by and reference_to

Hey,
please see https://github.com/spencermountain/nlp_compromise/blob/master/src/parents/noun/index.js

referenced_by uses the var posessives (typo ?). It is defined in the scope of the module and is

{
    "his": "he",
    "her": "she",
    "hers": "she",
    "their": "they",
    "them": "they",
    "its": "it"
  }

while reference_to uses the var var possessives defined in the scope of the function which is just

{
    "his":"he",
    "her":"she",
    "their":"they"
}

Shouldn't both be the same and maybe

{ 
    mine: 'i',
    yours: 'you',
    his: 'he',
    her: 'she',
    its: 'it',
    our: 'we',
    their: 'they',
    them: 'they' 
}

feature: add custom verbs/nouns etc

Scenario: using the library for natural language processing for a calendar assistant. Doesn't recognise "schedule" as a verb.

Would it be possible to pass in some configuration when instantiating the library, eg an array of verbs + nouns etc to allow users to inject extra words.
In my case I might extend verbs by passing in an array of my own verbs:
["schedule"]

That seems to work for me if I hack the code and add 'schedule' to the list of verbs...but I don't grok grammar well enough to know if it's completely correct (it becomes an infinitive verb, VBP)

"something" and "other" not labelled as nouns

I'd consider these words common enough to be included in the lexicon.

date_extractor's regex not replacing properly

Hi,

in date_extractor.js, line 24 to 35, the replace regex replaces dates in the format "Feb. 14, 1969" to "February14, 1969" (no space between the month and the date), leading the parser to skip the date and only match the year.

Fixed by surrounding the replaced month names by spaces:

    text = text.replace(/ Feb\.? /g, ' February ');
    text = text.replace(/ Mar\.? /g, ' March ');
    text = text.replace(/ Apr\.? /g, ' April ');
    text = text.replace(/ Jun\.? /g, ' June ');
    text = text.replace(/ Jul\.? /g, ' july ');
    text = text.replace(/ Aug\.? /g, ' august ');
    text = text.replace(/ Sep\.? /g, ' september ');
    text = text.replace(/ Oct\.? /g, ' october ');
    text = text.replace(/ Nov\.? /g, ' november ');
    text = text.replace(/ Dec\.? /g, ' december ');

2.0 - just a note : string to regex

see lexidates:
res.dayS = '\b('.concat(Object.keys(res.days).join('|'), ')\b');

When a string becomes a regex, in javascript, you must quote stuff with special regex-meaning double.
So \b should be \\bhere - see my original code...

If you want to use it like on top you need to pass it to a quote function, e.g.
Mozilla:

function escapeRegExp(string){
  return string.replace(/([.*+?^=!:${}()|\[\]\/\\])/g, "\\$1");
}

or in dojo see .string ...

nlp.spot not returning entity array, but rather an object

From Kroid/angular-nlp-compromise#1 :

nlp.spot("joe carter loves toronto");

From docs:

nlp.spot("joe carter loves toronto")
// ["joe carter", "toronto"]

I checked it from chrome console in example page http://rawgit.com/spencermountain/nlp_compromise/master/client_side/cute_demo/index.html

is_plural again

Hm - the last commit does not work properly because

in pluralize_rules we have rules for both singular to plural AND plural to plural
while
in singularize_rulesit is only plural to singular

(???)

In general I am working on a factory method called "dictionary" based on the "words" and "rules" and this can be autotranslated by our database to several languages covering the ngram and metrics etc.

ngram: support for ngrams with size equal to 1

Currently there is no way to use nlp.ngram() to perform a simple word frequency calculation (i.e. ngrams with size equal to one). Setting the max_size option to 1 produces ngrams with size 2. Setting max_size to 0 also gives the same result. I suspect that these two are the lines responsible for this issue: https://github.com/spencermountain/nlp_compromise/blob/master/src/methods/tokenization/ngram.js#L11 (where max_size is incremented - why?) and https://github.com/spencermountain/nlp_compromise/blob/master/src/methods/tokenization/ngram.js#L6
(where since 0 is false, max_size is assigned the value 5).

date_extractor: Cannot read property '1' of null

Hi,
I'm getting this error when trying to parse certain strings.
I've put in a hack for the function to always return null as i'm not using date extraction, but it's not a fix.

/node_modules/nlp_compromise/src/parents/value/coffeejs/date_extractor.js:224
h[k] = arr[places[k]];
^
TypeError: Cannot read property '1' of null
at /node_modules/nlp_compromise/src/parents/value/coffeejs/date_extractor.js:224:21
at Array.reduce (native)
at Object.regexes.process (/node_modules/nlp_compromise/src/parents/value/coffeejs/date_extractor.js:223:36)
at main (/node_modules/nlp_compromise/src/parents/value/coffeejs/date_extractor.js:334:20)
at the.date (/node_modules/nlp_compromise/src/parents/value/index.js:13:11)
at /node_modules/nlp_compromise/src/parents/value/index.js:38:11
at new Value (/node_modules/nlp_compromise/src/parents/value/index.js:45:4)
at Object.parents.value (/node_modules/nlp_compromise/src/parents/parents.js:22:10)
at /node_modules/nlp_compromise/src/pos.js:366:47
at Array.map (native)

language independence ...

Hey there,
again : this is not an issue.
The changes recently done are totally fine but let me explain why I make (made or am planning to make) which changes in the fork https://github.com/redaktor/nlp_compromise

As a European I would love this project to be as multilingual as possible ;)
The changes made have these goals :
• for contributing be totally explanative and readable
• for transport be browser-friendly and thus very small
• completely separate data / language logic / project logic

Three new files in src/data
: dictionary.js
The file where we can contribute multilingual words in the categories like in the readme.
: dictionary_rules.js (tba)
The file where we can contribute multilingual rules.
: _build.js
To build the data modules for one/some/all languages.
This could also be the first grunt step.

It will generate or overwrite a folder like 'en'.
Check it out node _build -l
Basically I am planning to let the build script generate a customized client side file and additional AMD browser modules.
See for instance the module.exportslines, there are more than 30 but they are useless in the browser and apart from that I'd optimize the compressing for browser a bit further.

I do also try to avoid duplicates further. For example in phrasal verbs : Some verbs are already in the verb data module and some adjectives are already in the adj. module ...

When it is complete:
• each module e.g. in /parent should only be a littlebit 'project logic'.
• our database can autotranslate
• I could attach our web interface to encourage translators even more ;)

nlp.sentences doesn't split on '!'

Example: nlp.sentences('How are you! That is great.') returns one sentence not two.

Britishize asymmetry with Americanize

On version 1.1.3

If you try typing in something like

require("nlp_compromise").britishize("color");
> "color"

require("nlp_compromise").britishize("favorite");
> "favorite"

require("nlp_compromise").britishize("internationalization");
> "internationalization"

It just returns whatever the input is.

The americanize function works perfectly fine, though.

nlp.pos('constructor') is returning an error

Uncaught TypeError: Cannot read property 'match' of undefined

if (w.match(/^(over|under|out|-|un|re|en).{4}/)) {
  var attempt = w.replace(/^(over|under|out|.*?-|un|re|en)/, '')
  return parts_of_speech[lexicon[attempt]]
}

cannot find dates

using this from node gives an error:
Error: Cannot find module './dates'

Steps to reproduce:

npm install nlp_compromise
node
nlp = require('nlp_compromise');

Just FYI.

@spencermountain
Please see this demo http://expresso-app.org/tutorial ...
I made the same demo with your nice project.
More or less lazily by porting the "python metrics logic" to .js.
The advantages are : .js only and onKeyPress ... Think of a better http://www.hemingwayapp.com ;))
Will work on it later today. Also pointed the author of expresso to your project.

The method could either be contributed as a .metrics() function to the "root level" used in a demo or as a standalone demo. Just tell me if you are interested by writing to @redaktor (I'll close this directly) .
Thank you for starting to produce this missing javascript-puzzlepiece !

Sentence boundary detection.

Very nice library.

When playing with some text pulled from a web article, noticed that the sentence boundary does not always work.

For example, the text below does not split sentences correctly.

The man who tried to kill former Pope John Paul II 33 years ago showed up at the Vatican on Saturday to put white roses on his tomb and said he wanted to meet Pope Francis.Mehmet Ali Agca, a Turk, left John Paul critically injured after firing several shots in the failed assassination attempt in St. Peter's Square on May 13, 1981.The former pope forgave Agca, once a member of a Turkish far right group known as the Grey Wolves, and went to meet him in 1983 in the Rome prison where he had been sentenced to life imprisonment for the attack.Agca called the Italian daily la Repubblica on Saturday to announce he had arrived in the Vatican, his first visit since the assassination attempt and exactly 31 years after John Paul met him in prison.The visit was confirmed to Reuters by Father Ciro Benedettini, the Vatican's deputy spokesman, who said Agca stood for a few moments in silent meditation over the tomb in St. Peter's Basilica before leaving two bunches of white roses.Agca, 56, was pardoned by Italy in 2000 and extradited to Turkey where he was imprisoned for the 1979 murder of a journalist and other crimes. He was released from jail in 2010.The attack against John Paul, who died in 2005, has remained clouded by unanswered questions over who may have been behind it. An Italian investigative parliamentary commission said in 2006 it was "beyond reasonable doubt" that it was masterminded by leaders of the former Soviet Union.The Vatican on Saturday gave a cool response to Agca's request to meet with Pope Francis. "He has put his flowers on John Paul's tomb; I think that is enough," Vatican spokesman father Federico Lombardi told la Repubblica.

Verb Tense Bug in Demo

Sentence:

joe carter plays patiently in toronto

Steps to reproduce:

Change plays to past-tense
Negate played

Result:

joe carter didn't playe patiently in toronto

past tense '-dy', '-ly'

I think maybe this is not working correctly, but as it seems broken for a bunch of verbs, maybe I'm missing something...

nlp.verb('study').to_past()
"studyed"

nlp.verb('apply').to_past()
"applyed"

feature request : phrasal verbs stemming

In the default mode [without {dont_combine:true}] it would be nice to have phrasal verbs recognized – as they can have a totally new meaning. For example

My grandfather likes to look back on his childhood.
``look back`

[taken from http://www.englisch-hilfen.de/grammar/phrasal_verbs.htm]

Real numbers, (i.e. 5) aren't recognised as cardinal numbers.

Title says it all.

are/were

I'm new to nlp and am weak on my grammar, so maybe I'm barking up the wrong trees.

I'm using nlp_compromise to switch the verb-tense in sentences from past to present, or present to past, using nlp.verb(vb).to_present() or nlp.verb(vb).to_past() as required.

It's working great, for the most part, except for when I try to swap the tense of "They are friends" or "They were friends".

Is there some other way I should be going about this, am I using the wrong tools, or is this something that can be extended with some new rules?

installing using `npm install nlp_compromise` gives me a `shasum check failed` error

tried on my home and work computer, same error.

Also looks like the version on npm is still 0.0.7

nlp.sentences() not returning all sentences

var text = "She was dead. He was ill."
nlp.sentences(text)

// returns only ["She was dead."]

I think it's because of the abbreviation regex is picking up ill. as an abbreviation, rather than the end of the sentence.

https://github.com/spencermountain/nlp_compromise/blob/master/src/methods/tokenization/sentence.js#L11

Similarly, nlp.sentences("It was Sunday. He attended mass.") only returns ["It was Sunday."] too

Circa, or c.

Is it possible to add an exception for the following regex?

/c\.(\ ?[0-9]+)/

Right now I'm using a small script to pre-process the text that I want to analyze with nlp_compromise. The current solution that I am using looks like this:

raw = raw.replace(/c\.(\ ?[0-9]+)/g, 'circa $1');

Basically, any c. YEAR will be replaced by circa YEAR so nlp doesn't mess up with that c.. While only c. might not be good enough to be added to abbreviations since it's not significant enough, this expression matches c. NUMBER, which I think it's unambiguous enough. What do you think? Is there a way to add this or other similar, case-specific abbreviations?

(I am not proposing to change c. for circa, this is just my solution, I am proposing to add, if possible, an exception for c. YEAR and not break sentences in that period).

Use strict mode please

And write code with this mode.
Example, in file client_side/nlp.js:5652 (release 1.1.0):

  uncountable_nouns = uncountables.reduce(function(h, a) {
    h[a] = true
    return h
  }, {})

ReferenceError: uncountable_nouns is not defined

http://www.w3schools.com/js/js_strict.asp

nlp.tag('Tony Hawk said he was very happy')[0] !== 'Tony Hawk'

But stated otherwise at https://github.com/spencermountain/nlp_comprimise#named-entity-recognition

NPM installation throws error

Below is the stack trace --

npm http GET https://registry.npmjs.org/nlp_comprimise
npm http 304 https://registry.npmjs.org/nlp_comprimise
npm http GET https://registry.npmjs.org/nlp_comprimise/-/nlp_comprimise-0.0.3.tgz
npm http 404 https://registry.npmjs.org/nlp_comprimise/-/nlp_comprimise-0.0.3.tgz
npm ERR! fetch failed https://registry.npmjs.org/nlp_comprimise/-/nlp_comprimise-0.0.3.tgz
npm ERR! Error: 404 Not Found
npm ERR!     at WriteStream.<anonymous> (/usr/local/Cellar/node/0.10.25/lib/node_modules/npm/lib/utils/fetch.js:57:12)
npm ERR!     at WriteStream.EventEmitter.emit (events.js:117:20)
npm ERR!     at fs.js:1596:14
npm ERR!     at /usr/local/Cellar/node/0.10.25/lib/node_modules/npm/node_modules/graceful-fs/graceful-fs.js:103:5
npm ERR!     at Object.oncomplete (fs.js:107:15)
npm ERR! If you need help, you may report this *entire* log,
npm ERR! including the npm and node versions, at:
npm ERR!     <http://github.com/isaacs/npm/issues>

npm ERR! System Darwin 13.0.0
npm ERR! command "/usr/local/Cellar/node/0.10.25/bin/node" "/usr/local/bin/npm" "install" "nlp_comprimise" "--save"
npm ERR! cwd /Users/WS/nlp/natural
npm ERR! node -v v0.10.25
npm ERR! npm -v 1.3.24
npm ERR! 
npm ERR! Additional logging details can be found in:
npm ERR!     /Users/WS/nlp/natural/npm-debug.log
npm ERR! not ok code 0

Contractions

Please note that he'sand she's becomes ['he', 'is'] and ['she', 'is']
but it could also be ['he', 'has'] and ['she', 'has']
stackexchange

• how about `ìt``?

• shouldn't the negative contractions be handled here too?
"cannot": ["can", "not"] is the only one.
But how about stuff like
"shouldn't": ["should", "not"]

This would affect logic_negate, I assume.

Changes in the fork and the pull request ...

Hey,

just commited nearly the last changes to the fork
https://github.com/redaktor/nlp_compromise
before I could do a pull request.

I need to eliminate the
• 'hardcoded' dups in lexicon generation
• last 37/1360(?) tests failing

The lexicon will be at least 10% smaller then and I really think starting with this structure
language dependent contributing can become easy.
Just because I saw you were recently active ...

Bower

Amazing library, thanks guy!
Please, add your lib to bower registry http://bower.io/ http://bower.io/docs/creating-packages/

Adding new Names for recognition using spot?

I looked through a variety of files, but haven't found either a list or a method of where I can append known people/organization names for recognition using .spot -- does this exist and I'm just missing it?

Prepositions (IN)

Hey there,

when I initially compared the data it ignored the prepositions (IN) due to a typo and our db splits in pre- and postpositions. Got that now and found some prepositions which are listed in other categories:
"before": is also CC,
"round": is also JJ,
"apart": is also RB (but can be preposition "apart from this" OR postposition "this apart")

And we list some prepositions which are not in the array yet (did NOT check other categories):

[
{en: 'a'},
{en: 'an'},
{en: 'abaft'},
{en: 'abeam'},
{en: 'aboard'},
{en: 'absent'},
{en: 'afore'},
{en: 'alongside'},
{en: 'amidst'},
{en: 'amongst'},
{en: 'anenst'},
{en: 'apropos'},
{en: 'apud'},
{en: 'aside'},
{en: 'astride'},
{en: 'athwart'},
{en: 'atop'},
{en: 'barring'},
{en: 'beneath'},
{en: 'beside'},
{en: 'beyond'},
{en: 'but'},
{en: 'chez'},
{en: 'circa'},
{en: 'concerning'},
{en: 'excluding'},
{en: 'failing'},
{en: 'following'},
{en: 'for'},
{en: 'forenenst'},
{en: 'given'},
{en: 'including'},
{en: 'inside'},
{en: 'like'},
{en: 'mid'},
{en: 'midst'},
{en: 'minus'},
{en: 'modulo'},
{en: 'near'},
{en: 'next'},
{en: 'notwithstanding'},
{en: 'opposite'},
{en: 'outside'},
{en: 'pace'},
{en: 'past'},
{en: 'plus'},
{en: 'pro'},
{en: 'qua'},
{en: 'regarding'},
{en: 'sans'},
{en: 'save'},
{en: 'times'},
{en: 'toward'},
{en: 'underneath'},
{en: 'unto'},
{en: 'worth'},
{en: 'together', description: 'questionable'},
{en: 'vis-à-vis', description: 'questionable'},
{en: 'thru', description: 'informal', meta: {entitySubstitution: ['en']}},
{en: 'thruout', description: 'informal', meta: {entitySubstitution: ['en']}},
{en: 'till', description: 'same as "until", wikipedia: "with prosodic restrictions"'},
{en: 'versus', description: 'NAB conflict: commonly abbreviated as "vs.", or (law or sports) as "v."'},
{en: 'vice', description: 'used as "in place of"'},
{en: 'with', description: 'sometimes written as "w/"'},
{en: 'w/', meta: {entitySubstitution: ['en']}},
{en: 'within', description: 'sometimes written as "w/in" or "w/i"'},
{en: 'w/in', meta: {entitySubstitution: ['en']}},
{en: 'w/i', meta: {entitySubstitution: ['en']}},
{en: 'without', description: 'sometimes written as "w/o"'},
{en: 'w/o', meta: {entitySubstitution: ['en']}},
{en: 'o\'', description: 'apocopic form of "of"', meta: {entitySubstitution: ['en']}}
]

btw - a nice one: https://www.youtube.com/watch?t=108&v=MHX-CiJBVy0

use of var in for loops ...

Regarding
a7f1e68 -> src/parents/noun/conjugate/inflect.js
please see e.g.
http://stackoverflow.com/questions/5717126/var-or-no-var-in-javascripts-for-in-loop

Usually it is "better practice" to use var in each for loop...
Personal opinion. Just saying.

nlp.tag is not defined

The README mentions it, but I don't see it in the exports nor does the current version published to npm have it.

ambiguous contractions in negate -

nlp.pos("he's eating a veggie burger").sentences[0].negate().text();

'he's isn't eating a veggie burger'

adding a quick fix...

January is not recognized by date_extractor?

Heya,

This is a wonderful library. Hoping to use it to extract dates in a project, but I noticed that January consistently fails to be properly extracted in tests. I'm wondering if this is a subtle bug with indexes / accidental type coercion of 0 to false in date_extractor.coffee.

Would be happy to help you track down the issue if you have trouble.

Here is an example I just tried on master:

nlp.value("Today is January 7, 2015").date()
{ month: null,
  day: 7,
  year: 2015,
  to_day: null,
  to_year: 2015,
  to_month: null }

Angular module

I'm writing angularjs module for this - https://github.com/Kroid/angular-nlp-compromise, if someone need.

Dictionary?

Are you using a own internal dictionary / algorithms to do the transformations etc? If so, and I seem to think this is the case, there is something off with the conjugation of the verb "load":

{ infinitive: 'loa',
  present: 'loads',
  past: 'loaded',
  gerund: 'loading',
  doer: 'loaer',
  future: 'will loa' }

Now if I try something else, like "to load":

{ infinitive: 'to load',
  present: 'to loads',
  past: 'to loaded',
  gerund: 'to loading',
  doer: 'to loader',
  future: 'will to load' }

This doesn't seem right either. Am I doing something wrong with entry of the string word(s) - some form I am missing? Or is this a corner case in the algorithm perhaps? Thought I'd at least report it 📦

Otherwise, great solution: much thanks!

nlp.spot not returning entity array, but rather an object (still)

lates nlp.js from master:

Question / feature request

for normalizing the input:

How about all normalizing all typographic stuff like curly and special quotes
to well the normalized ones ?

maybe useful for e.g.
O’Reilly to O'Reilly etc.

see http://practicaltypography.com/straight-and-curly-quotes.html

and note to me :
Maybe it would be useful to write a "preprocess" test, testing if everything in .js and .min.js ("expanded") is the same.

PP - compared to fork

Hey,
sorry for opening a new one (trying to separate the different questions) :
This is about posessive-pronounsPP which were not covered in the initial comparison (same reason: because they split to different categories in our db - I made the db compatible now, when you look for PP it'll join all 3 categories) - I am pasting the question I just added to the code (not sure if there are better expressions for the cases):

    // TODO - this covers more than the original :
    // posessive-pronouns (should) have 3 forms :

    // as possessive (adjective) determiner pronoun (my) OR
    // as possessive (noun) pronoun (mine) OR
    // as a reflexive pronoun (myself)

What do you think?

btw: some changes you proposed in contributing.md also came to the fork.
E. g. it has now JSdoc documentation (WIP, standard template for now) ...

spencermountain / compromise Goto Github PK

compromise's Introduction

compromise/one

compromise/two

compromise/three

API

Compromise/one

Output

Utils

Accessors

Match

Tag

Case

Whitespace

Loops

Insert

Transform

Lib

compromise/two:

Contractions

compromise/three:

Nouns

Verbs

Numbers

Sentences

Adjectives

Misc selections

.extend():

Docs:

gentle introduction:

Documentation:

Talks:

Articles:

Some fun Applications:

Comparisons

Plugins:

Dates

Stats

Speech

Wikipedia

Typescript

Limitations:

FAQ

See Also:

compromise's People

Contributors

Stargazers

Watchers

Forkers

compromise's Issues

Recommend Projects

Recommend Topics

Recommend Org

`compromise/one`

`compromise/two`

`compromise/three`