Coder Social home page Coder Social logo

spencermountain / compromise Goto Github PK

View Code? Open in Web Editor NEW
11.2K 165.0 640.0 56.08 MB

modest natural-language processing

Home Page: http://compromise.cool

License: MIT License

JavaScript 99.75% TypeScript 0.19% HTML 0.06%
nlp part-of-speech named-entity-recognition

compromise's Introduction

compromise
modest natural language processing
npm install compromise

don't you find it strange,
    how easy text is to make,

     ᔐᖜ   and how hard it is to actually parse and use?

compromise tries its best to turn text into data.
it makes limited and sensible decisions.
it's not as smart as you'd think.

import nlp from 'compromise'

let doc = nlp('she sells seashells by the seashore.')
doc.verbs().toPastTense()
doc.text()
// 'she sold seashells by the seashore.'

don't be fancy, at all:
if (doc.has('simon says #Verb')) {
  return true
}

grab parts of the text:
let doc = nlp(entireNovel)
doc.match('the #Adjective of times').text()
// "the blurst of times?"

and get data:

import plg from 'compromise-speech'
nlp.extend(plg)

let doc = nlp('Milwaukee has certainly had its share of visitors..')
doc.compute('syllables')
doc.places().json()
/*
[{
  "text": "Milwaukee",
  "terms": [{ 
    "normal": "milwaukee",
    "syllables": ["mil", "wau", "kee"]
  }]
}]
*/

avoid the problems of brittle parsers:

let doc = nlp("we're not gonna take it..")

doc.has('gonna') // true
doc.has('going to') // true (implicit)

// transform
doc.contractions().expand()
doc.text()
// 'we are not going to take it..'

and whip stuff around like it's data:

let doc = nlp('ninety five thousand and fifty two')
doc.numbers().add(20)
doc.text()
// 'ninety five thousand and seventy two'

-because it actually is-

let doc = nlp('the purple dinosaur')
doc.nouns().toPlural()
doc.text()
// 'the purple dinosaurs'

Use it on the client-side:

<script src="https://unpkg.com/compromise"></script>
<script>
  var doc = nlp('two bottles of beer')
  doc.numbers().minus(1)
  document.body.innerHTML = doc.text()
  // 'one bottle of beer'
</script>

or likewise:

import nlp from 'compromise'

var doc = nlp('London is calling')
doc.verbs().toNegative()
// 'London is not calling'

compromise is ~250kb (minified):

it's pretty fast. It can run on keypress:

it works mainly by conjugating all forms of a basic word list.

The final lexicon is ~14,000 words:

you can read more about how it works, here. it's weird.

okay -

compromise/one

A tokenizer of words, sentences, and punctuation.

import nlp from 'compromise/one'

let doc = nlp("Wayne's World, party time")
let data = doc.json()
/* [{ 
  normal:"wayne's world party time",
    terms:[{ text: "Wayne's", normal: "wayne" }, 
      ...
      ] 
  }]
*/

compromise/one splits your text up, wraps it in a handy API,

    and does nothing else -

/one is quick - most sentences take a 10th of a millisecond.

It can do ~1mb of text a second - or 10 wikipedia pages.

Infinite jest takes 3s.

You can also parallelize, or stream text to it with compromise-speed.

compromise/two

A part-of-speech tagger, and grammar-interpreter.

import nlp from 'compromise/two'

let doc = nlp("Wayne's World, party time")
let str = doc.match('#Possessive #Noun').text()
// "Wayne's World"

compromise/two automatically calculates the very basic grammar of each word.

this is more useful than people sometimes realize.

Light grammar helps you write cleaner templates, and get closer to the information.

compromise has 83 tags, arranged in a handsome graph.

#FirstName#Person#ProperNoun#Noun

you can see the grammar of each word by running doc.debug()

you can see the reasoning for each tag with nlp.verbose('tagger').

if you prefer Penn tags, you can derive them with:

let doc = nlp('welcome thrillho')
doc.compute('penn')
doc.json()

compromise/three

Phrase and sentence tooling.

import nlp from 'compromise/three'

let doc = nlp("Wayne's World, party time")
let str = doc.people().normalize().text()
// "wayne"

compromise/three is a set of tooling to zoom into and operate on parts of a text.

.numbers() grabs all the numbers in a document, for example - and extends it with new methods, like .subtract().

When you have a phrase, or group of words, you can see additional metadata about it with .json()

let doc = nlp('four out of five dentists')
console.log(doc.fractions().json())
/*[{
    text: 'four out of five',
    terms: [ [Object], [Object], [Object], [Object] ],
    fraction: { numerator: 4, denominator: 5, decimal: 0.8 }
  }
]*/
let doc = nlp('$4.09CAD')
doc.money().json()
/*[{
    text: '$4.09CAD',
    terms: [ [Object] ],
    number: { prefix: '$', num: 4.09, suffix: 'cad'}
  }
]*/

API

Compromise/one

Output
  • .text() - return the document as text
  • .json() - return the document as data
  • .debug() - pretty-print the interpreted document
  • .out() - a named or custom output
  • .html({}) - output custom html tags for matches
  • .wrap({}) - produce custom output for document matches
Utils
  • .found [getter] - is this document empty?
  • .docs [getter] get term objects as json
  • .length [getter] - count the # of characters in the document (string length)
  • .isView [getter] - identify a compromise object
  • .compute() - run a named analysis on the document
  • .clone() - deep-copy the document, so that no references remain
  • .termList() - return a flat list of all Term objects in match
  • .cache({}) - freeze the current state of the document, for speed-purposes
  • .uncache() - un-freezes the current state of the document, so it may be transformed
  • .freeze({}) - prevent any tags from being removed, in these terms
  • .unfreeze({}) - allow tags to change again, as default
Accessors
Match

(match methods use the match-syntax.)

  • .match('') - return a new Doc, with this one as a parent
  • .not('') - return all results except for this
  • .matchOne('') - return only the first match
  • .if('') - return each current phrase, only if it contains this match ('only')
  • .ifNo('') - Filter-out any current phrases that have this match ('notIf')
  • .has('') - Return a boolean if this match exists
  • .before('') - return all terms before a match, in each phrase
  • .after('') - return all terms after a match, in each phrase
  • .union() - return combined matches without duplicates
  • .intersection() - return only duplicate matches
  • .complement() - get everything not in another match
  • .settle() - remove overlaps from matches
  • .growRight('') - add any matching terms immediately after each match
  • .growLeft('') - add any matching terms immediately before each match
  • .grow('') - add any matching terms before or after each match
  • .sweep(net) - apply a series of match objects to the document
  • .splitOn('') - return a Document with three parts for every match ('splitOn')
  • .splitBefore('') - partition a phrase before each matching segment
  • .splitAfter('') - partition a phrase after each matching segment
  • .join() - merge any neighbouring terms in each match
  • .joinIf(leftMatch, rightMatch) - merge any neighbouring terms under given conditions
  • .lookup([]) - quick find for an array of string matches
  • .autoFill() - create type-ahead assumptions on the document
Tag
  • .tag('') - Give all terms the given tag
  • .tagSafe('') - Only apply tag to terms if it is consistent with current tags
  • .unTag('') - Remove this term from the given terms
  • .canBe('') - return only the terms that can be this tag
Case
Whitespace
  • .pre('') - add this punctuation or whitespace before each match
  • .post('') - add this punctuation or whitespace after each match
  • .trim() - remove start and end whitespace
  • .hyphenate() - connect words with hyphen, and remove whitespace
  • .dehyphenate() - remove hyphens between words, and set whitespace
  • .toQuotations() - add quotation marks around these matches
  • .toParentheses() - add brackets around these matches
Loops
  • .map(fn) - run each phrase through a function, and create a new document
  • .forEach(fn) - run a function on each phrase, as an individual document
  • .filter(fn) - return only the phrases that return true
  • .find(fn) - return a document with only the first phrase that matches
  • .some(fn) - return true or false if there is one matching phrase
  • .random(fn) - sample a subset of the results
Insert
Transform
Lib

(these methods are on the main nlp object)

compromise/two:

Contractions

compromise/three:

Nouns
Verbs
Numbers
Sentences
Adjectives
Misc selections

.extend():

This library comes with a considerate, common-sense baseline for english grammar.

You're free to change, or lay-waste to any settings - which is the fun part actually.

the easiest part is just to suggest tags for any given words:

let myWords = {
  kermit: 'FirstName',
  fozzie: 'FirstName',
}
let doc = nlp(muppetText, myWords)

or make heavier changes with a compromise-plugin.

import nlp from 'compromise'
nlp.extend({
  // add new tags
  tags: {
    Character: {
      isA: 'Person',
      notA: 'Adjective',
    },
  },
  // add or change words in the lexicon
  words: {
    kermit: 'Character',
    gonzo: 'Character',
  },
  // change inflections
  irregulars: {
    get: {
      pastTense: 'gotten',
      gerund: 'gettin',
    },
  },
  // add new methods to compromise
  api: View => {
    View.prototype.kermitVoice = function () {
      this.sentences().prepend('well,')
      this.match('i [(am|was)]').prepend('um,')
      return this
    }
  },
})

Docs:

gentle introduction:
Documentation:
Concepts API Plugins
Accuracy Accessors Adjectives
Caching Constructor-methods Dates
Case Contractions Export
Filesize Insert Hash
Internals Json Html
Justification Character Offsets Keypress
Lexicon Loops Ngrams
Match-syntax Match Numbers
Performance Nouns Paragraphs
Plugins Output Scan
Projects Selections Sentences
Tagger Sorting Syllables
Tags Split Pronounce
Tokenization Text Strict
Named-Entities Utils Penn-tags
Whitespace Verbs Typeahead
World data Normalization Sweep
Fuzzy-matching Typescript Mutation
Root-forms
Talks:
Articles:
Some fun Applications:
Comparisons

Plugins:

These are some helpful extensions:

Dates

npm install compromise-dates

Stats

npm install compromise-stats

Speech

npm install compromise-syllables

Wikipedia

npm install compromise-wikipedia


Typescript

we're committed to typescript/deno support, both in main and in the official-plugins:

import nlp from 'compromise'
import stats from 'compromise-stats'

const nlpEx = nlp.extend(stats)

nlpEx('This is type safe!').ngrams({ min: 1 })

Limitations:

  • slash-support: We currently split slashes up as different words, like we do for hyphens. so things like this don't work: nlp('the koala eats/shoots/leaves').has('koala leaves') //false

  • inter-sentence match: By default, sentences are the top-level abstraction. Inter-sentence, or multi-sentence matches aren't supported without a plugin: nlp("that's it. Back to Winnipeg!").has('it back')//false

  • nested match syntax: the danger beauty of regex is that you can recurse indefinitely. Our match syntax is much weaker. Things like this are not (yet) possible: doc.match('(modern (major|minor))? general') complex matches must be achieved with successive .match() statements.

  • dependency parsing: Proper sentence transformation requires understanding the syntax tree of a sentence, which we don't currently do. We should! Help wanted with this.

FAQ

    ☂️ Isn't javascript too...

      yeah it is!
      it wasn't built to compete with NLTK, and may not fit every project.
      string processing is synchronous too, and parallelizing node processes is weird.
      See here for information about speed & performance, and here for project motivations

    💃 Can it run on my arduino-watch?

      Only if it's water-proof!
      Read quick start for running compromise in workers, mobile apps, and all sorts of funny environments.

    🌎 Compromise in other Languages?

    ✨ Partial builds?

      we do offer a tokenize-only build, which has the POS-tagger pulled-out.
      but otherwise, compromise isn't easily tree-shaken.
      the tagging methods are competitive, and greedy, so it's not recommended to pull things out.
      Note that without a full POS-tagging, the contraction-parser won't work perfectly. ((spencer's cool) vs. (spencer's house))
      It's recommended to run the library fully.

See Also:

MIT

compromise's People

Contributors

axaysushir avatar creatorrr avatar davidbuhler avatar fdawgs avatar fmacpro avatar himself65 avatar ilyankou avatar jakeii avatar jaredreisinger avatar jfemia avatar johnyesberg avatar kahwee avatar kelvinhammond avatar khtdr avatar kiran-rao avatar leoseccia avatar lostfictions avatar marketingpip avatar markherhold avatar nloveladyallen avatar rotemdan avatar ryancasburn-kai avatar scagood avatar shamoons avatar silentrob avatar soyjavi avatar spencermountain avatar srubin avatar thegoatherder avatar wallali avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

compromise's Issues

Prepositions (IN)

Hey there,

when I initially compared the data it ignored the prepositions (IN) due to a typo and our db splits in pre- and postpositions. Got that now and found some prepositions which are listed in other categories:
"before": is also CC,
"round": is also JJ,
"apart": is also RB (but can be preposition "apart from this" OR postposition "this apart")

And we list some prepositions which are not in the array yet (did NOT check other categories):

[
{en: 'a'},
{en: 'an'},
{en: 'abaft'},
{en: 'abeam'},
{en: 'aboard'},
{en: 'absent'},
{en: 'afore'},
{en: 'alongside'},
{en: 'amidst'},
{en: 'amongst'},
{en: 'anenst'},
{en: 'apropos'},
{en: 'apud'},
{en: 'aside'},
{en: 'astride'},
{en: 'athwart'},
{en: 'atop'},
{en: 'barring'},
{en: 'beneath'},
{en: 'beside'},
{en: 'beyond'},
{en: 'but'},
{en: 'chez'},
{en: 'circa'},
{en: 'concerning'},
{en: 'excluding'},
{en: 'failing'},
{en: 'following'},
{en: 'for'},
{en: 'forenenst'},
{en: 'given'},
{en: 'including'},
{en: 'inside'},
{en: 'like'},
{en: 'mid'},
{en: 'midst'},
{en: 'minus'},
{en: 'modulo'},
{en: 'near'},
{en: 'next'},
{en: 'notwithstanding'},
{en: 'opposite'},
{en: 'outside'},
{en: 'pace'},
{en: 'past'},
{en: 'plus'},
{en: 'pro'},
{en: 'qua'},
{en: 'regarding'},
{en: 'sans'},
{en: 'save'},
{en: 'times'},
{en: 'toward'},
{en: 'underneath'},
{en: 'unto'},
{en: 'worth'},
{en: 'together', description: 'questionable'},
{en: 'vis-à-vis', description: 'questionable'},
{en: 'thru', description: 'informal', meta: {entitySubstitution: ['en']}},
{en: 'thruout', description: 'informal', meta: {entitySubstitution: ['en']}},
{en: 'till', description: 'same as "until", wikipedia: "with prosodic restrictions"'},
{en: 'versus', description: 'NAB conflict: commonly abbreviated as "vs.", or (law or sports) as "v."'},
{en: 'vice', description: 'used as "in place of"'},
{en: 'with', description: 'sometimes written as "w/"'},
{en: 'w/', meta: {entitySubstitution: ['en']}},
{en: 'within', description: 'sometimes written as "w/in" or "w/i"'},
{en: 'w/in', meta: {entitySubstitution: ['en']}},
{en: 'w/i', meta: {entitySubstitution: ['en']}},
{en: 'without', description: 'sometimes written as "w/o"'},
{en: 'w/o', meta: {entitySubstitution: ['en']}},
{en: 'o\'', description: 'apocopic form of "of"', meta: {entitySubstitution: ['en']}}
]

btw - a nice one: https://www.youtube.com/watch?t=108&v=MHX-CiJBVy0

PP - compared to fork

Hey,
sorry for opening a new one (trying to separate the different questions) :
This is about posessive-pronounsPP which were not covered in the initial comparison (same reason: because they split to different categories in our db - I made the db compatible now, when you look for PP it'll join all 3 categories) - I am pasting the question I just added to the code (not sure if there are better expressions for the cases):

    // TODO - this covers more than the original :
    // posessive-pronouns (should) have 3 forms :

    // as possessive (adjective) determiner pronoun (my) OR
    // as possessive (noun) pronoun (mine) OR
    // as a reflexive pronoun (myself)

What do you think?

btw: some changes you proposed in contributing.md also came to the fork.
E. g. it has now JSdoc documentation (WIP, standard template for now) ...

Adding new Names for recognition using spot?

I looked through a variety of files, but haven't found either a list or a method of where I can append known people/organization names for recognition using .spot -- does this exist and I'm just missing it?

Sentence boundary detection.

Very nice library.

When playing with some text pulled from a web article, noticed that the sentence boundary does not always work.

For example, the text below does not split sentences correctly.

The man who tried to kill former Pope John Paul II 33 years ago showed up at the Vatican on Saturday to put white roses on his tomb and said he wanted to meet Pope Francis.Mehmet Ali Agca, a Turk, left John Paul critically injured after firing several shots in the failed assassination attempt in St. Peter's Square on May 13, 1981.The former pope forgave Agca, once a member of a Turkish far right group known as the Grey Wolves, and went to meet him in 1983 in the Rome prison where he had been sentenced to life imprisonment for the attack.Agca called the Italian daily la Repubblica on Saturday to announce he had arrived in the Vatican, his first visit since the assassination attempt and exactly 31 years after John Paul met him in prison.The visit was confirmed to Reuters by Father Ciro Benedettini, the Vatican's deputy spokesman, who said Agca stood for a few moments in silent meditation over the tomb in St. Peter's Basilica before leaving two bunches of white roses.Agca, 56, was pardoned by Italy in 2000 and extradited to Turkey where he was imprisoned for the 1979 murder of a journalist and other crimes. He was released from jail in 2010.The attack against John Paul, who died in 2005, has remained clouded by unanswered questions over who may have been behind it. An Italian investigative parliamentary commission said in 2006 it was "beyond reasonable doubt" that it was masterminded by leaders of the former Soviet Union.The Vatican on Saturday gave a cool response to Agca's request to meet with Pope Francis. "He has put his flowers on John Paul's tomb; I think that is enough," Vatican spokesman father Federico Lombardi told la Repubblica.

nlp.pos('constructor') is returning an error

Uncaught TypeError: Cannot read property 'match' of undefined
if (w.match(/^(over|under|out|-|un|re|en).{4}/)) {
  var attempt = w.replace(/^(over|under|out|.*?-|un|re|en)/, '')
  return parts_of_speech[lexicon[attempt]]
}

date_extractor's regex not replacing properly

Hi,

in date_extractor.js, line 24 to 35, the replace regex replaces dates in the format "Feb. 14, 1969" to "February14, 1969" (no space between the month and the date), leading the parser to skip the date and only match the year.

Fixed by surrounding the replaced month names by spaces:

    text = text.replace(/ Feb\.? /g, ' February ');
    text = text.replace(/ Mar\.? /g, ' March ');
    text = text.replace(/ Apr\.? /g, ' April ');
    text = text.replace(/ Jun\.? /g, ' June ');
    text = text.replace(/ Jul\.? /g, ' july ');
    text = text.replace(/ Aug\.? /g, ' august ');
    text = text.replace(/ Sep\.? /g, ' september ');
    text = text.replace(/ Oct\.? /g, ' october ');
    text = text.replace(/ Nov\.? /g, ' november ');
    text = text.replace(/ Dec\.? /g, ' december ');

referenced_by and reference_to

Hey,
please see https://github.com/spencermountain/nlp_compromise/blob/master/src/parents/noun/index.js

referenced_by uses the var posessives (typo ?). It is defined in the scope of the module and is

{
    "his": "he",
    "her": "she",
    "hers": "she",
    "their": "they",
    "them": "they",
    "its": "it"
  }

while reference_to uses the var var possessives defined in the scope of the function which is just

{
    "his":"he",
    "her":"she",
    "their":"they"
}

Shouldn't both be the same and maybe

{ 
    mine: 'i',
    yours: 'you',
    his: 'he',
    her: 'she',
    its: 'it',
    our: 'we',
    their: 'they',
    them: 'they' 
}

?

Circa, or c.

Is it possible to add an exception for the following regex?

/c\.(\ ?[0-9]+)/

Right now I'm using a small script to pre-process the text that I want to analyze with nlp_compromise. The current solution that I am using looks like this:

raw = raw.replace(/c\.(\ ?[0-9]+)/g, 'circa $1');

Basically, any c. YEAR will be replaced by circa YEAR so nlp doesn't mess up with that c.. While only c. might not be good enough to be added to abbreviations since it's not significant enough, this expression matches c. NUMBER, which I think it's unambiguous enough. What do you think? Is there a way to add this or other similar, case-specific abbreviations?

(I am not proposing to change c. for circa, this is just my solution, I am proposing to add, if possible, an exception for c. YEAR and not break sentences in that period).

Contractions

Please note that he'sand she's becomes ['he', 'is'] and ['she', 'is']
but it could also be ['he', 'has'] and ['she', 'has']
stackexchange

• how about `ìt``?

• shouldn't the negative contractions be handled here too?
"cannot": ["can", "not"] is the only one.
But how about stuff like
"shouldn't": ["should", "not"]

This would affect logic_negate, I assume.

language independence ...

Hey there,
again : this is not an issue.
The changes recently done are totally fine but let me explain why I make (made or am planning to make) which changes in the fork https://github.com/redaktor/nlp_compromise

As a European I would love this project to be as multilingual as possible ;)
The changes made have these goals :
• for contributing be totally explanative and readable
• for transport be browser-friendly and thus very small
• completely separate data / language logic / project logic

Three new files in src/data
: dictionary.js
The file where we can contribute multilingual words in the categories like in the readme.
: dictionary_rules.js (tba)
The file where we can contribute multilingual rules.
: _build.js
To build the data modules for one/some/all languages.
This could also be the first grunt step.

It will generate or overwrite a folder like 'en'.
Check it out node _build -l
Basically I am planning to let the build script generate a customized client side file and additional AMD browser modules.
See for instance the module.exportslines, there are more than 30 but they are useless in the browser and apart from that I'd optimize the compressing for browser a bit further.

I do also try to avoid duplicates further. For example in phrasal verbs : Some verbs are already in the verb data module and some adjectives are already in the adj. module ...

When it is complete:
• each module e.g. in /parent should only be a littlebit 'project logic'.
• our database can autotranslate
• I could attach our web interface to encourage translators even more ;)

ngram: support for ngrams with size equal to 1

Currently there is no way to use nlp.ngram() to perform a simple word frequency calculation (i.e. ngrams with size equal to one). Setting the max_size option to 1 produces ngrams with size 2. Setting max_size to 0 also gives the same result. I suspect that these two are the lines responsible for this issue: https://github.com/spencermountain/nlp_compromise/blob/master/src/methods/tokenization/ngram.js#L11 (where max_size is incremented - why?) and https://github.com/spencermountain/nlp_compromise/blob/master/src/methods/tokenization/ngram.js#L6
(where since 0 is false, max_size is assigned the value 5).

nlp.sentences() collapses whitespace

Pretty self explanatory, but if there are multiple white space characters between a word, the sentence detected collapses these characters together

NPM installation throws error

Below is the stack trace --

npm http GET https://registry.npmjs.org/nlp_comprimise
npm http 304 https://registry.npmjs.org/nlp_comprimise
npm http GET https://registry.npmjs.org/nlp_comprimise/-/nlp_comprimise-0.0.3.tgz
npm http 404 https://registry.npmjs.org/nlp_comprimise/-/nlp_comprimise-0.0.3.tgz
npm ERR! fetch failed https://registry.npmjs.org/nlp_comprimise/-/nlp_comprimise-0.0.3.tgz
npm ERR! Error: 404 Not Found
npm ERR!     at WriteStream.<anonymous> (/usr/local/Cellar/node/0.10.25/lib/node_modules/npm/lib/utils/fetch.js:57:12)
npm ERR!     at WriteStream.EventEmitter.emit (events.js:117:20)
npm ERR!     at fs.js:1596:14
npm ERR!     at /usr/local/Cellar/node/0.10.25/lib/node_modules/npm/node_modules/graceful-fs/graceful-fs.js:103:5
npm ERR!     at Object.oncomplete (fs.js:107:15)
npm ERR! If you need help, you may report this *entire* log,
npm ERR! including the npm and node versions, at:
npm ERR!     <http://github.com/isaacs/npm/issues>

npm ERR! System Darwin 13.0.0
npm ERR! command "/usr/local/Cellar/node/0.10.25/bin/node" "/usr/local/bin/npm" "install" "nlp_comprimise" "--save"
npm ERR! cwd /Users/WS/nlp/natural
npm ERR! node -v v0.10.25
npm ERR! npm -v 1.3.24
npm ERR! 
npm ERR! Additional logging details can be found in:
npm ERR!     /Users/WS/nlp/natural/npm-debug.log
npm ERR! not ok code 0

date_extractor: Cannot read property '1' of null

Hi,
I'm getting this error when trying to parse certain strings.
I've put in a hack for the function to always return null as i'm not using date extraction, but it's not a fix.

/node_modules/nlp_compromise/src/parents/value/coffeejs/date_extractor.js:224
h[k] = arr[places[k]];
^
TypeError: Cannot read property '1' of null
at /node_modules/nlp_compromise/src/parents/value/coffeejs/date_extractor.js:224:21
at Array.reduce (native)
at Object.regexes.process (/node_modules/nlp_compromise/src/parents/value/coffeejs/date_extractor.js:223:36)
at main (/node_modules/nlp_compromise/src/parents/value/coffeejs/date_extractor.js:334:20)
at the.date (/node_modules/nlp_compromise/src/parents/value/index.js:13:11)
at /node_modules/nlp_compromise/src/parents/value/index.js:38:11
at new Value (/node_modules/nlp_compromise/src/parents/value/index.js:45:4)
at Object.parents.value (/node_modules/nlp_compromise/src/parents/parents.js:22:10)
at /node_modules/nlp_compromise/src/pos.js:366:47
at Array.map (native)

Sentence negate will negate everything

for example:
[Orig]
They are based on different physical effects use to guarantee a stable grasping between a gripper and the object to be grasped.

[negate]
They are not based on different physical didn't effects use to doesn't guarantee a stable grasping between a gripper and the object to be grasped.

Maybe just negate the first verb would be sufficient.

is_plural again

Hm - the last commit does not work properly because

in pluralize_rules we have rules for both singular to plural AND plural to plural
while
in singularize_rulesit is only plural to singular

(???)

In general I am working on a factory method called "dictionary" based on the "words" and "rules" and this can be autotranslated by our database to several languages covering the ngram and metrics etc.

are/were

I'm new to nlp and am weak on my grammar, so maybe I'm barking up the wrong trees.

I'm using nlp_compromise to switch the verb-tense in sentences from past to present, or present to past, using nlp.verb(vb).to_present() or nlp.verb(vb).to_past() as required.

It's working great, for the most part, except for when I try to swap the tense of "They are friends" or "They were friends".

Is there some other way I should be going about this, am I using the wrong tools, or is this something that can be extended with some new rules?

Implementing the package on a data structure ?

Hey I am quite new to meteor and computer science basically. I am linguistic trying to learn computational linguistics somehow. I guess I am struggling still. I was wondering how it would be possible to use this on my own corpus ? Let's say that I have a list of sentences and whenever I choose the sentence I want to see the properties of the sentence. Would be possible ?

cannot find dates

using this from node gives an error:
Error: Cannot find module './dates'

Steps to reproduce:

  • npm install nlp_compromise
  • node
  • nlp = require('nlp_compromise');

2.0 - just a note : string to regex

see lexidates:
res.dayS = '\b('.concat(Object.keys(res.days).join('|'), ')\b');

When a string becomes a regex, in javascript, you must quote stuff with special regex-meaning double.
So \b should be \\bhere - see my original code...

If you want to use it like on top you need to pass it to a quote function, e.g.
Mozilla:

function escapeRegExp(string){
  return string.replace(/([.*+?^=!:${}()|\[\]\/\\])/g, "\\$1");
}

or in dojo see .string ...

Verb Tense Bug in Demo

Sentence:

joe carter plays patiently in toronto

Steps to reproduce:

  1. Change plays to past-tense
  2. Negate played

Result:

joe carter didn't playe patiently in toronto

feature request : phrasal verbs stemming

In the default mode [without {dont_combine:true}] it would be nice to have phrasal verbs recognized – as they can have a totally new meaning. For example

My grandfather likes to look back on his childhood.
``look back`

[taken from http://www.englisch-hilfen.de/grammar/phrasal_verbs.htm]

Just FYI.

@spencermountain
Please see this demo http://expresso-app.org/tutorial ...
I made the same demo with your nice project.
More or less lazily by porting the "python metrics logic" to .js.
The advantages are : .js only and onKeyPress ... Think of a better http://www.hemingwayapp.com ;))
Will work on it later today. Also pointed the author of expresso to your project.

The method could either be contributed as a .metrics() function to the "root level" used in a demo or as a standalone demo. Just tell me if you are interested by writing to @redaktor (I'll close this directly) .
Thank you for starting to produce this missing javascript-puzzlepiece !

January is not recognized by date_extractor?

Heya,

This is a wonderful library. Hoping to use it to extract dates in a project, but I noticed that January consistently fails to be properly extracted in tests. I'm wondering if this is a subtle bug with indexes / accidental type coercion of 0 to false in date_extractor.coffee.

Would be happy to help you track down the issue if you have trouble.

Here is an example I just tried on master:

nlp.value("Today is January 7, 2015").date()
{ month: null,
  day: 7,
  year: 2015,
  to_day: null,
  to_year: 2015,
  to_month: null }

Question - Advanced Date Parsing

First off, this is such a great project!
Do you have any thoughts returning an array of dates from a parsed sentense? Or more advanced logic like ranges?

It looks like this does a lot of what Ive done in https://github.com/silentrob/normalizer but I suspect much faster (basic normalization and commonwelth => american conversation).

I also have some code that deals with numbers and parsing math expressions here https://github.com/silentrob/superscript/blob/master/lib/math.js.

performance / why is it running twice ...

Hey there,
contributing from my fork doesn't make sense because the structure will change to 'only the 3 dictionary files and a factory' soon.
However - let me ask some perfomance questions.
Maybe I missed something, hidden in the code, but
--> several 'autoclosure' functions run every time when a module is required.

Let's take an example - the conjugation of verbs which is used quite often.
I'll use simple console.log to demonstrate it.

In src/parents/verb/index

put some logs in the conjugate function

the.conjugate = function() {
  console.log( 'BEWARE! conjugate is conjugating' );
  verb_conjugate = require('./conjugate');
  var conjugated = verb_conjugate(the.word);
  console.log( 'conjugate result', conjugated );
  return conjugated; //verb_conjugate(the.word);
}

and in the 'autoclosure' form function

the.form = (function() {
    console.log( 'BEWARE! the.form is conjugating' );
    verb_conjugate = require('./conjugate');
    // don't choose infinitive if infinitive == present
    var order = [
      'past',
      'present',
      'gerund',
      'infinitive'
    ];
    var forms = verb_conjugate(the.word);
    console.log( 'forms result', forms );
    for (var i = 0; i < order.length; i++) {
        if (forms[order[i]] === the.word) {
            return order[i];
        }
    }
})()

When I do

console.log( nlp.verb('last') );

it will conjugate

and when I do

console.log( nlp.verb('last').conjugate() );

it will conjugate twice

Double quotes to specify pos

We are using nlp_compromise to parse requests for data pulls. In many cases, a product or retailer will get parsed in an undesirable fashion, i.e. "Stop & Shop" will not be thought of as a noun.

Is it possible today, or would it be possible, to allow double-quotes to group words together and default them to a particular part of speech, like NN?

nlp.tag is not defined

The README mentions it, but I don't see it in the exports nor does the current version published to npm have it.

Britishize asymmetry with Americanize

On version 1.1.3

If you try typing in something like

require("nlp_compromise").britishize("color");
> "color"

require("nlp_compromise").britishize("favorite");
> "favorite"

require("nlp_compromise").britishize("internationalization");
> "internationalization"

It just returns whatever the input is.

The americanize function works perfectly fine, though.

Relevant polandball

past tense '-dy', '-ly'

I think maybe this is not working correctly, but as it seems broken for a bunch of verbs, maybe I'm missing something...

nlp.verb('study').to_past()
"studyed"

nlp.verb('apply').to_past()
"applyed"

Dictionary?

Are you using a own internal dictionary / algorithms to do the transformations etc? If so, and I seem to think this is the case, there is something off with the conjugation of the verb "load":

{ infinitive: 'loa',
  present: 'loads',
  past: 'loaded',
  gerund: 'loading',
  doer: 'loaer',
  future: 'will loa' }

Now if I try something else, like "to load":

{ infinitive: 'to load',
  present: 'to loads',
  past: 'to loaded',
  gerund: 'to loading',
  doer: 'to loader',
  future: 'will to load' }

This doesn't seem right either. Am I doing something wrong with entry of the string word(s) - some form I am missing? Or is this a corner case in the algorithm perhaps? Thought I'd at least report it 📦

Otherwise, great solution: much thanks!

Changes in the fork and the pull request ...

Hey,

just commited nearly the last changes to the fork
https://github.com/redaktor/nlp_compromise
before I could do a pull request.

I need to eliminate the
• 'hardcoded' dups in lexicon generation
• last 37/1360(?) tests failing

The lexicon will be at least 10% smaller then and I really think starting with this structure
language dependent contributing can become easy.
Just because I saw you were recently active ...

feature: add custom verbs/nouns etc

Scenario: using the library for natural language processing for a calendar assistant. Doesn't recognise "schedule" as a verb.

Would it be possible to pass in some configuration when instantiating the library, eg an array of verbs + nouns etc to allow users to inject extra words.
In my case I might extend verbs by passing in an array of my own verbs:
["schedule"]

That seems to work for me if I hack the code and add 'schedule' to the list of verbs...but I don't grok grammar well enough to know if it's completely correct (it becomes an infinitive verb, VBP)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.