Coder Social home page Coder Social logo

analysis's Introduction

Pelias analysis libraries

Greenkeeper badge

This repository contains prebuild textual analysis functions (analyzers) which are composed of smaller modules (tokenizers), each tokenizer performs actions such as transforming, filtering and enriching word tokens.

Using Analyzers

Analyzers are available as functions and can be called like any regular function, the input is a single string and the output is also a single string:

var street = require('./analyzer/street')
var analyzer = street()

analyzer('main str s')
// Main Street South

Analyzers also accept a 'context' object which is available throughout the analysis pipeline:

var analyzer = street({ locale: 'de' })

analyzer('main str s')
// Main Strasse Sued

Using Tokenizers

Tokenizers are intended to be used as part of an analyzer, but can also be used independently by calling Array.reduce on an array of tokens:

var tokenizer = require('./tokenizer/diacritic')

[ 'žůžo', 'Cinématte' ].reduce( tokenizer, [] )
// [ 'zuzo', 'Cinematte' ]

Writing Tokenizers

Tokenizers are functions with the interface expected by Array.reduce.

In their simplest form a tokenizer is written as:

// a delete-all tokenizer emits no words
var tokenizer = function( res, word, pos, arr ){

  // you must always return $res
  return res
}

For a tokenizer to have no effect on the token stream it must res.push() on to the response array each word it took in:

// a no-op tokenizer emits words verbatim as they were taken in
var tokenizer = function( res, word, pos, arr ){

  // push the word on to the response array unmodified
  res.push( word )

  // you must always return $res
  return res
}

A tokenizer can choose which words are pushed downstream, it can also modify words and push more than one word on to the response array:

// a split tokenizer cuts a string on word boudaries, producing multiple words
var tokenizer = function( res, word, pos, arr ){

  // split the input word on word boundaries
  var parts = word.split(/\b/g)

  // push each part downstream
  parts.forEach( function( part ){
    res.push( part )
  })

  // you must always return $res
  return res
}

Using these techniques, you can write tokenizers which delete, modify or create new words.

Writing Tokenizers (advanced)

More advanced tokenizers require information about the context in which they were run, for example, knowing the locale of your input tokens might allow you to vary its functionality accordingly.

Context is provided to tokenizers by using Function.bind to bind the context to the tokenizer. This information will then be available inside the tokenizer using the this keyword:

// an abbreviation tokenizer converts the contracted form of a word to its equivalent expanded form
var tokenizer = function( res, word, pos, arr ){

  // detect the input locale (or default to english)
  var locale = this.locale || 'en'

  if( 'str.' === word ){
    switch( locale ){
      case 'de':
        // transform to German expansion
        res.push( 'strasse' )
        return res
      case 'en':
        // transform to English expansion
        res.push( 'street' )
        return res
    }
  }

  // push the word on to the response array unmodified
  res.push( word )

  // you must always return $res
  return res
}

You can then control the runtime context of the analyzer using Function.bind:

var english = tokenizer.bind({ locale: 'en' })
[ 'str.' ].reduce( english, [] )
// [ 'street' ]

var german = tokenizer.bind({ locale: 'de' })
[ 'str.' ].reduce( german, [] )
// [ 'strasse' ]

Command line interface

there is an included CLI script which allows you to easily pipe in files for testing an analyzer:

# test a single input
$ node cli.js en street <<< "n foo st w"

North Foo Street West

# test multiple inputs
$ echo -e "n foo st w\nw 16th st" | node cli.js en street

North Foo Street West
West 16 Street

# test against the contents of a file
$ node cli.js en street < nyc.names

100 Avenue
100 Drive
100 Road
... etc

# test against openaddresses data
$ cut -d',' -f4 /data/oa/de/berlin.csv | sort | uniq | node cli.js de street

Aachener Strasse
Aalemannufer
Aalesunder Strasse
... etc

using the linux diff command you can view a side-by-side comparison of the data before and after analysis:

$ diff \
  --side-by-side \
  --ignore-blank-lines \
  --suppress-common-lines \
  --width=100 \
  --expand-tabs \
  nyc.names \
  <(node cli.js en street < nyc.names)

ZEBRA PL                  | Zebra Place
ZECK CT                   | Zeck Court
ZEPHYR AVE                  | Zephyr Avenue
... etc

Running tests

units test are run with:

$ npm test

functional tests are run with:

$ npm run funcs

analysis's People

Contributors

greenkeeper[bot] avatar missinglink avatar orangejulius avatar trescube avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.