Coder Social home page Coder Social logo

pelias's Introduction

A modular, open-source search engine for our world.

Pelias is a geocoder powered completely by open data, available freely to everyone.

Local Installation · Cloud Webservice · Documentation · Community Chat

What is Pelias?
Pelias is a search engine for places worldwide, powered by open data. It turns addresses and place names into geographic coordinates, and turns geographic coordinates into places and addresses. With Pelias, you’re able to turn your users’ place searches into actionable geodata and transform your geodata into real places.

We think open data, open source, and open strategy win over proprietary solutions at any part of the stack and we want to ensure the services we offer are in line with that vision. We believe that an open geocoder improves over the long-term only if the community can incorporate truly representative local knowledge.

Pelias

A modular, open-source geocoder built on top of Elasticsearch for fast and accurate global search.

What's a geocoder do anyway?

Geocoding is the process of taking input text, such as an address or the name of a place, and returning a latitude/longitude location on the Earth's surface for that place.

geocode

... and a reverse geocoder, what's that?

Reverse geocoding is the opposite: returning a list of places near a given latitude/longitude point.

reverse

What are the most interesting features of Pelias?

  • Completely open-source and MIT licensed
  • A powerful data import architecture: Pelias supports many open-data projects out of the box but also works great with private data
  • Support for searching and displaying results in many languages
  • Fast and accurate autocomplete for user-facing geocoding
  • Support for many result types: addresses, venues, cities, countries, and more
  • Modular design, so you don't need to be an expert in everything to make changes
  • Easy installation with minimal external dependencies

What are the main goals of the Pelias project?

  • Provide accurate search results
  • Work equally well for a small city and the entire planet
  • Be highly configurable, so different use cases can be handled easily and efficiently
  • Provide a friendly, welcoming, helpful community that takes input from people all over the world

Where did Pelias come from?

Pelias was created in 2014 as an early project at Mapzen. After Mapzen's shutdown in 2017, Pelias is now part of the Linux Foundation.

How does it work?

Magic! (Just kidding) Like any geocoder, Pelias combines full text search techniques with knowledge of geography to quickly search over many millions of records, each representing some sort of location on Earth.

The Pelias architecture has three main components and several smaller pieces.

A diagram of the Pelias architecture.

Data importers

The importers filter, normalize, and ingest geographic datasets into the Pelias database. Currently there are six officially supported importers:

We are always discussing supporting additional datasets. Pelias users can also write their own importers, for example to import proprietary data into your own instance of Pelias.

Database

The underlying datastore that does most of the query heavy-lifting and powers our search results. We use Elasticsearch. Currently versions 7 and 8 are supported.

We've built a tool called pelias-schema that sets up Elasticsearch indices properly for Pelias.

Frontend services

This is where the actual geocoding process happens, and includes the components that users interact with when performing geocoding queries. The services are:

  • API: The API service defines the Pelias API, and talks to Elasticsearch or other services as needed to perform queries.
  • Placeholder: A service built specifically to capture the relationship between administrative areas (a catch-all term meaning anything like a city, state, country, etc). Elasticsearch does not handle relational data very well, so we built Placeholder specifically to manage this piece.
  • PIP: For reverse geocoding, it's important to be able to perform point-in-polygon(PIP) calculations quickly. The PIP service is is very good at quickly determining which admin area polygons a given point lies in.
  • Libpostal: Pelias uses the libpostal project for parsing addresses using the power of machine learning. We use a Go service built by the Who's on First team to make this happen quickly and efficiently.
  • Interpolation: This service knows all about addresses and streets. With that knowledge, it is able to supplement the known addresses that are stored directly in Elasticsearch and return fairly accurate estimated address results for many more queries than would otherwise be possible.

Dependencies

These are software projects that are not used directly but are used by other components of Pelias.

There are lots of these, but here are some important ones:

  • model: provide a single library for creating documents that fit the Pelias Elasticsearch schema. This is a core component of our flexible importer architecture
  • wof-admin-lookup: A library for performing administrative lookup using point-in-polygon math. Previously included in each of the importers but now only used by the PIP service.
  • query: This is where most of our actual Elasticsearch query generation happens.
  • config: Pelias is very configurable, and all of it is driven from a single JSON file which we call pelias.json. This package provides a library for reading, validating, and working with this configuration. It is used by almost every other Pelias component
  • dbclient: A Node.js stream library for quickly and efficiently importing records into Elasticsearch

Helpful tools

Finally, while not part of Pelias proper, we have built several useful tools for working with and testing Pelias

Notable examples include:

  • acceptance-tests: A Node.js command line tool for testing a full planet build of Pelias and ensuring everything works. Familiarity with this tool is very important for ensuring Pelias is working. It supports all Pelias features and has special facilities for testing autocomplete queries.
  • compare: A web-based tool for comparing different instances of Pelias (for example a production and staging environment). We have a reference instance at pelias.github.io/compare/
  • dashboard: Another web-based tool for providing statistics about the contents of a Pelias Elasticsearch index such as import speed, number of total records, and a breakdown of records of various types.

Documentation

The main documentation lives in the pelias/documentation repository.

Additionally, the README file in each of the component repositories listed above provides more detail on that piece.

Here's an example API response for a reverse geocoding query
$ curl -s "search.mapzen.com/v1/reverse?size=1&point.lat=40.74358294846026&point.lon=-73.99047374725342&api_key={YOUR_API_KEY}" | json
{
    "geocoding": {
        "attribution": "https://search.mapzen.com/v1/attribution",
        "engine": {
            "author": "Mapzen",
            "name": "Pelias",
            "version": "1.0"
        },
        "query": {
            "boundary.circle.lat": 40.74358294846026,
            "boundary.circle.lon": -73.99047374725342,
            "boundary.circle.radius": 500,
            "point.lat": 40.74358294846026,
            "point.lon": -73.99047374725342,
            "private": false,
            "querySize": 1,
            "size": 1
        },
        "timestamp": 1460736907438,
        "version": "0.1"
    },
    "type": "FeatureCollection",
    "features": [
        {
            "geometry": {
                "coordinates": [
                    -73.99051,
                    40.74361
                ],
                "type": "Point"
            },
            "properties": {
                "borough": "Manhattan",
                "borough_gid": "whosonfirst:borough:421205771",
                "confidence": 0.9,
                "country": "United States",
                "country_a": "USA",
                "country_gid": "whosonfirst:country:85633793",
                "county": "New York County",
                "county_gid": "whosonfirst:county:102081863",
                "distance": 0.004,
                "gid": "geonames:venue:9851011",
                "id": "9851011",
                "label": "Arlington, Manhattan, NY, USA",
                "layer": "venue",
                "locality": "New York",
                "locality_gid": "whosonfirst:locality:85977539",
                "name": "Arlington",
                "neighbourhood": "Flatiron District",
                "neighbourhood_gid": "whosonfirst:neighbourhood:85869245",
                "region": "New York",
                "region_a": "NY",
                "region_gid": "whosonfirst:region:85688543",
                "source": "geonames"
            },
            "type": "Feature"
        }
    ],
    "bbox": [
        -73.99051,
        40.74361,
        -73.99051,
        40.74361
    ]
}

How can I install my own instance of Pelias?

To try out Pelias quickly, use our Docker setup. It uses Docker and docker-compose to allow you to quickly set up a Pelias instance for a small area (by default Portland, Oregon) in under 30 minutes.

Do you offer a free geocoding API?

You can sign up for a trial API key at Geocode Earth. A commercial service has been operated by the core development team behind Pelias since 2014 (previously at search.mapzen.com). Discounts and free plans are available for free and open-source software projects.

What's it built with?

Pelias itself (the import pipelines and API) is written in Node.js, which makes it highly accessible for other developers and performant under heavy I/O. It aims to be modular and is distributed across a number of Node packages, each with its own repository under the Pelias GitHub organization.

For a select few components that have performance requirements that Node.js cannot meet, we prefer to write things in Go. A good example of this is the pbf2json tool that quickly converts OSM PBF files to JSON for our OSM importer.

Elasticsearch is our datastore of choice because of its unparalleled full text search functionality, scalability, and sufficiently robust geospatial support.

Contributing

Gitter

We built Pelias as an open source project not just because we believe that users should be able to view and play with the source code of tools they use, but to get the community involved in the project itself.

Especially with a geocoder with global coverage, it's just not possible for a small team to do it alone. We need you.

Anything that we can do to make contributing easier, we want to know about. Feel free to reach out to us via Github, Gitter, email, or Twitter. We'd love to help people get started working on Pelias, especially if you're new to open source or programming in general.

We have a list of Good First Issues for new contributors.

Both this meta-repo and the API service repo are worth looking at, as they're where most issues live. We also welcome reporting issues or suggesting improvements to our documentation.

The current Pelias team can be found on Github as missinglink and orangejulius.

Members emeritus include:

pelias's People

Contributors

avulfson17 avatar bradh avatar defozo avatar dianashk avatar easherma avatar echelon9 avatar heffergm avatar hkrishna avatar jayaddison avatar kathleenld avatar matkoniecz avatar meetar avatar michaelkirk avatar migurski avatar missinglink avatar oliverbienert avatar orangejulius avatar riordan avatar rmglennon avatar sevko avatar stephenlacy avatar tigerlily-he avatar tobiasdierich avatar trescube avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pelias's Issues

Agree on a consistent unit testing framework to use in all pelias repos

As agreed upon in a recent chat, it is important to agree on a single framework to be used for unit testing across the pelias organization. This will ensure consistency and cohesion.
We are currently predominantly using tape, but there is no strong preference for tape. I prefer mocha. I've compiled a short list of my reasons for liking it, and an even shorter list of my reasons for not liking it. Also, I like using mocha with should, which reads a lot like natural language and has extensive assertions built in.

Pros:

  1. skip/only are great tools during development

    • skip makes the tests show up as pending so you don’t forget to come back to them. you can skip entire suites or single tests.
    describe.skip('some test', function () { ... });
    describe('another test', function () {
    it.skip('should work', function (done) { ... });
    });
    
    • you can write unimplemented tests, which are effective place holders
    describe('another test', function () {
    it('should work in the future');
    });
    
    • only lets you run a single test while debugging an issue, no need to comment out all other test cases
  2. before/beforeEach/after/afterEach can be nested at various levels

  3. can run tests matching some regex pattern , or inverse of regex match results

  4. lots of built-in/plug-in reporters

  5. easy hookup to coverage tools (if we decide to go that route)

  6. very well supported and embraced by the community. a lot of open source projects use it so contributors would feel comfortable adding tests. searching github:

    • dependencies mocha extension:.json: 136,726
    • dependencies tape extension:.json: 10,808

Cons:

  1. have to add globals to .jshintrc because there is some magic that happens with both mocha and should
  2. need to rewrite existing tests for consistency

@missinglink @hkrishna @sevko opinions please

Expose bounding boxes in results where appropriate

When using a geocoder to jump to a particular location, bounding boxes are useful to appropriately set the viewport's zoom level. Assuming that Pelias stores them in ES, would it be possible to expose them in the API results?

Nominatim exposes boundingbox, e.g. http://nominatim.openstreetmap.org/search?q=Brooklyn&format=json:

{
  "place_id": "5988439137",
  "licence": "Data © OpenStreetMap contributors, ODbL 1.0. http://www.openstreetmap.org/copyright",
  "osm_type": "node",
  "osm_id": "158857828",
  "boundingbox": [
    "40.6501007080078",
    "40.6501045227051",
    "-73.9495849609375",
    "-73.949577331543"
  ],
  "lat": "40.6501038",
  "lon": "-73.9495823",
  "display_name": "Brooklyn, Downtown Brooklyn, Kings County, New York City, New York, United States of America",
  "class": "place",
  "type": "suburb",
  "importance": 0.79454442710904,
  "icon": "http://nominatim.openstreetmap.org/images/mapicons/poi_place_village.p.20.png"
}

In Pelias' case, it would probably make sense to expose the bounding box in addition to the point geometry. GeoJSON's bbox looks appropriate for this.

standardize logging

We should discuss a uniform logging solution:

  1. what logger to use: we've basically already settled on using winston, so we'll stick with that unless there's a good reason not to.

  2. default logger configuration: what default settings do we want to instantiate our logger with? I've been using the following openaddresses and dbclient:

    winston.remove( winston.transports.Console );
    winston.add( winston.transports.Console, {
      timestamp: true,
      colorize: true,
      level: 'verbose'
    });
    

    Is it worth spinning up a Pelias package containing our logger preferences, so that we can simply require( 'pelias-logger' ) and not duplicate them everywhere?

  3. environment-specific logger configuration: we'll presumably want to be able to tweak the defaults from pelias-config.

cc @dianashk , @hkrishna , @missinglink

investigate postal_code polygons

We should explore whether zip codes are valuable to index. They currently aren't accounted for in our Elasticsearch pipeline, and are mostly discarded by import scripts.

openstreetmap pipeline improvements

  • implement admin-lookup
  • remove dependency on having quattroshapes indexed before running
  • implement pelias-model
  • remove osm_types.js
  • general clean up
  • extract address data and store in ES
  • inconsistent ways count from subsequent import runs
  • fix stats.js module via unit tests, maybe extract to separate module
  • upgrade to the latest suggester-pipeline module
  • upgrade all dependencies
  • fix nodejs version inconsistencies to work with both 0.10 and 0.12
  • improve global module.exports, add tests.
  • configure code linting and precommit hook, lint everything.
  • make osm data mappers clearer and easier to write/modify
  • simplify features.js
  • add a unit test for every module
  • add detailed comments at the top of complex streams to explain their purpose
  • add doc.setAddress() and doc.getAddress() functions to pelias/model
  • add end-to-end system test
  • better documentation of pipeline process
  • merge experimental branch, publish and bump major version

[?] add regression tests to cover mappings, including the new address mapper

Hierarchy lookup should also provide a score

When we import geonames, osm nodes, ways etc - we lookup what admin boundaries each point (lat/lon) belongs to and populate a document object. I think this lookup should return all admin info (admin0, admin1, neighborhood, locality, alpha3 etc) and a score (based on population, popularity, category scores of individual admin types).

This way when we search for 123 main st - 123 main st, new york, ny has a higher score than 123 main st, lnyxville, wi

Add vagrant info to readme

Add information about vagrant installs and a link to that repo in the README file in this repository.

document code with JSDoc-style block comments

I'm a fan of JSDoc-style documentation comments, since it's usually helpful to comprehensively document your module/function APIs (while the implementation itself should be self-documented, unless you're doing something quite clever). We won't be using the jsdoc utility to actually generate documentation, but I think it's a decent format to adhere to. I only typically use @param and @return tags, like in this verbose example:

/**
 * Import all OpenAddresses CSV files in a directory into Pelias elasticsearch.
 *
 * @param {string} dirPath The path to a directory. All *.csv files inside of
 *    it will be read and imported (they're assumed to contain OpenAddresses
 *    data).
 * @param {object} opts Options to configure the import. Supports the following
 *    keys:
 *
 *      deduplicate: Pass address object through `address-deduplicator-stream`
 *        to perform deduplication. See the documentation:
 *        https://github.com/pelias/address-deduplicator-stream
 *
 *      admin-values: Add admin values to each address object (since
 *        OpenAddresses doesn't contain any) using `hierarchy-lookup`. See the
 *        documentation: https://github.com/pelias/hierarchy-lookup
 */

I'll eschew it for short/obvious functions.

Thoughts from @pelias/contributors ?

problem with import

I'm trying to import data for Europe but fail at the first step. Postgres is 9.3 with PostGIS 2.1.1 (shp2pgsql is the same version). The encoding is UTF-8 and osm2pgsql data is already imported in that database. Any Ideas what might cause the issue?

bundle exec rake quattroshapes:prepare_all
....
invalid command \N
invalid command \N
invalid command \.
ERROR:  syntax error at or near "AUS"
LINE 1: AUS Australia adm2 AU Australia 
        ^
ROLLBACK
rm /tmp/mapzen/qs_adm2*
wget http://static.quattroshapes.com/qs_localadmin.zip -P /tmp/mapzen

character encoding problem - special letters ( admin0 , admin1, ... )

I see some character encoding problems
test case: http://mapzen.com/pelias ; search: kamut

result :

Kamut (locality)
Kamut, Békés
admin0: Magyarország
admin1: Békés
locality: Kamut


But the Correct:

Kamut (locality)
Kamut, Békés
admin0: Magyarország
admin1: Békés
locality: Kamut


Kamut: ( http://en.wikipedia.org/wiki/Kamut,_Hungary )
Kamut is a village in Békés County, in the Southern Great Plain region of south-east Hungary.

Failure to create ES index?

Hi,

Thx for building and releasing this!

I'm having a bit of trouble creating the indices using bundle exec rake index:create.

I've cloned the master branch into a 12.04 lxc container with PostGIS 2 up and running. Also have ES 1.1.0 running on the default port.
I installed Ruby 2.1.1p76 (2014-02-24 revision 45161) [x86_64-linux] and changed the version of debugger from 1.6.5 to 1.6.6 (after which bundle succeeded).

Any ideas on where to look? I'm not very familiar with Ruby...

Thx!

Output from Ruby (and ES below):

vagrant@vagrant-base-precise-amd64:/vagrant/tmp/pelias$ bundle exec rake index:create --trace ** Invoke index:create (first_time) ** Execute index:create rake aborted! [500] {"error":"IndexCreationException[[pelias] failed to create index]; nested: FailedToResolveConfigException[Failed to resolve config path [/vagrant/tmp/pelias/config/synonyms.txt], tried file path [/vagrant/tmp/pelias/config/synonyms.txt], path file [/vagrant/tmp/elasticsearch-1.1.0/config/vagrant/tmp/pelias/config/synonyms.txt], and classpath]; ","status":500} /home/vagrant/.rvm/gems/ruby-2.1.1/gems/elasticsearch-transport-1.0.1/lib/elasticsearch/transport/transport/base.rb:132:in __raise_transport_error'
/home/vagrant/.rvm/gems/ruby-2.1.1/gems/elasticsearch-transport-1.0.1/lib/elasticsearch/transport/transport/base.rb:227:in perform_request' /home/vagrant/.rvm/gems/ruby-2.1.1/gems/elasticsearch-transport-1.0.1/lib/elasticsearch/transport/transport/http/faraday.rb:20:in perform_request'
/home/vagrant/.rvm/gems/ruby-2.1.1/gems/elasticsearch-transport-1.0.1/lib/elasticsearch/transport/client.rb:102:in perform_request' /home/vagrant/.rvm/gems/ruby-2.1.1/gems/elasticsearch-api-1.0.1/lib/elasticsearch/api/namespace/common.rb:21:in perform_request'
/home/vagrant/.rvm/gems/ruby-2.1.1/gems/elasticsearch-api-1.0.1/lib/elasticsearch/api/actions/indices/create.rb:77:in create' /vagrant/tmp/pelias/lib/pelias/tasks/index.rake:9:in block (2 levels) in <top (required)>'
/home/vagrant/.rvm/gems/ruby-2.1.1/gems/rake-10.1.1/lib/rake/task.rb:236:in call' /home/vagrant/.rvm/gems/ruby-2.1.1/gems/rake-10.1.1/lib/rake/task.rb:236:in block in execute'
/home/vagrant/.rvm/gems/ruby-2.1.1/gems/rake-10.1.1/lib/rake/task.rb:231:in each' /home/vagrant/.rvm/gems/ruby-2.1.1/gems/rake-10.1.1/lib/rake/task.rb:231:in execute'
/home/vagrant/.rvm/gems/ruby-2.1.1/gems/rake-10.1.1/lib/rake/task.rb:175:in block in invoke_with_call_chain' /home/vagrant/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/monitor.rb:211:in mon_synchronize'
/home/vagrant/.rvm/gems/ruby-2.1.1/gems/rake-10.1.1/lib/rake/task.rb:168:in invoke_with_call_chain' /home/vagrant/.rvm/gems/ruby-2.1.1/gems/rake-10.1.1/lib/rake/task.rb:161:in invoke'
/home/vagrant/.rvm/gems/ruby-2.1.1/gems/rake-10.1.1/lib/rake/application.rb:149:in invoke_task' /home/vagrant/.rvm/gems/ruby-2.1.1/gems/rake-10.1.1/lib/rake/application.rb:106:in block (2 levels) in top_level'
/home/vagrant/.rvm/gems/ruby-2.1.1/gems/rake-10.1.1/lib/rake/application.rb:106:in each' /home/vagrant/.rvm/gems/ruby-2.1.1/gems/rake-10.1.1/lib/rake/application.rb:106:in block in top_level'
/home/vagrant/.rvm/gems/ruby-2.1.1/gems/rake-10.1.1/lib/rake/application.rb:115:in run_with_threads' /home/vagrant/.rvm/gems/ruby-2.1.1/gems/rake-10.1.1/lib/rake/application.rb:100:in top_level'
/home/vagrant/.rvm/gems/ruby-2.1.1/gems/rake-10.1.1/lib/rake/application.rb:78:in block in run' /home/vagrant/.rvm/gems/ruby-2.1.1/gems/rake-10.1.1/lib/rake/application.rb:165:in standard_exception_handling'
/home/vagrant/.rvm/gems/ruby-2.1.1/gems/rake-10.1.1/lib/rake/application.rb:75:in run' /home/vagrant/.rvm/gems/ruby-2.1.1/gems/rake-10.1.1/bin/rake:33:in <top (required)>'
/home/vagrant/.rvm/gems/ruby-2.1.1/bin/rake:23:in load' /home/vagrant/.rvm/gems/ruby-2.1.1/bin/rake:23:in

'
/home/vagrant/.rvm/gems/ruby-2.1.1/bin/ruby_executable_hooks:15:in eval' /home/vagrant/.rvm/gems/ruby-2.1.1/bin/ruby_executable_hooks:15:in '
Tasks: TOP => index:create
vagrant@vagrant-base-precise-amd64:/vagrant/tmp/pelias$ bundle exec rake index:create
rake aborted!
[500] {"error":"IndexCreationException[[pelias] failed to create index]; nested: FailedToResolveConfigException[Failed to resolve config path [/vagrant/tmp/pelias/config/synonyms.txt], tried file path [/vagrant/tmp/pelias/config/synonyms.txt], path file [/vagrant/tmp/elasticsearch-1.1.0/config/vagrant/tmp/pelias/config/synonyms.txt], and classpath]; ","status":500}
/home/vagrant/.rvm/gems/ruby-2.1.1/gems/elasticsearch-transport-1.0.1/lib/elasticsearch/transport/transport/base.rb:132:in __raise_transport_error' /home/vagrant/.rvm/gems/ruby-2.1.1/gems/elasticsearch-transport-1.0.1/lib/elasticsearch/transport/transport/base.rb:227:in perform_request'
/home/vagrant/.rvm/gems/ruby-2.1.1/gems/elasticsearch-transport-1.0.1/lib/elasticsearch/transport/transport/http/faraday.rb:20:in perform_request' /home/vagrant/.rvm/gems/ruby-2.1.1/gems/elasticsearch-transport-1.0.1/lib/elasticsearch/transport/client.rb:102:in perform_request'
/home/vagrant/.rvm/gems/ruby-2.1.1/gems/elasticsearch-api-1.0.1/lib/elasticsearch/api/namespace/common.rb:21:in perform_request' /home/vagrant/.rvm/gems/ruby-2.1.1/gems/elasticsearch-api-1.0.1/lib/elasticsearch/api/actions/indices/create.rb:77:in create'
/vagrant/tmp/pelias/lib/pelias/tasks/index.rake:9:in block (2 levels) in <top (required)>' /home/vagrant/.rvm/gems/ruby-2.1.1/bin/ruby_executable_hooks:15:in eval'
/home/vagrant/.rvm/gems/ruby-2.1.1/bin/ruby_executable_hooks:15:in <main>' Tasks: TOP => index:create (See full trace by running task with --trace)

Elastis Search complains with the following traceback:

[2014-04-14 20:55:23,048][DEBUG][action.admin.indices.create] [Patsy Hellstrom] [pelias] failed to create
org.elasticsearch.indices.IndexCreationException: [pelias] failed to create index
    at org.elasticsearch.indices.InternalIndicesService.createIndex(InternalIndicesService.java:300)
    at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService$2.execute(MetaDataCreateIndexService.java:343)
    at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:308)
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:134)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:701)
Caused by: org.elasticsearch.env.FailedToResolveConfigException: Failed to resolve config path [/vagrant/tmp/pelias/config/synonyms.txt], tried file path [/vagrant/tmp/pelias/config/synonyms.txt], path file [/vagrant/tmp/elasticsearch-1.1.0/config/vagrant/tmp/pelias/config/synonyms.txt], and classpath
    at org.elasticsearch.env.Environment.resolveConfig(Environment.java:207)
    at org.elasticsearch.index.analysis.Analysis.getReaderFromFile(Analysis.java:270)
    at org.elasticsearch.index.analysis.SynonymTokenFilterFactory.<init>(SynonymTokenFilterFactory.java:66)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:534)
    at org.elasticsearch.common.inject.DefaultConstructionProxyFactory$1.newInstance(DefaultConstructionProxyFactory.java:54)
    at org.elasticsearch.common.inject.ConstructorInjector.construct(ConstructorInjector.java:86)
    at org.elasticsearch.common.inject.ConstructorBindingImpl$Factory.get(ConstructorBindingImpl.java:98)
    at org.elasticsearch.common.inject.FactoryProxy.get(FactoryProxy.java:52)
    at org.elasticsearch.common.inject.InjectorImpl$5$1.call(InjectorImpl.java:781)
    at org.elasticsearch.common.inject.InjectorImpl.callInContext(InjectorImpl.java:837)
    at org.elasticsearch.common.inject.InjectorImpl$5.get(InjectorImpl.java:777)
    at org.elasticsearch.common.inject.assistedinject.FactoryProvider2.invoke(FactoryProvider2.java:221)
    at com.sun.proxy.$Proxy18.create(Unknown Source)
    at org.elasticsearch.index.analysis.AnalysisService.<init>(AnalysisService.java:151)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:534)
    at org.elasticsearch.common.inject.DefaultConstructionProxyFactory$1.newInstance(DefaultConstructionProxyFactory.java:54)
    at org.elasticsearch.common.inject.ConstructorInjector.construct(ConstructorInjector.java:86)
    at org.elasticsearch.common.inject.ConstructorBindingImpl$Factory.get(ConstructorBindingImpl.java:98)
    at org.elasticsearch.common.inject.ProviderToInternalFactoryAdapter$1.call(ProviderToInternalFactoryAdapter.java:45)
    at org.elasticsearch.common.inject.InjectorImpl.callInContext(InjectorImpl.java:837)
    at org.elasticsearch.common.inject.ProviderToInternalFactoryAdapter.get(ProviderToInternalFactoryAdapter.java:42)
    at org.elasticsearch.common.inject.Scopes$1$1.get(Scopes.java:57)
    at org.elasticsearch.common.inject.InternalFactoryToProviderAdapter.get(InternalFactoryToProviderAdapter.java:45)
    at org.elasticsearch.common.inject.SingleParameterInjector.inject(SingleParameterInjector.java:42)
    at org.elasticsearch.common.inject.SingleParameterInjector.getAll(SingleParameterInjector.java:66)
    at org.elasticsearch.common.inject.ConstructorInjector.construct(ConstructorInjector.java:85)
    at org.elasticsearch.common.inject.ConstructorBindingImpl$Factory.get(ConstructorBindingImpl.java:98)
    at org.elasticsearch.common.inject.FactoryProxy.get(FactoryProxy.java:52)
    at org.elasticsearch.common.inject.ProviderToInternalFactoryAdapter$1.call(ProviderToInternalFactoryAdapter.java:45)
    at org.elasticsearch.common.inject.InjectorImpl.callInContext(InjectorImpl.java:837)
    at org.elasticsearch.common.inject.ProviderToInternalFactoryAdapter.get(ProviderToInternalFactoryAdapter.java:42)
    at org.elasticsearch.common.inject.Scopes$1$1.get(Scopes.java:57)
    at org.elasticsearch.common.inject.InternalFactoryToProviderAdapter.get(InternalFactoryToProviderAdapter.java:45)
    at org.elasticsearch.common.inject.SingleParameterInjector.inject(SingleParameterInjector.java:42)
    at org.elasticsearch.common.inject.SingleParameterInjector.getAll(SingleParameterInjector.java:66)
    at org.elasticsearch.common.inject.ConstructorInjector.construct(ConstructorInjector.java:85)
    at org.elasticsearch.common.inject.ConstructorBindingImpl$Factory.get(ConstructorBindingImpl.java:98)
    at org.elasticsearch.common.inject.FactoryProxy.get(FactoryProxy.java:52)
    at org.elasticsearch.common.inject.ProviderToInternalFactoryAdapter$1.call(ProviderToInternalFactoryAdapter.java:45)
    at org.elasticsearch.common.inject.InjectorImpl.callInContext(InjectorImpl.java:837)
    at org.elasticsearch.common.inject.ProviderToInternalFactoryAdapter.get(ProviderToInternalFactoryAdapter.java:42)
    at org.elasticsearch.common.inject.Scopes$1$1.get(Scopes.java:57)
    at org.elasticsearch.common.inject.InternalFactoryToProviderAdapter.get(InternalFactoryToProviderAdapter.java:45)
    at org.elasticsearch.common.inject.InjectorBuilder$1.call(InjectorBuilder.java:200)
    at org.elasticsearch.common.inject.InjectorBuilder$1.call(InjectorBuilder.java:193)
    at org.elasticsearch.common.inject.InjectorImpl.callInContext(InjectorImpl.java:830)
    at org.elasticsearch.common.inject.InjectorBuilder.loadEagerSingletons(InjectorBuilder.java:193)
    at org.elasticsearch.common.inject.InjectorBuilder.injectDynamically(InjectorBuilder.java:175)
    at org.elasticsearch.common.inject.InjectorBuilder.build(InjectorBuilder.java:110)
    at org.elasticsearch.common.inject.InjectorImpl.createChildInjector(InjectorImpl.java:131)
    at org.elasticsearch.common.inject.ModulesBuilder.createChildInjector(ModulesBuilder.java:69)
    at org.elasticsearch.indices.InternalIndicesService.createIndex(InternalIndicesService.java:298)
    ... 6 more

Elasticsearch plugin [PL-GG11]

Clean up trello tickets for the work done by Francisco, consider rolling back master to previous stable version as HEAD is unstable.

related issues:

TokenStream expanded to 384 finite strings. Only <= 256 finite strings are supported

Hello! I try to execute the following command:

sudo -u gis bundle exec rake quattroshapes:populate_locality ES_INLINE=1 --trace

But subsequently I have this error in the /var/log/elasticsearch/elasticsearch.log:

.....
java.lang.IllegalArgumentException: TokenStream expanded to 384 finite strings. Only <= 256 finite strings are supported
......

What the reason for the error? =(

TokenStream error: "Only <= 256 finite strings are supported"

These non-fatal errors are recurring and may be fixed with a config/schema/plugin update. @hkrishna may be able to provide more info.

[2015-02-26 16:58:53,415][DEBUG][action.bulk              ] [Demolition Man] [pelias][0] failed to execute bulk item (index) index {[pelias][osmnode][1974379318], source[{"center_point":{"lat":51.7543469,"lon":-0.3363454},"name":{"default":"St Peter's St o/s St Albans Tandoori"},"type":"node","alpha3":"GBR","admin1":"Hertfordshire","locality":"St Albans","neighborhood":"Porters Wood","admin0":"United Kingdom","admin2":"Hertfordshire","suggest":{"input":["st peter's st o/s st albans tandoori"],"output":"osmnode:1974379318","weight":6}}]}
java.lang.IllegalArgumentException: TokenStream expanded to 336 finite strings. Only <= 256 finite strings are supported
    at org.elasticsearch.search.suggest.completion.CompletionTokenStream.incrementToken(CompletionTokenStream.java:66)
    at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:618)
    at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:359)
    at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:318)
    at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:239)
    at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:457)
    at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1511)
    at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1246)
    at org.elasticsearch.index.engine.internal.InternalEngine.innerIndex(InternalEngine.java:594)
    at org.elasticsearch.index.engine.internal.InternalEngine.index(InternalEngine.java:522)
    at org.elasticsearch.index.shard.service.InternalIndexShard.index(InternalIndexShard.java:425)
    at org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:439)
    at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:150)
    at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction.java:512)
    at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1.run(TransportShardReplicationOperationAction.java:419)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

Address de-duper should be part of the import pipeline

When we import any geospatial data into pelias index, they should undergo the following

  • de-duper: to make sure we dont have two points with the same name from one or more sources.

Ideally, a de-duper instance should be kept alive through out the import process across different sources (geonames, osm, quattroshapes etc) so that it catches duplicate names/ points before adding them to the ES index.

Why Ruby?

If you want this to be an easy install for people then I don't believe Ruby is going to help.
Just look at the discourse.org project and all the troubles they have trying to roll out an easy to install server.

Is there anyway this could use maybe node.js or mono?

Street fallback

In cases where no street numbers exist for a certain street, or there are few numbers on that street it would be ideal to simply return the name of the street with a centroid of the polyline.

An example would be if for 'Main Street' we only had numbers 1,2 and 42. A user should still be able to type 'Main Street' and get the central point for that street, while also being able to search for "1 Main Street" etc.

Street segments may need to be re-assembled to accurately compute the centroid, or alternatively we could try to import the roads as a geoJSON polyline type.

reported by: @randymeech

Autocorrect query text [PL-PG03]

If user mistypes a known word in the dictionary it should be autocorrected.
Also search with Levenshtein distances threshold for names.

Needs to work across all endpoints.

Advanced Admin area scoring - taking population and popularity into account [PL-CG12]

Consider various possible scoring systems:

  • Score admin areas by population
  • Score admin areas by search frequency/popularity

There has been a lot of work being done around scoring and it's crucial that we use population and popularity information correctly - this helps with coarse geocoding without a geobias.

related issues:

Documented Email

The "get in contact" email in the README.md is bouncing emails sent to it.

Potential optimizations

This looks like a very cool project! I heard about it on Twitter and took a look...very impressive! I just wanted to drop off a few potential optimizations

Feel free to ignore any/all of these, especially if you've already tried them out. :)

Heap Usage

It was mentioned on twitter (and the readme) that pelias uses a bunch of heap. There are a few things you can do to reduce heap usage:

  • If the data is static after import, you could disable bloom filters on each index. The bloom filters are used to speed up indexing, but if the data is static, it just represents wasted heap space. Details about unloading are here (see the "tip").
  • Similarly, if the data is fairly static, you can probably reduce your primary shard count. Extra shards laying around will eat up heap space from Lucene overhead (term dictionaries, etc) and by reducing inverted index compression. If you reduce the shard count to 40 or even 20 primary shards, you will save some memory. This could potentially increase your query throughput, but may also increase the query latency (since queries will be mostly CPU bound by the geo_* filters, decreasing primary shards decreases how many machines participate in each query)

Ingestion speed

  • The readme says it takes three days ingest data? Is that all Elasticsearch or also other components? I would bump your bulk size to around 1000 docs. The dataset is 66m docs and 300gb, so each doc is probably around 4kb. Bumping the bulk size to 1000 docs will put you around 5-6mb per bulk, which is more optimal.
  • Do you spin up multiple import processes/threads? This can drastically reduce ingestion time

Query optimizations

The search looks to mostly be a demonstration, so you may not care about these, but a few easy wins:

  • You may consider enabling lat_lon for the geo_points. This indexes the lat/lon as individual fields, which enables the geo_* filters to execute ranges on the inverted index instead of field data. This is sometimes faster depending on the data involved.
  • The "closest" query can be restructured into a filtered query, which will potentially be faster. Something like:
{
    "query": {
        "filtered": {
           "filter": {
               "and": {
                  "filters": [
                     {"term": {"location.type": "search.type"}},
                     {"geo_distance": {}}
                  ]
               }
           }
        }
    }
}

This does a few things. First, it executes the Term as a filter instead of a query, which allows caching and should be faster. Second, it executes the Term filter before the expensive geo because of the And compound filter.

Unclear if this would be faster, however, since your usage of the top-level filter will do a good job limiting the number of documents that the geo will see. You might also try a hybrid approach like this:

{
    "query": {
        "filtered": {
           "filter": {
              "term": {"location.type": "search.type"}
           }
        }
    },
    "post_filter" : {
        "geo_distance: {}
    }
}

Basically identical to the existing query, except it uses a filter instead of a term query.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.