Coder Social home page Coder Social logo

openvenues

Open information extraction project for indexing and normalizing real-world venue/POI information from across the Web. Can be used standalone to extract venues from individual websites, or on a full-fledged copy of the entire Internet using the Common Crawl.

Project layout

  • extract: the "easy way", extract structured (or at least semi-structured) address and geo data from HTML markup. Supports schema.org microdata, RDFa Lite, hcard, geotags, HTML5 <address> elements, OpenGraph and extracting url params from Google map embeds
  • jobs: Amazon Elastic Mapreduce jobs for extracting places from the Common Crawl (224TB or 3.6+ billion urls available on S3 as of August 2014, new crawls published periodically).

Notes

BeautifulSoup vs. lxml

The first version of the Common Crawl extraction job was written using lxml, a fast C library based on libxml2, for parsing. However, running said parser over billions of badly-encoded webpages revealed some bugs in lxml/libxml2 related to reading from uninitialized memory at the C level (see https://bugs.launchpad.net/lxml/+bug/1240696), which eats up all the system's memory and crashes the box. The bug occurs non-deterministically, so is hard to track down, but will occur, on different documents, if the job is run for long enough. Until there's a fix lxml won't be usable for this project. BeautifulSoup is a forgiving pure-Python regex-based "parser" designed for working with "tag soup". It's up to 100x slower than lxml, so we currently use a high-recall (not necessarily high-precision) regex to filter out documents that definitely don't contain the keywords we're looking for before committing to a full parse. With this filter, the job still completes in a reasonable amount of time using 100 8-core machines.

Coming up next:

  • Address extraction (find postal addresses in text)
  • Deduping and normalization of venue names, addresses and locations

openvenues's Projects

address_languages icon address_languages

Frequent n-grams in OSM addresses by language. Helpful when contributing abbreviations to libpostal

chain_stores icon chain_stores

Frequent venue names in OSM. Used to construct the libpostal chains dictionary.

common_crawl icon common_crawl

Simple Python MapReduce jobs for processing the Common Crawl plus command-line utilities

gopostal icon gopostal

Go (cgo) interface to libpostal for fast international address parsing/normalization

jpostal icon jpostal

Java/JNI bindings to libpostal for for fast international street address parsing/normalization

libpostal icon libpostal

A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.

lieu icon lieu

Dedupe/batch geocode addresses and venues around the world with libpostal

node-postal icon node-postal

NodeJS bindings to libpostal for fast international address parsing/normalization

php-postal icon php-postal

PHP bindings to libpostal for for fast international street address parsing/normalization

pypostal icon pypostal

Python bindings to libpostal for fast international address parsing/normalization

ruby_postal icon ruby_postal

Ruby bindings to libpostal for fast international address parsing/normalization

sparkey icon sparkey

Simple constant key/value storage library, for read-heavy systems with infrequent large bulk inserts.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.