Coder Social home page Coder Social logo

openvenues / sparkey Goto Github PK

View Code? Open in Web Editor NEW

This project forked from spotify/sparkey

1.0 1.0 0.0 664 KB

Simple constant key/value storage library, for read-heavy systems with infrequent large bulk inserts.

License: Apache License 2.0

Makefile 0.44% M4 0.35% C 99.20%

sparkey's Introduction

openvenues

Open information extraction project for indexing and normalizing real-world venue/POI information from across the Web. Can be used standalone to extract venues from individual websites, or on a full-fledged copy of the entire Internet using the Common Crawl.

Project layout

  • extract: the "easy way", extract structured (or at least semi-structured) address and geo data from HTML markup. Supports schema.org microdata, RDFa Lite, hcard, geotags, HTML5 <address> elements, OpenGraph and extracting url params from Google map embeds
  • jobs: Amazon Elastic Mapreduce jobs for extracting places from the Common Crawl (224TB or 3.6+ billion urls available on S3 as of August 2014, new crawls published periodically).

Notes

BeautifulSoup vs. lxml

The first version of the Common Crawl extraction job was written using lxml, a fast C library based on libxml2, for parsing. However, running said parser over billions of badly-encoded webpages revealed some bugs in lxml/libxml2 related to reading from uninitialized memory at the C level (see https://bugs.launchpad.net/lxml/+bug/1240696), which eats up all the system's memory and crashes the box. The bug occurs non-deterministically, so is hard to track down, but will occur, on different documents, if the job is run for long enough. Until there's a fix lxml won't be usable for this project. BeautifulSoup is a forgiving pure-Python regex-based "parser" designed for working with "tag soup". It's up to 100x slower than lxml, so we currently use a high-recall (not necessarily high-precision) regex to filter out documents that definitely don't contain the keywords we're looking for before committing to a full parse. With this filter, the job still completes in a reasonable amount of time using 100 8-core machines.

Coming up next:

  • Address extraction (find postal addresses in text)
  • Deduping and normalization of venue names, addresses and locations

sparkey's People

Contributors

kirang89 avatar noj avatar nresare avatar rohansingh avatar rschildmeijer avatar spkrka avatar thedrow avatar wbolster avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.