Coder Social home page Coder Social logo

stackexchange-parquet's Introduction

Stack Exchange Parquet Conversion

The Stack Exchange Network periodically publishes an anonymized dump of all user-contributed content via The Internet Archive.

The dump consists of XML files that encode the user data. This project contains a Spark job to convert the data into parquet files, which are easier to use for subsequent processing.

At present only the following sites are converted, but it is trivial to add more:

  • travel.stackexchange.com
  • diy.stackexchange.com
  • security.stackexchange.com
  • english.stackexchange.com
  • stackoverflow.com

(These are ordered from smallest to largest dataset.)

Preparing the Data

The spark job assumes that you have obtained and unpacked the complete dump, uncompressed it and uploaded the files to HDFS in the /stackexchange/ directory.

The individual XML files may be gzip-compressed or not; if compressed they will be slower to convert due to not being splittable.

Building the Job

The job can be built using SBT:

% sbt assembly

This will build a JAR containing the Spark job: target/scala-2.10/stackexchange-parquet-assembly-0.1.jar

Running the Conversion

The conversion can be executed by submitting the job with an appropriate number of executors:

% spark-submit --num-executors 32 stackexchange-parquet-assembly-0.1.jar

Depending on the size of your cluster this can take minutes to hours to run. As of June, 2015 an AWS-based Hadoop cluster with 15 c4.4xlarge nodes can complete this job in a few minutes when 32 executors are requested.

stackexchange-parquet's People

Contributors

asnare avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.