Coder Social home page Coder Social logo

databricks's Introduction

Databricks Code Assignment: External GroupBy

Run Test:

This project is developed with Eclipse.

GroupByMap.java: process the input. If all the input data can be processed in memory, it will return the result directly. Otherwise, it will save partial results to temp files.

GroupByReduce.java: Similar to Reduce in MapReduce, it will process partial results in temp files, and output the final result to a temp file. Then, a FileIterator is returned.

FileIterator.java: Iterate records in a file, parse them, and return them in the required format.

ExternalGroupBy.java: The main class to process input data. Two parameters are required:

  • chunkSize: specify the number of records that can be processed in memeory
  • reduceSum: specify the number of GroupByReduce threads that can be run in parallel.

Run Test:

Overall test:

javac ExternalGroupByTest.java
java ExternalGroupByTest

Overall test output:

Testing in-memory data processing:
Alice   [value 6, value 7, value 8]
Bob     [value 9, value 10, value 11, value 12]
Charlie [value 1, value 2, value 3, value 4, value 5]
David   [value 13, value 14, value 15, value 16, value 17, value 18]

Testing external data processing:
Total input records:    24360
Output file:    /var/folders/3y/0fbxjxf914q4dt23_zttp07cgr1sgp/T/2789571709755756762.txt
Total records:  145
Total data points in output:    24360

FileIterator test:

javac FileIteratorTest.java
java FileIteratorTest

databricks's People

Contributors

briankwong avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.