Coder Social home page Coder Social logo

big-sort's Introduction

Introduction

Sorting big files that do not fit in memory drives us to use exeternal sorting methods that some common and widely used utilities - such as Unix sort - have.

" External sorting is a term for a class of sorting algorithms that can handle massive amounts of data." (Source: Wikipedia)

So, basically, we are using a "divide and conquer" strategy, sorting small chunks of the input and storing them into the disk. After that, we can read the first N bytes of each chunk and merge them into the output file. A high level diagram of this strategy is available here. Example:

chunk-1: [a, b, c, k, z]
chunk-2: [d, e, f]

Having these two ordered chunks, we should now select, comparing them, what is the order to write them into the output file. In this case, viewing the algorithm in high-level, we should have something like this:

1. a < d? Write `a` and advance in `chunk-1`
2. b < d? Write `b` and advance in `chunk-1`
3. c < d? Write `c` and advance in `chunk-1`
4. k < d? Write `d` and advance in `chunk-2`
5. k < e? Write `e` and advance in `chunk-2`
6. k < f? Write `f` and advance in `chunk-2`
7. Write `k`and advance in `chunk-1`
8. Write `z`and advance in `chunk-1``
9. Close the output file

About the solution

The solution to this problem is written in Java8 and does not use any external library. The program has a Mainclass that takes two arguments: the first one is the input file and the second one is the file where the output should be written. The architeture of the solution is the following:

  • ExternalSort=> just a wrapper for the two steps (divide and conquer) of the algorithm;
  • FileSplitter=> splits a given input file, sorts its contents in N chunks and writes them into disk;
  • SortedFileBuffer=> utility data structure that has the list of chunks that are opened, a relationship between these buffers and the last line that was read from them and a set of buffers that were completely read;
  • FileMerger=> Using the above class, selects the entry to write to the output buffer, from the list of open and sorted buffers, based on alphabetic comparation.

Test

There is a file test.txtin the root directory of the project. It's a blacklist from somewhere on the Internet. It has some lines that are not sorted, for test and demonstration purposes.

NOTE: this implementation should not be used in production. There are, certainly, a some optimizations and corrections to do. This is just a proof-of-concept.

big-sort's People

Contributors

joaoqalves avatar

Stargazers

Gökçer Belgüsen avatar Jose Ignacio Acin Pozo avatar

Watchers

 avatar

Forkers

gokcerbelgusen

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.