Coder Social home page Coder Social logo

pebble's Introduction

Image of Yaktocat

Pebble is a project with the vision of building a key value store that allows the storage and retrieval of big data in main memory, instead of distributed systems on disk. Making simpler, faster and cheaper to access and process this data.

Pebble is based on the work made by Paolo Boldi and Sebastiano Vigna with WebGraph. Reference to their work can be found in this paper

Paolo Boldi and Sebastiano Vigna. The WebGraph framework I: Compression techniques. In Proc.of the Thirteenth International World Wide Web Conference (WWW 2004), pages 595โˆ’601, Manhattan, USA, 2004. ACM Press

and the Web Graph page.

The Pebble projects extends their algorithms to store data tables with an extended set of supported data types and expects to provide an implementation better suited for production systems.

The project is on its early stage of implementation and work remains to be done to achieve its vision.

Quick Start

1. Compile and build package

mvn package

2. If desired build jar containing all dependencies.

mvn assembly:single

This could take around 5 minutes.

Pebble API Description

The current implementation supports the compression and decompression of lists of positive integer. These lists can be of three types: strictly incremental lists, incremental lists and unsorted lists. For details regarding the library api, refer to the Api Documentation. On following sections the basics around encoding and decoding lists is explained.

Encoding Lists

The core pebble API exposes the compression methods through the OutputSuccinctStream class. This class exposes three main methods:

  • [writeStrictlyIncrementalList](//groupon.github.io/pebble/org/pebble/core/encoding/OutputSuccinctStream.html#writeStrictlyIncrementalList(it.unimi.dsi.fastutil.ints.IntList, int, int, org.pebble.core.encoding.ints.datastructures.IntReferenceListsStore)) used to compress strictly incremental lists, such as:

    {4, 6, 7, 8, 11, 12}
  • [writeIncrementalList](//groupon.github.io/pebble/org/pebble/core/encoding/OutputSuccinctStream.html#writeIncrementalList(it.unimi.dsi.fastutil.ints.IntList, int, int, org.pebble.core.encoding.ints.datastructures.IntReferenceListsStore)) used to compress incremental lists, such as:

    {4, 6, 6, 6, 6, 7, 8, 8, 8, 11, 11, 12, 12, 12, 12, 12}
  • [writeList](//groupon.github.io/pebble/org/pebble/core/encoding/OutputSuccinctStream.html#writeList(it.unimi.dsi.fastutil.ints.IntList, int, int, org.pebble.core.encoding.ints.datastructures.IntReferenceListsStore)) used to compress lists with none order restrictions, such as:

    {6, 12, 11, 8, 12, 8, 6, 12, 8, 12, 6, 11, 6, 4, 12, 7}

These three methods receives four arguments: list, index, bits size and a reference lists store, each explained below. These methods returns the numbers of bits used to encode the input list.

List

This is the list to be encoded. The lists needs to be stored in a an instance of IntList class . For example:

IntList list = new IntArrayList(new int[] {4, 6, 7, 8, 11, 12});

WARNING the encoding process it might modify the input list. So in case the lists needs to be kept unchanged, a copy of the input list should be passed to the encoding function.

Index

Index of the input list. This is a correlative number of the passed list. The first list, should be passed with index 0, second list should be passed with index 1 and so on...

Bits Size

Number of bits required for the binary representation of the expected biggest value on the list. Is important to highlight that this is not the maximum number of the input list, instead correspond to the expected maximum number of the whole lists collection to be encoded. In case the values can be any integer, the value should be 31 given that only positive numbers are supported.

Reference Lists Store

Collection of previously compressed lists. This collection is used internally for Pebble's compression algorithm, to store and find reference lists candidates for the input list. To obtain an instance of IntReferenceListsStore for parameters needs to be passed:

  • size - maximum numbers of lists to be stored. When the number of lists stored exceeds size. Oldest stored lists will be replaced by newer lists.
  • maxRecursiveReferences - maximum number of allowed recursive references. This limits the maximum number of recursive references a list can have in order to be used as a reference lists. As this parameters gets bigger, the likelihood of finding better reference candidates in terms of compression increases, but with the trade of of reading speed when decoding lists.
  • minListSize - minimum required size of a list to be stored on the reference lists collection.
  • referenceListIndex - index is used to find the best reference list candidate. This must be an instance of any class which implements the IntReferenceListsIndex interface. The pebble core library provides the class InvertedListIntReferenceListsIndex as an implementing class of this interface. A reference lists store can be instantiated as:
final int size = 10000;
final int maxRecursiveReferences = 3;
final int minListSize = 1;
final IntReferenceListsIndex referenceListsIndex = new InvertedListIntReferenceListsIndex();
final IntReferenceListsStore referenceListsStore = new IntReferenceListsStore(
    size,
    maxRecursiveReferences,
    minListSize,
    referenceListsIndex
);

Finally to encode the example set of lists {4, 6, 7, 8, 11, 12} and {4, 5, 6, 7, 8, 9, 10, 11, 12} a possible code would be:

IntList list1 = new IntArrayList(new int[] {4, 6, 7, 8, 11, 12});
IntList list2 = new IntArrayList(new int[] {4, 5, 6, 7, 8, 9, 10, 11, 12});
OutputStream outputStream = new FileOutputStream("example.pz");
final OutputSuccinctStream outputSuccinctStream = new OutputSuccinctStream(outputStream);
final long[] offsets = new long[2];
int offset = 0;
offset += outputSuccinctStream.writeStrictlyIncrementalList(list1, 0, 31, referenceListsStore);
offsets[1] = offset;
offset += outputSuccinctStream.writeStrictlyIncrementalList(list2, 1, 31, referenceListsStore);
outputSuccinctStream.close();

At the end of this code the offset value will contains the total number of bits used to compress the collection.

Decoding Lists

The core pebble API exposes three types of iterators used to iterate over the compressed representation of the three types of supported lists.

Each of these tree iterators has a build method that returns an instance of the iterator starting on the beginning of the compressed list. This method receives four arguments: index, bits size and a bytes store, each explained below.

Index

Index of the input list to be retrieved.

Bits Size

Number of bits required for the binary representation of the expected biggest value on the compressed collection of lists.

Bytes Store

This must be an instance of any class which implements the PebbleBytesStore interface. This abstracts the specifics on how the compressed data is stored and retrieved, adding flexibility in the way the compressed data is handled. The pebble core library provides a basic class that implements this interface BytesArrayPebbleBytesStore as an implementing class of this interface. This class is a simple wrapper of a single byte array.

Finally to decode the first list used on the encoding example on Encoding section, a possible code would be:

byte[] data = Files.readAllBytes(Paths.get("example.pz"));
PebbleBytesStore bytesStore = new BytesArrayPebbleBytesStore(data, offsets);
IntIterator iterator = StrictlyIncrementalListIterator.build(0, 31, bytesStore);
while (iterator.hasNext()) {
    System.out.println(iterator.nextInt());
}

TODO

  • Add support for lists of long type.
  • Add support for storing more than integer values, such as: uuids, timestamps, finite text, free text and real numbers.
  • Explore the usage of Trove library instead of FastUtils WebGraph library.
  • Add support for storing complex data types. Data that is build on top primitive data types.
  • Implement java.util.Map interface.
  • Explore hbase memory cache.
  • Explore hadoop files storage application.
  • Use same compressed list as the reference list buffer. This will be helpful as well, to update the store on real time.

pebble's People

Contributors

rzilleruelogroupon avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.