Coder Social home page Coder Social logo

mkemka / grimlock Goto Github PK

View Code? Open in Web Editor NEW

This project forked from gitter-badger/grimlock

0.0 1.0 0.0 5.09 MB

library for performing data-science and machine learning related data preparation, aggregation, manipulation and querying tasks

License: Apache License 2.0

Scala 93.84% Shell 5.43% Ruby 0.01% CSS 0.71%

grimlock's Introduction

Grimlock

![Gitter](https://badges.gitter.im/Join Chat.svg)

Build Status Gitter chat

Overview

Grimlock is a library for performing data-science and machine learning related data preparation, aggregation, manipulation and querying tasks. It can be used for such tasks as:

  • Normalisation/Standardisation/Bucketing of numeric variables;
  • Binarisation of categorical variables;
  • Creating indicator variables;
  • Computing statistics in all dimensions;
  • Generating data for a variety of machine learning tools;
  • Partitioning and sampling of data;
  • Derived features;
  • Text analysis (tf-idf/LDA);
  • Computing pairwise distances.

Grimlock has default implementations for many of the above tasks. It also has a number of useful properties:

  • Supports wide variety of variable types;
  • Is easily extensible;
  • Can operate in multiple dimensions (currently up to 5);
  • Supports hetrogeneous data;
  • Can be used in the Scalding REPL (with simple symlink);
  • Supports basic as well as structured data types.

Concepts

Data Structures

The basic data structure in Grimlock is a N-dimensional sparse Matrix (N=1..5). Each cell in matrix consists of a Position and Content tuple.

         Matrix
           ^ 1
           |
           | M
  (Position, Content)

The position is, essentialy, a list of N coordinates (Value). The content consists of a Schema together with a Value. The value contains the actual value of the cell, while the schema defines what type of variable is in the cell, and what it's legal values are.

   Position              Content
       ^ 1                  ^ 1
       |                    |
       | N           +------+------+
     Value           | 1           | 1
                  Schema         Value

Lastly, the Codex singleton objects can be used to parse and write the basic data types used in both coordinates and values.

  Schema       Value
     ^ 1         ^ 1
     |           |
     | 1         | 1
   Codex       Codex

Working with Dimensions

Grimlock supports performing operations along all directions of the matrix. This is realised through a Slice. There are two realisations of Slice: Along and Over. Both are constructed with a single Dimension, but differ in how the dimension is interpreted. When using Over, all data in the matrix is grouped by the dimension and operations, such as aggregation, are applied to the resulting groups. When using Along, the data is group by all dimensions except the dimension used when constructing the Slice. The differences between Over and Along are graphically presented below for a dimensional matrix. Note that in 2 dimensions, Along and Over are each other's inverse.

      Over(Second)       Along(Third)

     +----+------+      +-----------+
    /    /|     /|     /     _     /|
   /    / |    / |    /    /|_|   / |
  +----+------+  |   +-----------+  |
  |    |  |   |  |   |   /_/ /   |  |
  |    |  +   |  +   |   |_|/    |  +
  |    | /    | /    |           | /
  |    |/     |/     |           |/
  +----+------+      +----+------+

Data Format

The basic data format used by Grimlock (though others are supported) is a row-oriented pipe separated file (each row is a single cell). The first N fields are the coordinates, optionally followed by the variable type and codex (again pipe separated). If the variable type and codex are omitted from the data then they have to be provided by a Dictionary. The last field of each row is the value.

In the example below the first field is a coordinate identifying an instance, the second field is a coordinate identifying a feature. The third and fourth columns are the variable type and codex respectively. The last column has the actual value.

> head src/main/scala/au/com/cba/omnia/grimlock/examples/exampleInput.txt
iid:0064402|fid:B|nominal|string|H
iid:0064402|fid:E|continuous|long|219
iid:0064402|fid:H|nominal|string|C
iid:0066848|fid:A|continuous|long|371
iid:0066848|fid:B|nominal|string|H
iid:0066848|fid:C|continuous|long|259
iid:0066848|fid:D|nominal|string|F
iid:0066848|fid:E|continuous|long|830
iid:0066848|fid:F|nominal|string|G
iid:0066848|fid:H|nominal|string|B
...

If the type and codex were omitted then the data would look as follows:

iid:0064402|fid:B|H
iid:0064402|fid:E|219
iid:0064402|fid:H|C
iid:0066848|fid:A|371
iid:0066848|fid:B|H
iid:0066848|fid:C|259
iid:0066848|fid:D|F
iid:0066848|fid:E|830
iid:0066848|fid:F|G
iid:0066848|fid:H|B
...

An external dictionary will then have to be provided to correctly decode and validate the values:

fid:A|long|continuous
fid:B|string|nominal
fid:C|long|continuous
fid:D|string|nominal
fid:E|long|continuous
fid:F|string|nominal
fid:H|string|nominal
...

Usage

Setting up REPL

The examples below are executed in the Scalding REPL. To use Grimlock in the REPL follow the following steps:

  1. Install Scalding; follow these instructions.
  2. Check out tag (0.11.2); git checkout 0.11.2.
  3. Update scala version; edit project/Build.scala and set scala version to 2.10.4.
  4. In scalding-repl/src/main/scala add symlink to Grimlock's code folder.
  5. Start REPL; ./sbt scalding-repl/console.

After the last command, the console should appear as follows:

> ./sbt scalding-repl/console
[info] Loading project definition from <path to>/scalding/project
[info] Set current project to scalding (in build file:<path to>/scalding/)
[info] Formatting 2 Scala sources {file:<path to>/scalding/}scalding-repl(compile) ...
[info] Compiling 2 Scala sources to <path to>/scalding/scalding-repl/target/scala-2.10/classes...
[warn] there were 7 feature warning(s); re-run with -feature for details
[warn] one warning found
[info] Starting scala interpreter...
[info] 
import com.twitter.scalding._
import com.twitter.scalding.ReplImplicits._
import com.twitter.scalding.ReplImplicitContext._
Welcome to Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.6.0_65).
Type in expressions to have them evaluated.
Type :help for more information.

scala>

Note, for readability, the REPL info is supressed from now on.

Getting started

When at the Scalding REPL console, the first step is to import Grimlock's functionality (be sure to press ctrl-D after the last import statement):

> scala> :paste
// Entering paste mode (ctrl-D to finish)

import grimlock._
import grimlock.contents._
import grimlock.contents.Contents._
import grimlock.contents.metadata._
import grimlock.encoding._
import grimlock.Matrix._
import grimlock.Names._
import grimlock.partition._
import grimlock.partition.Partitions._
import grimlock.position._
import grimlock.position.Positions._
import grimlock.reduce._
import grimlock.transform._
import grimlock.Type._
import grimlock.Types._

The next step is to read in data (be sure to change to the correct path to the Grimlock repo):

scala> val data = read2D("<path to>/grimlock/src/main/scala/au/com/cba/omnia/grimlock/examples/exampleInput.txt")

The returned data is a 2 dimensional matrix. To investigate it's content Scalding's dump command can be used in the REPL, use the matrix persist API for writing to disk:

scala> data.dump
(Position2D(StringValue(iid:0064402),StringValue(fid:B)),Content(NominalSchema[StringCodex](),StringValue(H)))
(Position2D(StringValue(iid:0064402),StringValue(fid:E)),Content(ContinuousSchema[LongCodex](),LongValue(219)))
(Position2D(StringValue(iid:0064402),StringValue(fid:H)),Content(NominalSchema[StringCodex](),StringValue(C)))
(Position2D(StringValue(iid:0066848),StringValue(fid:A)),Content(ContinuousSchema[LongCodex](),LongValue(371)))
(Position2D(StringValue(iid:0066848),StringValue(fid:B)),Content(NominalSchema[StringCodex](),StringValue(H)))
(Position2D(StringValue(iid:0066848),StringValue(fid:C)),Content(ContinuousSchema[LongCodex](),LongValue(259)))
(Position2D(StringValue(iid:0066848),StringValue(fid:D)),Content(NominalSchema[StringCodex](),StringValue(F)))
(Position2D(StringValue(iid:0066848),StringValue(fid:E)),Content(ContinuousSchema[LongCodex](),LongValue(830)))
(Position2D(StringValue(iid:0066848),StringValue(fid:F)),Content(NominalSchema[StringCodex](),StringValue(G)))
(Position2D(StringValue(iid:0066848),StringValue(fid:H)),Content(NominalSchema[StringCodex](),StringValue(B)))
...

The following shows a number of basic operations (get number of rows, get type of features, perform simple query):

scala> data.size(First).dump
(Position2D(StringValue(First),StringValue(size)),Content(DiscreteSchema[LongCodex](),LongValue(9)))

scala> data.types(Over(Second)).dump
(Position1D(StringValue(fid:A)),Numerical)
(Position1D(StringValue(fid:B)),Categorical)
(Position1D(StringValue(fid:C)),Numerical)
(Position1D(StringValue(fid:D)),Categorical)
(Position1D(StringValue(fid:E)),Numerical)
(Position1D(StringValue(fid:F)),Categorical)
(Position1D(StringValue(fid:G)),Numerical)
(Position1D(StringValue(fid:H)),Categorical)

scala> data.which((pos: Position, con: Content) => (con.value gtr 995) || (con.value equ "F")).dump
Position2D(StringValue(iid:0066848),StringValue(fid:D))
Position2D(StringValue(iid:0216406),StringValue(fid:E))
Position2D(StringValue(iid:0444510),StringValue(fid:D))

Now for something a little more intersting. Let's compute the number of features for each instance and then compute the moments of the distribution of counts:

scala> val counts = data.reduce(Over(First), Count())

scala> counts.dump
(Position1D(StringValue(iid:0064402)),Content(DiscreteSchema[LongCodex](),LongValue(3)))
(Position1D(StringValue(iid:0066848)),Content(DiscreteSchema[LongCodex](),LongValue(7)))
(Position1D(StringValue(iid:0216406)),Content(DiscreteSchema[LongCodex](),LongValue(5)))
(Position1D(StringValue(iid:0221707)),Content(DiscreteSchema[LongCodex](),LongValue(4)))
(Position1D(StringValue(iid:0262443)),Content(DiscreteSchema[LongCodex](),LongValue(2)))
(Position1D(StringValue(iid:0364354)),Content(DiscreteSchema[LongCodex](),LongValue(5)))
(Position1D(StringValue(iid:0375226)),Content(DiscreteSchema[LongCodex](),LongValue(3)))
(Position1D(StringValue(iid:0444510)),Content(DiscreteSchema[LongCodex](),LongValue(5)))
(Position1D(StringValue(iid:1004305)),Content(DiscreteSchema[LongCodex](),LongValue(2)))

scala> counts.reduceAndExpand(Along(First), Moments()).dump
(Position1D(StringValue(mean)),Content(ContinuousSchema[DoubleCodex](),DoubleValue(4.0,DoubleCodex)))
(Position1D(StringValue(std)),Content(ContinuousSchema[DoubleCodex](),DoubleValue(1.5634719199411433,DoubleCodex)))
(Position1D(StringValue(skewness)),Content(ContinuousSchema[DoubleCodex](),DoubleValue(0.348873899490999,DoubleCodex)))
(Position1D(StringValue(kurtosis)),Content(ContinuousSchema[DoubleCodex](),DoubleValue(-0.8057851239669427,DoubleCodex)))

For more examples see Demo.scala

Documentation

Scaladoc

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.