Coder Social home page Coder Social logo

tablesaw's Introduction

Tablesaw

HELP A GUY OUT: If you use tablesaw, please shoot me an email and tell me what you're up to: [email protected]. I'm thinking about next steps and would love your input. Or, if you decide not to use tablesaw, please let me know why. Thanks!

Tablesaw is the shortest path to data science in Java. It includes a data-frame, an embedded column-store, and hundreds of methods to transform, summarize, or filter data. If you work with data in Java, it will probably save you time and effort.

Tablesaw also supports descriptive statistics, data visualization, and machine learning. And it scales: You can munge a 1/2 billion rows on a laptop and over 2 billion records on a server.

There are other, more elaborate platforms for data science in Java. They're designed to work with vast amounts of data, and require a huge stack and a vast amount of effort. All it takes to get started with Tablesaw is one maven dependency:

<dependency>
    <groupId>com.github.lwhite1</groupId>
    <artifactId>tablesaw</artifactId>
    <version>0.7.6.4</version>
</dependency>

Documentation and support:

A 1.0 release is planned for year end.

Tablesaw features:

Data processing & transformation

  • Import data from RDBMS and CSV files, local or remote (http, S3, etc.)
  • Combine files
  • Add and remove columns
  • Sort, Group, Filter
  • Map/Reduce operations
  • Store tables in a fast, compressed columnar storage format

Statistics and Machine Learning

  • Descriptive stats: mean, min, max, median, sum, product, standard deviation, variance, percentiles, geometric mean, skewness, kurtosis, etc.
  • Regression: Least Squares
  • Classification: Logistic Regression, Linear Discriminant Analysis, Decision Trees, k-Nearest Neighbors, Random Forests
  • Clustering: k-Means, x-Means, g-Means
  • Association: Frequent Item Sets, Association Rule Mining
  • Feature engineering: Principal Components Analysis

Visualization

  • Scatter plots
  • Line plots
  • Vertical and Horizontal Bar charts
  • Histograms
  • Box plots
  • Quantile Plots
  • Pareto Charts

Here's an example where we use XChart to map the locations of tornadoes: Alt text

You can see examples and read more about plotting in Tablesaw here: https://jtablesaw.wordpress.com/2016/07/30/new-plot-types-in-tablesaw/.

Current performance:

You can load a 500,000,000 row, 4 column csv file (35GB on disk) entirely into about 10 GB of memory. If it's in Tablesaw's .saw format, you can load it in 22 seconds. You can query that table in 1-2 ms: fast enough to use as a cache for a Web app.

BTW, those numbers were achieved on a laptop.

Easy to Use is Easy to Say

The goal in this example is to identify the production shifts with the worst performance. These few lines demonstrate data import, column-wise operations (differenceInSeconds()), filters (isInQ2()) grouping and aggegating (median() and .by()), and (top(n)) calculations.

    Table ops = Table.createFromCsv("data/operations.csv");                             // load data
    DateTimeColumn start = ops.dateColumn("Date").atTime(ops.timeColumn("Start"));
    DateTimeColumn end = ops.dateColumn("Date").atTime(ops.timeColumn("End");
    LongColumn duration = start.differenceInSeconds(end);                        // calc duration
    duration.setName("Duration");
    ops.addColumn(duration);
    
    Table filtered = ops.selectWhere(                                            // filter
          allOf
              (column("date").isInQ2(),
              (column("SKU").startsWith("429")),
              (column("Operation").isEqualTo("Assembly"))));
   
    Table summary = filtered.median("Duration").by("Facility", "Shift");         // group medians
    FloatArrayList tops = summary.floatColumn("Median").top(5);                  // get "slowest"

If you see something that can be improved, please let me know.

tablesaw's People

Contributors

lwhite1 avatar richiethom avatar smarks avatar jonsondag avatar

Watchers

James Cloos avatar shotishu avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.