Coder Social home page Coder Social logo

ocramz / heidi Goto Github PK

View Code? Open in Web Editor NEW
27.0 5.0 3.0 355 KB

heidi : tidy data in Haskell

License: Other

Haskell 100.00%
data-mining data-science dataframes dataframe-library dataframe data-analysis tidy-data generic-programming generics algebraic-data-types

heidi's Introduction

Build Status

alt text

heidi : tidy data in Haskell

This library aims to bridge the gap between Haskell's precise but inflexible type discipline and the dynamic world of dataframes.

More specifically, heidi aims to make it easy to analyze collections of Haskell values; users encode their data (lists, maps and so on) into dataframes, and use functions provided by heidi for manipulation.

If this sounds interesting, read on!

Introduction

A "dataframe" is conceptually a table of data that can be manipulated with a computer program; it potentially contains numbers, text and anything else that can be rendered as text.

In scientific practice, a "tidy" dataframe is a specific way of arranging the data in which each row represents a distinct observation ("data point") and each column a "feature" (i.e. some observable aspect) of the data.

Nowadays, data science is a very established practice and many software libraries offer excellent functionality for working with such dataframes. R has tidyverse , Python has pandas, and so on.

What about Haskell?

The Frames [1] library offers rigorous type safety and good runtime performance, at the cost of some setup overhead. Heidi's main design goal instead is to have minimal overhead and possibly very low cognitive load to data science practitioners, at the cost of some type safety.

Quickstart

The following snippet demonstrates the minimal setup necessary to use heidi :

{-# language DeriveGeneric #-}   (1)
module MyDataScienceTask where
import GHC.Generics    (2)

import Heidi

data Sales = Sales { item :: String, amount :: Int } deriving (Eq, Show, Generic)     (3)
instance Heidi Sales     (4)

All datatypes that are meant to be used within dataframes must be in the Heidi typeclass, which in turn requires a Generic instance.

The DeriveGeneric language extension (1) enables the compiler to automatically write the correct incantations (3), as long as the user also imports the GHC.Generics module (2) from base.

The automatic dataframe encoding mechanism is made possible by the empty Heidi instance (4).

It is also convenient to use DeriveAnyClass to avoid writing the empty typeclass instance :

{-# language DeriveGeneric, DeriveAnyClass #-}
data Foo = Foo Int String deriving (Generic, Heidi)

Rationale

Out of the box, Haskell offers record types, e.g.

data Row a = MkRow { column1 :: Int, column2 :: String } deriving (Eq, Show)

which is handy because in one declaration you get a constructor method MkRow and accessors column1, column2, so a simple "data table" could be constructed as a list of such records, simply enough.

One thing that the language doesn't natively support is lookup by accessor name. For example column1 :: Row -> Int can only access a value of type Row, since the column1 name is globally unique (for a discussion on modern techniques to deal with this, see the Advanced section below).

In addition to lookup, many data tasks require relational operations across pairs of data tables; algorithmically, these require lookups both across rows and columns, and there's nothing in Haskell's implementation of records that supports this.

There are a number of additional tasks that are routine in data analysis such as plotting, rendering the dataset to various tabular formats (CSV, database ...), and this library aims to support those too with a convenient syntax.

Advanced

Haskell offers a number of advanced workarounds for manipulating types, such as generic traversals, lookups, etc. A brief list of keywords is given in the following, for those inclined to dive into the rabbit hole.

Row polymorphism

Elm, Purescript etc.

OverloadedRecordFields

[1]

Row types

As you might know, the "row types" problem is well understood and has been explored in practice; discussing the various tradeoffs between approaches would be lengthy and quite technical (and your humble author is not too qualified to do full justice to the topic either).

In Haskell , the Frames [2] library and related ecosystem stands out as a full-featured dataframe implementation that does not compromise on type safety.

Heidi instead offers generic transformations from the source datatypes to uni-typed values (conceptually, each row is a Map String T where data T = TInt Int | TChar Char etc.), a domain in which it's convenient to perform lookups and similar operations.

Exploring further : vinyl [3], heterogeneous lists, sums-of-products ...

References

[1] OverloadedRecordFields : https://downloads.haskell.org/ghc/latest/docs/html/users_guide/glasgow_exts.html#record-field-selector-polymorphism

[2] Frames : https://hackage.haskell.org/package/Frames

[3] vinyl : https://hackage.haskell.org/package/vinyl

[4] generics-sop : https://hackage.haskell.org/package/generics-sop

heidi's People

Contributors

ocramz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

heidi's Issues

Couple Frame and Header

  • A Header is uniquely determined by the type of the input data
  • once data are encoded in a Frame, we compute a Header from the type (with header)
  • relational operations such as JOINs, and tidying operations produce dataframes that have a different (larger or smaller) column set than the one of either operands
  • this is why a new Header should be derived when producing the operation result, and stored in the resulting Frame

possible representation :

data Frame = Frame {
    frameRows :: [Row [TC] VP]
  , frameHeader :: Header [TC]
}

currently, the information in Header and Val is not compatible:

λ> gflattenHM $ MkC1 42
fromList [([TC "C" "c1"],42)]
λ> 
λ> header (Proxy @C)
HSum "C" (fromList [("MkC3",HLeaf "()"),("MkC1",HLeaf "Int"),("MkC2",HProd "A" (fromList [("MkA",HLeaf "Int")]))])

tidy data operations with lists as keys

There are two internal details of the library that must be reconciled and the UX must be figured out. How to manipulate list-valued indexing keys, with little boilerplate?

  • On one hand, the generic encoding produces Row values which are keyed by lists (since the original values are flattened into a single trie, collecting record names depth-first)

encode :: (Foldable t, Heidi a) => t a -> Frame (Row [TC] VP)

  • On the other, the relational operations are completely polymorphic in the key type (as long as it's TrieKey from generic-trie, i.e. either a primitive type or a list of such etc.)

https://hackage.haskell.org/package/heidi-0.0.0/docs/Heidi-Data-Frame-Algorithms-GenericTrie.html

innerJoin :: (Foldable t, Ord v, TrieKey k, Eq v, Eq k) => k  -> k  -> t (Row k v)  -> t (Row k v)  -> Frame (Row k v)

add dplyr operations

  • mutate() adds new variables that are functions of existing variables
  • select() picks variables based on their names.
    • see the text, int etc. lenses
  • filter() picks cases based on their values.
  • summarise() reduces multiple values down to a single summary.
    - [ ] arrange() changes the ordering of the rows. not needed

date/ time types

Fix usage of Semigroup/Monoid instance of Header

Frames are not supposed to be constructed directly by the user (they should only be 'encode'd from data).
Currently we say 'hdr = mempty' in a few places for convenience (grep for all FIXMEs) but these should become a function of either inputs

For example, the header of a join of two Frames is the outer product of headers (?) -- look up schema computation in relational algebra/SQL

Blocks #11

CSV output

the Row type has a natural interpretation in cassava as well

  • import cassava
  • add necessary instances to Row

Multi-key-column joins

It seems essential to me that the library be able to join using multiple columns as the join key. I don't know if the underlying Trie makes that simpler.

The simplest version might simply add a column holding a product (maybe [v]?) of the key-columns to each Frame, then join the new Frames on that new column, then remove that column in the result. That's how I'd do it from outside the library. And that doesn't seem horribly inefficient. But I'm imagining there's a better way?

Traversal' rather than Decode

We use microlens now for a more principled approach to getting/setting entries from datagrame rows.

  • filterDecode should be ported to use a Traversal' rather than Decode

explain/log outcome of each operation

e.g. as in https://elbersb.com/public/posts/tidylog100/

filtered <- filter(mtcars, cyl == 4)
#> filter: removed 21 rows (66%), 11 rows remaining

joined <- left_join(nycflights13::flights, nycflights13::weather,
    by = c("year", "month", "day", "origin", "hour", "time_hour"))
#> left_join: added 9 columns (temp, dewp, humid, wind_dir, wind_speed, …)
#>            > rows only in x     1,556
#>            > rows only in y  (  6,737)
#>            > matched rows     335,220
#>            >                 =========
#>            > rows total       336,776

Categorical variables

various functions for manipulating cat variables, we could copy the API : https://forcats.tidyverse.org/

  • fct_reorder(): Reordering a factor by another variable.
  • fct_infreq(): Reordering a factor by the frequency of values.
  • fct_relevel(): Changing the order of a factor by hand.
  • fct_lump(): Collapsing the least/most frequent values of a factor into “other”.

FUNDING.yml parsing errors

I clicked the sponsor button for this repo and got the warning:

Some errors were encountered when parsing the FUNDING.yml file:

Some users provided are not enrolled in GitHub Sponsors. Apply to Sponsors.
Learn more about formatting FUNDING.yml.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.