ocramz / heidi Goto Github PK
View Code? Open in Web Editor NEWheidi : tidy data in Haskell
License: Other
heidi : tidy data in Haskell
License: Other
For the list representation this is trivial, not so much if the internals are a ConduitT or similar
Add a few useful date/time types from time
(https://hackage.haskell.org/package/time) , e.g.
A checklist for where to add things :
It seems essential to me that the library be able to join using multiple columns as the join key. I don't know if the underlying Trie makes that simpler.
The simplest version might simply add a column holding a product (maybe [v]
?) of the key-columns to each Frame, then join the new Frames on that new column, then remove that column in the result. That's how I'd do it from outside the library. And that doesn't seem horribly inefficient. But I'm imagining there's a better way?
I clicked the sponsor button for this repo and got the warning:
Some errors were encountered when parsing the FUNDING.yml file:
Some users provided are not enrolled in GitHub Sponsors. Apply to Sponsors.
Learn more about formatting FUNDING.yml.
https://github.com/tidyverse/tidyr
e.g. ascii
:
+-------------+-----------------+
| Person | House |
+-------+-----+-------+---------+
| Name | Age | Color | Price |
+-------+-----+-------+---------+
| David | 63 | Green | $170000 |
| Ava | 34 | Blue | $115000 |
| Sonia | 12 | Green | $150000 |
+-------+-----+-------+---------+
https://hackage.haskell.org/package/colonnade-1.2.0.1/docs/Colonnade.html#g:10
GT has a too restrictive bound on base : glguy/tries#19
Frames are not supposed to be constructed directly by the user (they should only be 'encode'd from data).
Currently we say 'hdr = mempty' in a few places for convenience (grep for all FIXMEs) but these should become a function of either inputs
For example, the header of a join of two Frames is the outer product of headers (?) -- look up schema computation in relational algebra/SQL
Blocks #11
In-memory typed data -> dataframe
i.e. after parsing validation etc.
text
, int
etc. lensesWe use microlens now for a more principled approach to getting/setting entries from datagrame rows.
There are two internal details of the library that must be reconciled and the UX must be figured out. How to manipulate list-valued indexing keys, with little boilerplate?
Row
values which are keyed by lists (since the original values are flattened into a single trie, collecting record names depth-first)encode :: (Foldable t, Heidi a) => t a -> Frame (Row [TC] VP)
generic-trie
, i.e. either a primitive type or a list of such etc.)https://hackage.haskell.org/package/heidi-0.0.0/docs/Heidi-Data-Frame-Algorithms-GenericTrie.html
innerJoin :: (Foldable t, Ord v, TrieKey k, Eq v, Eq k) => k -> k -> t (Row k v) -> t (Row k v) -> Frame (Row k v)
the latest generic-trie [1] is a more sustainable bet, so let's import it and get rid of its vendored source
e.g. as in https://elbersb.com/public/posts/tidylog100/
filtered <- filter(mtcars, cyl == 4)
#> filter: removed 21 rows (66%), 11 rows remaining
joined <- left_join(nycflights13::flights, nycflights13::weather,
by = c("year", "month", "day", "origin", "hour", "time_hour"))
#> left_join: added 9 columns (temp, dewp, humid, wind_dir, wind_speed, …)
#> > rows only in x 1,556
#> > rows only in y ( 6,737)
#> > matched rows 335,220
#> > =========
#> > rows total 336,776
the Row type has a natural interpretation in cassava
as well
(<>) = union
mempty = empty
various functions for manipulating cat variables, we could copy the API : https://forcats.tidyverse.org/
header
)possible representation :
data Frame = Frame {
frameRows :: [Row [TC] VP]
, frameHeader :: Header [TC]
}
currently, the information in Header and Val is not compatible:
λ> gflattenHM $ MkC1 42
fromList [([TC "C" "c1"],42)]
λ>
λ> header (Proxy @C)
HSum "C" (fromList [("MkC3",HLeaf "()"),("MkC1",HLeaf "Int"),("MkC2",HProd "A" (fromList [("MkA",HLeaf "Int")]))])
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.