techascent / tech.ml.dataset Goto Github PK

A Clojure high performance data processing system

License: Eclipse Public License 1.0

Clojure 87.03% Shell 0.34% Java 12.21% Python 0.32% R 0.09%

clojure dataframe csv xlsx datascience machine-learning dataset etl-pipeline java

tech.ml.dataset's Introduction

tech.ml.dataset

tech.ml.dataset (TMD) is a Clojure library for tabular data processing similar to Python's Pandas, or R's data.table. It supports pragmatic data-intensive work on the JVM by providing powerful abstractions that simplify implementing efficient solutions to real problems. Datasets shrink in memory through columnar storage and the use of primitive arrays, packed datetime types, and string tables.

Unlike in Python or R, TMD datasets are functional, which means they're easier to reason about.

Installing

Installation instructions for your favorite build system (lein, deps.edn, etc...) can be found at Clojars, where the library is hosted:

https://clojars.org/techascent/tech.ml.dataset

Verifying Installation

user> (require 'tech.v3.dataset)
nil
user> (->> (System/getProperties)
           (map (fn [[k v]] {:k k :v (apply str (take 40 (str v)))}))
           (tech.v3.dataset/->>dataset {:dataset-name "My Truncated System Properties"}))

My Truncated System Properties [53 2]:

|                         :k |                                       :v |
|----------------------------|------------------------------------------|
|                sun.desktop |                                    gnome |
|                awt.toolkit |                     sun.awt.X11.XToolkit |
| java.specification.version |                                       11 |
|            sun.cpu.isalist |                                          |
|           sun.jnu.encoding |                                    UTF-8 |
|            java.class.path | src:resources:target/classes:/home/harol |
|             java.vm.vendor |                                   Ubuntu |
|        sun.arch.data.model |                                       64 |
|            java.vendor.url |                      https://ubuntu.com/ |
|              user.timezone |                           America/Denver |
|                        ... |                                      ... |
|                    os.arch |                                    amd64 |
| java.vm.specification.name |       Java Virtual Machine Specification |
|        java.awt.printerjob |                   sun.print.PSPrinterJob |
|         sun.os.patch.level |                                  unknown |
|          java.library.path | /usr/java/packages/lib:/usr/lib/x86_64-l |
|               java.vm.info |                      mixed mode, sharing |
|                java.vendor |                                   Ubuntu |
|            java.vm.version |      11.0.17+8-post-Ubuntu-1ubuntu222.04 |
|    sun.io.unicode.encoding |                            UnicodeLittle |
|        apple.awt.UIElement |                                     true |
|         java.class.version |                                     55.0 |

📚 Documentation 📚

The best place to start is the "Getting Started" topic in the documentation: https://techascent.github.io/tech.ml.dataset/000-getting-started.html

The "Walkthrough" topic provides long-form examples of processing real data: https://techascent.github.io/tech.ml.dataset/100-walkthrough.html

The "Quick Reference" topic summarizes many of the most frequently used functions: https://techascent.github.io/tech.ml.dataset/200-quick-reference.html

The API docs document every available function: https://techascent.github.io/tech.ml.dataset/

The provided Java API (javadoc / with frames) and sample program (source) show how to use TMD from Java.

Questions / Community

Log an issue!
Visit the zulip stream.
Or the slack data science channel.

Related Projects and Notes

An alternative cutting-edge api with some important extra features is available via tablecloth.
tech.v3.datatype provides the underlying numeric subsystem to TMD.
Simple regression/classification machine learning pathways are available in tech.ml.
Some independent benchmarks indicating TMD's speed.
Bindings to a high performance in-process SQL database.
A Graal native example project.
The scicloj.ml tutorials may be a good way to jump straight into data science.
Comparison between R's data.table, R's dplyr, and an older version of TMD.
Another overview of some of the available functions from genme: Some Functions

License

Distributed under the Eclipse Public License either version 1.0 or (at your option) any later version.

tech.ml.dataset's People

Contributors

Stargazers

Watchers

tech.ml.dataset's Issues

Explicit support for time, date, datetime

Tablesaw supports this. It would just take a bit of time to get in there and figure out the caveats.

Variable length windows

The current rolling window system is a fixed-window system independent of the data.

An upgrade to it would be to allow windows to be dependent upon data, so a you have a 2 second window rolling average or something like that.

boxed values arrays can't be transfered to a dataset

(let [arr (into-array Double (repeat 10 1.0))]
  (tech.ml.dataset/name-values-seq->dataset  [["double" arr]]))

(let [arr (into-array Boolean (repeat 10 true))]
  (tech.ml.dataset/name-values-seq->dataset  [["boolean" arr]]))

tech.v2.tensor column parse/cast error?

Hi,
Thx for this library! I'm using it for data-science in clojure and am experimenting with pymc3 and libpython-clj :)

I think I've stumbled upon a bug.
I read that :object type columns are implemented now. This works great when I use vectors as the values of a column, however when I try to use tech.v2.tensor as the values of a column it get an error. Is something going wrong with parsing?

Reproducible example:

;; works
(ds/->dataset [{:a [0.4935 0.5552]} {:a [0.4935 0.5552]}])
;; => _unnamed [2 1]:
;; |              :a |
;; |-----------------|
;; | [0.4935 0.5552] |
;; | [0.4935 0.5552] |

;; doesn't work
(ds/->dataset [{:a (tech.v2.tensor/->tensor [0.4935 0.5552]) }
               {:a (tech.v2.tensor/->tensor [0.4935 0.5552]) }])
;; Execution error (ClassCastException) at tech.ml.dataset.parse.mapseq$map$reify$reify$reify__52074/doubleValue (mapseq.clj:39).
;; tech.v2.tensor.impl.Tensor cannot be cast to java.lang.Number

I'm using newest versions of libpython-clj and datatype

{:deps {
        clj-python/libpython-clj {:mvn/version "1.38" }
        techascent/tech.ml.dataset {:mvn/version "2.0-beta-11"}
        techascent/tech.datatype {:mvn/version "5.0-beta-2"}}}

Creating own aggregation function

What is the best way if I want to create my own aggregate function?
I know that I can always reduce on columns but then I escape from datatype flow.

->flyweight upgrades

->flyweight isn't very descriptive although it is fun. ->seq-of-maps would be a clearer name.
optimizing it such that it creates a record on the fly and thus all maps share the same keys and structure would be a solid move forward.

semantic-csv does this here using create-struct.

14 day trial period

Setup stripe trial periods.

categorical column and list of categories

What is the meaning behind categorical column?

My understanding is that column is categorical when content is discrete. It can be inferred from the type but usually should be inferred from the content or user should mark such column as categorical. Not every String column is categorical. Also there are some int64 columns which are categorical.

What is interesting for me is to have an ability to get sequence of categories, without calling unique on the column. Maybe we should have an option: set-as-categorical on dataset and column name, which adds metatag with (lazy)list of categories.

support for .json and .json.gz

Reasonable extension :-).

zipped dataset

I found a page full of datasets but the one I've chosen is zipped. Can you add zip format as well? (it's supported by Java SDK fortunately).

import from URL

Could you consider an option to add reading data from URL?

It would be nice to have:

(ds/->dataset "https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv")

Which (on beta-22) throws Missing config value: :tech-io-cache-local exception.

Workaround:

(ds/->dataset (.openStream (java.net.URL. "https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv")))

recommend result of join-by-column be something that implements PColumnarDataset; Indexing error in :join-table result

Went to implement a naive left-join and corresponding test, working off the ww3 sample

Using the following data:
lhs

[{"CustomerID" 1,
"CustomerName" "Alfreds Futterkiste",
"ContactName" "Maria Anders",
"Address" "Obere Str. 57",
"City" "Berlin",
"PostalCode" 12209,
"Country" "Germany"}
{"CustomerID" 2,
"CustomerName" "Ana Trujillo Emparedados y helados",
"ContactName" "Ana Trujillo",
"Address" "Avda. de la Constitución 2222",
"City" "México D.F.",
"PostalCode" 5021,
"Country" "Mexico"}
{"CustomerID" 3,
"CustomerName" "Antonio Moreno Taquería",
"ContactName" "Antonio Moreno",
"Address" "Mataderos 2312",
"City" "México D.F.",
"PostalCode" 5023,
"Country" "Mexico"}]

rhs

[{"OrderID" 10308,
"CustomerID" 2,
"EmployeeID" 7,
"OrderDate" "1996-09-18",
"ShipperID" 3}
{"OrderID" 10309,
"CustomerID" 37,
"EmployeeID" 3,
"OrderDate" "1996-09-19",
"ShipperID" 1}
{"OrderID" 10310,
"CustomerID" 77,
"EmployeeID" 8,
"OrderDate" "1996-09-20",
"ShipperID" 2}]

The result of join-by-column is a map of :join-table and :rhs-missing, not something that can immediately be used by the ds protocols and API. The stock join-by-column completes and results print apparently successfully.

{:join-table join-table [1 11]:

| CustomerID |                       CustomerName |  ContactName |                       Address |        City | PostalCode | Country | OrderID | EmployeeID |  OrderDate | ShipperID |
|------------+------------------------------------+--------------+-------------------------------+-------------+------------+---------+---------+------------+------------+-----------|
|          2 | Ana Trujillo Emparedados y helados | Ana Trujillo | Avda. de la Constitución 2222 | México D.F. |       5021 |  Mexico |   10308 |          7 | 1996-09-18 |         3 |
, :rhs-missing-indexes [1 2]}

If I try to apply the ds functions to the :join-table result, e.g. columns, I get an error:

[Error printing return value (IndexOutOfBoundsException) at it.unimi.dsi.fastutil.longs.LongArrayList/getLong (LongArrayList.java:267).
Index (1) is greater than or equal to list size (1)
user>

Seems like maybe this is an off-by-one error somewhere (looks like we should be getting 0 instead of 1 as an index).

Maybe the :rhs-missing information could be elevated to metadata on the join-table, so that the type of join-by-column is String|Keyword|Number -> PColumnarDataset -> PColumnarDataset -> PColumnarDataset ?

cols selection enhance

We have some variant for column(s) selection

column and columns returns columnar data type
select-columns returns dataset

That's ok. I would add:

select-column - to create dataset with one column
Enhance arguments to select-columns:
- seq of names - as currently
- map - rename and select
- optionally: sequence which can contain name or map to mix two above

Vega vis label thoughts

I'd like to be able to add a label to both a plot:

And individual axes:

name-values-seq->dataset not parsing tensor correctly?

Hi @cnuernber,

Am I wrong in expecting that

(require '[tech.ml.dataset :as ds])
(= (ds/->dataset [{:a 0 :tensor (tech.v2.tensor/->tensor [1 2 3])}
                  {:a 1 :tensor (tech.v2.tensor/->tensor [4 5 6])}])

   (ds/name-values-seq->dataset {:a [0 1]
                                 :tensor (tech.v2.tensor/->tensor [[1 2 3]
                                                                   [4 5 6]])}))

I get back the following error

Execution error (ExceptionInfo) at tech.ml.dataset.impl.dataset/name-values-seq->dataset (dataset.clj:211).
Different sized columns detected: clojure.lang.LazySeq@405

operations on and propagation of missing/nan values

In the integer example below I get an overflow error when I want to do a subtraction with a missing value. I expected I would receive a missing/NAN as a result.
Missing values don't propagate. In the float example below, column :a-b should have 1 missing value.

;; integer
(as-> (ds/->dataset [{:a 1 :b 2}]) $
    (ds/update-column $ :b #(tech.ml.dataset.column/set-missing % [0]))
    (ds/new-column $ :a-b (dfn/- ($ :a) ($ :b))))
;; => Error printing return value (ArithmeticException) at clojure.lang.Numbers/throwIntOverflow (Numbers.java:1576).
;; integer overflow

;; float
(as-> (ds/->dataset [{:a 1.0 :b 2.0}]) $
  (ds/update-column $ :b #(tech.ml.dataset.column/set-missing % [0]))
  (ds/new-column $ :a-b (dfn/- ($ :a) ($ :b)))
  (ds/descriptive-stats $))
;; => _unnamed: descriptive-stats [3 9]:
;; | :col-name | :datatype | :n-valid | :n-missing | :mean |  :min |  :max | :standard-deviation | :skew |
;; |-----------+-----------+----------+------------+-------+-------+-------+---------------------+-------|
;; |        :a |  :float64 |        1 |          0 | 1,000 | 1,000 | 1,000 |               0,000 |   NAN |
;; |      :a-b |  :float64 |        1 |          0 |   NAN |   NAN |   NAN |               0,000 |   NAN |
;; |        :b |  :float64 |        0 |          1 |   NAN |   NAN |   NAN |                 NAN |   NAN |

Lightweight empty columns

Implementing a left join example, realizing that having a flyweight empty column that projects in O(1) space an arbitrary range of missing values would be very useful for these kinds of joins. Right now, we have to manually create a missing set (e.g. via new-column) full of all the entries. Should be able to reify a set implementation that is just a countable thing that returns a range for reader/seq, and checks if indices are within an the range for containment.

Only tricky bit is promoting if we allow items to be added to the set. Either coerce it to the full missing set as before, or work on a sparse scheme to allow "present" values, sort of like we use missing values for the column stuff.

date format yyyyMMdd doesn't parse automaticall

The format "yyyyMMdd" is included in tech.ml.dataset.parse.datetime/date-parser-patterns, so I assumed that means that it parses automatically, but it doesn't:

(ds/select-rows (ds/->dataset "https://covidtracking.com/api/v1/states/daily.csv" {:column-whitelist ["date"]}) (range 3))
;; => https://covidtracking.com/api/v1/states/daily.csv [3 1]:
;; |     date |
;; |----------|
;; | 20200422 |
;; | 20200422 |
;; | 20200422 |

(ds/select-rows (ds/->dataset "https://covidtracking.com/api/v1/states/daily.csv"
                              {:column-whitelist ["date"]
                               :parser-fn {"date" [:local-date "yyyyMMdd"]}}) (range 3))
;; => https://covidtracking.com/api/v1/states/daily.csv [3 1]:
;; |       date |
;; |------------|
;; | 2020-04-22 |
;; | 2020-04-22 |
;; | 2020-04-22 |

iterator-seq memory leak (with surprising results)

Just documenting here what I found regarding the possible memory leak due to cached sequences in tech.ml.dataset.parse/csv->columns.

Retaining the head of the invoked iterator-seq on the iterable produced by the univocity CSV parser should result in e.g. large text files being fully realized and cached until after the dataset is constructed and the function returns. In theory, this would trivially blow the heap (e.g. 1GB of string references could expand to significantly more in memory, etc.).

I tested with the corrected test data, using a filesize of 1GB with the legacy (leaky) implementation. I then tested with the proposed fix that uses an atom to clear the reference later, here.

For a baseline file of ~1.3GB, the dataset is able to load in both cases in about the same time. More importantly, the memory footprint and GC pressure is fairly consistent, with the freed implementation having a bit less (around 1.5gb max used). It looks like univocity's CSV parser is returning many flyweight objects for the rows in its iterable, rather than actual arrays of strings, since the memory footprint is surprisingly low. So, the typically destructive caching behavior doesn't impact us at this scale.

To test this further, I pumped the row count for the synthetic data, and generated a 13.4 GB dataset to load. On a 4GB heap, we do see a difference in the memory profile, but again not drastic. Both implementations peak out around 3GB eventually, with the leaking version running into GC issues earlier. Consequently, the non-leaking version is around 5% faster to load the dataset. Both datasets load on a 4GB heap though, which is actually fairly impressive.

In all, I think this is a minor issue since univocity's parsing implementation appears to be efficient, thus mitigating the effects of the leak. For implementation sake, it's probably fine to leave it as is, unless people run into problems in the far future. In the extreme, say someone is trying to load a file that vastly outstrips available memory (say 100GB), it may be a problem, but at that point there are likely smarter options for processing. For laptop use cases, the existing solution appears to scale fine.

`ds-concat` nil punning

> (ds/ds-concat nil ds)
Execution error (IllegalArgumentException) at tech.ml.protocols.dataset/eval36872$fn$G (dataset.clj:7).
No implementation of method: :columns of protocol: #'tech.ml.protocols.dataset/PColumnarDataset found for class: nil

could just result in ds instead of an error

`update-column` type detection

Naive attempt:

tech.ml.dataset.vega> ds
#<tech.ml.dataset.generic_columnar_dataset.GenericColumnarDataset@401f8598  [560 3]:

| symbol |       date |  price |
|--------+------------+--------|
|   MSFT | Jan 1 2000 | 39.810 |
|   MSFT | Feb 1 2000 | 36.350 |
|   MSFT | Mar 1 2000 | 43.220 |
|   MSFT | Apr 1 2000 | 28.370 |
|   MSFT | May 1 2000 | 25.450 |
|   MSFT | Jun 1 2000 | 32.540 |
|   MSFT | Jul 1 2000 | 28.400 |
|   MSFT | Aug 1 2000 | 28.400 |
|   MSFT | Sep 1 2000 | 24.530 |
|   MSFT | Oct 1 2000 | 28.020 |
|   MSFT | Nov 1 2000 | 23.340 |
|   MSFT | Dec 1 2000 | 17.650 |
|   MSFT | Jan 1 2001 | 24.840 |
|   MSFT | Feb 1 2001 | 24.000 |
|   MSFT | Mar 1 2001 | 22.250 |
|   MSFT | Apr 1 2001 | 27.560 |
|   MSFT | May 1 2001 | 28.140 |
|   MSFT | Jun 1 2001 | 29.700 |
|   MSFT | Jul 1 2001 | 26.930 |
|   MSFT | Aug 1 2001 | 23.210 |
|   MSFT | Sep 1 2001 | 20.820 |
|   MSFT | Oct 1 2001 | 23.650 |
|   MSFT | Nov 1 2001 | 26.120 |
|   MSFT | Dec 1 2001 | 26.950 |
|   MSFT | Jan 1 2002 | 25.920 |
>
tech.ml.dataset.vega> (.getTime (.parse (java.text.SimpleDateFormat. "MMM dd yyyy") "Oct 1 2001"))
1001916000000
tech.ml.dataset.vega> (ds/update-column ds "date" (fn [col]
                                                    (->> col
                                                         (map (fn [v] (.getTime (.parse (java.text.SimpleDateFormat. "MMM dd yyyy") v)))))))
Execution error (IllegalArgumentException) at tech.libs.tablesaw.tablesaw-column/make-empty-column (tablesaw_column.clj:446).
No matching clause: :object

Discussion:

key-fn should work on file formats other than json (csv,xls)

doubles imported as objects

Take a look at this dataset: https://github.com/generateme/cljplot/blob/master/data/chem97.json

When I load it, :gcsescore gets object type instead of float64.

(chem97 :gcsescore)
;; => #tech.ml.dataset.column<object>[31022]
:gcsescore
[6.625, 7.625, 7.250, 7.500, 6.444, 7.750, 6.750, 6.909, 6.375, 7.750, 7.857, 7.333, 7.750, 7.700, 6.300, 7.300, 6.636, 7.272, 7.200, 6.454, ...]

(reduce + (chem97 :gcsescore))
;; => 194994.49800000832

to-double-array on booleans

Below code raises exception. Doc string states: "For booleans, true=1 false=0".

(let [d (ds/->dataset [{:a true} {:a true} {:a false}])]
  (col/to-double-array (d :a)))

:text implementation of make-container erroneously returns nil

tech.ml.dataset.impl.column

(defn make-container
  ([dtype n-elems]
   (case dtype
     :string (make-string-table n-elems "")
     :text (let [list-data (ArrayList.)]
             (dotimes [iter n-elems]
               (.add list-data ""))
             list-data) ;;this should be added, otherwise the let returns nil, not the arraylist.
     (dtype/make-container :list dtype n-elems)))
  ([dtype]
   (make-container dtype 0)))

Vega viz design next thought

Doing dataset -> vega is good and fun and has already proved very useful.

It occurs to me that there's nothing really dataset specific in the operations, however. I could imagine splitting the vega/viz portion out to work on (probably?) sequences of values, and then calling into that with thin wrappers that extract the sequences of values from datasets - to keep the code easy to use from ds.

The upside would be that if some data doesn't happen to be in a dataset, we could avoid the extra step of turning it into a dataset just for the purposes of vis.

group-by and aggregate

Aggregation on dataset and on groups (after group-by) should result a dataset. As @keesterbrugge shows it can be done with some Clojure operations. Wonder if there are more optimal versions.

(defn aggregate
  ([agg-fns-map ds]
   (aggregate {} agg-fns-map ds))
  ([m agg-fns-map ds]
   (into m (map (fn [[k agg-fn]]
                  [k (agg-fn ds)]) agg-fns-map))))

(def aggregate->dataset (comp ds/->dataset vector aggregate))

(defn group-by-columns-and-aggregate
  [gr-colls agg-fns-map ds]
  (->> (ds/group-by identity ds gr-colls)
       (map (fn [[group-idx group-ds]]
              (aggregate group-idx agg-fns-map group-ds)))
       ds/->dataset))

Results can be found here: https://github.com/genmeblog/techtest/blob/master/src/techtest/core.clj#L642

group-by categories

Is there any way to get the list of the keys from group-by operation without actual grouping? I can do group-by and then call keys but I'm not sure if it's efficient way.

dataset namespace loading time

Require dataset is very, very slow, it takes 25s in my machine

user> (time (require '[tech.ml.dataset :as ds]))
"Elapsed time: 25646.0355 msecs"
nil

filter-by

There is an use case where we want to filter based on some column calculations. This calculations can't be done by simple map.
For example: we need to filter by (moving) average, or R rank. To flow looks like that:

Take a column(s)
Produce new temporary column (calculate moving average for example)
Filter by this new column, or gather indices and select rows
Drop column

Here is concrete case with rank: https://github.com/genmeblog/techtest/blob/master/src/techtest/datatable_dplyr.clj#L554

I can think about solution of filter (filter-by maybe?) which takes a sequence and selects only rows corresponding to the result of predicate.

rows selection enhanced

I can imagine such helper functions:

head - select first n rows (equivalent to (select-rows (range n))
tail - select last n rows
sample - return n random rows (with repetitions or not)
shuffle (permute) dataset
unique (whole dataset) by rows

ensure that options are passed in the correct place for function call type.

Some of the API functions are dataset-first some are dataset-last. Regardless, if the dataset is first it should always be first and options and additional arguments should follow. The inverse is true for dataset-last functionality. Filter, for instance, breaks this which makes the options to filter irritating to use.

sorting by columns with given order

It would be great to have possibility to sort dataset by columns with given order. It's fairly easy to sort by columns with names as keywords ordering ascending:

(->> (ds/->dataset [{:a 1 :b 2} {:a 11 :b -2} {:a 1 :b -1}])
     (ds/sort-by (juxt :a :b)))
;; => _unnamed [3 2]:
;;    | :a | :b |
;;    |----+----|
;;    |  1 | -1 |
;;    |  1 |  2 |
;;    | 11 | -2 |

But things are getting more complicated when:

names are not keywords (custom field selector needs to be written)
ordering is descending (custom comparator needs to be written)

I think enhancement should add sort-by-columns with the same arguments as sort-by-column. Also compare-fn can be: function (as is now) or seq of order, one of :asc and :desc.

Above example (imagined):

(->> (ds/->dataset [{:a 1 :b 2} {:a 11 :b -2} {:a 1 :b -1}])
     (ds/sort-by-columns [:a :b] [:asc :desc]))
;; => _unnamed [3 2]:
;;    | :a | :b |
;;    |----+----|
;;    |  1 |  2 |
;;    |  1 | -1 |
;;    | 11 | -2 |

remove-rows

I don't see function to discard some rows by index.

Naive left join implementation, test

This is the manual implementation that ended up generating aforementioned issues elsewhere.

joinr@eb11001

Looks like there's a more elegant way to join; the left-join test may be useful going forward though.

provide infinite `idx-seq` to `set-missing` makes cider unresponsive

The following code makes cider unresponsive. Interrupting the repl doesn't work, gives "sync nrepl request timed out (op close)"

(-> (ds/->dataset [{:a 1 :b 2}])
    (ds/update-column :b #(tech.ml.dataset.column/set-missing % (range))))

When I shut down the jvm process I received an out of memory error in the repl

Should left-join and right-join return all fields?

referenced in pull request #25 . I think the behavior for left and right joins would be to include the fields from both lhs and rhs. The difference between the two would be in the missing fields and missing values on either the lhs fields or rhs fields in the resulting joined table. Uncertain if - given the design only returns indices - this is a pedantic difference. Any join operation will return all the fields of both tables then. Tests predicates in #25 can be altered trivially if I'm mistaken.

reshape

I would like to have a function for reshape. Currently I have only one need. columns -> rows

input:

a	b	c
1	2	3
4	5	6

result

column	value
a	1
a	4
b	2
b	5
c	3
c	6

options when dataset is created

We have two ways of api creation:

From map or pairs [name seq] using name-values-seq->dataset
Other cases via ->dataset or ->>dataset

options are passed differently: first on variadic position, latter as second argument.

BTW there should be one function (you can infer if something is map or sequence of sequences (length=2)

pipeline is very magic

The current pipeline pathway has a lot of magic. Building pipelines requires quite a bit of quoting and especially when they are mixed with actual values unquoting the right thing at the right time is a bit of a PITA.

Working to reduce the magic involved in this while still keeping the pipeline as data would help new people get used to it and also help understadability of the codebase.

tech.libs.tablesaw.TablesawColumn doesn't implement get-column-value

Working on a subcolumn implementation as a proof of concept, got almost there and noticed (After copying over the implementation of TableSawColumn and testing), that col-proto/get-column-value is abstract here. Is there a reason for this? looks like the api function of the same name just delegates to the protocol, so I think not.

Sequence of integers is coerced to float64

Is it proper behaviour?

(first (ds/name-values-seq->dataset {:t (repeatedly 10 #(rand-nth [1 2 3 4]))}))
;; => #tech.ml.dataset.column<float64>[10]
:t
[1.000, 3.000, 4.000, 3.000, 4.000, 4.000, 1.000, 2.000, 1.000, 2.000, ]

Failed to find op

After calling dfn/max or dfn/min on column I'm getting following exception:

1. Unhandled java.lang.Exception
   Failed to find op: :min

  builtin_op_providers.clj:   30  tech.v2.datatype.builtin-op-providers/get-op
  builtin_op_providers.clj:   26  tech.v2.datatype.builtin-op-providers/get-op

dropping/selecting rows with missing values

When I try to drop rows with missing values and when all rows in dataset have missing values I'm getting the exception.

;; works (there is one row without missing values)
(let [ds (ds/->dataset [{:a 1 :b 1} {:b 2}])]
  (ds/drop-rows ds (ds/missing ds)))
;; => _unnamed [1 2]:
;;    | :a | :b |
;;    |----+----|
;;    |  1 |  1 |

;; all rows have missing values -> exception
(let [ds (ds/->dataset [{:a 1} {:b 2}])]
  (ds/drop-rows ds (ds/missing ds)))

1. Unhandled java.util.NoSuchElementException
   Empty RoaringArray

         RoaringArray.java:  998  org.roaringbitmap.RoaringArray/assertNonEmpty
         RoaringArray.java:  978  org.roaringbitmap.RoaringArray/first
        RoaringBitmap.java: 2745  org.roaringbitmap.RoaringBitmap/first
                bitmap.clj:   88  tech.v2.datatype.bitmap/eval23542/fn
             protocols.clj:  306  tech.v2.datatype.protocols/eval15210/fn/G
                bitmap.clj:  217  tech.v2.datatype.bitmap/bitmap->efficient-random-access-reader
                bitmap.clj:  210  tech.v2.datatype.bitmap/bitmap->efficient-random-access-reader
                column.clj:  134  tech.ml.dataset.impl.column/->efficient-reader
                column.clj:  130  tech.ml.dataset.impl.column/->efficient-reader
                column.clj:  313  tech.ml.dataset.impl.column.Column/select
                column.clj:  117  tech.ml.dataset.column/select
                column.clj:  114  tech.ml.dataset.column/select
               dataset.clj:  111  tech.ml.dataset.impl.dataset.Dataset/fn
                  core.clj: 2753  clojure.core/map/fn
              LazySeq.java:   42  clojure.lang.LazySeq/sval
              LazySeq.java:   51  clojure.lang.LazySeq/seq
                   RT.java:  535  clojure.lang.RT/seq
                   RT.java:  650  clojure.lang.RT/countFrom
                   RT.java:  643  clojure.lang.RT/count
                column.clj:  321  tech.ml.dataset.column/ensure-column-seq
                column.clj:  302  tech.ml.dataset.column/ensure-column-seq
               dataset.clj:  191  tech.ml.dataset.impl.dataset/new-dataset
               dataset.clj:  181  tech.ml.dataset.impl.dataset/new-dataset
               dataset.clj:  113  tech.ml.dataset.impl.dataset.Dataset/select
                  base.clj:  203  tech.ml.dataset.base/select
                  base.clj:  191  tech.ml.dataset.base/select
                  base.clj:  257  tech.ml.dataset.base/drop-rows
                  base.clj:  251  tech.ml.dataset.base/drop-rows

Move to graph structure for pipeline

This has been suggested several times by several people and it is just getting to be time.

We would like to get these things out of the transform:

Dependency analysis - which columns are dependent upon which inputs thus making it easier to trim out columns when we aren't going to infer or train on them.
Variable promotion - be able to declaratively specify variables that get automagically promoted to a higher level. This level could be a gridsearch level or it could be a UI level.

Execution of the graph produces both a new graph and a new dataset. In this way we preserve the ability to auto-produce the inference process from the training process.

We intend to model the nodes of the graph as maps with ids with an edge list similar to cortex.

Looks like ds-concat doesn't respect missing set from columns.

(ds-col/new-column (ds-col/column-name first-col)
                                         column-values
                                         (ds-col/metadata first-col))

this shows up when building a left-join dataset. I have a function in my branch empty-column that leverages new-column to create a column with all indices marked as missing for the datatype. This is a naive rough draft, but it works.

When I go to implement left-join, the result look correct, except the missing data for each column isn't propagated, since I leverage ds-concat above, which uses the arity of new-column that ignore missing data. I "think" this is not correct on the implementation of ds-concat, although I could be missing something about the guarantees concat is supposed to confer. There's no docstring at the moment, so I'm just projecting my expectation onto it.

Unique, drop duplicates

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html

https://www.rdocumentation.org/packages/data.table/versions/1.12.8/topics/duplicated

Pandas seems strictly better to be as you get a choice between if you want the first or last occurrence.

Continuation of #44 .

Using a parallel-map-reduce algorithm of just iterating every row and building a hashtable of hash codes to index array lists seems like a good way of doing this. Then potentially there is a fast combine method of the hashtables that results in the set of indexes to either add or remove. That is sort of like (group-by-index identity dataset) followed by iterating the indexes and taking first or last or butfirst or butlast.

I guess that also implies an algorithm to get the hash value of a given row which in and of itself is a thing.

ip addresses interpreted as durations

tech.ip-addresses> (println (slurp "./test/data/ip-addresses.csv"))
name,ip
Harold,10.0.0.1
Google,172.217.1.206
nil

tech.ip-addresses> (ds/->dataset "./test/data/ip-addresses.csv")
#<tech.ml.dataset.impl.dataset.Dataset@3b60f35c ./test/data/ip-addresses.csv [2 2]:

|   name |              ip |
|--------+-----------------|
| Harold |     PT10H0.001S |
| Google | PT175H37M1.206S |
>

group-by categories

Is there any way to get the list of the keys from group-by operation without actual grouping? I can do group-by and then call keys but I'm not sure if it's efficient way.

csv with empty column's name

When I export dataset from R which has row names, resulting CSV looks like that:

"","Rural Male","Rural Female","Urban Male","Urban Female"
"50-54",11.7,8.7,15.4,8.4
"55-59",18.1,11.7,24.3,13.6
"60-64",26.9,20.3,37,19.3
"65-69",41,30.9,54.6,35.1
"70-74",66,54.3,71.1,50

CSV contains empty string as the first column's name (row names). In such case import fails. I think it would be great to assign some dummy name in such case.