Comments (9)
It does - due to previous bugs it uses a linked hash map.
from tech.ml.dataset.
Which bugs? Why is it better this way?
from tech.ml.dataset.
Specifically you use a linked hash map when you care about order of insertion or modification there was an issue around group-by-column I think but I can't find it. There is an expectation among people coming from R that the order of the returned map matches the order the data was found in the original column.
We don't respect this for group-by-column-agg as there we use the concurrent hashmap but for the non-massive cases of group-by I tried to respect this ordering.
from tech.ml.dataset.
oic, that's cool and makes sense - I wonder if it would be possible 🍰 to have cake and also eat it by using a Clojure sorted-map
? --- this situation was very confusing yesterday because CSV import helpfully made a column integers (nice, saves ram!) but then after group-by-column->indexes
another source of the same numbers must have been Long
s, and so every lookup was returning nil
.
from tech.ml.dataset.
Linked is not same as sorted - linked is insertion order so order of the column data.
The map itself should have had longs in it regardless - are you saying the keys in the group-by-column->indexes were integers? Regardless that sounds like a lot of pain. We should file a small repro of that because it is annoying. I am working on a linked hash map in hamf to address this which will use the normal equiv pathways.
Unequivically it is fairly wrong but it is extremely wrong if the map keys aren't longs.
from tech.ml.dataset.
The big change is in hamf
from tech.ml.dataset.
Linked is not same as sorted - linked is insertion order so order of the column data.
Got it! I was thinking of it wrong.
are you saying the keys in the group-by-column->indexes were integers?
Ah, no, sorry for the confusion, the keys in the lookup map (result of group-by-column->indexes
) were longs, but the column values they were derived from were int32.
So, when the ints in the column were used to lookup from the map key'd by longs, it failed.
We should file a small repro of that because it is annoying.
Here's a minimal case - the real case, in the annealer, was more delicate and convoluted, as you can imagine
However, this demonstrates the unexpected nil
.
repl'ing this stuff is 🎲 actually dicey, because Integer
s print as things that read as Long
s - so it's very easy to get confused.
user> (require '[tech.v3.dataset :as ds])
nil
user> (spit "tmp.csv" "a\n111111\n2222222\n333333\n")
nil
user> (def ds (ds/->dataset "tmp.csv"))
#'user/ds
user> (get ds "a")
#tech.v3.dataset.column<int32>[3]
a
[111111, 2222222, 333333]
user> (def lookup (ds/group-by-column->indexes ds "a"))
#'user/lookup
user> (def v (first (int-array (get ds "a"))))
#'user/v
user> v
111111
user> (get lookup v)
nil
user> (get lookup 111111)
[0]
from tech.ml.dataset.
- https://github.com/cnuernber/dtype-next/blob/master/test/tech/v3/datatype_test.clj#L855
- https://github.com/techascent/tech.ml.dataset/blob/master/test/tech/v3/dataset_test.clj#L1719
The bigger question to me is when will we find the first subtle issues with the linkedhashmap implementation...
from tech.ml.dataset.
when will we find the first subtle issues with the linkedhashmap implementation...
With a sufficently royal interpretation of the 'we' in this question, the answer becomes "in due time" (:
New tests look cool and relevant, I look forward to upgrading and taking out some of the hacks (into {} ...)
I've put in place.
from tech.ml.dataset.
Related Issues (20)
- CVE-2021-40531 on org.apache.datasketches/datasketches-java HOT 1
- left-join fails when options argument is nil HOT 2
- Documentation and the actual behavior of `select` do not match. HOT 2
- `ds/rows` produces something vector-of-maps-like that transit cannot handle HOT 7
- Arrow, writing nested types. HOT 5
- tribuo changes types between input dataset and prediction HOT 5
- tensor->dataset not working for 2-d arrays HOT 2
- dataset->categorical-maps does not work as documented
- make `invert-categorical-map` more strict on unknown reverse mapping values HOT 4
- add additional arrity for probability-distributions->label-column to specify result-data type
- strange cat map produced with multiple columns HOT 1
- not all comment lines are recognized as comments HOT 4
- Missing column when reading a parquet file HOT 5
- 'exact' type rolling window
- upgrade to org.tribuo 4.3.1
- support jsonl format for read HOT 1
- fastexcel - update documentation to new version of fastexcel-reader
- add conversion from tensor to/from org.tribuo.math.la.Matrix HOT 1
- inconsistency between printed shape and `shape` function HOT 1
- .xlsx loading fails if a column has a number for a name HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tech.ml.dataset.