Comments (7)
I believe this is actually correct -- when we first got the FB dataset, we hadn't realized the original authors had smoothed out several outliers at the end of the file. In the new version, we've restored those original outliers.
To see that the distribution is different from lognormal, plot all the keys but the last 1000.
See also the discussion in S4.2 on the arXiv draft, "Performance of RBS."
from sosd.
Interesting. Thanks for the clarification.
from sosd.
I believe this is actually correct -- when we first got the FB dataset, we hadn't realized the original authors had smoothed out several outliers at the end of the file. In the new version, we've restored those original outliers.
To see that the distribution is different from lognormal, plot all the keys but the last 1000.
See also the discussion in S4.2 on the arXiv draft, "Performance of RBS."
I understand this, but then it would make the comparison unfair. For example, RadixBinarySearch suffers from the outliers but RadixSpine (which has almost the same nature) is manually set not to consider the last 200 records:
SOSD/competitors/radix_spline.h
Line 193 in 3b52d09
Wouldn't it be more convenient to just revert back to the original FB dataset in the MLforSys paper? I don't think it is a common case in real-world to have a uniform distribution with only 0.0000125% outliers in such extremely different scale.
from sosd.
@alexandervanrenen @andreaskipf we put in this hack for radix spline because the new version, which wasn't released yet, could handle such outliers automatically. Is this the case? If so, can we please get the new code merged ASAP? If not, we need to remove the hack and show the degraded RS numbers.
I don't think it is fruitful to argue over if extreme outliers are or are not common in real world data. 2^64-x is often used to mark a special value (e.g., PostgreSQL NULLs and default flags). I think there's obvious utility in having at least one dataset that contains them.
Of course, no structure should be manually specialized, beyond tuning, to handle any particular dataset.
from sosd.
from sosd.
In the meantime, looks like the best RS config is to use a single radix bit. Tests on my desktop:
Old numbers:
[ryan@ryan-arch-tower SOSD]$ build/benchmark --pareto --only RS data/fb_200M_uint64 data/fb_200M_uint64_equality_lookups_10M
Repeating lookup code 1 time(s).
Using 1 thread(s).
Only executing indexes matching RS
read 200000000 values from data/fb_200M_uint64 in 825 ms (242.424 M values/s)
data is unique
read 10000000 values from data/fb_200M_uint64_equality_lookups_10M in 75 ms (133.333 M values/s)
RESULT: RS,1,281.006,636316028,5778155784,BinarySearch
RESULT: RS,2,305.131,286466588,5279849287,BinarySearch
RESULT: RS,3,331.326,112973820,4987933362,BinarySearch
RESULT: RS,4,348.465,34267228,4846300852,BinarySearch
RESULT: RS,5,358.911,9114700,4733765847,BinarySearch
RESULT: RS,6,427.388,1941628,4670745930,BinarySearch
RESULT: RS,7,498.816,805676,4816661161,BinarySearch
RESULT: RS,8,571.965,277708,4835943355,BinarySearch
RESULT: RS,9,791.477,80220,4833923097,BinarySearch
RESULT: RS,10,855.381,22956,4848160201,BinarySearch
New numbers:
[ryan@ryan-arch-tower SOSD]$ build/benchmark --pareto --only RS data/fb_200M_uint64 data/fb_200M_uint64_equality_lookups_10M
Repeating lookup code 1 time(s).
Using 1 thread(s).
Only executing indexes matching RS
read 200000000 values from data/fb_200M_uint64 in 821 ms (243.605 M values/s)
data is unique
read 10000000 values from data/fb_200M_uint64_equality_lookups_10M in 72 ms (138.889 M values/s)
RESULT: RS,1,772.807,502098748,5536266535,BinarySearch
RESULT: RS,2,727.458,269689644,5248179606,BinarySearch
RESULT: RS,3,679.077,108779644,4881358643,BinarySearch
RESULT: RS,4,656.597,33218684,4736136287,BinarySearch
RESULT: RS,5,626.173,8066140,4704914496,BinarySearch
RESULT: RS,6,647.496,1810572,4684253574,BinarySearch
RESULT: RS,7,685.563,740156,4674584564,BinarySearch
RESULT: RS,8,721.185,261340,4680036937,BinarySearch
RESULT: RS,9,777.845,80204,4677753423,BinarySearch
RESULT: RS,10,828.457,22924,4726326099,BinarySearch
As expected, a major regression. With only a spline layer, RS has the same non-monotonic behavior as many tree structures.
from sosd.
We have just updated RS. The new version won't handle these outliers in the immediate future.
from sosd.
Related Issues (15)
- [More Information on source code] HOT 1
- Support for key duplicates in ARTPrimary? HOT 3
- Possibly wrong cache-miss measurement HOT 2
- Compile errors (prepare.sh) HOT 6
- Output files (containing only RMI and RS) HOT 4
- error: No such file or directory HOT 5
- execute_perf.sh: "Error opening counter cycles" HOT 4
- more information about the dataset HOT 9
- Compilation error when running scripts/prepare.sh HOT 3
- [RadixSpline] memory leak and a suggested fix HOT 5
- [RadixSpline] SIGSEGV on EqualityLookup when key = 0 HOT 10
- The commit corresponding to the original SOSD paper not tagged or citable HOT 6
- The benchmark doesn't compile (prepare.sh) HOT 9
- Benchmark's memory requirement HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from sosd.