Coder Social home page Coder Social logo

Comments (7)

RyanMarcus avatar RyanMarcus commented on August 12, 2024

I believe this is actually correct -- when we first got the FB dataset, we hadn't realized the original authors had smoothed out several outliers at the end of the file. In the new version, we've restored those original outliers.

To see that the distribution is different from lognormal, plot all the keys but the last 1000.

See also the discussion in S4.2 on the arXiv draft, "Performance of RBS."

from sosd.

alihadian avatar alihadian commented on August 12, 2024

Interesting. Thanks for the clarification.

from sosd.

alihadian avatar alihadian commented on August 12, 2024

I believe this is actually correct -- when we first got the FB dataset, we hadn't realized the original authors had smoothed out several outliers at the end of the file. In the new version, we've restored those original outliers.

To see that the distribution is different from lognormal, plot all the keys but the last 1000.

See also the discussion in S4.2 on the arXiv draft, "Performance of RBS."

I understand this, but then it would make the comparison unfair. For example, RadixBinarySearch suffers from the outliers but RadixSpine (which has almost the same nature) is manually set not to consider the last 200 records:

truncate_by = 200;

Wouldn't it be more convenient to just revert back to the original FB dataset in the MLforSys paper? I don't think it is a common case in real-world to have a uniform distribution with only 0.0000125% outliers in such extremely different scale.

from sosd.

RyanMarcus avatar RyanMarcus commented on August 12, 2024

@alexandervanrenen @andreaskipf we put in this hack for radix spline because the new version, which wasn't released yet, could handle such outliers automatically. Is this the case? If so, can we please get the new code merged ASAP? If not, we need to remove the hack and show the degraded RS numbers.

I don't think it is fruitful to argue over if extreme outliers are or are not common in real world data. 2^64-x is often used to mark a special value (e.g., PostgreSQL NULLs and default flags). I think there's obvious utility in having at least one dataset that contains them.

Of course, no structure should be manually specialized, beyond tuning, to handle any particular dataset.

from sosd.

andreaskipf avatar andreaskipf commented on August 12, 2024

from sosd.

RyanMarcus avatar RyanMarcus commented on August 12, 2024

In the meantime, looks like the best RS config is to use a single radix bit. Tests on my desktop:

Old numbers:

[ryan@ryan-arch-tower SOSD]$ build/benchmark --pareto --only RS data/fb_200M_uint64 data/fb_200M_uint64_equality_lookups_10M 
Repeating lookup code 1 time(s).
Using 1 thread(s).
Only executing indexes matching RS
read 200000000 values from data/fb_200M_uint64 in 825 ms (242.424 M values/s)
data is unique
read 10000000 values from data/fb_200M_uint64_equality_lookups_10M in 75 ms (133.333 M values/s)
RESULT: RS,1,281.006,636316028,5778155784,BinarySearch
RESULT: RS,2,305.131,286466588,5279849287,BinarySearch
RESULT: RS,3,331.326,112973820,4987933362,BinarySearch
RESULT: RS,4,348.465,34267228,4846300852,BinarySearch
RESULT: RS,5,358.911,9114700,4733765847,BinarySearch
RESULT: RS,6,427.388,1941628,4670745930,BinarySearch
RESULT: RS,7,498.816,805676,4816661161,BinarySearch
RESULT: RS,8,571.965,277708,4835943355,BinarySearch
RESULT: RS,9,791.477,80220,4833923097,BinarySearch
RESULT: RS,10,855.381,22956,4848160201,BinarySearch

New numbers:

[ryan@ryan-arch-tower SOSD]$ build/benchmark --pareto --only RS data/fb_200M_uint64 data/fb_200M_uint64_equality_lookups_10M 
Repeating lookup code 1 time(s).
Using 1 thread(s).
Only executing indexes matching RS
read 200000000 values from data/fb_200M_uint64 in 821 ms (243.605 M values/s)
data is unique
read 10000000 values from data/fb_200M_uint64_equality_lookups_10M in 72 ms (138.889 M values/s)
RESULT: RS,1,772.807,502098748,5536266535,BinarySearch
RESULT: RS,2,727.458,269689644,5248179606,BinarySearch
RESULT: RS,3,679.077,108779644,4881358643,BinarySearch
RESULT: RS,4,656.597,33218684,4736136287,BinarySearch
RESULT: RS,5,626.173,8066140,4704914496,BinarySearch
RESULT: RS,6,647.496,1810572,4684253574,BinarySearch
RESULT: RS,7,685.563,740156,4674584564,BinarySearch
RESULT: RS,8,721.185,261340,4680036937,BinarySearch
RESULT: RS,9,777.845,80204,4677753423,BinarySearch
RESULT: RS,10,828.457,22924,4726326099,BinarySearch

As expected, a major regression. With only a spline layer, RS has the same non-monotonic behavior as many tree structures.

from sosd.

andreaskipf avatar andreaskipf commented on August 12, 2024

We have just updated RS. The new version won't handle these outliers in the immediate future.

from sosd.

Related Issues (15)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.