Coder Social home page Coder Social logo

Comments (6)

jeancroy avatar jeancroy commented on August 16, 2024

That's interesting.

Why fuzzaldrin fail ?
It zoom on the filename basically doing score = score(fullpath) + score(filename).
It also bail out (return 0) when it does not find some chars.
Because it cannot find "model" in the zoomed "user.rb" that part get 0.

Will new version solve it ?
Good news is that dynamic programming support errors. I do the bail out thing to be compatible, and because It bring a lot of speed. But I do it on the full string.

Fullpath clearly score more in your "Best Result" mark.

Zommed in, it's less clear.
Wrong result contain the whole "user" part, as well as "mode" which is requested in the query.
Correct result start sooner and is shorter. However size & position does not weight for a lot. (This is done to support longer string where query is acronym)

All in all I believe it's a matter of the preference for file name or full path.
I might add a test about this as a remainder people use fullpath too.
Because now pretty much every test is "prefer basename", except for the case of exact match which does not cover this.

from atomfiles.

aaronjensen avatar aaronjensen commented on August 16, 2024

@jeancroy the problem is that the algorithm (and it sounds like your new optimizations) do not take into account word boundaries or match length. That's very important for fuzzy matching in my experience.

See the example here: https://github.com/garybernhardt/selecta#theory-of-operation
and the algorithm that this uses: https://github.com/JazzCore/ctrlp-cmatcher

Basically, the length of the match of app/models/user.rb when word boundaries are taken into account is as low as it can be (modeluser = 9) whereas the match length of the winning option is massive:

drop_moderator_column_on_users.rb
.....1234567890123.......4567.... -> 17

app/models/user.rb
....12345..6789... -> 9

By taking into account word boundaries and minimum match length, app/models/user is clearly better.

from atomfiles.

aaronjensen avatar aaronjensen commented on August 16, 2024

By the way, instead of considering base name vs full path, you might consider giving preference to shorter paths and matches closer to the end of the path. I'm not sure how selecta and ctrl-p-cmatcher do it, but they do it well

from atomfiles.

jeancroy avatar jeancroy commented on August 16, 2024

@aaronjensen added test, passed without modification 👍

from atomfiles.

aaronjensen avatar aaronjensen commented on August 16, 2024

awesome! hope to see it get merged in soon :) Thanks.

from atomfiles.

jeancroy avatar jeancroy commented on August 16, 2024

Because you where kind enough to share idea, I'll develop a bit more.

See that use case ?
atom/fuzzy-finder#57 (A)
Somehow ImportanceTableCtrl should be prefered to switch.css when searching for itc
So that put some limit to the size-related penalty.

Another one:
atom/fuzzaldrin#17 (B)
diagonals should be prefered to Diagonal when searching for diag
Here a single character case 'D' vs 'd', should win, even if there's one extra s at the end of the word.

Those expectations kind of puts haystack size at the role of tiebreaker.
Why would we want to penalize large string ? Because the sheer size of them allow to do accidental matches. However when they do so, it tend to be isolated character, in the middle of nowhere, and that's what we are going to detect ! We can do so by counting the number of consecutive match (grouped character) and giving large bonus for that.

However in this case "moderator" is almost "model". Also my script would have attached the "u" to "user" instead of "columns". Final both match landmark character (word boundary) at "m" and "u". So I cannot guarantee that large string will be scored like garbage (because it's not garbage) and a tie breaker is exactly what you need.

Another interesting fact is that I find consecutive character very intuitive and use it to resolve otherwise contradictory cases. How do i know the lowercase "i" of "itc" should prefer uppercase "I" of "Importance" while lowercase "d" of "diag" should reject uppercase "D" of "Diagonal" in favor of "diagonal" ? Well "diag" score consecutive point in the actual word, while "itc" score consecutive in the acronym of the word ! Where the consecutive are, control the affinity for exact case vs acronym camelCase. It also ensure that if a large string accidentally match a landmark, not part of a sequence requested by query, we still get garbage like score.

from atomfiles.

Related Issues (14)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.