Comments (4)
I'd like to propose that the easiest first cut at this problem (which I am happy to do if everyone is busy) would be to compare the distribution of term frequencies between every pair of slices, using Kullback-Leibler divergence (a non-reflexive "distance" measurement for probability distributions) which is quick to measure and easy to interpret.
Following this measurement, one of three things will be true:
- The divergence between all pairs (going both ways, since divergences are not necessarily reflexive) is "fairly low" (for some definition of that term), in which case we can conclude that all term tables encode very similar information, and it is likely we can eliminate the vast majority of them.
- The divergence between "most" pairs is "fairly low", in which case we can conclude that some tables encode very similar information, and it is likely we can eliminate some of them.
- The divergence between pairs is fairly high, in which case we should re-evaluate our options for determining whether band table training should work as it does.
Note that this only tells us whether the term tables our similar -- on top of that, we will need to do additional work to make sure that band table training really is similar for nearly-identical term tables. It may be worth verifying experimentally that this is true as well, depending on how complex the training process is.
Now, lastly, I'd like to suggest a methodology for performing this evaluation:
- Decoding the term table and loading it programmatically into memory. May be annoying, depending on how the config files are laid out.
- Converting the frequencies of each term table to a probability distribution (which includes justifying our position on subtle issues like whether we should smooth all the distributions over the union of vocabularies of all term tables).
- Taking each pair of documents and computing the KL divergence on each pair, going both ways, because divergences are not necessarily reflexive.
from bitfunnel.
I'd like to know the result of that experiment, but I'm still missing part of the bigger picture here.
- Do we have any reason to believe that our term tables will be useful for arbitrary users? I'd imagine, for example, that
www
is in a private row for us (I should really find out where these config files live and write something that can query them, but let's just posit that for now). My guess is thatwww
doesn't need to be in a private row for a lot of users. That's not so bad, I guess -- they'll just see degraded performance. However, something that I suspect will be a problem is that users may have a lot of terms that should be in private rows that we won't place in private rows with our term tables. - Do we have any reason to believe that experimental results about how different our term tables are generalize to all other users? For "all", the answer is surely no since you can construct some kind of pathological data set, but my guess is that the answer here is also no for a lot of actual data sets.
- If the answer to "1" is no, what simple process can we have that lets other folks onboard easily, whether they need 1 term table or 10?
from bitfunnel.
[NOTE: The following was originally posted from hausdorff-msft
, which is an account I used to join the Microsoft
group. I've deleted the original comment and reposted it from hausdorff
because I want to link this work to that account instead; it is onerous on GH to be constantly swapping between accounts.]
I'm not sure you are missing part of the big picture, actually. It seems like we both believe that (1) and (2) are clearly "no", and I think the consensus around (3) is that it's not yet clear. Please correct me if this is wrong.
Also it is not 100% clear to me what implication this has if any for first steps. My belief is that it will be difficult to hit more general usability goals unless we have good tools and repeatable, documented methodologies for quantifying, evaluating, and describing the behavior of critical system components, many of which at this point are completely black-box. I think it is likely we are on the same page with this, but maybe not with the proposal that a good first step is to have a simple, easy set of tools for evaluating similarity of term tables.
Thoughts?
from bitfunnel.
The answer to this question appears to be "no". The method used in the Experimental
TermTreatment and also the Optimal
TermTreatment produces an optimal configuration modulo some assumptions and some bugs. There's still a lot of room for improvement that we could get from relaxing assumptions, allowing higher ranks, etc, but from what we know of row configurations in Bing, this seems likely to produce better results than the BandTable and the BandTable trainer with something that's also much simpler.
from bitfunnel.
Related Issues (20)
- Linux and Windows versions of BitFunnelToolTest have different behavior. HOT 1
- ShardCostFunction has no way to specify shard density.
- Support for Wildcard or Regex Queries? HOT 2
- Support VS 2017 Build HOT 1
- Complete support for Sharding
- Establish termtable defaults for density and treatment HOT 2
- REPL fails to load index due to buffer size calculations HOT 3
- Document the bitfunnel library API HOT 3
- Doozer build fails on Utilities - TokenManagerTest line 411 HOT 1
- Upgrade GoogleTest HOT 1
- REPL "show rows" command does not list all documents/columns HOT 3
- Ubuntu Artful g++ compiler (7.2) cannot compile NativeJIT due to deprecation warning
- The proportion of ad hoc vs. explicit terms varies significantly across shards
- Change BitFunnel executable name to 'bitfunnel' for *nix users
- Query parser errors need more graceful handling HOT 1
- REPL "status" command outputs incorrect shard statistics & does not use "shard" info
- REPL shouldn't catch CheckException HOT 1
- Is this project now dormant? What is the status? HOT 2
- Replicating BitFunnel Experiments HOT 9
- not compilable on Apple Silicon (M3 in my case)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from bitfunnel.