It's probably not reasonable to ship the 1.6GB that we get as output from training. It

Determine whether or not we need TermTable/BandTable training about bitfunnel HOT 4 CLOSED

bitfunnel commented on June 22, 2024

Determine whether or not we need TermTable/BandTable training

from bitfunnel.

Comments (4)

hausdorff-msft commented on June 22, 2024

I'd like to propose that the easiest first cut at this problem (which I am happy to do if everyone is busy) would be to compare the distribution of term frequencies between every pair of slices, using Kullback-Leibler divergence (a non-reflexive "distance" measurement for probability distributions) which is quick to measure and easy to interpret.

Following this measurement, one of three things will be true:

The divergence between all pairs (going both ways, since divergences are not necessarily reflexive) is "fairly low" (for some definition of that term), in which case we can conclude that all term tables encode very similar information, and it is likely we can eliminate the vast majority of them.
The divergence between "most" pairs is "fairly low", in which case we can conclude that some tables encode very similar information, and it is likely we can eliminate some of them.
The divergence between pairs is fairly high, in which case we should re-evaluate our options for determining whether band table training should work as it does.

Note that this only tells us whether the term tables our similar -- on top of that, we will need to do additional work to make sure that band table training really is similar for nearly-identical term tables. It may be worth verifying experimentally that this is true as well, depending on how complex the training process is.

Now, lastly, I'd like to suggest a methodology for performing this evaluation:

Decoding the term table and loading it programmatically into memory. May be annoying, depending on how the config files are laid out.
Converting the frequencies of each term table to a probability distribution (which includes justifying our position on subtle issues like whether we should smooth all the distributions over the union of vocabularies of all term tables).
Taking each pair of documents and computing the KL divergence on each pair, going both ways, because divergences are not necessarily reflexive.

from bitfunnel.

danluu commented on June 22, 2024

I'd like to know the result of that experiment, but I'm still missing part of the bigger picture here.

Do we have any reason to believe that our term tables will be useful for arbitrary users? I'd imagine, for example, that www is in a private row for us (I should really find out where these config files live and write something that can query them, but let's just posit that for now). My guess is that www doesn't need to be in a private row for a lot of users. That's not so bad, I guess -- they'll just see degraded performance. However, something that I suspect will be a problem is that users may have a lot of terms that should be in private rows that we won't place in private rows with our term tables.
Do we have any reason to believe that experimental results about how different our term tables are generalize to all other users? For "all", the answer is surely no since you can construct some kind of pathological data set, but my guess is that the answer here is also no for a lot of actual data sets.
If the answer to "1" is no, what simple process can we have that lets other folks onboard easily, whether they need 1 term table or 10?

from bitfunnel.

hausdorff commented on June 22, 2024

[NOTE: The following was originally posted from hausdorff-msft, which is an account I used to join the Microsoft group. I've deleted the original comment and reposted it from hausdorff because I want to link this work to that account instead; it is onerous on GH to be constantly swapping between accounts.]

I'm not sure you are missing part of the big picture, actually. It seems like we both believe that (1) and (2) are clearly "no", and I think the consensus around (3) is that it's not yet clear. Please correct me if this is wrong.

Also it is not 100% clear to me what implication this has if any for first steps. My belief is that it will be difficult to hit more general usability goals unless we have good tools and repeatable, documented methodologies for quantifying, evaluating, and describing the behavior of critical system components, many of which at this point are completely black-box. I think it is likely we are on the same page with this, but maybe not with the proposal that a good first step is to have a simple, easy set of tools for evaluating similarity of term tables.

Thoughts?

from bitfunnel.

danluu commented on June 22, 2024

The answer to this question appears to be "no". The method used in the Experimental TermTreatment and also the Optimal TermTreatment produces an optimal configuration modulo some assumptions and some bugs. There's still a lot of room for improvement that we could get from relaxing assumptions, allowing higher ranks, etc, but from what we know of row configurations in Bing, this seems likely to produce better results than the BandTable and the BandTable trainer with something that's also much simpler.

from bitfunnel.

Determine whether or not we need TermTable/BandTable training about bitfunnel HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent