Coder Social home page Coder Social logo

Comments (4)

hausdorff-msft avatar hausdorff-msft commented on June 22, 2024

I'd like to propose that the easiest first cut at this problem (which I am happy to do if everyone is busy) would be to compare the distribution of term frequencies between every pair of slices, using Kullback-Leibler divergence (a non-reflexive "distance" measurement for probability distributions) which is quick to measure and easy to interpret.

Following this measurement, one of three things will be true:

  • The divergence between all pairs (going both ways, since divergences are not necessarily reflexive) is "fairly low" (for some definition of that term), in which case we can conclude that all term tables encode very similar information, and it is likely we can eliminate the vast majority of them.
  • The divergence between "most" pairs is "fairly low", in which case we can conclude that some tables encode very similar information, and it is likely we can eliminate some of them.
  • The divergence between pairs is fairly high, in which case we should re-evaluate our options for determining whether band table training should work as it does.

Note that this only tells us whether the term tables our similar -- on top of that, we will need to do additional work to make sure that band table training really is similar for nearly-identical term tables. It may be worth verifying experimentally that this is true as well, depending on how complex the training process is.

Now, lastly, I'd like to suggest a methodology for performing this evaluation:

  1. Decoding the term table and loading it programmatically into memory. May be annoying, depending on how the config files are laid out.
  2. Converting the frequencies of each term table to a probability distribution (which includes justifying our position on subtle issues like whether we should smooth all the distributions over the union of vocabularies of all term tables).
  3. Taking each pair of documents and computing the KL divergence on each pair, going both ways, because divergences are not necessarily reflexive.

from bitfunnel.

danluu avatar danluu commented on June 22, 2024

I'd like to know the result of that experiment, but I'm still missing part of the bigger picture here.

  1. Do we have any reason to believe that our term tables will be useful for arbitrary users? I'd imagine, for example, that www is in a private row for us (I should really find out where these config files live and write something that can query them, but let's just posit that for now). My guess is that www doesn't need to be in a private row for a lot of users. That's not so bad, I guess -- they'll just see degraded performance. However, something that I suspect will be a problem is that users may have a lot of terms that should be in private rows that we won't place in private rows with our term tables.
  2. Do we have any reason to believe that experimental results about how different our term tables are generalize to all other users? For "all", the answer is surely no since you can construct some kind of pathological data set, but my guess is that the answer here is also no for a lot of actual data sets.
  3. If the answer to "1" is no, what simple process can we have that lets other folks onboard easily, whether they need 1 term table or 10?

from bitfunnel.

hausdorff avatar hausdorff commented on June 22, 2024

[NOTE: The following was originally posted from hausdorff-msft, which is an account I used to join the Microsoft group. I've deleted the original comment and reposted it from hausdorff because I want to link this work to that account instead; it is onerous on GH to be constantly swapping between accounts.]

I'm not sure you are missing part of the big picture, actually. It seems like we both believe that (1) and (2) are clearly "no", and I think the consensus around (3) is that it's not yet clear. Please correct me if this is wrong.

Also it is not 100% clear to me what implication this has if any for first steps. My belief is that it will be difficult to hit more general usability goals unless we have good tools and repeatable, documented methodologies for quantifying, evaluating, and describing the behavior of critical system components, many of which at this point are completely black-box. I think it is likely we are on the same page with this, but maybe not with the proposal that a good first step is to have a simple, easy set of tools for evaluating similarity of term tables.

Thoughts?

from bitfunnel.

danluu avatar danluu commented on June 22, 2024

The answer to this question appears to be "no". The method used in the Experimental TermTreatment and also the Optimal TermTreatment produces an optimal configuration modulo some assumptions and some bugs. There's still a lot of room for improvement that we could get from relaxing assumptions, allowing higher ranks, etc, but from what we know of row configurations in Bing, this seems likely to produce better results than the BandTable and the BandTable trainer with something that's also much simpler.

from bitfunnel.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.