Coder Social home page Coder Social logo

Comments (2)

talevy23 avatar talevy23 commented on June 25, 2024

It seems we get frequency per review.
It is more likely word W is an indicator for bad reviews if it appeared in many bad reviews rather than appeared many times in a single review.

This is later used when adding 'unknown words'.
If you scroll down the code you'll find

def add_unknown_words(word_vecs, vocab, min_df=1, k=300):
    """
    For words that occur in at least min_df documents, create a separate word vector.    
    0.25 is chosen so the unknown vectors have (approximately) same variance as pre-trained ones
    """
    for word in vocab:
        if word not in word_vecs and vocab[word] >= min_df:
            word_vecs[word] = np.random.uniform(-0.25,0.25,k)  

Here we don't consider words that appear in a single review.
I think it would have been clearer for a higher threshold.
For example: filter out words that appear in less than 10 reviews.

from cnn_sentence.

Larry955 avatar Larry955 commented on June 25, 2024

@talevy23
Thanks a lot!! your opinion really inspair me and solve my confusion. It's a good explanation for filtering out words that appears in less than 10(or any other number) reviews. From that we can conclude that the code only cares how many times a word appears in the reviews but doesn't care about its frequency in a single review, right?

from cnn_sentence.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.