Comments (2)
It seems we get frequency per review.
It is more likely word W is an indicator for bad reviews if it appeared in many bad reviews rather than appeared many times in a single review.
This is later used when adding 'unknown words'.
If you scroll down the code you'll find
def add_unknown_words(word_vecs, vocab, min_df=1, k=300):
"""
For words that occur in at least min_df documents, create a separate word vector.
0.25 is chosen so the unknown vectors have (approximately) same variance as pre-trained ones
"""
for word in vocab:
if word not in word_vecs and vocab[word] >= min_df:
word_vecs[word] = np.random.uniform(-0.25,0.25,k)
Here we don't consider words that appear in a single review.
I think it would have been clearer for a higher threshold.
For example: filter out words that appear in less than 10 reviews.
from cnn_sentence.
@talevy23
Thanks a lot!! your opinion really inspair me and solve my confusion. It's a good explanation for filtering out words that appears in less than 10(or any other number) reviews. From that we can conclude that the code only cares how many times a word appears in the reviews but doesn't care about its frequency in a single review, right?
from cnn_sentence.
Related Issues (20)
- AttributeError: 'module' object has no attribute 'LeNetConvPoolLayer'
- question regarding datasets HOT 2
- Dealing with overfitting HOT 1
- success with CUDA 7.5? HOT 1
- about how many times does the iteration of experiment train
- Word Embddings in Non-static Mode HOT 2
- test_model in file conv_net_sentence.py
- Permissions Denied when loading GoogleNews-vectors-negative300.bin
- how about the size of feature map? HOT 1
- How do i get the name of every layer and their size?Someone knows?
- question about clearn_str in process_data.py HOT 1
- confused on the dropout_cost_p and cost_p ??
- how much RAM do i need to process Google News dataset bin model file? HOT 3
- License?
- multilabel classificaion HOT 2
- What does '<PAD>' stands for?
- How much memory do I need to process bin file (i.e. GoogleNews-vectors-negative300.bin) HOT 3
- a pickle file problem HOT 3
- NotImplementedError: The image and the kernel must have the same type.inputs(float32), kerns(float64)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cnn_sentence.