Coder Social home page Coder Social logo

wikipedia_word2vec's Introduction

Word2vec 4 Wikipedia

Train Word2vec Model based on Wikipedia by Python Gensim

wikipedia_word2vec's People

Contributors

panyang avatar ringsaturn avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

wikipedia_word2vec's Issues

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb7 in position 4: invalid start byte

This is my code on Chinese word2vec by gensim, there are so many tutorials about this topic, but all of them are almost the same(word2vec on wiki corpus). I was wandering if you guys have come up with the problem like this and I couldn't figure it out:

import jieba
import time
from gensim.models import word2vec

# 对 TXT 文档结巴分词,输出结果也导出为 TXT 文档
stopwordset = set()
with open('stopwordset.txt', encoding='utf-8') as sw:
    for line in sw:
        stopwordset.add(line.strip('\n'))
        
output = open('result.txt', 'w')

with open('jieba.txt', 'r') as content:
    for line in content:
        words = jieba.cut(line, cut_all=False)
        for word in words:
            if word not in stopwordset:
                output.write(word + ' ')
output.close()

sentences = word2vec.Text8Corpus('result.txt')
model = word2vec.Word2Vec(sentences, size=20)

And the error message as follows:

UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-4-7956a445b8ae> in <module>()
      1 sentences = word2vec.Text8Corpus('result.txt')
----> 2 model = word2vec.Word2Vec(sentences, size=20)

C:\Users\libin\AppData\Local\Continuum\Anaconda3\lib\site-packages\gensim\models\word2vec.py in __init__(self, sentences, size, alpha, window, min_count, max_vocab_size, sample, seed, workers, min_alpha, sg, hs, negative, cbow_mean, hashfxn, iter, null_word, trim_rule, sorted_vocab, batch_words, compute_loss)
    501             if isinstance(sentences, GeneratorType):
    502                 raise TypeError("You can't pass a generator as the sentences argument. Try an iterator.")
--> 503             self.build_vocab(sentences, trim_rule=trim_rule)
    504             self.train(sentences, total_examples=self.corpus_count, epochs=self.iter,
    505                        start_alpha=self.alpha, end_alpha=self.min_alpha)

C:\Users\libin\AppData\Local\Continuum\Anaconda3\lib\site-packages\gensim\models\word2vec.py in build_vocab(self, sentences, keep_raw_vocab, trim_rule, progress_per, update)
    575 
    576         """
--> 577         self.scan_vocab(sentences, progress_per=progress_per, trim_rule=trim_rule)  # initial survey
    578         self.scale_vocab(keep_raw_vocab=keep_raw_vocab, trim_rule=trim_rule, update=update)  # trim by min_count & precalculate downsampling
    579         self.finalize_vocab(update=update)  # build tables & arrays

C:\Users\libin\AppData\Local\Continuum\Anaconda3\lib\site-packages\gensim\models\word2vec.py in scan_vocab(self, sentences, progress_per, trim_rule)
    587         vocab = defaultdict(int)
    588         checked_string_types = 0
--> 589         for sentence_no, sentence in enumerate(sentences):
    590             if not checked_string_types:
    591                 if isinstance(sentence, string_types):

C:\Users\libin\AppData\Local\Continuum\Anaconda3\lib\site-packages\gensim\models\word2vec.py in __iter__(self)
   1501                 last_token = text.rfind(b' ')  # last token may have been split in two... keep for next iteration
   1502                 words, rest = (utils.to_unicode(text[:last_token]).split(),
-> 1503                                text[last_token:].strip()) if last_token >= 0 else ([], text)
   1504                 sentence.extend(words)
   1505                 while len(sentence) >= self.max_sentence_length:

C:\Users\libin\AppData\Local\Continuum\Anaconda3\lib\site-packages\gensim\utils.py in any2unicode(text, encoding, errors)
    238     if isinstance(text, unicode):
    239         return text
--> 240     return unicode(text, encoding, errors=errors)
    241 to_unicode = any2unicode
    242 

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb7 in position 4: invalid start byte

TypeError: sequence item 0: expected str instance, bytes found

while process one text of wiki, the text reading from data is byte type, but the source code use ' '.join(),this line of code will throw an error.
if you like ,you would use follow code instead that:
output.write(space.join(map(bytes.decode, text)) + '\n')

ParseError in process_wiki.py

I keep getting:
Traceback (most recent call last):
File "/anaconda2/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/anaconda2/lib/python2.7/site-packages/gensim/utils.py", line 843, in run
wrapped_chunk = [list(chunk)]
File "/anaconda2/lib/python2.7/site-packages/gensim/corpora/wikicorpus.py", line 302, in
texts = ((text, self.lemmatize, title, pageid) for title, text, pageid in extract_pages(bz2.BZ2File(self.fname), self.filter_namespaces))
File "/anaconda2/lib/python2.7/site-packages/gensim/corpora/wikicorpus.py", line 214, in extract_pages
for elem in elems:
File "/anaconda2/lib/python2.7/site-packages/gensim/corpora/wikicorpus.py", line 199, in
elems = (elem for _, elem in iterparse(f, events=("end",)))
File "", line 107, in next
ParseError: no element found: line 45, column 0

How could I solve it? Thanks!

The prompt error is:

Hi,
I implement:
v1# python train_word2vec_model.py wiki.zh.text.jian.seg.utf-8 wiki.zh.text.model wiki.zh.text.vector
2017-05-12 01:19:45,578: INFO: running train_word2vec_model.py wiki.zh.text.jian.seg.utf-8 wiki.zh.text.model wiki.zh.text.vector
2017-05-12 01:19:45,594: INFO: collecting all words and their counts
2017-05-12 01:19:45,648: INFO: PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-05-12 01:19:50,171: INFO: PROGRESS: at sentence #10000, processed 6464399 words, keeping 725285 word types
2017-05-12 01:19:53,546: INFO: PROGRESS: at sentence #20000, processed 11125064 words, keeping 1120049 word types
2017-05-12 01:19:58,920: INFO: PROGRESS: at sentence #30000, processed 15348776 words, keeping 1423306 word types
2017-05-12 01:20:01,128: INFO: PROGRESS: at sentence #40000, processed 19278980 words, keeping 1693287 word types
2017-05-12 01:20:03,203: INFO: PROGRESS: at sentence #50000, processed 22967412 words, keeping 1928859 word types
2017-05-12 01:20:04,554: INFO: PROGRESS: at sentence #60000, processed 26514303 words, keeping 2139812 word types
2017-05-12 01:20:07,120: INFO: PROGRESS: at sentence #70000, processed 29850501 words, keeping 2337565 word types
2017-05-12 01:20:09,387: INFO: PROGRESS: at sentence #80000, processed 33111262 words, keeping 2527187 word types
2017-05-12 01:20:11,163: INFO: PROGRESS: at sentence #90000, processed 36251605 words, keeping 2695901 word types
Traceback (most recent call last):
File "train_word2vec_model.py", line 27, in
workers=multiprocessing.cpu_count())
File "/usr/local/lib/python2.7/dist-packages/gensim/models/word2vec.py", line 478, in init
self.build_vocab(sentences, trim_rule=trim_rule)
File "/usr/local/lib/python2.7/dist-packages/gensim/models/word2vec.py", line 553, in build_vocab
self.scan_vocab(sentences, progress_per=progress_per, trim_rule=trim_rule) # initial survey
File "/usr/local/lib/python2.7/dist-packages/gensim/models/word2vec.py", line 575, in scan_vocab
vocab[word] += 1
MemoryError
Thank you!
molyswu

UnicodeEncodeError: 'ascii' codec can't encode characters in position 1458-1459: ordinal not in range(128)

i had the below error and manage to fix it in process_wiki,py
2017-10-23 08:23:14,607: INFO: running process_wiki.py /home/ay_salama/bigdata/wikipedia_download/enwiki-latest-pages-articles.xml.bz2 wiki.en.text
Traceback (most recent call last):
File "process_wiki.py", line 40, in
output.write(space.join(text) + "\n")
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1458-1459: ordinal not in range(128)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.