panyang / wikipedia_word2vec Goto Github PK

View Code? Open in Web Editor NEW

284.0 16.0 170.0 7 KB

Train Word2vec Model based on Wikipedia

License: MIT License

Python 100.00%

wikipedia_word2vec's Introduction

Word2vec 4 Wikipedia

Train Word2vec Model based on Wikipedia by Python Gensim

wikipedia_word2vec's People

Contributors

Stargazers

Watchers

Forkers

binkmust deepinwine shyant lingyongyan zbctop lawrenceliu bayesrule keymea pilehvar logicxin nanfengpo jianzhengming goodhaidong tahasabih shirveon dt1219 nttrungmt salamer sam186 ia-s huokedu abc3436645 dingshanliang smilemilk1992 zhili-zh noahcse limin2021 zwytop th1nk4data ben2017a jasonhoou goldsmith777 lianghq7 mjz9054 201528014227051 yuchengg pumayhui ligoudanblabla libertatis shqcandy whatisnull evitself souvickg whitewinder tpr-ly movinghera iamsvv faisalbuland trigrass2 feixie0331 changfengfeng zhuxf0407 ynyuan aymansalama eight-corner berryhn jokerdu haomaoxiang copperdong zmpyzmpy isr-wang lin1091885194 swordsmanxyz innerface songyandong crystalajj sydneywusiyuan ryfan-rs gechen d0cbrown wuyou61 gibson1 midnitekoder zyj0021200 aaronzhangl arryboom erfontes hongweijun811 mengyuhu liuna0630 596350754 piginzoo lentitude knifeeoneone alwayssomeone junelll malakhovks ghiblifield wuyongdec shiyullong connietong iammonster2333 chaoongithub bingforbetter lizongyao123 fei161 hefeiq sjtuwangjiahuan hanjinda linlingting

wikipedia_word2vec's Issues

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb7 in position 4: invalid start byte

This is my code on Chinese word2vec by gensim, there are so many tutorials about this topic, but all of them are almost the same(word2vec on wiki corpus). I was wandering if you guys have come up with the problem like this and I couldn't figure it out:

import jieba
import time
from gensim.models import word2vec

# 对 TXT 文档结巴分词，输出结果也导出为 TXT 文档
stopwordset = set()
with open('stopwordset.txt', encoding='utf-8') as sw:
    for line in sw:
        stopwordset.add(line.strip('\n'))
        
output = open('result.txt', 'w')

with open('jieba.txt', 'r') as content:
    for line in content:
        words = jieba.cut(line, cut_all=False)
        for word in words:
            if word not in stopwordset:
                output.write(word + ' ')
output.close()

sentences = word2vec.Text8Corpus('result.txt')
model = word2vec.Word2Vec(sentences, size=20)

And the error message as follows:

UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-4-7956a445b8ae> in <module>()
      1 sentences = word2vec.Text8Corpus('result.txt')
----> 2 model = word2vec.Word2Vec(sentences, size=20)

C:\Users\libin\AppData\Local\Continuum\Anaconda3\lib\site-packages\gensim\models\word2vec.py in __init__(self, sentences, size, alpha, window, min_count, max_vocab_size, sample, seed, workers, min_alpha, sg, hs, negative, cbow_mean, hashfxn, iter, null_word, trim_rule, sorted_vocab, batch_words, compute_loss)
    501             if isinstance(sentences, GeneratorType):
    502                 raise TypeError("You can't pass a generator as the sentences argument. Try an iterator.")
--> 503             self.build_vocab(sentences, trim_rule=trim_rule)
    504             self.train(sentences, total_examples=self.corpus_count, epochs=self.iter,
    505                        start_alpha=self.alpha, end_alpha=self.min_alpha)

C:\Users\libin\AppData\Local\Continuum\Anaconda3\lib\site-packages\gensim\models\word2vec.py in build_vocab(self, sentences, keep_raw_vocab, trim_rule, progress_per, update)
    575 
    576         """
--> 577         self.scan_vocab(sentences, progress_per=progress_per, trim_rule=trim_rule)  # initial survey
    578         self.scale_vocab(keep_raw_vocab=keep_raw_vocab, trim_rule=trim_rule, update=update)  # trim by min_count & precalculate downsampling
    579         self.finalize_vocab(update=update)  # build tables & arrays

C:\Users\libin\AppData\Local\Continuum\Anaconda3\lib\site-packages\gensim\models\word2vec.py in scan_vocab(self, sentences, progress_per, trim_rule)
    587         vocab = defaultdict(int)
    588         checked_string_types = 0
--> 589         for sentence_no, sentence in enumerate(sentences):
    590             if not checked_string_types:
    591                 if isinstance(sentence, string_types):

C:\Users\libin\AppData\Local\Continuum\Anaconda3\lib\site-packages\gensim\models\word2vec.py in __iter__(self)
   1501                 last_token = text.rfind(b' ')  # last token may have been split in two... keep for next iteration
   1502                 words, rest = (utils.to_unicode(text[:last_token]).split(),
-> 1503                                text[last_token:].strip()) if last_token >= 0 else ([], text)
   1504                 sentence.extend(words)
   1505                 while len(sentence) >= self.max_sentence_length:

C:\Users\libin\AppData\Local\Continuum\Anaconda3\lib\site-packages\gensim\utils.py in any2unicode(text, encoding, errors)
    238     if isinstance(text, unicode):
    239         return text
--> 240     return unicode(text, encoding, errors=errors)
    241 to_unicode = any2unicode
    242 

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb7 in position 4: invalid start byte

TypeError: sequence item 0: expected str instance, bytes found

while process one text of wiki, the text reading from data is byte type, but the source code use ' '.join(),this line of code will throw an error.
if you like ,you would use follow code instead that:
output.write(space.join(map(bytes.decode, text)) + '\n')

ParseError in process_wiki.py

I keep getting:
Traceback (most recent call last):
File "/anaconda2/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/anaconda2/lib/python2.7/site-packages/gensim/utils.py", line 843, in run
wrapped_chunk = [list(chunk)]
File "/anaconda2/lib/python2.7/site-packages/gensim/corpora/wikicorpus.py", line 302, in
texts = ((text, self.lemmatize, title, pageid) for title, text, pageid in extract_pages(bz2.BZ2File(self.fname), self.filter_namespaces))
File "/anaconda2/lib/python2.7/site-packages/gensim/corpora/wikicorpus.py", line 214, in extract_pages
for elem in elems:
File "/anaconda2/lib/python2.7/site-packages/gensim/corpora/wikicorpus.py", line 199, in
elems = (elem for _, elem in iterparse(f, events=("end",)))
File "", line 107, in next
ParseError: no element found: line 45, column 0

How could I solve it? Thanks!

请问这个库里面的脚本可以用来训练任何除了英文以外的语言吗？

The prompt error is:

Hi,
I implement:
v1# python train_word2vec_model.py wiki.zh.text.jian.seg.utf-8 wiki.zh.text.model wiki.zh.text.vector
2017-05-12 01:19:45,578: INFO: running train_word2vec_model.py wiki.zh.text.jian.seg.utf-8 wiki.zh.text.model wiki.zh.text.vector
2017-05-12 01:19:45,594: INFO: collecting all words and their counts
2017-05-12 01:19:45,648: INFO: PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-05-12 01:19:50,171: INFO: PROGRESS: at sentence #10000, processed 6464399 words, keeping 725285 word types
2017-05-12 01:19:53,546: INFO: PROGRESS: at sentence #20000, processed 11125064 words, keeping 1120049 word types
2017-05-12 01:19:58,920: INFO: PROGRESS: at sentence #30000, processed 15348776 words, keeping 1423306 word types
2017-05-12 01:20:01,128: INFO: PROGRESS: at sentence #40000, processed 19278980 words, keeping 1693287 word types
2017-05-12 01:20:03,203: INFO: PROGRESS: at sentence #50000, processed 22967412 words, keeping 1928859 word types
2017-05-12 01:20:04,554: INFO: PROGRESS: at sentence #60000, processed 26514303 words, keeping 2139812 word types
2017-05-12 01:20:07,120: INFO: PROGRESS: at sentence #70000, processed 29850501 words, keeping 2337565 word types
2017-05-12 01:20:09,387: INFO: PROGRESS: at sentence #80000, processed 33111262 words, keeping 2527187 word types
2017-05-12 01:20:11,163: INFO: PROGRESS: at sentence #90000, processed 36251605 words, keeping 2695901 word types
Traceback (most recent call last):
File "train_word2vec_model.py", line 27, in
workers=multiprocessing.cpu_count())
File "/usr/local/lib/python2.7/dist-packages/gensim/models/word2vec.py", line 478, in init
self.build_vocab(sentences, trim_rule=trim_rule)
File "/usr/local/lib/python2.7/dist-packages/gensim/models/word2vec.py", line 553, in build_vocab
self.scan_vocab(sentences, progress_per=progress_per, trim_rule=trim_rule) # initial survey
File "/usr/local/lib/python2.7/dist-packages/gensim/models/word2vec.py", line 575, in scan_vocab
vocab[word] += 1
MemoryError
Thank you!
molyswu

codecs is much better

Here I believe codecs.open() is a much better solution to the compatible problem.

UnicodeEncodeError: 'ascii' codec can't encode characters in position 1458-1459: ordinal not in range(128)

i had the below error and manage to fix it in process_wiki,py
2017-10-23 08:23:14,607: INFO: running process_wiki.py /home/ay_salama/bigdata/wikipedia_download/enwiki-latest-pages-articles.xml.bz2 wiki.en.text
Traceback (most recent call last):
File "process_wiki.py", line 40, in
output.write(space.join(text) + "\n")
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1458-1459: ordinal not in range(128)

panyang / wikipedia_word2vec Goto Github PK

wikipedia_word2vec's Introduction

Word2vec 4 Wikipedia

wikipedia_word2vec's People

Contributors

Stargazers

Watchers

Forkers

wikipedia_word2vec's Issues

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb7 in position 4: invalid start byte

TypeError: sequence item 0: expected str instance, bytes found

ParseError in process_wiki.py

请问这个库里面的脚本可以用来训练任何除了英文以外的语言吗？

The prompt error is:

codecs is much better

UnicodeEncodeError: 'ascii' codec can't encode characters in position 1458-1459: ordinal not in range(128)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent