Coder Social home page Coder Social logo

ja.text8's Introduction

ja.text8

ja.text8 is a small (100MB) text corpus from the web (japanese wikipedia).

You can download ja.text8 corpus from the following link:

Usage

You can train word2vec by ja.text8. After downloading ja.text8, run the following code. It takes about 2 minutes to finish training:

import logging
from gensim.models import word2vec

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
 
sentences = word2vec.Text8Corpus('ja.text8')
model = word2vec.Word2Vec(sentences, size=200)

After the training, you can test the model as follows:

>>> model.most_similar(['日本'])
[('**', 0.598496675491333),
 ('韓国', 0.5914819240570068),
 ('アメリカ', 0.5286925435066223),
 ('英国', 0.5090063810348511),
 ('**', 0.4761126637458801),
 ('米国', 0.45954638719558716),
 ('アメリカ合衆国', 0.45181626081466675),
 ('イギリス', 0.44740626215934753),
 ('ソ連', 0.43657147884368896),
 ('海外', 0.4325913190841675)]

Great!

Requirements

  • Python 3.x
  • MeCab
  • virtualenv

Make corpus by yourself

You can download ja.text8. But you can make the corpus by yourself.

Simply run:

$ ./setup.sh

License

CC-BY-SA

ja.text8's People

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.