Coder Social home page Coder Social logo

jumanpp-jumandic's Introduction

Descritption

This repository contains a set of scripts to build a ready-to-use Juman++ model for Jumandic.

Prerequrements

  • Unix environment (on Windows use WSL or MSYS2/MinGW64)
  • Juman++ build environment
  • Python 3.6+
  • Ruby
  • Perl
  • Configured ssh authorization for github (we will clone several repositories via ssh)
  • 32 GB of RAM

Recommended

  • Original texts from Mainichi Shinbun (year 1995) for Kyoto Corpus (see the page for more information). Othewise, Juman++ model will be trained only on Leads corpus and will have poor quality.

How to Use

Run the configuration script: python3 configure.py. It will prompt for the location of Mainichi Shinbun texts.

After that run make nornn for training a model without RNN component. make rnn produces the model with RNN component. The models will be inside the bld/model folder.

Adding your words to the model

It is possible to add your words to the model. To do it:

  1. Perform the configuration as described above: python3 configure.py
  2. Fetch the repositories make repo.
  3. Go into bld/repos/jumandic folder, it is a local clone of JumanDIC repository.
  4. Create a new file with the .dic extension in the userdic folder of the bld/repos/jumandic folder.
  5. Put your words into that file, in JUMAN dictionary format (refer to other files for example).
  6. Execute make clean-dic if you have already built a Juman++ model.
  7. Build your model as shown above.

If the built model does not contain your words, ensure that the binary dictionary was rebuilt after adding new words.

jumanpp-jumandic's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

jumanpp-jumandic's Issues

トレーニングしたモデルの使い方

このスクリプトでトレーニングしたモデルをjumanppに設定するには
jumanpp --model=jumanpp-jumandic/bld/models/jumandic-nornn.modelのように設定して使用するのでしょうか?

毎日新聞の原文1995年をセットした際のエラー

毎日新聞の1995年のテキストがあるディレクトリを指定してmake rnnを実行すると
Can't open jumanpp-jumandic/bld/repos/kyoto-corpus/id/full.id jumanpp-jumandic/bld/repos/kyoto-corpus/dat/num/950101.knp' を stat できません: そのようなファイルやディレクトリはありませんとなります。該当ディレクトリにfull.idというファイルはありませんでした。

トレーニングする際の辞書の優先度

ユーザー辞書を追加してモデルをトレーニングする際、優先度の設定などはあるのでしょうか?
トレーニング後のモデルを指定してjumanppを実行した際に追加したユーザー辞書の単語が取れていないので何か設定できるのであれば教えてください。よろしくお願いいたします。

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.