Coder Social home page Coder Social logo

jeongukjae / nori-clone Goto Github PK

View Code? Open in Web Editor NEW
42.0 3.0 4.0 268.03 MB

Standalone Nori (Korean Morphological Analyzer)

License: Apache License 2.0

Starlark 16.09% C++ 63.41% Shell 3.87% Python 4.24% Go 3.96% C 1.21% Java 7.22%
korean korean-nlp morphological-analysis pos-tagging

nori-clone's Introduction

nori-clone

Standalone Nori (Korean Morphological Analyzer in Apache Lucene) written in C++.

blog post (written in Korean)

Introduction

ElasticSearch provides high-quality/performance Korean morphological analyzer nori. But nori's code is strongly coupled with the Lucene codebase, and nori is written in Java that is the main language in the Lucene project. So, it's hard to use nori standalone in Python or Golang with the same performance. Therefore, I re-implemented almost the same algorithms with nori in Lucene using C++ for the portability and usability.

Usage

This project is written in C++, but also provides Python and Golang binding.

Pre-built dictionaries

A dictionary/ directory is for the pre-built dictionary files that is used for distribtion and test cases. For now, there are two pre-built dictionaries, lagacy and latest.

  • legacy dictionary does not normalize inputs, and built with mecab-ko-dic-2.0.3-20170922 that is same with original nori.
  • latest dictionary normalizes the inputs with the form NFKC, and built with mecab-ko-dic-2.1.1-20180720.

Performance

elapsed time

For more details, check out tools/benchmark.

Differences with original nori

Check out tools/comparison.

For the contributors

Check out CONTRIBUTING.md

References

nori-clone's People

Contributors

jeongukjae avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

nori-clone's Issues

Unifying dictionary format to single file

In the current implementation, we have to bring several files to use nori-clone, and most of the file is in the protobuf format. It would be better to bring just a single file.

update python interface

The current python interface is not convenient to use.

import nori

dictionary = nori.Dictionary()
dictionary.load_prebuilt_dictionary("./dictionary/latest-dictionary.nori")
dictionary.load_user_dictionary("./dictionary/latest-userdict.txt")
tokenizer = nori.NoriTokenizer(dictionary)

result = tokenizer.tokenize("이 프로젝트는 nori를 재작성하는 프로젝트입니다.")

for token in result.tokens:
    print(token.surface)

I think the API should be changed as follows.

import nori

tokenizer = nori.NoriTokenizer()
tokenizer.load_prebuilt_dictionary("./dictionary/latest-dictionary.nori")
tokenizer.load_user_dictionary("./dictionary/latest-userdict.txt")

result = tokenizer.tokenize("이 프로젝트는 nori를 재작성하는 프로젝트입니다.")

for token in result.tokens:
    print(token.surface)

unknown char bugs

input: 도로ㆍ지반ㆍ수자원ㆍ건설환경ㆍ건축ㆍ화재설비연구
expected:

 도로, MORPHEME, NNG, NNG
 ᆞ, MORPHEME, UNKNOWN, UNKNOWN
 지반, MORPHEME, NNG, NNG
 ᆞ, MORPHEME, UNKNOWN, UNKNOWN
 수자원, COMPOUND, NNG, NNG
 ᆞ, MORPHEME, UNKNOWN, UNKNOWN
 건설, MORPHEME, NNG, NNG
 환경, MORPHEME, NNG, NNG
 ᆞ, MORPHEME, UNKNOWN, UNKNOWN
 건축, MORPHEME, NNG, NNG
 ᆞ, MORPHEME, UNKNOWN, UNKNOWN
 화재, MORPHEME, NNG, NNG
 설비, MORPHEME, NNG, NNG
 연구, MORPHEME, NNG, NNG

current:

 도로, MORPHEME, NNG, NNG
 ᆞ지반ᆞ수자원ᆞ건설환경ᆞ건축ᆞ화재설비연구, MORPHEME, UNKNOWN, UNKNOWN

More proper POS types for terms in the user dictionary

It would be better if we can assign more proper POS types for terms in the user dictionary. Currently, all user dictionary's terms are treated as NNG.

We can find the required right-ids and left-ids for the user dictionary at right-id.def and left-id.def files.

3533 NNG,*,*,*,*,*,*,*
3534 NNG,*,F,*,*,*,*,*
3535 NNG,*,T,*,*,*,*,*
3536 NNG,지명,F,*,*,*,*,*
3537 NNG,지명,T,*,*,*,*,*
3538 NNP,*,F,*,*,*,*,*
3539 NNP,*,T,*,*,*,*,*
3541 NNP,인명,*,*,*,*,*,*
3542 NNP,인명,F,*,*,*,*,*
3543 NNP,인명,T,*,*,*,*,*
3544 NNP,지명,*,*,*,*,*,*
3545 NNP,지명,F,*,*,*,*,*
3546 NNP,지명,T,*,*,*,*,*

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.