Coder Social home page Coder Social logo

myan-word-breaker's Introduction

Myan-word-breaker

Word segmentation tool for Myanmar sentence.

Demo

Try this library at flask-py-word-breaker

Introduction

This is the word segmentation tool for Myanmar text. Word segmentation is the process of determining word boundaries in a piece of text.

Word Segmentation is the most basic but very important step for Natural Language Processing of Myanmar Text. It is also non-trivial task since Myanmar text is a string of characters without explicit word boundary delimiters.

In this library, the method from the research paper of Ko Tun Thura Thet (2008) Word segmentation for the Myanmar Language is used.

The library supports two ways to find the possible combinations

  1. All possible combinations loop
  2. Sub-word possibility recursive method.

The first one might be a little bit better to provide more precise result while the later one has huge improvement in runtime performance.

To validate the evaluate these possible combinations, the library uses Bigram Collocation Strength Statistical Approach according to described in the research paper.

Usage

# coding=utf8
from word_breaker.word_segment_v5 import WordSegment

wordSegmenter = WordSegment()
# Segment using sub_word_possibility segmentation method on Unicode String
print(wordSegmenter.normalize_break('သဘာဝဟာသဘာဝပါ', 'unicode', wordSegmenter.SegmentationMethod.sub_word_possibility))

# Segment using sub_word_possibility segmentation method on Zawgyi String
print(wordSegmenter.normalize_break('သဘာဝဟာသဘာဝပါ', 'zawgyi', wordSegmenter.SegmentationMethod.sub_word_possibility))

# Segment using all_possible_combination segmentation method on Unicode String
print(wordSegmenter.normalize_break('သဘာဝဟာသဘာဝပါ', 'unicode', wordSegmenter.SegmentationMethod.all_possible_combination))

# Segment using all_possible_combination segmentation method on Zawgyi String
print(wordSegmenter.normalize_break('သဘာဝဟာသဘာဝပါ', 'zawgyi', wordSegmenter.SegmentationMethod.all_possible_combination))

Todo

  • Create pip package to properly distribute
  • Scoring and evaluate the library precision performance
  • Syllable level separated dictionary files
  • Train Bi-gram Collocation data as a model

Credit

License

MIT

myan-word-breaker's People

Contributors

stevenay avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.