Coder Social home page Coder Social logo

Comments (3)

polm avatar polm commented on June 28, 2024 1

Hey, this is not a bug but just an example of the limitations of Mecab. Depending on the settings for unknown words, some sequences of words can be combined together as an UNK (未知語) instead of being split. It's particularly easy to cause this if the sequences of words you're using aren't similar to the kind used in the training data for the dictionary (mostly newspaper-article type stuff). This is because Mecab uses not only the existence of words in the dictionary, but also transitions between different parts of speech when calculating the best place to split words.

In particular, Mecab just treats sutegana the same as kana. You can see this in the character class definitions in the char.def file distributed with ipadic, which is the dictionary you appear to be using.

You can read more about unk processing here.

from mecab.

rolzy avatar rolzy commented on June 28, 2024

I have found another example, this time where the 捨て仮名 is not at the front but still included:

ありがとうございますなんかはそうですね
ありがとう      感動詞,*,*,*,*,*,ありがとう,アリガトウ,アリガトー
ござい  助動詞,*,*,*,五段・ラ行特殊,連用形,ござる,ゴザイ,ゴザイ
ます    助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス
なんか  助詞,副助詞,*,*,*,*,なんか,ナンカ,ナンカ
は      助詞,係助詞,*,*,*,*,は,ハ,ワ
そうですね      フィラー,*,*,*,*,*,そうですね,ソウデスネ,ソーデスネ
EOS
ありがとうございますなんかはそうですねちょっと
ありがとう      感動詞,*,*,*,*,*,ありがとう,アリガトウ,アリガトー
ござい  助動詞,*,*,*,五段・ラ行特殊,連用形,ござる,ゴザイ,ゴザイ
ます    助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス
なんか  助詞,副助詞,*,*,*,*,なんか,ナンカ,ナンカ
は      助詞,係助詞,*,*,*,*,は,ハ,ワ
そうですね      フィラー,*,*,*,*,*,そうですね,ソウデスネ,ソーデスネ
ちょっと        副詞,助詞類接続,*,*,*,*,ちょっと,チョット,チョット
EOS
ありがとうございますなんかはそうですねちょっとやっぱ
ありがとう      感動詞,*,*,*,*,*,ありがとう,アリガトウ,アリガトー
ご      接頭詞,名詞接続,*,*,*,*,ご,ゴ,ゴ
ざいますなんかはそうですねちょっとやっぱ        名詞,一般,*,*,*,*,*
EOS

from mecab.

rolzy avatar rolzy commented on June 28, 2024

Thank you very much for your reply!

from mecab.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.