Comments (3)
Hey, this is not a bug but just an example of the limitations of Mecab. Depending on the settings for unknown words, some sequences of words can be combined together as an UNK (未知語) instead of being split. It's particularly easy to cause this if the sequences of words you're using aren't similar to the kind used in the training data for the dictionary (mostly newspaper-article type stuff). This is because Mecab uses not only the existence of words in the dictionary, but also transitions between different parts of speech when calculating the best place to split words.
In particular, Mecab just treats sutegana the same as kana. You can see this in the character class definitions in the char.def file distributed with ipadic, which is the dictionary you appear to be using.
You can read more about unk processing here.
from mecab.
I have found another example, this time where the 捨て仮名 is not at the front but still included:
ありがとうございますなんかはそうですね
ありがとう 感動詞,*,*,*,*,*,ありがとう,アリガトウ,アリガトー
ござい 助動詞,*,*,*,五段・ラ行特殊,連用形,ござる,ゴザイ,ゴザイ
ます 助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス
なんか 助詞,副助詞,*,*,*,*,なんか,ナンカ,ナンカ
は 助詞,係助詞,*,*,*,*,は,ハ,ワ
そうですね フィラー,*,*,*,*,*,そうですね,ソウデスネ,ソーデスネ
EOS
ありがとうございますなんかはそうですねちょっと
ありがとう 感動詞,*,*,*,*,*,ありがとう,アリガトウ,アリガトー
ござい 助動詞,*,*,*,五段・ラ行特殊,連用形,ござる,ゴザイ,ゴザイ
ます 助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス
なんか 助詞,副助詞,*,*,*,*,なんか,ナンカ,ナンカ
は 助詞,係助詞,*,*,*,*,は,ハ,ワ
そうですね フィラー,*,*,*,*,*,そうですね,ソウデスネ,ソーデスネ
ちょっと 副詞,助詞類接続,*,*,*,*,ちょっと,チョット,チョット
EOS
ありがとうございますなんかはそうですねちょっとやっぱ
ありがとう 感動詞,*,*,*,*,*,ありがとう,アリガトウ,アリガトー
ご 接頭詞,名詞接続,*,*,*,*,ご,ゴ,ゴ
ざいますなんかはそうですねちょっとやっぱ 名詞,一般,*,*,*,*,*
EOS
from mecab.
Thank you very much for your reply!
from mecab.
Related Issues (20)
- Is it possible to have a .Net/C# wrapper for MeCab?
- Problems when training HOT 2
- When training, speed of reading corpus is very slow
- [mecab-dict-index] error HOT 2
- Don't specify node-format option when using UniDic HOT 1
- matrix right/left dimension checking is inconsistent (compiling user dictionary/assigning user dict costs) HOT 3
- mecab-dict-gen crashes after a long time
- Memoly leak when use python-wrapper and input string is too long
- Installing mecab HOT 1
- Meet a undefined reference to '__imp__ZN5MeCab12createTaggerEPKc' when running the example.cpp HOT 2
- Mecab algorithm (Mecabアルゴリズム) HOT 1
- Tag repo please HOT 1
- Support for Ruby2.7?
- Failure initializing Tagger has no error message
- 形容詞活用形「正しく」が副詞として扱われる HOT 1
- http://creativecommons.org/licenses/by-sa/3.0/
- Max Grouping Size off-by-one error
- “'gcc' failed with exit status 1” when trying to install Mecab with PyPy docker image HOT 1
- WPATH_FORCE() not defined on windows when compiling with msvc.
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mecab.