Coder Social home page Coder Social logo

koichiyasuoka / unidic2ud Goto Github PK

View Code? Open in Web Editor NEW
32.0 2.0 2.0 163.71 MB

Tokenizer POS-tagger Lemmatizer and Dependency-parser for modern and contemporary Japanese

License: MIT License

Python 62.20% Shell 0.99% Jupyter Notebook 36.81%
nlp dependency-parser japanese-language

unidic2ud's Introduction

Current PyPI packages

UniDic2UD

Tokenizer, POS-tagger, lemmatizer, and dependency-parser for modern and contemporary Japanese, working on Universal Dependencies.

Basic usage

>>> import unidic2ud
>>> nlp=unidic2ud.load("kindai")
>>> s=nlp("其國を治めんと欲する者は先づ其家を齊ふ")
>>> print(s)
# text = 其國を治めんと欲する者は先づ其家を齊ふ
1		其の	DET	連体詞	_	2	det	_	SpaceAfter=No|Translit=ソノ
2			NOUN	名詞-普通名詞-一般	_	4	obj	_	SpaceAfter=No|Translit=クニ
3			ADP	助詞-格助詞	_	2	case	_	SpaceAfter=No|Translit=
4	治め	収める	VERB	動詞-一般	_	7	advcl	_	SpaceAfter=No|Translit=オサメ
5			AUX	助動詞	_	4	aux	_	SpaceAfter=No|Translit=
6			ADP	助詞-格助詞	_	4	case	_	SpaceAfter=No|Translit=
7	欲する	欲する	VERB	動詞-一般	_	8	acl	_	SpaceAfter=No|Translit=ホッスル
8			NOUN	名詞-普通名詞-一般	_	14	nsubj	_	SpaceAfter=No|Translit=モノ
9			ADP	助詞-係助詞	_	8	case	_	SpaceAfter=No|Translit=
10	先づ	先ず	ADV	副詞	_	14	advmod	_	SpaceAfter=No|Translit=マヅ
11		其の	DET	連体詞	_	12	det	_	SpaceAfter=No|Translit=ソノ
12			NOUN	名詞-普通名詞-一般	_	14	obj	_	SpaceAfter=No|Translit=ウチ
13			ADP	助詞-格助詞	_	12	case	_	SpaceAfter=No|Translit=
14	齊ふ	整える	VERB	動詞-一般	_	0	root	_	SpaceAfter=No|Translit=トトノフ

>>> t=s[7]
>>> print(t.id,t.form,t.lemma,t.upos,t.xpos,t.feats,t.head.id,t.deprel,t.deps,t.misc)
7 欲する 欲する VERB 動詞-一般 _ 8 acl _ SpaceAfter=No|Translit=ホッスル

>>> print(s.to_tree())
     <══╗         det(決定詞)
     ═╗═╝<obj(目的語)
     <╝   ║       case(格表示)
  治め ═╗═╗═╝<advcl(連用修飾節)
     <╝ ║   ║     aux(動詞補助成分)
     <══╝   ║     case(格表示)
欲する ═══════╝<acl(連体修飾節)
     ═╗═══════╝<nsubj(主語)
     <╝         ║ case(格表示)
  先づ <══════╗   ║ advmod(連用修飾語)
     <══╗   ║   ║ det(決定詞)
     ═╗═╝<╗ ║   ║ obj(目的語)
     <╝   ║ ║   ║ case(格表示)
  齊ふ ═════╝═╝═══╝ root()

>>> f=open("trial.svg","w")
>>> f.write(s.to_svg())
>>> f.close()

trial.svg

unidic2ud.load(UniDic,UDPipe) loads a natural language processor pipeline, which uses UniDic for tokenizer POS-tagger and lemmatizer, then uses UDPipe for dependency-parser. The default UDPipe is UDPipe="japanese-modern". Available UniDic options are:

unidic2ud.UniDic2UDEntry.to_tree() has an option to_tree(BoxDrawingWidth=2) for old terminals, whose Box Drawing characters are "fullwidth".

You can simply use unidic2ud on the command line:

echo 其國を治めんと欲する者は先づ其家を齊ふ | unidic2ud -U kindai

CaboCha emulator usage

>>> import unidic2ud.cabocha as CaboCha
>>> c=CaboCha.Parser("kindai")
>>> s=c.parse("其國を治めんと欲する者は先づ其家を齊ふ")
>>> print(s.toString(CaboCha.FORMAT_TREE_LATTICE))
  -D
  國を-D
治めんと-D
    欲する-D
        者は-------D
          先づ-----D
              -D |
              家を-D
                齊ふ
EOS
* 0 1D 0/0 0.000000
	連体詞,*,*,*,*,*,其の,ソノ,*,DET	O	1<-det-2
* 1 2D 0/1 0.000000
	名詞,普通名詞,一般,*,*,*,,クニ,*,NOUN	O	2<-obj-4
	助詞,格助詞,*,*,*,*,,,*,ADP	O	3<-case-2
* 2 3D 0/1 0.000000
治め	動詞,一般,*,*,*,*,収める,オサメ,*,VERB	O	4<-advcl-7
	助動詞,*,*,*,*,*,,,*,AUX	O	5<-aux-4
	助詞,格助詞,*,*,*,*,,,*,ADP	O	6<-case-4
* 3 4D 0/0 0.000000
欲する	動詞,一般,*,*,*,*,欲する,ホッスル,*,VERB	O	7<-acl-8
* 4 8D 0/1 0.000000
	名詞,普通名詞,一般,*,*,*,,モノ,*,NOUN	O	8<-nsubj-14
	助詞,係助詞,*,*,*,*,,,*,ADP	O	9<-case-8
* 5 8D 0/0 0.000000
先づ	副詞,*,*,*,*,*,先ず,マヅ,*,ADV	O	10<-advmod-14
* 6 7D 0/0 0.000000
	連体詞,*,*,*,*,*,其の,ソノ,*,DET	O	11<-det-12
* 7 8D 0/1 0.000000
	名詞,普通名詞,一般,*,*,*,,ウチ,*,NOUN	O	12<-obj-14
	助詞,格助詞,*,*,*,*,,,*,ADP	O	13<-case-12
* 8 -1D 0/0 0.000000
齊ふ	動詞,一般,*,*,*,*,整える,トトノフ,*,VERB	O	14<-root
EOS
>>> for c in [s.chunk(i) for i in range(s.chunk_size())]:
...   if c.link>=0:
...     print(c,"->",s.chunk(c.link))
...
 -> 國を
國を -> 治めんと
治めんと -> 欲する
欲する -> 者は
者は -> 齊ふ
先づ -> 齊ふ
 -> 家を
家を -> 齊ふ

CaboCha.Parser(UniDic) is an alias for unidic2ud.load(UniDic,UDPipe="japanese-modern"), and its default is UniDic=None. CaboCha.Tree.toString(format) has five available formats:

  • CaboCha.FORMAT_TREE: tree (numbered as 0)
  • CaboCha.FORMAT_LATTICE: lattice (numbered as 1)
  • CaboCha.FORMAT_TREE_LATTICE: tree + lattice (numbered as 2)
  • CaboCha.FORMAT_XML: XML (numbered as 3)
  • CaboCha.FORMAT_CONLL: Universal Dependencies CoNLL-U (numbered as 4)

You can simply use udcabocha on the command line:

echo 其國を治めんと欲する者は先づ其家を齊ふ | udcabocha -U kindai -f 2

-U UniDic specifies UniDic. -f format specifies the output format in 0 to 4 above (default is -f 0) and in 5 to 8 below:

dot.png

Try notebook for Google Colaboratory.

Usage via spaCy

If you have already installed spaCy 2.1.0 or later, you can use UniDic via spaCy Language pipeline.

>>> import unidic2ud.spacy
>>> nlp=unidic2ud.spacy.load("kindai")
>>> d=nlp("其國を治めんと欲する者は先づ其家を齊ふ")
>>> print(unidic2ud.spacy.to_conllu(d))
# text = 其國を治めんと欲する者は先づ其家を齊ふ
1		其の	DET	連体詞	_	2	det	_	SpaceAfter=No|Translit=ソノ
2			NOUN	名詞-普通名詞-一般	_	4	obj	_	SpaceAfter=No|Translit=クニ
3			ADP	助詞-格助詞	_	2	case	_	SpaceAfter=No|Translit=
4	治め	収める	VERB	動詞-一般	_	7	advcl	_	SpaceAfter=No|Translit=オサメ
5			AUX	助動詞	_	4	aux	_	SpaceAfter=No|Translit=
6			ADP	助詞-格助詞	_	4	case	_	SpaceAfter=No|Translit=
7	欲する	欲する	VERB	動詞-一般	_	8	acl	_	SpaceAfter=No|Translit=ホッスル
8			NOUN	名詞-普通名詞-一般	_	14	nsubj	_	SpaceAfter=No|Translit=モノ
9			ADP	助詞-係助詞	_	8	case	_	SpaceAfter=No|Translit=
10	先づ	先ず	ADV	副詞	_	14	advmod	_	SpaceAfter=No|Translit=マヅ
11		其の	DET	連体詞	_	12	det	_	SpaceAfter=No|Translit=ソノ
12			NOUN	名詞-普通名詞-一般	_	14	obj	_	SpaceAfter=No|Translit=ウチ
13			ADP	助詞-格助詞	_	12	case	_	SpaceAfter=No|Translit=
14	齊ふ	整える	VERB	動詞-一般	_	0	root	_	SpaceAfter=No|Translit=トトノフ

>>> t=d[6]
>>> print(t.i+1,t.orth_,t.lemma_,t.pos_,t.tag_,t.head.i+1,t.dep_,t.whitespace_,t.norm_)
7 欲する 欲する VERB 動詞-一般 8 acl  ホッスル

>>> from deplacy.deprelja import deprelja
>>> for b in unidic2ud.spacy.bunsetu_spans(d):
...   for t in b.lefts:
...     print(unidic2ud.spacy.bunsetu_span(t),"->",b,"("+deprelja[t.dep_]+")")
...
 -> 國を (決定詞)
國を -> 治めんと (目的語)
治めんと -> 欲する (連用修飾節)
欲する -> 者は (連体修飾節)
 -> 家を (決定詞)
者は -> 齊ふ (主語)
先づ -> 齊ふ (連用修飾語)
家を -> 齊ふ (目的語)

unidic2ud.spacy.load(UniDic,parser) loads a spaCy pipeline, which uses UniDic for tokenizer POS-tagger and lemmatizer (as shown above), then uses parser for dependency-parser. The default parser is parser="japanese-modern" and available options are:

Installation for Linux

Tar-ball is available for Linux, and is installed by default when you use pip:

pip install unidic2ud

By default installation, UniDic is invoked through Web APIs. If you want to invoke them locally and faster, you can download UniDic which you use just as follows:

python -m unidic2ud download kindai
python -m unidic2ud dictlist

Licenses of dictionaries and models are: GPL/LGPL/BSD for gendai and spoken; CC BY-NC-SA 4.0 for others.

Installation for Cygwin

Make sure to get gcc-g++ python37-pip python37-devel packages, and then:

pip3.7 install unidic2ud

Use python3.7 command in Cygwin instead of python.

Installation for Jupyter Notebook (Google Colaboratory)

!pip install unidic2ud

Benchmarks

Results of 舞姬/雪國/荒野より-Benchmarks

舞姬 LAS MLAS BLEX
UniDic="kindai" 81.13 70.37 77.78
UniDic="qkana" 79.25 70.37 77.78
UniDic="kinsei" 72.22 60.71 64.29
雪國 LAS MLAS BLEX
UniDic="qkana" 89.29 85.71 81.63
UniDic="kinsei" 89.29 85.71 77.55
UniDic="kindai" 84.96 81.63 77.55
荒野より LAS MLAS BLEX
UniDic="kindai" 76.44 61.54 53.85
UniDic="qkana" 75.39 61.54 53.85
UniDic="kinsei" 71.88 58.97 51.28

Author

Koichi Yasuoka (安岡孝一)

References

unidic2ud's People

Contributors

koichiyasuoka avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

kyodocn idiig

unidic2ud's Issues

shlex call in mecab breaks library with downloaded dictionaries on windows

A call to shlex.split(args) within mecab strips \\ characters from paths unless the paths are enclosed by " characters.

line 224 in unidic2ud.py does not do this:
self.mecab=Tagger("-r "+r+" -d "+d).parse

thus the library does not work with downloaded dictionaries on windows.
this modified line fixes the problem.

self.mecab=Tagger(f"""-r "{r}" -d "{d}" """).parse

mecab itself does enclose paths in this way within it's own code, so I find it highly unlikely that this modification will break anything on other platforms.

All the best :)

the bug in the file "unidic2ud/unidic2ud.py"

When I ran the demo codes, it caused the error.

code

import unidic2ud
nlp=unidic2ud.load("kindai")
s=nlp("其國を治めんと欲する者は先づ其家を齊ふ")
print(s)

cmd output

------------------- ERROR DETAILS ------------------------
arguments: -r site-packages\unidic2ud\mecabrc -d site-packages\unidic2ud\download\gendai
[ifs] no such file or directory: site-packagesunidic2udmecabrc
----------------------------------------------------------

After I changed the file "unidic2ud/unidic2ud.py", line 224 and 231, the code can run.

224 self.mecab=Tagger("-r "+r+" -d "+d).parse
# self.mecab=Tagger("-r '"+r+"' -d '"+d+"'").parse
231 self.mecab=Tagger("-r "+r+" -d "+unidic_lite.DICDIR).parse
# self.mecab=Tagger("-r '"+r+"' -d '"+unidic_lite.DICDIR+"'").parse

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.