Coder Social home page Coder Social logo

elasticsearch-sudachi's Introduction

analysis-sudachi

analysis-sudachi is an Elasticsearch plugin for tokenization of Japanese text using Sudachi the Japanese morphological analyzer.

Build Status Quality Gate

What's new?

  • version 1.3.2

    • Upgrade sudachi morphological analyzer to 0.3.1
  • version 1.3.1

    • Upgrade sudachi morphological analyzer to 0.3.0
    • Minor bug fix
  • version 1.3.0

    • Upgrade sudachi morphological analyzer to 0.2.0
    • Import sudachi from maven central repository
    • Minor bug fix
  • version 1.2.0

    • Upgrading sudachi morphological analyzer to 0.2.0-SNAPSHOT
    • New filter sudachi_normalizedform was added; see sudachi_normalizedform
    • Default normalization behavior was changed; neather baseform filter and normalziedform filter not applied
    • sudachi_readingform filter was changed with new romaji mappings based on MS-IME
  • version 1.1.0

  • version 1.0.0

    • first release

Build

  1. Build analysis-sudachi.
   $ mvn package

Installation

  1. Download analysis-sudachi-elasticsearch zip archive file
  2. Move current dir to $ES_HOME
  3. Execute "bin/elasticsearch-plugin install file:///plugin-zip-path"
  4. Download sudachi dictionary archive from https://github.com/WorksApplications/SudachiDict
  5. Extract dic file and place it to config/sudachi_tokenizer/system_core.dic
  6. Execute "bin/elasticsearch"

Configuration

  • tokenizer: Select tokenizer. (sudachi) (string)
  • mode: Select mode. (normal or search or extended) (string, default: search)
    • normal: Regular segmentataion. (Use C mode of Sudachi)
      Ex) 関西国際空港 / アバラカダブラ
    • search: Additional segmentation useful for search. (Use C and A mode)
      Ex)関西国際空港, 関西, 国際, 空港 / アバラカダブラ
    • extended: Similar to search mode, but also unigram unknown words.
      Ex)関西国際空港, 関西, 国際, 空港 / アバラカダブラ, ア, バ, ラ, カ, ダ, ブ, ラ
  • discard_punctuation: Select to discard punctuation or not. (bool, default: true)
  • settings_path: Sudachi setting file path. The path may be absolute or relative; relative paths are resolved with respect to ES_HOME. (string, default: null)
  • resources_path: Sudachi dictionary path. The path may be absolute or relative; relative paths are resolved with respect to ES_HOME. (string, default: null)

Example

{
  "settings": {
    "index": {
      "analysis": {
        "tokenizer": {
          "sudachi_tokenizer": {
            "type": "sudachi_tokenizer",
            "mode": "search",
	    "discard_punctuation": true,
            "resources_path": "/etc/elasticsearch/sudachi"
          }
        },
        "analyzer": {
          "sudachi_analyzer": {
            "filter": [],
            "tokenizer": "sudachi_tokenizer",
            "type": "custom"
          }
        }
      }
    }
  }
}

Filters

sudachi_part_of_speech

The sudachi_part_of_speech token filter removes tokens that match a set of part-of-speech tags. It accepts the following setting:

The stopatgs is an array of part-of-speech and/or inflection tags that should be removed. It defaults to the stoptags.txt file embedded in the lucene-analysis-sudachi.jar.

Sudachi POS information is a csv list, consisting 6 items;

  • 1-4 part-of-speech hierarchy (品詞階層)
  • 5 inflectional type (活用型)
  • 6 inflectional form (活用形)

With the stoptags, you can filter out the result in any of these forward matching forms;

  • 1 - e.g., 名詞
  • 1,2 - e.g., 名詞,固有名詞
  • 1,2,3 - e.g., 名詞,固有名詞,地名
  • 1,2,3,4 - e.g., 名詞,固有名詞,地名,一般
  • 5 - e.g., 五段-カ行
  • 6 - e.g., 終止形-一般
  • 5,6 - e.g., 五段-カ行,終止形-一般

PUT sudachi_sample

{
  "settings": {
    "index": {
      "analysis": {
        "tokenizer": {
          "sudachi_tokenizer": {
            "type": "sudachi_tokenizer",
            "resources_path": "/etc/elasticsearch/sudachi"
          }
        },
        "analyzer": {
          "sudachi_analyzer": {
            "filter": [
              "my_posfilter"
	    ],
            "tokenizer": "sudachi_tokenizer",
            "type": "custom"
          }
        },
        "filter":{
         "my_posfilter":{
          "type":"sudachi_part_of_speech",
          "stoptags":[
           "助詞",
           "助動詞",
           "補助記号,句点",
           "補助記号,読点"
          ]
         }
        }
      }
    }
  }
}

POST sudachi_sample

{
    "analyzer":"sudachi_analyzer",
    "text":"寿司がおいしいね"
}

Which responds with:

{
    "tokens": [
        {
            "token": "寿司",
            "start_offset": 0,
            "end_offset": 2,
            "type": "word",
            "position": 0
        },
        {
            "token": "美味しい",
            "start_offset": 3,
            "end_offset": 7,
            "type": "word",
            "position": 2
        }
    ]
}

sudachi_ja_stop

The sudachi_ja_stop token filter filters out Japanese stopwords (japanese), and any other custom stopwords specified by the user. This filter only supports the predefined japanese stopwords list. If you want to use a different predefined list, then use the stop token filter instead.

PUT sudachi_sample

{
  "settings": {
    "index": {
      "analysis": {
        "tokenizer": {
          "sudachi_tokenizer": {
            "type": "sudachi_tokenizer",
            "resources_path": "/etc/elasticsearch/sudachi"
          }
        },
        "analyzer": {
          "sudachi_analyzer": {
            "filter": [
              "my_stopfilter"
	    ],
            "tokenizer": "sudachi_tokenizer",
            "type": "custom"
          }
        },
        "filter":{
         "my_stopfilter":{
          "type":"sudachi_ja_stop",
          "stopwords":[
            "_japanese_",
            "",
            "です"
          ]
         }
        }
      }
    }
  }
}

POST sudachi_sample

{
 "analyzer":"sudachi_analyzer",
 "text":"私は宇宙人です。"
}

Which responds with:

{
    "tokens": [
        {
            "token": "",
            "start_offset": 0,
            "end_offset": 1,
            "type": "word",
            "position": 0
        },
        {
            "token": "宇宙",
            "start_offset": 2,
            "end_offset": 4,
            "type": "word",
            "position": 2
        },
        {
            "token": "",
            "start_offset": 4,
            "end_offset": 5,
            "type": "word",
            "position": 3
        }
    ]
}

sudachi_baseform

The sudachi_baseform token filter replaces terms with their SudachiBaseFormAttribute. This acts as a lemmatizer for verbs and adjectives.

PUT sudachi_sample

{
  "settings": {
    "index": {
      "analysis": {
        "tokenizer": {
          "sudachi_tokenizer": {
            "type": "sudachi_tokenizer",
            "resources_path": "/etc/elasticsearch/sudachi"
          }
        },
        "analyzer": {
          "sudachi_analyzer": {
            "filter": [
              "sudachi_baseform"
            ],
            "tokenizer": "sudachi_tokenizer",
            "type": "custom"
          }
        }
      }
    }
  }
}

POST sudachi_sample

{
  "analyzer": "sudachi_analyzer",
  "text": "飲み"
}

Which responds with:

{
    "tokens": [
        {
            "token": "飲む",
            "start_offset": 0,
            "end_offset": 2,
            "type": "word",
            "position": 0
        }
    ]
}

sudachi_normalizedform

The sudachi_normalizedform token filter replaces terms with their SudachiNormalizedFormAttribute. This acts as a normalizer for spelling variants.

This filter lemmatizes verbs and adjectives too. You don't need to use sudachi_baseform filter with this filter.

PUT sudachi_sample

{
  "settings": {
    "index": {
      "analysis": {
        "tokenizer": {
          "sudachi_tokenizer": {
            "type": "sudachi_tokenizer",
            "resources_path": "/etc/elasticsearch/sudachi"
          }
        },
        "analyzer": {
          "sudachi_analyzer": {
            "filter": [
              "sudachi_normalizedform"
            ],
            "tokenizer": "sudachi_tokenizer",
            "type": "custom"
          }
        }
      }
    }
  }
}

POST sudachi_sample

{
  "analyzer": "sudachi_analyzer",
  "text": "呑み"
}

Which responds with:

{
    "tokens": [
        {
            "token": "飲む",
            "start_offset": 0,
            "end_offset": 2,
            "type": "word",
            "position": 0
        }
    ]
}

sudachi_readingform

Convert to katakana or romaji reading. The sudachi_readingform token filter replaces the token with its reading form in either katakana or romaji. It accepts the following setting:

use_romaji

Whether romaji reading form should be output instead of katakana. Defaults to false.

When using the pre-defined sudachi_readingform filter, use_romaji is set to true. The default when defining a custom sudachi_readingform, however, is false. The only reason to use the custom form is if you need the katakana reading form:

PUT sudachi_sample

{
    "settings": {
        "index": {
            "analysis": {
                "filter": {
                    "romaji_readingform": {
                        "type": "sudachi_readingform",
                        "use_romaji": true
                    },
                    "katakana_readingform": {
                        "type": "sudachi_readingform",
                        "use_romaji": false
                    }
                },
                "tokenizer": {
                    "sudachi_tokenizer": {
                        "type": "sudachi_tokenizer",
                        "resources_path": "/etc/elasticsearch/sudachi"
                    }
                },
                "analyzer": {
                    "romaji_analyzer": {
                        "tokenizer": "sudachi_tokenizer",
                        "filter": [
                            "romaji_readingform"
                        ]
                    },
                    "katakana_analyzer": {
                        "tokenizer": "sudachi_tokenizer",
                        "filter": [
                            "katakana_readingform"
                        ]
                    }
                }
            }
        }
    }
}

POST sudachi_sample

{
  "analyzer": "katakana_analyzer",
  "text": "寿司"
}

Returns スシ.

{
  "analyzer": "romaji_analyzer",
  "text": "寿司"
}

Returns susi.

License

Copyright (c) 2017-2019 Works Applications Co., Ltd. Originally under elasticsearch, https://www.elastic.co/jp/products/elasticsearch Originally under lucene, https://lucene.apache.org/

elasticsearch-sudachi's People

Contributors

hiroshi-matsuda avatar hiroshi-matsuda-rit avatar iwamurayu avatar kazuma-t avatar kengotoda avatar kmotohas avatar liu-to avatar miyakelp avatar mocobeta avatar nzws avatar sorami avatar vbkaisetsu avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.