gitchennan / elasticsearch-analysis-lc-pinyin Goto Github PK

View Code? Open in Web Editor NEW

154.0 154.0 57.0 271 KB

一款运行于Elasticsearch之上的中文拼音智能分词插件，支持全拼、首字母、中文混合搜索

License: Artistic License 2.0

Java 100.00%

elasticsearch-analysis-lc-pinyin's People

Contributors

Stargazers

Watchers

Forkers

tongji1907 lovelynicky michael-ancestor wandec davidmr001 hj5 kawen11 pengweigit tangtangsara sk163 heruibin ocre ddarkblue smartlv extremeyu tangmin721 xycloud happyclassedu zjpjohn chenglinjava68 witnesslq yaozd zhiji6 zhangyu03121011 flybird119 wtbrave chrismayday androidzhaoxiaogang icefoxs xiaomin0322 a252937166 luyunyyyyy from1900 shenhj2016 flyforfreedom gaogaoyanjiu floatdirt xinghun92 tkdo purple-jimmy jiangtianan sorata gigbucket refineli wulin-challenge songyejiang wulin-challenge2 dmdaguan bestjex fightingtong jiaofusen jsoq zippoy chiwenheng bulksecuritygeneratorprojectv2 yangzhou666

elasticsearch-analysis-lc-pinyin's Issues

[2017-07-04T09:38:28,839][ERROR][o.e.b.Bootstrap ] Exception
java.lang.IllegalArgumentException: plugin [analysis-lc-pinyin] is incompatible with version [5.4.3]; was designed for version [5.3.0]
at org.elasticsearch.plugins.PluginInfo.readFromProperties(PluginInfo.java:146) ~[elasticsearch-5.4.3.jar:5.4.3]
at org.elasticsearch.bootstrap.Spawner.spawnNativePluginControllers(Spawner.java:86) ~[elasticsearch-5.4.3.jar:5.4.3]
at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:167) ~[elasticsearch-5.4.3.jar:5.4.3]
at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:350) [elasticsearch-5.4.3.jar:5.4.3]
at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:123) [elasticsearch-5.4.3.jar:5.4.3]
at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:114) [elasticsearch-5.4.3.jar:5.4.3]
at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:67) [elasticsearch-5.4.3.jar:5.4.3]
at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:122) [elasticsearch-5.4.3.jar:5.4.3]
at org.elasticsearch.cli.Command.main(Command.java:88) [elasticsearch-5.4.3.jar:5.4.3]
at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:91) [elasticsearch-5.4.3.jar:5.4.3]
at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:84) [elasticsearch-5.4.3.jar:5.4.3]
[2017-07-04T09:38:28,852][WARN ][o.e.b.ElasticsearchUncaughtExceptionHandler] [node-1] uncaught exception in thread [main]
org.elasticsearch.bootstrap.StartupException: java.lang.IllegalArgumentException: plugin [analysis-lc-pinyin] is incompatible with version [5.4.3]; was designed for version [5.3.0]
at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:127) ~[elasticsearch-5.4.3.jar:5.4.3]
at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:114) ~[elasticsearch-5.4.3.jar:5.4.3]
at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:67) ~[elasticsearch-5.4.3.jar:5.4.3]
at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:122) ~[elasticsearch-5.4.3.jar:5.4.3]
at org.elasticsearch.cli.Command.main(Command.java:88) ~[elasticsearch-5.4.3.jar:5.4.3]
at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:91) ~[elasticsearch-5.4.3.jar:5.4.3]
at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:84) ~[elasticsearch-5.4.3.jar:5.4.3]
Caused by: java.lang.IllegalArgumentException: plugin [analysis-lc-pinyin] is incompatible with version [5.4.3]; was designed for version [5.3.0]
at org.elasticsearch.plugins.PluginInfo.readFromProperties(PluginInfo.java:146) ~[elasticsearch-5.4.3.jar:5.4.3]
at org.elasticsearch.bootstrap.Spawner.spawnNativePluginControllers(Spawner.java:86) ~[elasticsearch-5.4.3.jar:5.4.3]
at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:167) ~[elasticsearch-5.4.3.jar:5.4.3]
at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:350) ~[elasticsearch-5.4.3.jar:5.4.3]
at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:123) ~[elasticsearch-5.4.3.jar:5.4.3]
... 6 more

插件对带多音字的词语支持有问题

遇到带多音字的词语时，会把多音字的所有读音都转化为term吗？比如
GET /_analyze?analyzer=lc_index&text=成长&pretty
{ "tokens" : [ { "token" : "成", "start_offset" : 0, "end_offset" : 1, "type" : "word", "position" : 0 }, { "token" : "cheng", "start_offset" : 0, "end_offset" : 1, "type" : "word", "position" : 0 }, { "token" : "c", "start_offset" : 0, "end_offset" : 1, "type" : "word", "position" : 0 }, { "token" : "长", "start_offset" : 1, "end_offset" : 2, "type" : "word", "position" : 1 }, { "token" : "zhang", "start_offset" : 1, "end_offset" : 2, "type" : "word", "position" : 1 }, { "token" : "z", "start_offset" : 1, "end_offset" : 2, "type" : "word", "position" : 1 }, { "token" : "chang", "start_offset" : 1, "end_offset" : 2, "type" : "word", "position" : 1 }, { "token" : "c", "start_offset" : 1, "end_offset" : 2, "type" : "word", "position" : 1 } ] }
显然，zhang的读音不应该出现。
插件不支持多音字的词库吗？或者有没有办法解决这种情况？麻烦了。

lc_search分词时,中文能否分词出拼音,目前只能按单子分词

如题:谢谢

和demo一样的配置，为什么搜索"yundo"出不来结果呢？

你好，我想实现汉字、拼音、简拼搜索，下面是配置和索引数据，我搜索"yd"、"yundong"和"运动"，都可以出来数据，为什么"yundo"出来不了呢

curl -XPUT http://192.168.0.101:9200/index/ -d'
{
  "settings": {
    "analysis": {
      "analyzer": {
        "ik_letter_smart": {
          "type": "custom",
          "tokenizer": "ik_max_word",
          "filter": [
            "lc_first_letter"
          ]
        },
        "ik_py_smart": {
          "type": "custom",
          "tokenizer": "ik_max_word",
          "filter": [
            "lc_full_pinyin"
          ]
        }
      }
    }
  }
}'





curl -XPOST http://192.168.0.101:9200/index/_mapping/brand -d'
{
  "brand": {
    "properties": {
      "name": {
        "type": "text",
        "analyzer": "lc_index",
        "search_analyzer": "lc_search",
        "term_vector": "with_positions_offsets"
      }
    }
  }
}'

curl -XPOST http://192.168.0.101:9200/index/brand/1 -d'{"name":"百度"}'
curl -XPOST http://192.168.0.101:9200/index/brand/8 -d'{"name":"百度糯米"}'
curl -XPOST http://192.168.0.101:9200/index/brand/2 -d'{"name":"阿里巴巴"}'
curl -XPOST http://192.168.0.101:9200/index/brand/3 -d'{"name":"腾讯科技"}'
curl -XPOST http://192.168.0.101:9200/index/brand/4 -d'{"name":"网易游戏"}'
curl -XPOST http://192.168.0.101:9200/index/brand/9 -d'{"name":"大众点评"}'
curl -XPOST http://192.168.0.101:9200/index/brand/10 -d'{"name":"携程旅行网"}'
curl -XPOST http://192.168.0.101:9200/index/brand/11 -d'{"name":"运动"}'
curl -XPOST http://192.168.0.101:9200/index/brand/12 -d'{"name":"运动鞋"}'
curl -XPOST http://192.168.0.101:9200/index/brand/13 -d'{"name":"运动鞋 男"}'

lc-pinyin如何安装

请问一下，readme只介绍了如何使用，并没有介绍如何安装，可以补充吗？

search模式下参数化支持首字母搜索和智能最优匹配

目前search模式下不支持用户指定按照首字母或者智能最优匹配来分词，现针对这两种模式支持用户参数化

setting 当中tokenizer 设置问题

分词器 - Tokenizer
lc_index：参数 mode: full_pinyin，first_letter，chinese_char
lc_search：参数 mode: smart_pinyin，single_letter
上面是您给出的，但是实际当中如何使用这个mode呢？

这是我的setting

{
  "number_of_shards": 5,
  "number_of_replicas": 1,
  "index": {
    "settings": {
      "analysis": {
        "analyzer": {
          "lc_analyzer": {
            "type": "custom",
            "tokenizer": {
              "lc_index":{
                "mode":"full_pinyin"
              }
            },
            "filter": [
              "lc_full_pinyin"
            ]
          }
        }
      }
    }
  }
}

这是我的mapping

{
  "news": {
    "properties": {
      "newsId": {
        "type": "long"
      },
      "cityId": {
        "type": "integer"
      },
      "desId": {
        "type": "String"
      },
      "newsTitle": {
        "type": "string",
        "store": true,
        "analyzer": "lc_analyzer",
        "search_analyzer": "lc_search"
      },
      "newsTitlePinYin": {
        "type": "string",
        "store": true,
        "analyzer": "ik_max_word",
        "search_analyzer": "ik_smart"
      },
      "newsTitleJianPin": {
        "type": "string",
        "store": true,
        "analyzer": "ik_max_word",
        "search_analyzer": "ik_smart"
      },
      "newsTitleSource": {
        "type": "string",
        "store": true,
        "index": "not_analyzed"
      },
      "newsTitlePinYinSource": {
        "type": "string",
        "store": true,
        "index": "not_analyzed"
      },
      "newsTitleJianPinSource": {
        "type": "string",
        "store": true,
        "index": "not_analyzed"
      },
      "newsAbstract": {
        "type": "string",
        "store": true,
        "analyzer": "ik_max_word",
        "search_analyzer": "ik_smart"
      },
      "newsEditor": {
        "type": "string",
        "store": true,
        "analyzer": "ik_max_word",
        "search_analyzer": "ik_smart"
      },
      "editorNickName":{
        "type": "string",
        "store": true,
        "index": "not_analyzed"
      },
      "editorAvatar":{
        "type": "string",
        "store": true,
        "index": "not_analyzed"
      },
      "publishTime":{
        "type": "date",
        "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
      },
      "newsTitleSuggest": {
        "type": "completion",
        "payloads": true,
        "analyzer": "ik_smart",
        "search_analyzer": "ik_smart"
      }
    }
  }
}

下面是错误。。

{
"error": {
"root_cause": [
{
"type": "mapper_parsing_exception",
"reason": "analyzer [lc_analyzer] not found for field [newsTitle]"
}
],
"type": "mapper_parsing_exception",
"reason": "analyzer [lc_analyzer] not found for field [newsTitle]"
},
"status": 400
}

===========================换一种setting也不行=======================

{
  "number_of_shards": 5,
  "number_of_replicas": 1,
  "index": {
    "settings": {
      "analysis": {
        "analyzer": {
          "lc_analyzer": {
            "type": "custom",
            "tokenizer": "lc_index",
            "filter": [
              "lc_full_pinyin"
            ]
          }
        }
      }
    }
  }
}

然后也是上面的错误，lc_analyzer找不到newsTitle

completion suggest使用lc_index索引计算量巨大

使用ES completion suggest时,采用lc_index索引数据时,ES进程卡死,cpu 直接100%,请教下是什么原因?
字段mapping如下:

"suggestText": {
"type": "completion",
"analyzer": "lc_index",
"search_analyzer": "lc_search",
"payloads": true,
"preserve_separators": false,
"preserve_position_increments": true,
"max_input_length": 50
}
lc_index用于type非completion的字段索引正常,其他分词器,如ik_max_word,ik_smart索引completion类型的字段也正常,唯独lc_index索引completion类型的字段,会出现cpu计算量巨大,索引速度巨慢的情况.

如何在java项目中引用这个插件呢

支持 5.1.1吗

拼音提示的中文不是内容的前缀词，可能是内容中的词？

比如说：我搜索一个“ali”,能搜出“阿里巴巴”，“你是阿里”。我想问一下，可以只搜出“阿里”开头的，而不是含有“阿里”的内容都搜出来。谢谢

支持es2.3.4吗

5.5.2版本搜索“baidu”时的问题

使用lc给出的DEMO，我在本地进行测试。发现搜索“baidu”时，“百度”这个条目分数没有“百度糯米”分数高。
以下是我查询的结果：
{
"took": 14,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 2.8384802,
"hits": [
{
"_index": "index",
"_type": "brand",
"_id": "8",
"_score": 2.8384802,
"_source": {
"name": "百度糯米"
},
"highlight": {
"name": [
"百度糯米"
]
}
},
{
"_index": "index",
"_type": "brand",
"_id": "1",
"_score": 0.8271048,
"_source": {
"name": "百度"
},
"highlight": {
"name": [
"百度"
]
}
}
]
}
}

现在的问题是：为什么DEMO中的es版本与es5.5.2的版本，查询的结果为什么不一样了。 demo给出的结果是“百度”在前，“百度糯米”在后

{
    "query": {
        "match": {
          "keyword": {
            "query": "iphone",
            "analyzer": "lc_search",
            "type": "phrase"
          }
        }
    }
}