gitchennan / elasticsearch-analysis-lc-pinyin Goto Github PK
View Code? Open in Web Editor NEW一款运行于Elasticsearch之上的中文拼音智能分词插件,支持全拼、首字母、中文混合搜索
License: Artistic License 2.0
一款运行于Elasticsearch之上的中文拼音智能分词插件,支持全拼、首字母、中文混合搜索
License: Artistic License 2.0
以下是错误信息,另外能否提供直接解压就能用、不需编译的二进制包?
[2017-07-04T09:38:28,839][ERROR][o.e.b.Bootstrap ] Exception
java.lang.IllegalArgumentException: plugin [analysis-lc-pinyin] is incompatible with version [5.4.3]; was designed for version [5.3.0]
at org.elasticsearch.plugins.PluginInfo.readFromProperties(PluginInfo.java:146) ~[elasticsearch-5.4.3.jar:5.4.3]
at org.elasticsearch.bootstrap.Spawner.spawnNativePluginControllers(Spawner.java:86) ~[elasticsearch-5.4.3.jar:5.4.3]
at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:167) ~[elasticsearch-5.4.3.jar:5.4.3]
at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:350) [elasticsearch-5.4.3.jar:5.4.3]
at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:123) [elasticsearch-5.4.3.jar:5.4.3]
at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:114) [elasticsearch-5.4.3.jar:5.4.3]
at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:67) [elasticsearch-5.4.3.jar:5.4.3]
at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:122) [elasticsearch-5.4.3.jar:5.4.3]
at org.elasticsearch.cli.Command.main(Command.java:88) [elasticsearch-5.4.3.jar:5.4.3]
at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:91) [elasticsearch-5.4.3.jar:5.4.3]
at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:84) [elasticsearch-5.4.3.jar:5.4.3]
[2017-07-04T09:38:28,852][WARN ][o.e.b.ElasticsearchUncaughtExceptionHandler] [node-1] uncaught exception in thread [main]
org.elasticsearch.bootstrap.StartupException: java.lang.IllegalArgumentException: plugin [analysis-lc-pinyin] is incompatible with version [5.4.3]; was designed for version [5.3.0]
at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:127) ~[elasticsearch-5.4.3.jar:5.4.3]
at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:114) ~[elasticsearch-5.4.3.jar:5.4.3]
at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:67) ~[elasticsearch-5.4.3.jar:5.4.3]
at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:122) ~[elasticsearch-5.4.3.jar:5.4.3]
at org.elasticsearch.cli.Command.main(Command.java:88) ~[elasticsearch-5.4.3.jar:5.4.3]
at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:91) ~[elasticsearch-5.4.3.jar:5.4.3]
at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:84) ~[elasticsearch-5.4.3.jar:5.4.3]
Caused by: java.lang.IllegalArgumentException: plugin [analysis-lc-pinyin] is incompatible with version [5.4.3]; was designed for version [5.3.0]
at org.elasticsearch.plugins.PluginInfo.readFromProperties(PluginInfo.java:146) ~[elasticsearch-5.4.3.jar:5.4.3]
at org.elasticsearch.bootstrap.Spawner.spawnNativePluginControllers(Spawner.java:86) ~[elasticsearch-5.4.3.jar:5.4.3]
at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:167) ~[elasticsearch-5.4.3.jar:5.4.3]
at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:350) ~[elasticsearch-5.4.3.jar:5.4.3]
at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:123) ~[elasticsearch-5.4.3.jar:5.4.3]
... 6 more
遇到带多音字的词语时,会把多音字的所有读音都转化为term吗?比如
GET /_analyze?analyzer=lc_index&text=成长&pretty
{ "tokens" : [ { "token" : "成", "start_offset" : 0, "end_offset" : 1, "type" : "word", "position" : 0 }, { "token" : "cheng", "start_offset" : 0, "end_offset" : 1, "type" : "word", "position" : 0 }, { "token" : "c", "start_offset" : 0, "end_offset" : 1, "type" : "word", "position" : 0 }, { "token" : "长", "start_offset" : 1, "end_offset" : 2, "type" : "word", "position" : 1 }, { "token" : "zhang", "start_offset" : 1, "end_offset" : 2, "type" : "word", "position" : 1 }, { "token" : "z", "start_offset" : 1, "end_offset" : 2, "type" : "word", "position" : 1 }, { "token" : "chang", "start_offset" : 1, "end_offset" : 2, "type" : "word", "position" : 1 }, { "token" : "c", "start_offset" : 1, "end_offset" : 2, "type" : "word", "position" : 1 } ] }
显然,zhang的读音不应该出现。
插件不支持多音字的词库吗?或者有没有办法解决这种情况?麻烦了。
如题:谢谢
你好,我想实现汉字、拼音、简拼搜索,下面是配置和索引数据,我搜索"yd"、"yundong"和"运动",都可以出来数据,为什么"yundo"出来不了呢
curl -XPUT http://192.168.0.101:9200/index/ -d'
{
"settings": {
"analysis": {
"analyzer": {
"ik_letter_smart": {
"type": "custom",
"tokenizer": "ik_max_word",
"filter": [
"lc_first_letter"
]
},
"ik_py_smart": {
"type": "custom",
"tokenizer": "ik_max_word",
"filter": [
"lc_full_pinyin"
]
}
}
}
}
}'
curl -XPOST http://192.168.0.101:9200/index/_mapping/brand -d'
{
"brand": {
"properties": {
"name": {
"type": "text",
"analyzer": "lc_index",
"search_analyzer": "lc_search",
"term_vector": "with_positions_offsets"
}
}
}
}'
curl -XPOST http://192.168.0.101:9200/index/brand/1 -d'{"name":"百度"}'
curl -XPOST http://192.168.0.101:9200/index/brand/8 -d'{"name":"百度糯米"}'
curl -XPOST http://192.168.0.101:9200/index/brand/2 -d'{"name":"阿里巴巴"}'
curl -XPOST http://192.168.0.101:9200/index/brand/3 -d'{"name":"腾讯科技"}'
curl -XPOST http://192.168.0.101:9200/index/brand/4 -d'{"name":"网易游戏"}'
curl -XPOST http://192.168.0.101:9200/index/brand/9 -d'{"name":"大众点评"}'
curl -XPOST http://192.168.0.101:9200/index/brand/10 -d'{"name":"携程旅行网"}'
curl -XPOST http://192.168.0.101:9200/index/brand/11 -d'{"name":"运动"}'
curl -XPOST http://192.168.0.101:9200/index/brand/12 -d'{"name":"运动鞋"}'
curl -XPOST http://192.168.0.101:9200/index/brand/13 -d'{"name":"运动鞋 男"}'
请问一下,readme只介绍了如何使用,并没有介绍如何安装,可以补充吗?
目前search模式下不支持用户指定按照首字母或者智能最优匹配来分词,现针对这两种模式支持用户参数化
分词器 - Tokenizer
lc_index:参数 mode: full_pinyin,first_letter,chinese_char
lc_search:参数 mode: smart_pinyin,single_letter
上面是您给出的,但是实际当中如何使用这个mode呢?
这是我的setting
{
"number_of_shards": 5,
"number_of_replicas": 1,
"index": {
"settings": {
"analysis": {
"analyzer": {
"lc_analyzer": {
"type": "custom",
"tokenizer": {
"lc_index":{
"mode":"full_pinyin"
}
},
"filter": [
"lc_full_pinyin"
]
}
}
}
}
}
}
这是我的mapping
{
"news": {
"properties": {
"newsId": {
"type": "long"
},
"cityId": {
"type": "integer"
},
"desId": {
"type": "String"
},
"newsTitle": {
"type": "string",
"store": true,
"analyzer": "lc_analyzer",
"search_analyzer": "lc_search"
},
"newsTitlePinYin": {
"type": "string",
"store": true,
"analyzer": "ik_max_word",
"search_analyzer": "ik_smart"
},
"newsTitleJianPin": {
"type": "string",
"store": true,
"analyzer": "ik_max_word",
"search_analyzer": "ik_smart"
},
"newsTitleSource": {
"type": "string",
"store": true,
"index": "not_analyzed"
},
"newsTitlePinYinSource": {
"type": "string",
"store": true,
"index": "not_analyzed"
},
"newsTitleJianPinSource": {
"type": "string",
"store": true,
"index": "not_analyzed"
},
"newsAbstract": {
"type": "string",
"store": true,
"analyzer": "ik_max_word",
"search_analyzer": "ik_smart"
},
"newsEditor": {
"type": "string",
"store": true,
"analyzer": "ik_max_word",
"search_analyzer": "ik_smart"
},
"editorNickName":{
"type": "string",
"store": true,
"index": "not_analyzed"
},
"editorAvatar":{
"type": "string",
"store": true,
"index": "not_analyzed"
},
"publishTime":{
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
},
"newsTitleSuggest": {
"type": "completion",
"payloads": true,
"analyzer": "ik_smart",
"search_analyzer": "ik_smart"
}
}
}
}
下面是错误。。
{
"error": {
"root_cause": [
{
"type": "mapper_parsing_exception",
"reason": "analyzer [lc_analyzer] not found for field [newsTitle]"
}
],
"type": "mapper_parsing_exception",
"reason": "analyzer [lc_analyzer] not found for field [newsTitle]"
},
"status": 400
}
===========================换一种setting也不行=======================
{
"number_of_shards": 5,
"number_of_replicas": 1,
"index": {
"settings": {
"analysis": {
"analyzer": {
"lc_analyzer": {
"type": "custom",
"tokenizer": "lc_index",
"filter": [
"lc_full_pinyin"
]
}
}
}
}
}
}
然后也是上面的错误,lc_analyzer找不到newsTitle
使用ES completion suggest时,采用lc_index索引数据时,ES进程卡死,cpu 直接100%,请教下是什么原因?
字段mapping如下:
"suggestText": {
"type": "completion",
"analyzer": "lc_index",
"search_analyzer": "lc_search",
"payloads": true,
"preserve_separators": false,
"preserve_position_increments": true,
"max_input_length": 50
}
lc_index用于type非completion的字段索引正常,其他分词器,如ik_max_word,ik_smart索引completion类型的字段也正常,唯独lc_index索引completion类型的字段,会出现cpu计算量巨大,索引速度巨慢的情况.
比如说:我搜索一个“ali”,能搜出“阿里巴巴”,“你是阿里”。我想问一下,可以只搜出“阿里”开头的,而不是含有“阿里”的内容都搜出来。谢谢
使用lc给出的DEMO,我在本地进行测试。发现搜索“baidu”时,“百度”这个条目分数没有“百度糯米”分数高。
以下是我查询的结果:
{
"took": 14,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 2.8384802,
"hits": [
{
"_index": "index",
"_type": "brand",
"_id": "8",
"_score": 2.8384802,
"_source": {
"name": "百度糯米"
},
"highlight": {
"name": [
"百度糯米"
]
}
},
{
"_index": "index",
"_type": "brand",
"_id": "1",
"_score": 0.8271048,
"_source": {
"name": "百度"
},
"highlight": {
"name": [
"百度"
]
}
}
]
}
}
现在的问题是: 为什么DEMO中的es版本与es5.5.2的版本,查询的结果为什么不一样了。 demo给出的结果是“百度”在前,“百度糯米”在后
额,我安装的elasticsearch版本是6.0.0的,现在这个插件最新版本是5.3怎么兼容哦
目前search模式下分词采用反向最大匹配,且未考虑分词后剩余单个字母的个数,现改为:匹配采用正向匹配算法,回溯取得最优匹配分词。
词条包含英文分词搜索有问题
使用lc_search分词时,希望匹配首拼,输入payh,希望得到平安银行,但是因为分词为pa,y,h,导致搜索不到正确结果.
谢谢
比如词库里面有关键词”iphone 6s“,但我输入“iphone”搜索不到,是怎么回事呢?
{
"query": {
"match": {
"keyword": {
"query": "iphone",
"analyzer": "lc_search",
"type": "phrase"
}
}
}
}
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.