Coder Social home page Coder Social logo

elasticsearch-analysis-bosonnlp's Introduction

玻森数据中文分析器ElasticSearch插件 (Beta版)

概述

ElasticSearch 是一个基于 Lucene 的强大搜索服务器,也是企业最受欢迎的搜索引擎之一。但是 ES 本身对中文分词和搜索比较局限。因为内置的分析器在处理中文分词时,只有两种方式:一种是单字(unigrams)形式,即简单粗暴的将中文的每一个汉字作为一个词(token)分开;另一种是两字(bigrams)的,也就是任意相邻的两个汉字作为一个词分开。这两种方式都不能很好的满足现在的中文分词需求,进而影响了搜索结果。因此玻森数据开发了一款基于玻森中文分词的 ElasticSearch 的插件(BosonNLP Analyzer for ElasticSearch)方便大家准确的使用中文搜索。

安装

依赖

ElasticSearch 官网安装说明 https://www.elastic.co/guide/en/elasticsearch/guide/1.x/_installing_elasticsearch.html

选择插件版本

其对应的版本号和插件版本号如下:

BosonNLP version ES Version
master 2.2.0->master
1.3.0-beta 2.2.0
1.2.2-beta 2.1.2
1.2.1-beta 2.1.1
1.2.0-beta 2.1.0
1.1.0-beta 2.0.0
1.0.0-beta 1.7.x

安装插件

现在提供以下两种方式安装插件。

方法一

从 github 的链接直接下载插件,不同版本的 github 的链接已在对应版本的 README 中给出。以下示例为 ES 2.0.0及以上的插件安装命令。

$ sudo bin/plugin install https://github.com/bosondata/elasticsearch-analysis-bosonnlp/releases/download/{version}/elasticsearch-analysis-bosonnlp-{version}.zip

例:下载 1.3.0 本版的插件,则在{version}填写对应的版本,在 1.3.0-beta branch 的 README 中有具体命令:

$ sudo bin/plugin install https://github.com/bosondata/elasticsearch-analysis-bosonnlp/releases/download/1.3.0-beta/elasticsearch-analysis-bosonnlp-1.3.0-beta.zip

方法二

本地编译生成插件。

  1. 构建项目包

    下载玻森中文分析器项目到本地,并在项目根目录下通过 Maven 构建项目包:

    mvn clean package
    

    构建后的项目包elasticsearch-analysis-bosonnlp-{version}.ziptarget/releases/生成。

  2. 安装插件

    通过 ElasticSearch 的 plugin 加载插件,在 ElasticSearch 根目录执行以下命令即可:

    $ sudo bin/plugin install file:/root/path/to/your/elasticsearch-analysis-bosonnlp-{version}.zip

设置

运行 ElasticSearch 之前需要在 config 文件夹中修改elasticsearch.yml来定义使用玻森中文分析器,并填写玻森 API_TOKEN 以及玻森分词 API 的地址,即在该文件结尾处添加:

index:
  analysis:
    analyzer:
      bosonnlp:
          type: bosonnlp
          API_URL: http://api.bosonnlp.com/tag/analysis
          # You MUST give the API_TOKEN value, otherwise it doesn't work
          API_TOKEN: *PUT YOUR API TOKEN HERE*
          # Please uncomment if you want to specify ANY ONE of the following
          # areguments, otherwise the DEFAULT value will be used, i.e.,
          # space_mode is 0,
          # oov_level is 3,
          # t2s is 0,
          # special_char_conv is 0.
          # More detials can be found in bosonnlp docs:
          # http://docs.bosonnlp.com/tag.html
          #
          #
          # space_mode: put your value here(range from 0-3)
          # oov_level: put your value here(range from 0-4)
          # t2s: put your value here(range from 0-1)
          # special_char_conv: put your value here(range from 0-1)

需要注意的是

  1. 必须在 API_URL 填写给定的分词地址以及在API_TOKEN:*PUT YOUR API TOKEN HERE* 中填写给定的玻森数据API_TOKEN,否则无法使用玻森中文分析器。该 API_TOKEN 是注册玻森数据账号所获得。

  2. 如果配置文件中已经有配置过其他的 analyzer,请直接在 analyzer 下如上添加 bosonnlp analyzer。

  3. 如果有多个 node 并且都需要 BosonNLP 的分词插件,则每个 node 下的 yaml 文件都需要如上安装和设置。

  4. 另外,玻森中文分词还提供了4个参数(space_modeoov_levelt2sspecial_char_conv)可满足不同的分词需求。如果取默认值,则无需任何修改;否则,可取消对应参数的注释并赋值。

例:需开启繁体转换成简体(t2s)功能,则取消t2s的注释并赋值。

 t2s: 1

更多关于玻森中文分词参数的信息,可以在此了解

设置完之后就可以运行 ElasticSearch 了,如果对该设置有新的改动,需要重启 ElasticSearch 才可生效。

测试

分词测试

运行 Elasiticsearch

显示插件加载成功

...
[time][INFO ][plugins] [Gaza] loaded [analysis-bosonnlp]
...

建立 index

curl -XPUT 'localhost:9200/test'

测试分析器是否配置成功

curl -XGET 'localhost:9200/test/_analyze?analyzer=bosonnlp&pretty' -d '这是玻森数据分词的测试'

结果

{
  "tokens" : [ {
    "token" : "",
    "start_offset" : 0,
    "end_offset" : 1,
    "type" : "word",
    "position" : 0
  }, {
    "token" : "",
    "start_offset" : 1,
    "end_offset" : 2,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "玻森",
    "start_offset" : 2,
    "end_offset" : 4,
    "type" : "word",
    "position" : 2
  }, {
    "token" : "数据",
    "start_offset" : 4,
    "end_offset" : 6,
    "type" : "word",
    "position" : 3
  }, {
    "token" : "分词",
    "start_offset" : 6,
    "end_offset" : 8,
    "type" : "word",
    "position" : 4
  }, {
    "token" : "",
    "start_offset" : 8,
    "end_offset" : 9,
    "type" : "word",
    "position" : 5
  }, {
    "token" : "测试",
    "start_offset" : 9,
    "end_offset" : 11,
    "type" : "word",
    "position" : 6
  } ]
}

搜索测试

建立 mapping

curl -XPUT 'localhost:9200/test/text/_mapping' -d'
{
  "text": {
    "properties": {
      "content": {
        "type": "string", 
        "analyzer": "bosonnlp", 
        "search_analyzer": "bosonnlp"
      }
    }
  }
}

输入数据

curl -XPUT 'localhost:9200/test/text/1' -d'
{"content": "美称**武器商很神秘 花巨资海外参展却一言不发"}
'
curl -XPUT 'localhost:9200/test/text/2' -d'
{"content": "复旦发现江浙沪儿童体内普遍有兽用抗生素"}
'
curl -XPUT 'localhost:9200/test/text/3' -d'
{"content": "37年后重启顶层设计 **未来城市发展料现四大变化"}
'

查询搜索

curl -XPOST 'localhost:9200/test/text/_search?pretty'  -d'
{
  "query" : { "term" : { "content" : "**" }}
}
'

结果

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.076713204,
    "hits" : [ {
      "_index" : "test",
      "_type" : "text",
      "_id" : "1",
      "_score" : 0.076713204,
      "_source":
{
 "content": "美称**武器商很神秘 花巨资海外参展却一言不发"}
    }, {
      "_index" : "test",
      "_type" : "text",
      "_id" : "3",
      "_score" : 0.076713204,
      "_source":
{
 "content": "37年后重启顶层设计 **未来城市发展料现四大变化"}
    } ]
  }
}

查询搜索

curl -XPOST 'localhost:9200/test/text/_search?pretty'  -d'
{
  "query" : { "term" : { "content" : "国武" }}
}'

结果

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}

查询搜索

curl -XPOST 'localhost:9200/test/text/_search?pretty'  -d'
{
  "query" : { "term" : { "content" : "国" }}
}'

结果

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}

如果用 ES 默认的分析器(Standard Analyzer)去查询,得到如下结果:

查询搜索

curl -XPOST 'localhost:9200/test/text/_search?pretty'  -d'
{
  "query" : { "term" : { "content" : "国" }}
}'

结果

{
  "took" : 8,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.057534903,
    "hits" : [ {
      "_index" : "test3",
      "_type" : "text",
      "_id" : "1",
      "_score" : 0.057534903,
      "_source":
{"content": "美称**武器商很神秘 花巨资海外参展却一言不发"}
    }, {
      "_index" : "test3",
      "_type" : "text",
      "_id" : "3",
      "_score" : 0.057534903,
      "_source":
{"content": "37年后重启顶层设计 **未来城市发展料现四大变化"}

    } ]
  }
}

查询搜索

curl -XPOST 'localhost:9200/test3/text/_search?pretty' -d '
{
 "query": {"term":{"content":"**"}}
}'

结果

{
  "took" : 14,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}

配置 Token Filter

现有的 BosonNLP 分析器没有内置 token filter,如果有过滤 Token 的需求,可以利用 BosonNLP Tokenizer 和 ES 提供的 token filter 搭建定制分析器。

步骤

配置定制的 analyzer 有以下三个步骤:

  • 添加 BosonNLP tokenizer

elasticsearch.yml 文件中 analysis 下添加 tokenizer, 并在 tokenizer 中添加 BosonNLP tokenizer 的配置:

index:
  analysis:
    analyzer:
      ...
    tokenizer:
      bosonnlp:
          type: bosonnlp
          API_URL: http://api.bosonnlp.com/tag/analysis
          # You MUST give the API_TOKEN value, otherwise it doesn't work
          API_TOKEN: *PUT YOUR API TOKEN HERE*
          # Please uncomment if you want to specify ANY ONE of the following
          # areguments, otherwise the DEFAULT value will be used, i.e.,
          # space_mode is 0,
          # oov_level is 3,
          # t2s is 0,
          # special_char_conv is 0.
          # More detials can be found in bosonnlp docs:
          # http://docs.bosonnlp.com/tag.html
          #
          #
          # space_mode: put your value here(range from 0-3)
          # oov_level: put your value here(range from 0-4)
          # t2s: put your value here(range from 0-1)
          # special_char_conv: put your value here(range from 0-1)

同样需要注意的是

  1. 必须在 API_URL 中填写给定的分词地址以及在 API_TOKEN:*PUT YOUR API TOKEN HERE* 中填写给定的玻森数据API_TOKEN,否则无法使用玻森 tokenizer。
  2. 如果配置文件中已经有配置过其他的 tokenizer,请直接在 tokenizer 下如上添加 bosonnlp tokenizer。
  3. 如果需要改动参数的默认值,请可取消对应参数的注释并赋值。
  • 添加 token filter

elasticsearch.yml 文件中 analysis 下添加 filter, 并在 filter 中添加所需 filter 的配置(下面例子中,我们以 lowercase filter 为例):

index:
  analysis:
    analyzer:
      ...
    tokenizer:
      ...
    filter:
      lowercase:
          type: lowercase
  • 添加定制的 analyzer

elasticsearch.yml 文件中 analysis 下添加 analyzer, 并在 analyzer 中添加定制的 analyzer 的配置(下面例子中,我们把定制的 analyzer 命名为 filter_bosonnlp):

index:
  analysis:
    analyzer:
      ...
      filter_bosonnlp:
          type: custom
          tokenizer: bosonnlp
          filter: [lowercase]

如有其他想要添加的 filter,可以在配置完 filter 之后在上述 filter:[] 列表中添加,以逗号隔开。

附上完整的定制 analyzer 配置:

index:
  analysis:
    analyzer:
      filter_bosonnlp:
          type: custom
          tokenizer: bosonnlp
          filter: [lowercase]
    tokenizer:
      bosonnlp:
          type: bosonnlp
          API_URL: http://api.bosonnlp.com/tag/analysis
          # You MUST give the API_TOKEN value, otherwise it doesn't work
          API_TOKEN: *PUT YOUR API TOKEN HERE*
          # Please uncomment if you want to specify ANY ONE of the following 
          # areguments, otherwise the DEFAULT value will be used, i.e., 
          # space_mode is 0, 
          # oov_level is 3,
          # t2s is 0,
          # special_char_conv is 0.
          # More detials can be found in bosonnlp docs:
          # http://docs.bosonnlp.com/tag.html
          #
          #
          # space_mode: put your value here(range from 0-3)
          # oov_level: put your value here(range from 0-4)
          # t2s: put your value here(range from 0-1)
          # special_char_conv: put your value here(range from 0-1)
    filter:
      lowercase:
          type: lowercase

注意

由于 ES 搜索内核 Lucene 索引文件的设计结构所限,每个文档的每个字段必须单独分析, 无法采用 BosonNLP 的批处理调用,从而在 Network IO 上会有较大的时间开销。

elasticsearch-analysis-bosonnlp's People

Contributors

ba9els avatar bryant1410 avatar mrluanma avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

elasticsearch-analysis-bosonnlp's Issues

2.4

no vesion for 2.4.0.

java.util.ConcurrentModificationException

Version 1.0.0

I got exception like this when I trying to index (no matter one by one or in bulk mode):

java.util.ConcurrentModificationException
at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:859)
at java.util.ArrayList$Itr.next(ArrayList.java:831)
at org.bosonnlp.analyzer.lucene.BosonNLPTokenizer.incrementToken(BosonNLPTokenizer.java:66)
at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:618)
at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:359)
at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:318)
at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:241)
at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:465)
at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1526)
at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1252)
at org.elasticsearch.index.engine.InternalEngine.innerIndex(InternalEngine.java:432)
at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:364)
at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:511)
at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:196)
at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$PrimaryPhase.performOnPrimary(TransportShardReplicationOperationAction.java:574)
at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$PrimaryPhase$1.doRun(TransportShardReplicationOperationAction.java:440)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
[2016-12-26 19:26:43,270][DEBUG][http.netty ] [Llan the Sorcerer] Caught exception while handling client http traffic, closing connection [id: 0x55c87666, /182.92.69.212:51374 => /182.92.242.58:9288]
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:64)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

有做过性能测试么

分词作为一个频繁调用的功能,一般都是本地的,这种远程api调用,数据量大了,性能怎么样,qps能支撑多少 1000qps 能达到么

是否继续更新此开源库?

ElasticSearch官网上已发布到v5.0.2版本,而此开源库上一次提交已是8个月以前,请问是否还会继续维护此开源库?

有支持2.3.3的版本吗?

dev /usr/share/elasticsearch/bin sudo ./plugin install file:/home/dev/elasticsearch-analysis-bosonnlp-1.3.0-beta.zip
-> Installing from file:/home/dev/elasticsearch-analysis-bosonnlp-1.3.0-beta.zip...
Trying file:/home/dev/elasticsearch-analysis-bosonnlp-1.3.0-beta.zip ...
Downloading .......................DONE
Verifying file:/home/dev/elasticsearch-analysis-bosonnlp-1.3.0-beta.zip checksums if available ...
NOTE: Unable to verify checksum for downloaded plugin (unable to find .sha1 or .md5 file to verify)
ERROR: Plugin [elasticsearch-analysis-bosonnlp] is incompatible with Elasticsearch [2.3.3]. Was designed for version [2.2.0]

不能将bosonnlp设置为elasticsearch的默认分词器

在按照帮助文档安装好bosonnlp插件后,一切正常。但尝试通过在elasticsearch.yml配置文件中添加ndex.anialysis.analyzer.default.type: bosonnlp将之设置为默认的分词器后遇到问题,elasticsearch可正常启动,但在涉及到put/get/post方法的分词过程时会报错,无法正常运行。具体说明如下:

1.假设已安装好bosonnlp插件,且通过index.anialysis.analyzer.default.type: bosonnlp配置为默认分词器,并删除原有所有数据后重启elasticsearch,保证测试环境干净。

2.建立 index
curl -XPUT 'localhost:9200/test'

3.测试分析器是否配置成功
curl -XGET 'localhost:9200/test/_analyze?analyzer=bosonnlp&pretty' -d '这是玻森数据分词的测试'

测试结果显示分词成功

4.创建mapping(使用默认分词器)
{
"text": {
"properties": {
"content": {
"type": "string"
}
}
}
}

5.输入数据时报错
curl -XPUT 'localhost:9200/test/text/1' -d'
{"content": "美称**武器商很神秘 花巨资海外参展却一言不发"}
'

报错信息
{
"error": {
"root_cause": [
{
"type": "remote_transport_exception",
"reason": "[xxx][10.10.xxx.7:9300][indices:data/write/index[p]]"
}
],
"type": "runtime_exception",
"reason": "java.net.MalformedURLException: no protocol: ?space_mode=0&oov_level=3&t2s=0&special_char_conv=0",
"caused_by": {
"type": "malformed_u_r_l_exception",
"reason": "no protocol: ?space_mode=0&oov_level=3&t2s=0&special_char_conv=0"
}
},
"status": 500
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.