Coder Social home page Coder Social logo

parsianalyzer's Introduction

ParsiAnalyzer

ParsiAnalyzer is an analysis plugin for Elasticsearch. Analysis is a process that consists of the following steps:

  • Tokenizing a block of text into individual terms
  • Normalizing these terms into a standard form

An analyzer is really just a wrapper that combines Character filters, Tokenizer, and Token filters. Elasticsearch provides many Built-in Analyzers but there's still room for improvement especially for Persian language. This plugin provides tools for tokenizing, normalizing and stemming Persian text.

Key features

  • Tokenize Persian text

    • Convert whitespaces to zero width nonjoiner (نیم‌فاصله) whenever it is necessary. for example,می رود to می‌رود.
    • Convert Persian punctuations to their English equivalent. for example,۳/۱۴ to ۳.۱۴
    • Tokenize Persian text by whitespaces and punctuations.
  • Normalize Persian tokens into a single canonical form

    • Transform all forms of Yeh, Kaf, Heh, and Hamza to a unique form. for example,براي to برای.
    • Convert all Persian and Arabic numbers to their English equivalent. for example,۱۴۳ to 143.
    • Remove diacritic (اِعراب) from words. for example, اَرّه to اره.
    • Remove Kashida form words. for example, بادبــــــادک to بادبادک.
  • Remove common Persian stop words

    • Persian stop words like از, به and etc will be removed.
  • Stem Persian words

    • Remove common Persian suffixes. for example, ها or ان.

Installation

To install the plugin for Elasticsearch 7.13.1, run this command:

bin\elasticsearch-plugin install https://www.dropbox.com/s/cr61dmnx95taivi/ParsiAnalyzer-7.13.1.zip?dl=1

Build

If you want to build ParsiAnalyzer for any specific version of Elasticsearch, follow these steps:

  1. Make sure you've installed JDK and Maven on your computer
  2. Clone project
  3. Open pom.xml
  4. Under dependencies tag, change Elasticsearch version to your desired version
  5. Open plugin-descriptor.properties
  6. Change elasticsearch.version to your desired version
  7. Run this maven command: mvn clean package
  8. In the target/releases folder, you’ll now find a zip file. install the plugin using this command: bin/elasticsearch-plugin install file:///path/to/ParsiAnalyzer.zip

Usage

To see how this plugin works, you can use Elasticsearch's analyze API:

POST _analyze
{
  "analyzer" : "parsi",
  "text" : "روباه قهوه‌اي چابك از روی سگ تنبل می پرد"
}

If you find stemming a little annoying, you can always use the standard variation of ParsiAnalyzer:

POST _analyze
{
  "analyzer" : "parsi_standard",
  "text" : "روباه قهوه‌اي چابك از روی سگ تنبل می پرد"
}

ParsiAnalyzer can be specified directly in the field mapping as follows:

PUT /my_index
{
  "mappings": {
    "blog": {
      "properties": {
        "title": {
          "type":     "text",
          "analyzer": "parsi" 
        }
      }
    }
  }
}

Contact me

Email: n.esmaielyfard [at] gmail.com

parsianalyzer's People

Contributors

narimann2 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

parsianalyzer's Issues

Error Installing Plugin for Elasticsearch 7.7

Hi, I tried to install this Analyzer for Elasticsearch 7.7 but it throws an error as below:
image
It needs to update to 7.7 version, I think.
Update it please, so we can use it.
Thanks

Issue with sorting some characters in which start with گ چ پ ژ

When I try to sort my documents, the documents in which their name starts with one of the characters like گ چ پ ژ, the sort does not work correctly.
This is the index which I have created:

PUT index_persian_names_test_with_nariman_analyzer
{
  "mappings": {
    "properties": {
      "name": {
        "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          },
          "analyzer": "persian_custom_analyzer"
      }
    }
  },
  "settings": {
    "index": {
      "number_of_shards": 5,
      "max_result_window": 5000,
      "analysis": {
        "analyzer": {
          "english_custom_analyzer": {
            "filter": [
              "lowercase",
              "decimal_digit"
            ],
            "tokenizer": "classic"
          },
          "persian_custom_analyzer": {
            "filter": [
              "lowercase",
              "decimal_digit",
              "parsi_normalizer"
            ],
            "char_filter": [
              "zero_width_spaces"
            ],
            "type": "custom",
            "tokenizer": "standard"
          }
        },
        "char_filter": {
          "zero_width_spaces": {
            "type": "mapping",
            "mappings": [
              """\u200C => \u0020""",
              """\u200B => \u0020""",
              """\u200D => \u0020""",
              """\u200E => \u0020""",
              """\u200F => \u0020""",
              """\u001F => \u0020""",
              """\u00AC => \u0020"""
            ]
          }
        }
      },
      "number_of_replicas": 0
    }
  }
}

I've added these documents:

POST index_persian_names_test_with_nariman_analyzer/_doc
{
  "name": "کرگدن"
}


POST index_persian_names_test_with_nariman_analyzer/_doc
{
  "name": "فیل"
}


POST index_persian_names_test_with_nariman_analyzer/_doc
{
  "name": "پاندا"
}


POST index_persian_names_test_with_nariman_analyzer/_doc
{
  "name": "قناری"
}


POST index_persian_names_test_with_nariman_analyzer/_doc
{
  "name": "گراز وحشی"
}


POST index_persian_names_test_with_nariman_analyzer/_doc
{
  "name": "ژیان"
}

POST index_persian_names_test_with_nariman_analyzer/_doc
{
  "name": "یوزپلنگ"
}

And finally when I try to see the sorted results, the document in which starts with پ must come at first, but it does not.

GET index_persian_names_test_with_nariman_analyzer/_search
{
  "query": {
    "match_all": {
      
    }
  },
  "sort": [
    {
      "name.keyword": {
        "order": "asc"
      }
    }
  ]
}

Here is the result:

{
  "took" : 624,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 7,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [
      {
        "_index" : "index_persian_names_test_with_nariman_analyzer",
        "_type" : "_doc",
        "_id" : "F0qtn4oBMhBe8matcKHy",
        "_score" : null,
        "_source" : {
          "name" : "فیل"
        },
        "sort" : [
          "فیل"
        ]
      },
      {
        "_index" : "index_persian_names_test_with_nariman_analyzer",
        "_type" : "_doc",
        "_id" : "30qtn4oBMhBe8matgqHw",
        "_score" : null,
        "_source" : {
          "name" : "قناری"
        },
        "sort" : [
          "قناری"
        ]
      },
      {
        "_index" : "index_persian_names_test_with_nariman_analyzer",
        "_type" : "_doc",
        "_id" : "3Eqtn4oBMhBe8mateKHt",
        "_score" : null,
        "_source" : {
          "name" : "پاندا"
        },
        "sort" : [
          "پاندا"
        ]
      },
      {
        "_index" : "index_persian_names_test_with_nariman_analyzer",
        "_type" : "_doc",
        "_id" : "pkqtn4oBMhBe8matn6Jp",
        "_score" : null,
        "_source" : {
          "name" : "ژیان"
        },
        "sort" : [
          "ژیان"
        ]
      },
      {
        "_index" : "index_persian_names_test_with_nariman_analyzer",
        "_type" : "_doc",
        "_id" : "FUqtn4oBMhBe8matXqGp",
        "_score" : null,
        "_source" : {
          "name" : "کرگدن"
        },
        "sort" : [
          "کرگدن"
        ]
      },
      {
        "_index" : "index_persian_names_test_with_nariman_analyzer",
        "_type" : "_doc",
        "_id" : "4Eqtn4oBMhBe8mati6HX",
        "_score" : null,
        "_source" : {
          "name" : "گراز وحشی"
        },
        "sort" : [
          "گراز وحشی"
        ]
      },
      {
        "_index" : "index_persian_names_test_with_nariman_analyzer",
        "_type" : "_doc",
        "_id" : "qUqtn4oBMhBe8matpqKq",
        "_score" : null,
        "_source" : {
          "name" : "یوزپلنگ"
        },
        "sort" : [
          "یوزپلنگ"
        ]
      }
    ]
  }
}

I will be grateful for your help...

Use filters independently

I need to use each filters inside parsi analyzer independently in a custom analyzer. In fact I need to extend parsi analyzer in custom way. Is there a way to achieve that?

Support elastic 7.2

Do you plan to support elastic 7.2? It's a good plugin but 6.4.2 is old and we want to upgrade our elastic to 7.2. Could you please add support for this version?

ERROR: Unknown plugin ParsiAnalyzer-1.0-SNAPSHOT.zip

I've followed the build steps for version 7.17.4 and built the ParsiAnalyzer-1.0-SNAPSHOT.zip under the target/releases folder. When I try to install it by elasticsearch-plugin, it fails:

-> Installing /path/to/ParsiAnalyzer/target/releases/ParsiAnalyzer-1.0-SNAPSHOT.zip
-> Failed installing /path/to/ParsiAnalyzer/target/releases/ParsiAnalyzer-1.0-SNAPSHOT.zip
-> Rolling back /path/to/ParsiAnalyzer/target/releases/ParsiAnalyzer-1.0-SNAPSHOT.zip
-> Rolled back /path/to/ParsiAnalyzer/target/releases/ParsiAnalyzer-1.0-SNAPSHOT.zip
A tool for managing installed elasticsearch plugins

Non-option arguments:
command              

Option             Description        
------             -----------        
-E <KeyValuePair>  Configure a setting
-h, --help         Show help          
-s, --silent       Show minimal output
-v, --verbose      Show verbose output

ERROR: Unknown plugin /path/to/ParsiAnalyzer/target/releases/ParsiAnalyzer-1.0-SNAPSHOT.zip

Any idea about this problem?

Error Installing with Elasticsearch 6.4.3

Hi community, I have changed the Elasticsearch dependency to 6.4.3 but on install, i get this error below:

Exception in thread "main" java.nio.file.NoSuchFileException: /usr/share/elasticsearch/plugins/.installing-2138313496221851015/plugin-descriptor.properties
        at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
        at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
        at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116)
        at java.base/sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:215)
        at java.base/java.nio.file.Files.newByteChannel(Files.java:369)
        at java.base/java.nio.file.Files.newByteChannel(Files.java:415)
        at java.base/java.nio.file.spi.FileSystemProvider.newInputStream(FileSystemProvider.java:384)
        at java.base/java.nio.file.Files.newInputStream(Files.java:154)
        at org.elasticsearch.plugins.PluginInfo.readFromProperties(PluginInfo.java:162)
        at org.elasticsearch.plugins.InstallPluginCommand.loadPluginInfo(InstallPluginCommand.java:713)
        at org.elasticsearch.plugins.InstallPluginCommand.installPlugin(InstallPluginCommand.java:792)
        at org.elasticsearch.plugins.InstallPluginCommand.install(InstallPluginCommand.java:775)
        at org.elasticsearch.plugins.InstallPluginCommand.execute(InstallPluginCommand.java:231)
        at org.elasticsearch.plugins.InstallPluginCommand.execute(InstallPluginCommand.java:216)
        at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:86)
        at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:124)
        at org.elasticsearch.cli.MultiCommand.execute(MultiCommand.java:77)
        at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:124)
        at org.elasticsearch.cli.Command.main(Command.java:90)
        at org.elasticsearch.plugins.PluginCli.main(PluginCli.java:47)

And here is my plugin-descriptor.properties:

description=Persian analyzer for elasticsearch.
version=1.0
name=ParsiAnalyzer
classname=org.elasticsearch.analyzer.ParsiAnalyzerPlugin
java.version=1.8
elasticsearch.version=6.4.3

Installation file link(zip): https://drive.google.com/open?id=1ac-cPCRqCPXcTrN8xBKUJ_MLLr5ZfOSK
And i have followed this:

1- Download the latest source code
2- Open pom.xml file
3- Under dependecies tag, change elasticserach version from 5.6.3 to 6.1.2 and save it
4- Open plugin-descriptor.properties file
5- Change elasticsearch.version from 6.4.2 to 5.6.13
6- Run this maven command: mvn clean package -DskipTests
7- In the folder target/releases you’ll now find a zip file called ParsiAnalyzer. install the plugin using this command:
bin/elasticsearch-plugin install file:///path/to/ParsiAnalyzer.zip

Notice: I'm using Dockerized Elasticsearch.

What's wrong with my configs?

Not compatible

hi,
Not compatible with elastics search 8.9.0 .
please update

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.