jprante / elasticsearch-analysis-baseform Goto Github PK

View Code? Open in Web Editor NEW

26.0 26.0 8.0 4.95 MB

Baseform lemmatization for Elasticsearch

License: Apache License 2.0

Java 100.00%

elasticsearch-analysis-baseform's People

Contributors

Stargazers

Watchers

Forkers

vineeth-mohan simpsora yaroslavgaponov dgkris ackermann thilohaas gbigenios unendin

elasticsearch-analysis-baseform's Issues

Installation fails with file not found

Seems this is no longer available at this URL

http://xbib.org/repository/org/xbib/elasticsearch/plugin/elasticsearch-analysis-baseform/1.3.0.0/elasticsearch-analysis-baseform-1.3.0.0-plugin.zip

Could not find plugin descriptor 'plugin-descriptor.properties' in plugin zip

While installing with elasticsearch 2.2.0, I'm facing the mentioned error:

Could not find plugin descriptor 'plugin-descriptor.properties' in plugin zip

Upgraded version for 2.4

I tried installing this plugin for elasticsearch version 2.4 and it refused to get installed.

StackOverflowError in Dictionary.lookup

I'm using the plugin version from the elasticsearch-plugin-bundle 1.4.0.4 with ES 1.4.2 and I've configured a filter and analyzer like this:

"analysis": {
    "analyzer": {
        "german_foobar": {
            "tokenizer": "standard",
            "filter": [
                "german_foobar"
            ],
            "type": "custom"
        }
    },
    "filter": {
        "german_foobar": {
            "language": "de",
            "type": "baseform"
        }
    }
}

When I try to analyze the string "wurde zum tollen gemacht" with this analyzer, I get a StackOverflowError in Dictionary.lookup on the server:

Exception in thread "main" org.elasticsearch.common.util.concurrent.UncategorizedExecutionException: Failed execution
    at org.elasticsearch.action.support.AdapterActionFuture.rethrowExecutionException(AdapterActionFuture.java:92)
    at org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:79)
    at org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:61)
    at com.fileee.search.impl.DefaultSearchClient.analyze(DefaultSearchClient.java:389)
    at com.fileee.search.impl.DefaultSearchClient.main(DefaultSearchClient.java:696)
Caused by: java.util.concurrent.ExecutionException: java.lang.StackOverflowError
    at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.getValue(BaseFuture.java:288)
    at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:261)
    at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:92)
    at org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:72)
    ... 3 more
Caused by: java.lang.StackOverflowError
    at java.nio.charset.CharsetDecoder.replaceWith(CharsetDecoder.java:303)
    at java.nio.charset.CharsetDecoder.<init>(CharsetDecoder.java:207)
    at java.nio.charset.CharsetDecoder.<init>(CharsetDecoder.java:233)
    at sun.nio.cs.UTF_8$Decoder.<init>(UTF_8.java:84)
    at sun.nio.cs.UTF_8$Decoder.<init>(UTF_8.java:81)
    at sun.nio.cs.UTF_8.newDecoder(UTF_8.java:68)
    at java.lang.StringCoding.decode(StringCoding.java:213)
    at java.lang.String.<init>(String.java:451)
    at org.xbib.elasticsearch.index.analysis.baseform.Dictionary.lookup(Dictionary.java:58)
    at org.xbib.elasticsearch.index.analysis.baseform.Dictionary.lookup(Dictionary.java:59)
    at org.xbib.elasticsearch.index.analysis.baseform.Dictionary.lookup(Dictionary.java:59)
...

Stackoverflow error with german article "einem"

Hey, good job! The baseform analyzer is really cool, doing exactly what I needed!

However, I get Stackoverflow errors when indexing/analyzing a text that contains the german word/article "einem":

GET /myindex/_analyze?analyzer=german&text=mit einem test&pretty=1

throws
[2013-12-17 18:48:47,382][DEBUG][action.admin.indices.analyze] [Karl] failed to execute [org.elasticsearch.action.admin.indices.analyze.AnalyzeRequest@41a330e4]
java.lang.StackOverflowError
at sun.nio.cs.UTF_8$Decoder.decodeLoop(UTF_8.java:324)
at java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:561)
at java.lang.StringCoding$StringDecoder.decode(StringCoding.java:158)
at java.lang.StringCoding.decode(StringCoding.java:196)
at java.lang.String.(String.java:491)
at org.xbib.elasticsearch.analysis.baseform.Dictionary.lookup(Dictionary.java:64)
at org.xbib.elasticsearch.analysis.baseform.Dictionary.lookup(Dictionary.java:65)

The last line repeats over and over...

When I change the text to "das ist ein test" everything works fine!

I just found another word that causes an exception: "lange" or "lang"

"dieser test dauert kurz" works fine
"dieser test dauert lange" causes a stack overflow

can't find package elasticsearch-analysis-baseform 2.2.1.1

Hi,
I try to install elasticsearch-analysis-baseform 2.2.1.1. But the link named in the instruction don't work.
98a6872

here is the link:
http://xbib.org/repository/org/xbib/elasticsearch/plugin/elasticsearch-analysis-baseform/2.2.1.1/elasticsearch-analysis-baseform-2.2.1.1-plugin.zip

THX for assistance

T.C

English adjectives are not lemmatized

For example, "quickly" is not reduced to "quick."

It looks like there are lemma files for nouns and verbs, but not for adjectives. Is there a resource for english adjective lemmatization that could be added to the plugin?

Thanks very much.

Installation fails with file not found

It seems xbib.org is down, and hence I am unable to install the plugin for Elasticsearch 1.x.

Output of installation step:

+ ./bin/plugin -install analysis-baseform -url http://xbib.org/repository/org/xbib/elasticsearch/plugin/elasticsearch-analysis-baseform/1.4.0.0/elasticsearch-analysis-baseform-1.4.0.0-plugin.zip
-> Installing analysis-baseform...
Trying http://xbib.org/repository/org/xbib/elasticsearch/plugin/elasticsearch-analysis-baseform/1.4.0.0/elasticsearch-analysis-baseform-1.4.0.0-plugin.zip...
Failed: ConnectException[Connection refused (Connection refused)]
Trying https://github.com/null/analysis-baseform/archive/master.zip...
Failed to install analysis-baseform, reason: failed to download out of all possible locations..., use --verbose to get detailed information

With ES 1.3.0 the baseform tokenization is failing bcoz of deprecated lucene APIs

Case sensitive

I'm using this plugin for german text and it seems that it's case sensitive. Is that the case? If yes, what's the reason for that?

ES 5.1 / 5.2

Hi @jprante ,

is there any plan to upgrade it for the newer ES versions? Are there any parts, where the community could help you with?

BaseformTokenFilter sets incorrect offsets for inserted baseforms

At https://github.com/jprante/elasticsearch-analysis-baseform/blob/master/src/main/java/org/xbib/elasticsearch/index/analysis/baseform/BaseformTokenFilter.java#L43 we set the offsets from the saved token, but when the token was saved, we never set its offsets, so these offsets are always 0, which is dangerous since parts of Lucene assume offsets move forwards.

I think to fix this we should just remove that one line ... because the restoreState(current) right above it will already set the correct offsets.

elasticsearch 1.2.*

when is the plugin gonna be compatible with version 1.2.* of elasticsearch? or is there a way that I can install it manually?

Problem with highlighting and baseform

If we have some text field with value for EN - "frostbit" (or for DE - "wirkt") and in request try to get highlighting information - we have error: invalidTokenOffsetsException.

It seems endOffset is more then original value size.

Thanks