snowballstem / snowball Goto Github PK

Snowball compiler and stemming algorithms

License: BSD 3-Clause "New" or "Revised" License

Makefile 3.88% C 74.10% Java 1.97% Perl 1.16% Python 2.65% JavaScript 1.80% Rust 2.58% Go 1.44% C# 3.86% Pascal 1.59% Ada 4.97%

snowball's People

Contributors

Stargazers

Watchers

Forkers

rdamodharan rboulton tanmaykm nullptr0-0 oerd hispavista vstakhov cesarsouza narayana1208 kairat stephanheilner carlopellegriniarchijcom zhenshancao hieuhoang assem-ch yodasantu omron93 taruti dainiusjocas mitya57 wayzou goodotcom xjzhou chenmoshushi dscorbett jinxuanw-clairvoyant sufrostico loqu8 doniexun adammendoza predictabledata digideskio wshallum mrfsong363 jwriteclub zhyongwei kronic jimregan cesco69 mathter fdgtergrfg woemike jdemler dukedai senseplanact rlugojr scaevola guludo hexresearch patperry mschoch appleacknowledgements greenat92 sndsabin chandon davty flavioamurriocs e2fyi data-man sireof ashang divyaguduru wysqh todevelopersteam jordimas ojwb hgldj1966 w32zhong pwolfgang luisbalru semuar aravindveluri vvarma devvver pebsconsulting linghuazaii alexmihalev gembin jolyzhang awesomedotnetcore titaniasoftware chffjkrv didsustechfxxkedup erthub davidcamirand mfcardenas iolalla els-rd lindafr currysoftware graingert sinhasantos qeex cknill ubuntu-repo forivall sanjibnarzary yunjusong dv-kchahid alphanut

snowball's Issues

Missing 'german2' stemming algorithm variant in the C version of the libstemmer library

Hi,

currently, the tarball of the C version of the libstemmer library is missing the 'german2' algorithm. Is there a particular reason for this?

Thanks!

license in separate file

I see BSD-3-Clause in README.rst, could you please consider to include it in separate file, which would be shipped along README.rst instead? It is possible to store license separately from other docs.

Thank you for considering

Stemmer for Occitan language in Java for SolR, code generated doesn't compile

I am trying to write a Stemmer for the Occitan language very close the Catalan language but I can't.
There is a difference in the Among signature between the java code generated and what is used in the SolR buid chain.
Where can I get the accurate version of the compiler in order to generate the good one ?
For example

/* CatalanStemmer  from SolR code */
@SuppressWarnings("unused") public class CatalanStemmer extends SnowballProgram {

private static final long serialVersionUID = 1L;

       /* patched */ private static final java.lang.invoke.MethodHandles.Lookup methodObject = java.lang.invoke.MethodHandles.lookup();  
                private final static Among a_0[] = {
                    new Among ( "", -1, 13, "", methodObject ),                                        
                    new Among ( "\u00B7", 0, 12, "", methodObject ),                                                              

/*OccitanStemmer generated by the compiler*/
public class OccitanStemmer extends org.tartarus.snowball.SnowballProgram {

    private static final long serialVersionUID = 1L;

    private final static Among a_0[] = {
        new Among("", -1, 13),

Again

        among_var = find_among(a_0);

and

        among_var = find_among_b(a_1);

are differents

Pre-built Binaries

I just wanted to let you know that the Julia community provides pre-built binary artifacts for various native dependencies, and has recently built them for Snowball. These binaries, for many different architectures and operating systems, are available from https://github.com/JuliaBinaryWrappers/Snowball_jll.jl/releases . The bundles contain a shared library (.dylib/.so/.dll) and the stemwords executable. They were built from the Snowball C library download. They've been built for all available encodings.

Background infromation about the cross-compiler based build system can be found at https://binarybuilder.org/

[Please close this issue when convenient-- it's only for your information]

Swedish stemmer

I think i have found a bug in the swedish stemmer. When searching for "mötet" (the meeting) i should get result for "möte" and "möten". I think the problem is when stemming words ending with "et". (words ending with "andet" and "het" should work though. Those endings are in the suffix list.

When searching for the longest suffix in the first step i added this suffix "et" and that works. Don't know if that is the right way to fix this though.

Arabic Stemmer

I installed the Stemmer and tried to use it with Arabic, but that's what I got

raise KeyError("Stemming algorithm '%s' not found" % lang)

KeyError: "Stemming algorithm 'arabic' not found"

Am I missing some dependencies, if so could you please could you please clarify what do I need to make this stemming work?

Problems with Russian letter Ё

As http://snowballstem.org/algorithms/russian/stemmer.html properly mentions, Russian alphabet contains letter Ё [jo] which is quite often replaced with Е, especially in regular, non-academic texts.

So indeed the beast approach is to replace Ё -> Е when stemming.

Now if you check the existing demo http://snowballstem.org/demo.html you can see that it doesn't actually happen.

Let's take Russian word for "honey" — «мёд», and its form with different ending — «мёдом».
If you paste it along with its "normalized" form (with Е) you can see that the form with Ё is not properly stemmed:

Here's sample input so that you can run the tests yourself: "мёд мёдом мед медом".

This is a serious problem when searching through the corpus of natural texts. Even if you're purist (like me in this case) and type all your search terms with properly placed Ё you won't be able to match the original texts that are using Е.

Insert licence into source code

Could you please inset licence into distributed source code?

Because BSD licence requires including its text with all derivative works. So without including a proper full license text it is making difficult or impossible for anyone to comply with desired license terms.

There was discussion on old mailing list - http://thread.gmane.org/gmane.comp.search.snowball/1463

Thanks,
Marek

Cache error in python stemmer

When calling the stemWords method in python stemmer, it may crashes with the following error:

AttributeError: 'FrenchStemmer' object has no attribute 'clear_cache'

It must be an error in the code as the _clear_cache method exists and is called by the sister method stemWord.

I've made a PR to fix this issue:

#105

Ukrainian algorithm

found algorithm for Ukrainian lang in this repo
https://github.com/Tapkomet/UAStemming

It would be great if you added it

$ on string doesn't work properly with most backends

This came to light while working on getting the Go backend merged in #57, but also breaks the Latin stemmer (#58).

I believe it affects all languages except C.

Creating a ticket for this so it doesn't get forgotten.

Java builds and tests

Hi @rboulton ,

I would like the Java stemmers to also be included in travis-ci builds, ideally with stemming tests like it's C counterparts (see #5).

If it's alright with you, I'd like to work on this and submit a pull-request shortly.

oerd

Compiler generates invalid code for ANSI C

The Snowball compiler currently generates C code which intermingles variable declarations and code, which is incompatible with ANSI C.

The fix is to declare the variable at the start of the block. Here is a patch that accomplishes this>

 compiler/generator.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/compiler/generator.c b/compiler/generator.c
index 30643ca..29aeef3 100644
--- a/compiler/generator.c
+++ b/compiler/generator.c
@@ -890,11 +890,11 @@ static void generate_slicefrom(struct generator * g, struct node * p) {
 
 static void generate_setlimit(struct generator * g, struct node * p) {
     int keep_c;
-    writef(g, "~{~k~C", p);
-    keep_c = g->keep_count;
-    w(g, "~Mint mlimit");
+    w(g, "~{int mlimit");
+    keep_c = g->keep_count + 1;
     write_int(g, keep_c);
     w(g, ";~N");
+    writef(g, "~M~k~C", p);
     generate(g, p->left);
 
     w(g, "~Mmlimit");

I have visually checked the result for all generated C stemmer files and found just the intended changes. The new code compiles well as ANSI C and passes all my tests.

Optimise "setlimit tomark AE for ..." for all target languages

All uses of setlimit in the current stemmers we ship follow this pattern, and by special-casing we can avoid having to save and restore the cursor.

df2064d implemented this for C, but it would be good to handle this for all the target languages.

Java class generation

I'm working on a stemmer for Lithuanian language. Stemmer is going to be used in Lucene based search engine, therefore I need to use the generated Java class.

In my algorithm I'm using the length of the word and I believe this causes me a problem.

I've been following this tutorial. The resulting Java class works fine with the example program. But Java code does not compile for using in it Lucene with this problem:

[...] /src/main/java/org/tartarus/snowball/ext/LithuanianStemmer.java:[589,27] error: current has private access in SnowballProgram

I see that the code which was generated by snowball compiler tries to get length of the string in this way:

(current.length());

Unfortunatelly, the implementation of SnowballProgram class that is used by Lucene has defined current field in this way

 // current string
    private char current[];

So, if I want to use the generated class in Lucene, I have to manually change the call for string length in this way:

getCurrent().length();

Am I facing a bug? Or this is the expected behaviour?

Mistakes in the Dutch stemmer

I first want to thank everyone on the Snowball project for creating this software. It's great that we can use the software to build more sophisticated search capabilities for our users. However, when I was testing several Dutch words, I noticed there are actually quite a lot of mistakes. I'm not quite sure how to fix the problems in the Dutch stemmer, so I thought I'd mentioned them here and hope someone picks it up.

Not sure where to start, so I'll mention a couple that are incorrect (the last word is the correct one):

gevaren gevar -> gevaar
gevaar gevar -> gevaar
gevaarlijk gevar -> gevaarlijk
gevaarlijke gevar -> gevaarlijk
gevaarlijker gevaarlijker -> gevaarlijk
gevaarten gevaart -> gevaarte
gevallen gevall -> geval
geven gev -> geef
gevist gevist -> vis
gewasbescherming gewasbescherm -> gewasbescherming
gewassen gewass -> gewas
geweer gewer -> geweer
aanbellen aanbell -> bel aan (yes, Dutch is weird)
aandeel aandel -> aandeel
aaneen aanen -> aaneen (should really be excluded from stemming if possible, since there is no way that this word occurs in any other form)
aalmoezen aalmoez -> aalmoes
gangetje gangetj -> gang
gebaartje gebaartj -> gebaar

These are just a few, but there are quite a lot more. Should you need help verifying or testing the stemmer for the Dutch words, I'm happy to help :)

Generated Java code does not inherit from the correct class

In the java TestApp, we have the statement at https://github.com/snowballstem/snowball/blob/master/java/org/tartarus/snowball/TestApp.java#L31

SnowballStemmer stemmer = (SnowballStemmer) stemClass.newInstance();

However, the generated Java class inherits from org.tartarus.snowball.SnowballProgram rather than org.tartarus.snowball.SnowballStemmer.

By simply making the generated class inherit from org.tartarus.snowball.SnowballStemmer (already a subclass of org.tartarus.snowball.SnowballProgram) the TestApp works out of the box.

Windows

How compile windows?

Add stemmer generators for C# code

Hi there,

I would like to know if anyone is interested on having support for C# generators in this project. Generating C# code is actually quite different from what the Java generator does, as those languages, albeit similar, don't have the same constructs (i.e. labelled breaks/loops). As such, in order to create a generator for C#, it would be necessary to mix parts from the C/C++ generator and the Java one. Plus, it would also be interesting to have a .NET library for Snowball, in the same spirit as the Java one that is available in this repository.

As such, I just wanted to add that I have started a new fork to address those issues that is almost completely finished, with support for a new generator_csharp.c that can already generate all stemmer definition files that come with Snowball. The generated C# files could be seen, for example, here:

https://github.com/cesarsouza/snowball/tree/csharp-development/csharp/Snowball/Algorithms

I have also added some unit tests for some languages, though not all of them yet. So, if any project maintainer would like this feature, I can send a pull request to merge my fork'd version. Otherwise I can just keep my fork around for anyone else that could be interested.

The fork is currently available at https://github.com/cesarsouza/snowball. In order to generate C# files, you can pass "-cs" to the command line compiler, as in

snowball.exe stem_ISO_8859_1.sbl -cs -o EnglishStemmer -name EnglishStemmer -u

An example generated file can be found here.

Hope it can be useful! 😃

When generating Java from *.sbl, with current SnowballProgram and java sources, current.length is not accessible anymore.

When generating Java from *.sbl files , in combination with the current SnowballProgram java file (2016072500L), current.length is not accessible anymore, because current is now a private StringBuffer.

So things like "len" and "size" don't compile, should you change them to getCurrent().length() ?
(see the part with "Thanks to Wolfram Esser for spotting this problem." in SnowballProgram.Java)

Java : generated stemmer code variable cannot instantiate abstract grandparent class `SnowballProgram`.

Java : there is an issue with the generated stemmer code using a stemmer variable referencing the abstract grandparent SnowballProgram class.

I think I have the solution (see below), but would like to have some feedback.

For example when compiling the Schinke Latin latin.sbl.txt stemmer :
see http://snowball.tartarus.org/otherapps/schinke/intro.html

In the generated concrete Java class LatinStemmer, which extends SnowballStemmer extends abstract SnowballProgram.
In this class $noun_form and $verb_form variables are copies of the LatinStemmer class.

Generated code example for LatinStemmer from line 48 resp 57 :

 48: SnowballProgram v_2 = new SnowballProgram(this);
      ... 
 57: SnowballProgram v_4 = new SnowballProgram(this);

The issue is that SnowballProgram is an abstract class, and thus can not be instantiated.

My possible solution : change the SnowballStemmer class copy constructor to call the super class SnowballProgram

package org.tartarus.snowball;
import java.lang.reflect.InvocationTargetException;

public abstract class SnowballStemmer extends SnowballProgram {

  public abstract boolean stem();

  public SnowballStemmer(SnowballStemmer other) {
    super(other);
  }

 static final long serialVersionUID = 2016072500L;
}

Then also generate the copy constructor in the generated class:

public LatinStemmer(SnowballStemmer other) {
    super(other);
 }

Then the generated code can be changed as follows for "LatinStemmer":

   LatinStemmer v_2 = new LatinStemmer(this);

I can't find a better solution on Stack Overflow, but maybe I'm missing something here?
Note that LatinStemmer is one of the few snowball algorithms, that uses a stemmer variable.
Is there a way to eliminate such variables?

NB: import java.lang.reflect.InvocationTargetException; this import seems not (longer) used,
since there is no try/catch method for InvocationTargetException in SnowballStemmer ?

Invalid generated Java

The Java file generated from this source:

externals (null)
define null as ()

contains this invalid method definition:

    public boolean null() {
        // (, line 2
        return true;
    }

Snowball as a shared library

Hi there,

Is there a way to compile snowball to a shared library?

Generating something like libsnowball.so ? or libsnowball.dll?

UnicodeDecodeError with Danish stemmer

PyStemmer throws a UnicodeDecodeError on specific input involving certain emojis with Danish stemmer (note the difference in inputs, the second has two 'a' in between the emojis):

> mkvirtualenv -p /usr/bin/python3 stemmer
...
> pip install PyStemmer
...
Successfully installed PyStemmer-1.3.0
> python3 --version
Python 3.6.6
> python3 -c "import Stemmer; print(Stemmer.Stemmer('da').stemWord(b'\xf0\x9f\x98\x98a\xf0\x9f\x98\x98'.decode('utf-8')))"
😘a😘
> python3 -c "import Stemmer; print(Stemmer.Stemmer('da').stemWord(b'\xf0\x9f\x98\x98aa\xf0\x9f\x98\x98'.decode('utf-8')))"
Traceback (most recent call last):
  File "Stemmer.pyx", line 184, in Stemmer.Stemmer.stemWord (src/Stemmer.c:1669)
KeyError: b'\xf0\x9f\x98\x98aa\xf0\x9f\x98\x98'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "Stemmer.pyx", line 192, in Stemmer.Stemmer.stemWord (src/Stemmer.c:1772)
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 6-8: unexpected end of data

Other tested languages work:

> python3 -c "import Stemmer; [Stemmer.Stemmer(lang).stemWord(b'\xf0\x9f\x98\x98aa\xf0\x9f\x98\x98'.decode('utf-8')) for lang in ('en', 'sv', 'fi', 'fr', 'de')]" && echo 'ok'
ok

problems with Turkish stemmer

After trying the Turkish stemmer, I found there might be some inconsistencies with the original paper 'An Affix Stripping Morphological Analyzer for Turkish'.
Here a some error examples:


Yazacağım->Yazacak // should stem to Yaz, acak is a tense suffix
Yazıyorsunuz-> Yaziyor// should stem to Yazi, yor is a tense suffix

Both examples above derive from Yazi(write).

Above all, thanks a lot for the hard word to help people analyse languages they don't even know much.

Multiple errors in generated Java sources for Latin algorithm

Just copying request posted from mailing archive.

I’m trying to generate Java sources for http://snowballstem.org/otherapps/schinke/ algorithm. I added stem.sbl from schinke.tgz to “snowball/algorithms/latin/stem.sbl” sources. Then updated GNUmakefile:

diff --git a/GNUmakefile b/GNUmakefile
index d6c7606..08237fa 100644
--- a/GNUmakefile
+++ b/GNUmakefile
@@ -29,7 +29,7 @@ libstemmer_algorithms = arabic \
                        danish dutch english finnish french german hungarian \
                        italian \
                        norwegian porter portuguese romanian \
-                       russian spanish swedish tamil turkish
+                       russian spanish swedish tamil turkish latin
 
 KOI8_R_algorithms = russian
 ISO_8859_1_algorithms = danish dutch english finnish french german italian \

Apparently generated sources are not compiled against

java version "1.8.0_92"
Java(TM) SE Runtime Environment (build 1.8.0_92-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.92-b14, mixed mode)

with error:

[error] ./src/main/java/org/tartarus/snowball/ext/latinStemmer.java:260: missing return statement
[error]     }

Even if I stub the error with return true or return false, stemmer produces weird results. When I launch TestApp latin in.txt –o out.txt for input datum it produces string datum datum, but should just dat.

Take default Java class name from output filename

When generating Java from *.sbl, class name is empty.
Note : I use a windows command / batch file from the algorithms folder.
example : snowball dutch\stem_ISO_8859_1.sbl -o dutchStemmer_ISO -j

resulting file : dutchStemmer_ISO.java

generated class name should be dutchStemmer_ISO but is empty, I put [missing] to indicate where :

// This file was generated automatically by the Snowball to Java compiler
// http://snowballstem.org/

package org.tartarus.snowball.ext;

import org.tartarus.snowball.Among;

/**

This class was automatically generated by a Snowball to Java compiler

It implements the stemming algorithm defined by a snowball script.
*/

public class [missing] extends org.tartarus.snowball.SnowballProgram {
...

Also at the end of this file the same problem arises:

public boolean equals( Object o ) {
return o instanceof [missing];
}

public int hashCode() {
return [missing].class.getName().hashCode();
}

German stemmer mis-stems 'System'

Systeme -> System
Systemen -> System
Systemes -> System
Systems -> System
but
System -> Syst

Please confer thread here as well: http://www.postgresql.org/message-id/[email protected]

Unreachable statement in generated Java code

The following Snowball code:

externals (stem)
define stem as not true

produces invalid Java code:

public boolean stem() {
    // not, line 2
    lab0: do {
        return false;
    } while (false);
    return true;
}

which does not compile:

org/tartarus/snowball/ext/testStemmer.java:23: error: unreachable statement
    return true;
    ^

Numbers with finnish stemmer

When entering a number to finnish stemmer with 2 or more trailing zeroes, last zero is removed.

Addition: This issue applies not only to zeroes but all numbers, eg. "555", "9999" etc.

Steps to reproduce

Open demo at snowballstem.org: http://snowballstem.org/demo.html
Select Finnish stemming algorithm
Enter a text including a nuber with two or more trailing zeroes, eg. "testattiin 2000"

Expected result

Output will be "testat 2000"

Actual result

Output is "testat 200" (last zero is gone)

[rust] warning: trait objects without an explicit `dyn` are deprecated

We're currently seeing a number of these warnings from CI here and below:

https://travis-ci.org/snowballstem/snowball/jobs/581014461#L342

The Rust compiler seems to be suggesting to just insert dyn before in every instance, but I know pretty much nothing about Rust and have no idea if that's actually the appropriate fix.

@JDemler Please could you take a look?

Also is there an equivalent of GCC's -Werror which would cause new warnings to make CI fail so we can't miss them? I only spotted this while cleaning up the CI config.

Stemming feminine nouns in german stemmer

It would good to stem the feminine version of nous in the german stemmers. For example "Verkäuferin" should be stemmed to "Verkauf" like it it is done for the masculine form "Verkäufer". This is especially important for occupations for example on a job boards.

Is there a reason not to do that, or has no one gotten arround to do it or come up with an approriate algorithm? I might try to do a PR if there is any chance of getting it accepted.

examples folder in C-dist should be cleaned

*.o files in examples folder should be removed on make clean. The patch:

diff --git a/GNUmakefile b/GNUmakefile
index d6c7606..8dbfefb 100644
--- a/GNUmakefile
+++ b/GNUmakefile
@@ -309,7 +309,7 @@ dist_libstemmer_c: \
        echo 'stemwords: examples/stemwords.o libstemmer.o' >> $${dest}/Makefile && \
        echo '  $$(CC) -o $$@ $$^' >> $${dest}/Makefile && \
        echo 'clean:' >> $${dest}/Makefile && \
-       echo '  rm -f stemwords *.o $(c_src_dir)/*.o runtime/*.o libstemmer/*.o' >> $${dest}/Makefile && \
+       echo '  rm -f stemwords *.o $(c_src_dir)/*.o runtime/*.o libstemmer/*.o examples/*.o' >> $${dest}/Makefile && \
        (cd dist && tar zcf $${destname}.tgz $${destname}) && \
        rm -rf $${dest}

What is Stemming?

I saw snowballstemmer being brought in as a dependency and was curious what such an unusually named package was. The PyPI page doesn't mention what stemming is or why Snowball algorithms are noteworthy for it 🤷‍♀️

The GitHub repository's README makes a mention to a site snowballstem.org which, frustratingly, also fails to say what stemming is. A Google search and a wikipedia entry later explains it 🤦‍♀️

Both the PyPI and the README ought to at least give a one line description of what stemming is, perhaps:

Stemming is the process of finding word "stems" (bases or root forms) from derived or inflected forms, such as fisher, fishing, fished → fish

Missing return statement in generated Java and Rust code

The following Snowball code:

externals (stem)
define stem as (false or true)

produces invalid Java and Rust code: the return statements of stem are missing.

The Java error is:

org/tartarus/snowball/ext/testStemmer.java:27: error: missing return statement
}
^
1 error

The Rust error is:

error[E0308]: mismatched types
  --> /Users/dcorbett/snowball/rust/target/debug/build/testapp-ca6d6d95758b7568/out/test_stemmer.rs:25:9
   |
25 |         break 'lab0;
   |         ^^^^^^^^^^^ expected (), found bool
   |
   = note: expected type `()`
              found type `bool`

arabic, tamil

Should arabic and tamil be added to modules and modules_utf8?

PyPI licence tags

@mitya57 This was held for moderation by the mailing list, but it didn't seem relevant for most subscribers (and I'm not sure if you're even subscribed to the list) so putting it in an issue here instead:

Hello!

You have recently released a new version of snowballstemmer. Thanks! It will
probably make a lot of people happy. There is a tiny issue with your
package on PyPI though: it is a bit difficult to guess the exact
license you use.

It is very clear that you use one of the BSD licenses. PyPI allows you
to use the following classifier:

License :: OSI Approved :: BSD License

But it is not very precise. You can also use the "license" field in
setuptools, but there is no "fixed" format, so you could write "BSD2",
"BSD-2", "BSD-2-Clause", etc. This makes it hard for automated tools
to properly guess which BSD license you are using.

May I suggest using the aforementioned classifier and an SPDX
identifier (https://spdx.org/licenses/) in the "license" field?

Thanks,
Cyril Roelandt

(This email was automatically generated, but you can answer it! A real
human being will read your answer.)

(If you think it makes more sense to list your address as maintainer for PyPI rather than the mailing list, please do).

UTF-8 ?

From: algorithms/french/stem_ISO_8859_1.sbl

stringdef a^   hex 'E2'  // a-circumflex
stringdef a`   hex 'E0'  // a-grave
stringdef c,   hex 'E7'  // c-cedilla

stringdef e"   hex 'EB'  // e-diaeresis (rare)
stringdef e'   hex 'E9'  // e-acute
stringdef e^   hex 'EA'  // e-circumflex
stringdef e`   hex 'E8'  // e-grave
stringdef i"   hex 'EF'  // i-diaeresis
stringdef i^   hex 'EE'  // i-circumflex
stringdef o^   hex 'F4'  // o-circumflex
stringdef u^   hex 'FB'  // u-circumflex
stringdef u`   hex 'F9'  // u-grave

So far there is no UTF-8 version. Why?

Handle x plural forms for French

Hello! I hope this is the right place to report a problem I found with an app that uses PostgreSQL 10's full-text search.

There is a class of French nouns that form their plural in x: jeux, hiboux, choux, aulx, baux, etc.

Testing with PG and reading the doc at https://snowballstem.org/algorithms/french/stemmer.html make me think that these are not handled.

create release tag

Hi,

It would be very helpful for packagers if you created release tags.

https://help.github.com/articles/about-releases

Thanks!

Issues with numbers and alphanumeric codes

In #66, it was reported that the Finnish stemmer damaged numbers (e.g. 2000 -> 200). This is now fixed, but there are similar (though more subtle) issues elsewhere.

E.g. The Danish stemmer damages alphanumeric codes where the initial alpha part meets certain criteria - e.g. space1999 -> space199, hal9000 -> hal900, 0x0e00 -> 0x0e0. These are significantly less problematic in practice as in many cases there probably isn't any unwanted conflation (there isn't a "space199" or a "hal900") and the stem is really just an opaque internal token in typical usage. It's more likely to be a problem for cases such as hex codes in error messages, inventory codes, etc, and I think it is worth addressing in a similar way to the fix for Finnish - i.e. by replacing a non-v check with a c check where c is all the letters in danish except those in v.

We also should review the other algorithms where non is used (dutch, english, french, german, hungarian, indonesian, irish, italian, kraaij_pohlmann, lithuanian, norwegian, porter, portuguese, romanian, russian, spanish, swedish, turkish), and add a generic test that feeds some numbers and alphanumeric into the stemmer and checks they aren't changed, to help ensure there aren't other issues of this sort now or in the future.

Maven artifact and metadata for the Java stemming algorithm library

Hi
can you provide maven artifact and metadata for the Java stemming algorithm library?
Thanks in advance
Regards

Source files without license headers

Hi,
All source files are without license headers
Please, confirm the licensing of code and/or content/s, and add license headers.

https://fedoraproject.org/wiki/Packaging:LicensingGuidelines?rd=Packaging/LicensingGuidelines#License_Clarification

Thanks in advance
Regards

release a python wheel

python AttributeError snowballstemmer.algorithms()

Hello,

I installed the library from PyPI

pip install snowballstemmer

There is a bug in https://github.com/snowballstem/snowball/blob/master/python/create_init.py#L42

----> 1 snowballstemmer.algorithms()

     67         return Stemmer.language()
     68     else:
---> 69         return list(_languages.key())
     70
     71 def stemmer(lang):

AttributeError: 'dict' object has no attribute 'key'

It should be _languages.keys()