Coder Social home page Coder Social logo

snowballstem / snowball Goto Github PK

View Code? Open in Web Editor NEW
718.0 718.0 172.0 1.48 MB

Snowball compiler and stemming algorithms

Home Page: https://snowballstem.org/

License: BSD 3-Clause "New" or "Revised" License

Makefile 3.88% C 74.10% Java 1.97% Perl 1.16% Python 2.65% JavaScript 1.80% Rust 2.58% Go 1.44% C# 3.86% Pascal 1.59% Ada 4.97%

snowball's People

Contributors

assem-ch avatar dbcerigo avatar dscorbett avatar graingert avatar iolalla avatar jannick0 avatar jbaum98 avatar jdemler avatar jdufresne avatar jimregan avatar jonathlela avatar jsoref avatar jsteemann avatar lindafr avatar martin-porter avatar master avatar mitya57 avatar mschoch avatar oerd avatar ojwb avatar paper42 avatar patperry avatar rboulton avatar rdamodharan avatar rmuir avatar stcarrez avatar stef4np avatar stefanor avatar turnerj avatar vstakhov avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

snowball's Issues

license in separate file

I see BSD-3-Clause in README.rst, could you please consider to include it in separate file, which would be shipped along README.rst instead? It is possible to store license separately from other docs.

Thank you for considering

Stemmer for Occitan language in Java for SolR, code generated doesn't compile

I am trying to write a Stemmer for the Occitan language very close the Catalan language but I can't.
There is a difference in the Among signature between the java code generated and what is used in the SolR buid chain.
Where can I get the accurate version of the compiler in order to generate the good one ?
For example

/* CatalanStemmer  from SolR code */
@SuppressWarnings("unused") public class CatalanStemmer extends SnowballProgram {

private static final long serialVersionUID = 1L;

       /* patched */ private static final java.lang.invoke.MethodHandles.Lookup methodObject = java.lang.invoke.MethodHandles.lookup();  
                private final static Among a_0[] = {
                    new Among ( "", -1, 13, "", methodObject ),                                        
                    new Among ( "\u00B7", 0, 12, "", methodObject ),                                                              

/*OccitanStemmer generated by the compiler*/
public class OccitanStemmer extends org.tartarus.snowball.SnowballProgram {

    private static final long serialVersionUID = 1L;

    private final static Among a_0[] = {
        new Among("", -1, 13),

Again

        among_var = find_among(a_0);

and

        among_var = find_among_b(a_1);

are differents

Pre-built Binaries

I just wanted to let you know that the Julia community provides pre-built binary artifacts for various native dependencies, and has recently built them for Snowball. These binaries, for many different architectures and operating systems, are available from https://github.com/JuliaBinaryWrappers/Snowball_jll.jl/releases . The bundles contain a shared library (.dylib/.so/.dll) and the stemwords executable. They were built from the Snowball C library download. They've been built for all available encodings.

Background infromation about the cross-compiler based build system can be found at https://binarybuilder.org/

[Please close this issue when convenient-- it's only for your information]

Swedish stemmer

I think i have found a bug in the swedish stemmer. When searching for "mötet" (the meeting) i should get result for "möte" and "möten". I think the problem is when stemming words ending with "et". (words ending with "andet" and "het" should work though. Those endings are in the suffix list.

When searching for the longest suffix in the first step i added this suffix "et" and that works. Don't know if that is the right way to fix this though.

Arabic Stemmer

I installed the Stemmer and tried to use it with Arabic, but that's what I got

raise KeyError("Stemming algorithm '%s' not found" % lang)

KeyError: "Stemming algorithm 'arabic' not found"

Am I missing some dependencies, if so could you please could you please clarify what do I need to make this stemming work?

Problems with Russian letter Ё

As http://snowballstem.org/algorithms/russian/stemmer.html properly mentions, Russian alphabet contains letter Ё [jo] which is quite often replaced with Е, especially in regular, non-academic texts.

So indeed the beast approach is to replace Ё -> Е when stemming.

Now if you check the existing demo http://snowballstem.org/demo.html you can see that it doesn't actually happen.

Let's take Russian word for "honey" — «мёд», and its form with different ending — «мёдом».
If you paste it along with its "normalized" form (with Е) you can see that the form with Ё is not properly stemmed:
demo

Here's sample input so that you can run the tests yourself: "мёд мёдом мед медом".

This is a serious problem when searching through the corpus of natural texts. Even if you're purist (like me in this case) and type all your search terms with properly placed Ё you won't be able to match the original texts that are using Е.

Cache error in python stemmer

When calling the stemWords method in python stemmer, it may crashes with the following error:

AttributeError: 'FrenchStemmer' object has no attribute 'clear_cache'

It must be an error in the code as the _clear_cache method exists and is called by the sister method stemWord.

I've made a PR to fix this issue:

#105

Java builds and tests

Hi @rboulton ,

I would like the Java stemmers to also be included in travis-ci builds, ideally with stemming tests like it's C counterparts (see #5).

If it's alright with you, I'd like to work on this and submit a pull-request shortly.

oerd

Compiler generates invalid code for ANSI C

The Snowball compiler currently generates C code which intermingles variable declarations and code, which is incompatible with ANSI C.

The fix is to declare the variable at the start of the block. Here is a patch that accomplishes this>

 compiler/generator.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/compiler/generator.c b/compiler/generator.c
index 30643ca..29aeef3 100644
--- a/compiler/generator.c
+++ b/compiler/generator.c
@@ -890,11 +890,11 @@ static void generate_slicefrom(struct generator * g, struct node * p) {
 
 static void generate_setlimit(struct generator * g, struct node * p) {
     int keep_c;
-    writef(g, "~{~k~C", p);
-    keep_c = g->keep_count;
-    w(g, "~Mint mlimit");
+    w(g, "~{int mlimit");
+    keep_c = g->keep_count + 1;
     write_int(g, keep_c);
     w(g, ";~N");
+    writef(g, "~M~k~C", p);
     generate(g, p->left);
 
     w(g, "~Mmlimit");

I have visually checked the result for all generated C stemmer files and found just the intended changes. The new code compiles well as ANSI C and passes all my tests.

Java class generation

I'm working on a stemmer for Lithuanian language. Stemmer is going to be used in Lucene based search engine, therefore I need to use the generated Java class.

In my algorithm I'm using the length of the word and I believe this causes me a problem.

I've been following this tutorial. The resulting Java class works fine with the example program. But Java code does not compile for using in it Lucene with this problem:

[...] /src/main/java/org/tartarus/snowball/ext/LithuanianStemmer.java:[589,27] error: current has private access in SnowballProgram

I see that the code which was generated by snowball compiler tries to get length of the string in this way:

(current.length());

Unfortunatelly, the implementation of SnowballProgram class that is used by Lucene has defined current field in this way

 // current string
    private char current[];

So, if I want to use the generated class in Lucene, I have to manually change the call for string length in this way:

getCurrent().length();

Am I facing a bug? Or this is the expected behaviour?

Mistakes in the Dutch stemmer

I first want to thank everyone on the Snowball project for creating this software. It's great that we can use the software to build more sophisticated search capabilities for our users. However, when I was testing several Dutch words, I noticed there are actually quite a lot of mistakes. I'm not quite sure how to fix the problems in the Dutch stemmer, so I thought I'd mentioned them here and hope someone picks it up.

Not sure where to start, so I'll mention a couple that are incorrect (the last word is the correct one):

gevaren gevar -> gevaar
gevaar gevar -> gevaar
gevaarlijk gevar -> gevaarlijk
gevaarlijke gevar -> gevaarlijk
gevaarlijker gevaarlijker -> gevaarlijk
gevaarten gevaart -> gevaarte
gevallen gevall -> geval
geven gev -> geef
gevist gevist -> vis
gewasbescherming gewasbescherm -> gewasbescherming
gewassen gewass -> gewas
geweer gewer -> geweer
aanbellen aanbell -> bel aan (yes, Dutch is weird)
aandeel aandel -> aandeel
aaneen aanen -> aaneen (should really be excluded from stemming if possible, since there is no way that this word occurs in any other form)
aalmoezen aalmoez -> aalmoes
gangetje gangetj -> gang
gebaartje gebaartj -> gebaar

These are just a few, but there are quite a lot more. Should you need help verifying or testing the stemmer for the Dutch words, I'm happy to help :)

Generated Java code does not inherit from the correct class

In the java TestApp, we have the statement at https://github.com/snowballstem/snowball/blob/master/java/org/tartarus/snowball/TestApp.java#L31

SnowballStemmer stemmer = (SnowballStemmer) stemClass.newInstance();

However, the generated Java class inherits from org.tartarus.snowball.SnowballProgram rather than org.tartarus.snowball.SnowballStemmer.

By simply making the generated class inherit from org.tartarus.snowball.SnowballStemmer (already a subclass of org.tartarus.snowball.SnowballProgram) the TestApp works out of the box.

Add stemmer generators for C# code

Hi there,

I would like to know if anyone is interested on having support for C# generators in this project. Generating C# code is actually quite different from what the Java generator does, as those languages, albeit similar, don't have the same constructs (i.e. labelled breaks/loops). As such, in order to create a generator for C#, it would be necessary to mix parts from the C/C++ generator and the Java one. Plus, it would also be interesting to have a .NET library for Snowball, in the same spirit as the Java one that is available in this repository.

As such, I just wanted to add that I have started a new fork to address those issues that is almost completely finished, with support for a new generator_csharp.c that can already generate all stemmer definition files that come with Snowball. The generated C# files could be seen, for example, here:

I have also added some unit tests for some languages, though not all of them yet. So, if any project maintainer would like this feature, I can send a pull request to merge my fork'd version. Otherwise I can just keep my fork around for anyone else that could be interested.

The fork is currently available at https://github.com/cesarsouza/snowball. In order to generate C# files, you can pass "-cs" to the command line compiler, as in

  • snowball.exe stem_ISO_8859_1.sbl -cs -o EnglishStemmer -name EnglishStemmer -u

An example generated file can be found here.

Hope it can be useful! 😃

When generating Java from *.sbl, with current SnowballProgram and java sources, current.length is not accessible anymore.

When generating Java from *.sbl files , in combination with the current SnowballProgram java file (2016072500L), current.length is not accessible anymore, because current is now a private StringBuffer.

So things like "len" and "size" don't compile, should you change them to getCurrent().length() ?
(see the part with "Thanks to Wolfram Esser for spotting this problem." in SnowballProgram.Java)

Java : generated stemmer code variable cannot instantiate abstract grandparent class `SnowballProgram`.

Java : there is an issue with the generated stemmer code using a stemmer variable referencing the abstract grandparent SnowballProgram class.

I think I have the solution (see below), but would like to have some feedback.

For example when compiling the Schinke Latin latin.sbl.txt stemmer :
see http://snowball.tartarus.org/otherapps/schinke/intro.html

In the generated concrete Java class LatinStemmer, which extends SnowballStemmer extends abstract SnowballProgram.
In this class $noun_form and $verb_form variables are copies of the LatinStemmer class.

Generated code example for LatinStemmer from line 48 resp 57 :

 48: SnowballProgram v_2 = new SnowballProgram(this);
      ... 
 57: SnowballProgram v_4 = new SnowballProgram(this);	  

The issue is that SnowballProgram is an abstract class, and thus can not be instantiated.

My possible solution : change the SnowballStemmer class copy constructor to call the super class SnowballProgram

package org.tartarus.snowball;
import java.lang.reflect.InvocationTargetException;

public abstract class SnowballStemmer extends SnowballProgram {

  public abstract boolean stem();

  public SnowballStemmer(SnowballStemmer other) {
    super(other);
  }

 static final long serialVersionUID = 2016072500L;
}

Then also generate the copy constructor in the generated class:

public LatinStemmer(SnowballStemmer other) {
    super(other);
 }

Then the generated code can be changed as follows for "LatinStemmer":

   LatinStemmer v_2 = new LatinStemmer(this);

I can't find a better solution on Stack Overflow, but maybe I'm missing something here?
Note that LatinStemmer is one of the few snowball algorithms, that uses a stemmer variable.
Is there a way to eliminate such variables?

NB: import java.lang.reflect.InvocationTargetException; this import seems not (longer) used,
since there is no try/catch method for InvocationTargetException in SnowballStemmer ?

Invalid generated Java

The Java file generated from this source:

externals (null)
define null as ()

contains this invalid method definition:

    public boolean null() {
        // (, line 2
        return true;
    }

Snowball as a shared library

Hi there,

Is there a way to compile snowball to a shared library?

Generating something like libsnowball.so ? or libsnowball.dll?

UnicodeDecodeError with Danish stemmer

PyStemmer throws a UnicodeDecodeError on specific input involving certain emojis with Danish stemmer (note the difference in inputs, the second has two 'a' in between the emojis):

> mkvirtualenv -p /usr/bin/python3 stemmer
...
> pip install PyStemmer
...
Successfully installed PyStemmer-1.3.0
> python3 --version
Python 3.6.6
> python3 -c "import Stemmer; print(Stemmer.Stemmer('da').stemWord(b'\xf0\x9f\x98\x98a\xf0\x9f\x98\x98'.decode('utf-8')))"
😘a😘
> python3 -c "import Stemmer; print(Stemmer.Stemmer('da').stemWord(b'\xf0\x9f\x98\x98aa\xf0\x9f\x98\x98'.decode('utf-8')))"
Traceback (most recent call last):
  File "Stemmer.pyx", line 184, in Stemmer.Stemmer.stemWord (src/Stemmer.c:1669)
KeyError: b'\xf0\x9f\x98\x98aa\xf0\x9f\x98\x98'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "Stemmer.pyx", line 192, in Stemmer.Stemmer.stemWord (src/Stemmer.c:1772)
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 6-8: unexpected end of data

Other tested languages work:

> python3 -c "import Stemmer; [Stemmer.Stemmer(lang).stemWord(b'\xf0\x9f\x98\x98aa\xf0\x9f\x98\x98'.decode('utf-8')) for lang in ('en', 'sv', 'fi', 'fr', 'de')]" && echo 'ok'
ok

problems with Turkish stemmer

After trying the Turkish stemmer, I found there might be some inconsistencies with the original paper 'An Affix Stripping Morphological Analyzer for Turkish'.
Here a some error examples:


Yazacağım->Yazacak // should stem to Yaz, acak is a tense suffix
Yazıyorsunuz-> Yaziyor// should stem to Yazi, yor is a tense suffix 

Both examples above derive from Yazi(write).

Above all, thanks a lot for the hard word to help people analyse languages they don't even know much.

Multiple errors in generated Java sources for Latin algorithm

Just copying request posted from mailing archive.

I’m trying to generate Java sources for http://snowballstem.org/otherapps/schinke/ algorithm. I added stem.sbl from schinke.tgz to “snowball/algorithms/latin/stem.sbl” sources. Then updated GNUmakefile:

diff --git a/GNUmakefile b/GNUmakefile
index d6c7606..08237fa 100644
--- a/GNUmakefile
+++ b/GNUmakefile
@@ -29,7 +29,7 @@ libstemmer_algorithms = arabic \
                        danish dutch english finnish french german hungarian \
                        italian \
                        norwegian porter portuguese romanian \
-                       russian spanish swedish tamil turkish
+                       russian spanish swedish tamil turkish latin
 
 KOI8_R_algorithms = russian
 ISO_8859_1_algorithms = danish dutch english finnish french german italian \

Apparently generated sources are not compiled against

java version "1.8.0_92"
Java(TM) SE Runtime Environment (build 1.8.0_92-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.92-b14, mixed mode)

with error:

[error] ./src/main/java/org/tartarus/snowball/ext/latinStemmer.java:260: missing return statement
[error]     } 

Even if I stub the error with return true or return false, stemmer produces weird results. When I launch TestApp latin in.txt –o out.txt for input datum it produces string datum datum, but should just dat.

Take default Java class name from output filename

When generating Java from *.sbl, class name is empty.
Note : I use a windows command / batch file from the algorithms folder.
example : snowball dutch\stem_ISO_8859_1.sbl -o dutchStemmer_ISO -j

resulting file : dutchStemmer_ISO.java

generated class name should be dutchStemmer_ISO but is empty, I put [missing] to indicate where :

// This file was generated automatically by the Snowball to Java compiler
// http://snowballstem.org/

package org.tartarus.snowball.ext;

import org.tartarus.snowball.Among;

/**

  • This class was automatically generated by a Snowball to Java compiler
  • It implements the stemming algorithm defined by a snowball script.
    */

public class [missing] extends org.tartarus.snowball.SnowballProgram {
...

Also at the end of this file the same problem arises:

public boolean equals( Object o ) {
return o instanceof [missing];
}

public int hashCode() {
return [missing].class.getName().hashCode();
}

Unreachable statement in generated Java code

The following Snowball code:

externals (stem)
define stem as not true

produces invalid Java code:

public boolean stem() {
    // not, line 2
    lab0: do {
        return false;
    } while (false);
    return true;
}

which does not compile:

org/tartarus/snowball/ext/testStemmer.java:23: error: unreachable statement
    return true;
    ^

Numbers with finnish stemmer

When entering a number to finnish stemmer with 2 or more trailing zeroes, last zero is removed.

Addition: This issue applies not only to zeroes but all numbers, eg. "555", "9999" etc.

Steps to reproduce

  1. Open demo at snowballstem.org: http://snowballstem.org/demo.html
  2. Select Finnish stemming algorithm
  3. Enter a text including a nuber with two or more trailing zeroes, eg. "testattiin 2000"

Expected result

  1. Output will be "testat 2000"

Actual result

  1. Output is "testat 200" (last zero is gone)

[rust] warning: trait objects without an explicit `dyn` are deprecated

We're currently seeing a number of these warnings from CI here and below:

https://travis-ci.org/snowballstem/snowball/jobs/581014461#L342

The Rust compiler seems to be suggesting to just insert dyn before in every instance, but I know pretty much nothing about Rust and have no idea if that's actually the appropriate fix.

@JDemler Please could you take a look?

Also is there an equivalent of GCC's -Werror which would cause new warnings to make CI fail so we can't miss them? I only spotted this while cleaning up the CI config.

Stemming feminine nouns in german stemmer

It would good to stem the feminine version of nous in the german stemmers. For example "Verkäuferin" should be stemmed to "Verkauf" like it it is done for the masculine form "Verkäufer". This is especially important for occupations for example on a job boards.

Is there a reason not to do that, or has no one gotten arround to do it or come up with an approriate algorithm? I might try to do a PR if there is any chance of getting it accepted.

examples folder in C-dist should be cleaned

*.o files in examples folder should be removed on make clean. The patch:

diff --git a/GNUmakefile b/GNUmakefile
index d6c7606..8dbfefb 100644
--- a/GNUmakefile
+++ b/GNUmakefile
@@ -309,7 +309,7 @@ dist_libstemmer_c: \
        echo 'stemwords: examples/stemwords.o libstemmer.o' >> $${dest}/Makefile && \
        echo '  $$(CC) -o $$@ $$^' >> $${dest}/Makefile && \
        echo 'clean:' >> $${dest}/Makefile && \
-       echo '  rm -f stemwords *.o $(c_src_dir)/*.o runtime/*.o libstemmer/*.o' >> $${dest}/Makefile && \
+       echo '  rm -f stemwords *.o $(c_src_dir)/*.o runtime/*.o libstemmer/*.o examples/*.o' >> $${dest}/Makefile && \
        (cd dist && tar zcf $${destname}.tgz $${destname}) && \
        rm -rf $${dest}

What is Stemming?

I saw snowballstemmer being brought in as a dependency and was curious what such an unusually named package was. The PyPI page doesn't mention what stemming is or why Snowball algorithms are noteworthy for it 🤷‍♀️

The GitHub repository's README makes a mention to a site snowballstem.org which, frustratingly, also fails to say what stemming is. A Google search and a wikipedia entry later explains it 🤦‍♀️

Both the PyPI and the README ought to at least give a one line description of what stemming is, perhaps:

Stemming is the process of finding word "stems" (bases or root forms) from derived or inflected forms, such as fisher, fishing, fishedfish

Missing return statement in generated Java and Rust code

The following Snowball code:

externals (stem)
define stem as (false or true)

produces invalid Java and Rust code: the return statements of stem are missing.

The Java error is:

org/tartarus/snowball/ext/testStemmer.java:27: error: missing return statement
}
^
1 error

The Rust error is:

error[E0308]: mismatched types
  --> /Users/dcorbett/snowball/rust/target/debug/build/testapp-ca6d6d95758b7568/out/test_stemmer.rs:25:9
   |
25 |         break 'lab0;
   |         ^^^^^^^^^^^ expected (), found bool
   |
   = note: expected type `()`
              found type `bool`

arabic, tamil

Should arabic and tamil be added to modules and modules_utf8?

PyPI licence tags

@mitya57 This was held for moderation by the mailing list, but it didn't seem relevant for most subscribers (and I'm not sure if you're even subscribed to the list) so putting it in an issue here instead:

Hello!

You have recently released a new version of snowballstemmer. Thanks! It will
probably make a lot of people happy. There is a tiny issue with your
package on PyPI though: it is a bit difficult to guess the exact
license you use.

It is very clear that you use one of the BSD licenses. PyPI allows you
to use the following classifier:

License :: OSI Approved :: BSD License

But it is not very precise. You can also use the "license" field in
setuptools, but there is no "fixed" format, so you could write "BSD2",
"BSD-2", "BSD-2-Clause", etc. This makes it hard for automated tools
to properly guess which BSD license you are using.

May I suggest using the aforementioned classifier and an SPDX
identifier (https://spdx.org/licenses/) in the "license" field?

Thanks,
Cyril Roelandt

(This email was automatically generated, but you can answer it! A real
human being will read your answer.)

(If you think it makes more sense to list your address as maintainer for PyPI rather than the mailing list, please do).

UTF-8 ?

From: algorithms/french/stem_ISO_8859_1.sbl

stringdef a^   hex 'E2'  // a-circumflex
stringdef a`   hex 'E0'  // a-grave
stringdef c,   hex 'E7'  // c-cedilla

stringdef e"   hex 'EB'  // e-diaeresis (rare)
stringdef e'   hex 'E9'  // e-acute
stringdef e^   hex 'EA'  // e-circumflex
stringdef e`   hex 'E8'  // e-grave
stringdef i"   hex 'EF'  // i-diaeresis
stringdef i^   hex 'EE'  // i-circumflex
stringdef o^   hex 'F4'  // o-circumflex
stringdef u^   hex 'FB'  // u-circumflex
stringdef u`   hex 'F9'  // u-grave

So far there is no UTF-8 version. Why?

Issues with numbers and alphanumeric codes

In #66, it was reported that the Finnish stemmer damaged numbers (e.g. 2000 -> 200). This is now fixed, but there are similar (though more subtle) issues elsewhere.

E.g. The Danish stemmer damages alphanumeric codes where the initial alpha part meets certain criteria - e.g. space1999 -> space199, hal9000 -> hal900, 0x0e00 -> 0x0e0. These are significantly less problematic in practice as in many cases there probably isn't any unwanted conflation (there isn't a "space199" or a "hal900") and the stem is really just an opaque internal token in typical usage. It's more likely to be a problem for cases such as hex codes in error messages, inventory codes, etc, and I think it is worth addressing in a similar way to the fix for Finnish - i.e. by replacing a non-v check with a c check where c is all the letters in danish except those in v.

We also should review the other algorithms where non is used (dutch, english, french, german, hungarian, indonesian, irish, italian, kraaij_pohlmann, lithuanian, norwegian, porter, portuguese, romanian, russian, spanish, swedish, turkish), and add a generic test that feeds some numbers and alphanumeric into the stemmer and checks they aren't changed, to help ensure there aren't other issues of this sort now or in the future.

Stemming algorithm for Faroese

Hi, I would like to contribute a stemming algorithm for Faroese.
How do I get the required permissions to do a push?

Regards
Brandur

Consider creating a change log

Currently, there is no simple way to determine which changes have been made between the released versions. Consider creating a changelog which would contain this information.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.