snowballstem / snowball Goto Github PK
View Code? Open in Web Editor NEWSnowball compiler and stemming algorithms
Home Page: https://snowballstem.org/
License: BSD 3-Clause "New" or "Revised" License
Snowball compiler and stemming algorithms
Home Page: https://snowballstem.org/
License: BSD 3-Clause "New" or "Revised" License
Hi,
currently, the tarball of the C version of the libstemmer library is missing the 'german2' algorithm. Is there a particular reason for this?
Thanks!
I see BSD-3-Clause in README.rst, could you please consider to include it in separate file, which would be shipped along README.rst instead? It is possible to store license separately from other docs.
Thank you for considering
I am trying to write a Stemmer for the Occitan language very close the Catalan language but I can't.
There is a difference in the Among signature between the java code generated and what is used in the SolR buid chain.
Where can I get the accurate version of the compiler in order to generate the good one ?
For example
/* CatalanStemmer from SolR code */
@SuppressWarnings("unused") public class CatalanStemmer extends SnowballProgram {
private static final long serialVersionUID = 1L;
/* patched */ private static final java.lang.invoke.MethodHandles.Lookup methodObject = java.lang.invoke.MethodHandles.lookup();
private final static Among a_0[] = {
new Among ( "", -1, 13, "", methodObject ),
new Among ( "\u00B7", 0, 12, "", methodObject ),
/*OccitanStemmer generated by the compiler*/
public class OccitanStemmer extends org.tartarus.snowball.SnowballProgram {
private static final long serialVersionUID = 1L;
private final static Among a_0[] = {
new Among("", -1, 13),
Again
among_var = find_among(a_0);
and
among_var = find_among_b(a_1);
are differents
I just wanted to let you know that the Julia community provides pre-built binary artifacts for various native dependencies, and has recently built them for Snowball. These binaries, for many different architectures and operating systems, are available from https://github.com/JuliaBinaryWrappers/Snowball_jll.jl/releases . The bundles contain a shared library (.dylib/.so/.dll
) and the stemwords
executable. They were built from the Snowball C library download. They've been built for all available encodings.
Background infromation about the cross-compiler based build system can be found at https://binarybuilder.org/
[Please close this issue when convenient-- it's only for your information]
I think i have found a bug in the swedish stemmer. When searching for "mötet" (the meeting) i should get result for "möte" and "möten". I think the problem is when stemming words ending with "et". (words ending with "andet" and "het" should work though. Those endings are in the suffix list.
When searching for the longest suffix in the first step i added this suffix "et" and that works. Don't know if that is the right way to fix this though.
I installed the Stemmer and tried to use it with Arabic, but that's what I got
raise KeyError("Stemming algorithm '%s' not found" % lang)
KeyError: "Stemming algorithm 'arabic' not found"
Am I missing some dependencies, if so could you please could you please clarify what do I need to make this stemming work?
As http://snowballstem.org/algorithms/russian/stemmer.html properly mentions, Russian alphabet contains letter Ё [jo] which is quite often replaced with Е, especially in regular, non-academic texts.
So indeed the beast approach is to replace Ё -> Е when stemming.
Now if you check the existing demo http://snowballstem.org/demo.html you can see that it doesn't actually happen.
Let's take Russian word for "honey" — «мёд», and its form with different ending — «мёдом».
If you paste it along with its "normalized" form (with Е) you can see that the form with Ё is not properly stemmed:
Here's sample input so that you can run the tests yourself: "мёд мёдом мед медом".
This is a serious problem when searching through the corpus of natural texts. Even if you're purist (like me in this case) and type all your search terms with properly placed Ё you won't be able to match the original texts that are using Е.
Could you please inset licence into distributed source code?
Because BSD licence requires including its text with all derivative works. So without including a proper full license text it is making difficult or impossible for anyone to comply with desired license terms.
There was discussion on old mailing list - http://thread.gmane.org/gmane.comp.search.snowball/1463
Thanks,
Marek
When calling the stemWords
method in python stemmer, it may crashes with the following error:
AttributeError: 'FrenchStemmer' object has no attribute 'clear_cache'
It must be an error in the code as the _clear_cache
method exists and is called by the sister method stemWord
.
I've made a PR to fix this issue:
found algorithm for Ukrainian lang in this repo
https://github.com/Tapkomet/UAStemming
It would be great if you added it
The Snowball compiler currently generates C code which intermingles variable declarations and code, which is incompatible with ANSI C.
The fix is to declare the variable at the start of the block. Here is a patch that accomplishes this>
compiler/generator.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/compiler/generator.c b/compiler/generator.c
index 30643ca..29aeef3 100644
--- a/compiler/generator.c
+++ b/compiler/generator.c
@@ -890,11 +890,11 @@ static void generate_slicefrom(struct generator * g, struct node * p) {
static void generate_setlimit(struct generator * g, struct node * p) {
int keep_c;
- writef(g, "~{~k~C", p);
- keep_c = g->keep_count;
- w(g, "~Mint mlimit");
+ w(g, "~{int mlimit");
+ keep_c = g->keep_count + 1;
write_int(g, keep_c);
w(g, ";~N");
+ writef(g, "~M~k~C", p);
generate(g, p->left);
w(g, "~Mmlimit");
I have visually checked the result for all generated C stemmer files and found just the intended changes. The new code compiles well as ANSI C and passes all my tests.
All uses of setlimit in the current stemmers we ship follow this pattern, and by special-casing we can avoid having to save and restore the cursor.
df2064d implemented this for C, but it would be good to handle this for all the target languages.
I'm working on a stemmer for Lithuanian language. Stemmer is going to be used in Lucene based search engine, therefore I need to use the generated Java class.
In my algorithm I'm using the length of the word and I believe this causes me a problem.
I've been following this tutorial. The resulting Java class works fine with the example program. But Java code does not compile for using in it Lucene with this problem:
[...] /src/main/java/org/tartarus/snowball/ext/LithuanianStemmer.java:[589,27] error: current has private access in SnowballProgram
I see that the code which was generated by snowball compiler tries to get length of the string in this way:
(current.length());
Unfortunatelly, the implementation of SnowballProgram
class that is used by Lucene has defined current
field in this way
// current string
private char current[];
So, if I want to use the generated class in Lucene, I have to manually change the call for string length in this way:
getCurrent().length();
Am I facing a bug? Or this is the expected behaviour?
I first want to thank everyone on the Snowball project for creating this software. It's great that we can use the software to build more sophisticated search capabilities for our users. However, when I was testing several Dutch words, I noticed there are actually quite a lot of mistakes. I'm not quite sure how to fix the problems in the Dutch stemmer, so I thought I'd mentioned them here and hope someone picks it up.
Not sure where to start, so I'll mention a couple that are incorrect (the last word is the correct one):
gevaren gevar -> gevaar
gevaar gevar -> gevaar
gevaarlijk gevar -> gevaarlijk
gevaarlijke gevar -> gevaarlijk
gevaarlijker gevaarlijker -> gevaarlijk
gevaarten gevaart -> gevaarte
gevallen gevall -> geval
geven gev -> geef
gevist gevist -> vis
gewasbescherming gewasbescherm -> gewasbescherming
gewassen gewass -> gewas
geweer gewer -> geweer
aanbellen aanbell -> bel aan (yes, Dutch is weird)
aandeel aandel -> aandeel
aaneen aanen -> aaneen (should really be excluded from stemming if possible, since there is no way that this word occurs in any other form)
aalmoezen aalmoez -> aalmoes
gangetje gangetj -> gang
gebaartje gebaartj -> gebaar
These are just a few, but there are quite a lot more. Should you need help verifying or testing the stemmer for the Dutch words, I'm happy to help :)
In the java TestApp, we have the statement at https://github.com/snowballstem/snowball/blob/master/java/org/tartarus/snowball/TestApp.java#L31
SnowballStemmer stemmer = (SnowballStemmer) stemClass.newInstance();
However, the generated Java class inherits from org.tartarus.snowball.SnowballProgram
rather than org.tartarus.snowball.SnowballStemmer
.
By simply making the generated class inherit from org.tartarus.snowball.SnowballStemmer
(already a subclass of org.tartarus.snowball.SnowballProgram
) the TestApp works out of the box.
How compile windows?
Hi there,
I would like to know if anyone is interested on having support for C# generators in this project. Generating C# code is actually quite different from what the Java generator does, as those languages, albeit similar, don't have the same constructs (i.e. labelled breaks/loops). As such, in order to create a generator for C#, it would be necessary to mix parts from the C/C++ generator and the Java one. Plus, it would also be interesting to have a .NET library for Snowball, in the same spirit as the Java one that is available in this repository.
As such, I just wanted to add that I have started a new fork to address those issues that is almost completely finished, with support for a new generator_csharp.c that can already generate all stemmer definition files that come with Snowball. The generated C# files could be seen, for example, here:
I have also added some unit tests for some languages, though not all of them yet. So, if any project maintainer would like this feature, I can send a pull request to merge my fork'd version. Otherwise I can just keep my fork around for anyone else that could be interested.
The fork is currently available at https://github.com/cesarsouza/snowball. In order to generate C# files, you can pass "-cs" to the command line compiler, as in
An example generated file can be found here.
Hope it can be useful! 😃
When generating Java from *.sbl files , in combination with the current SnowballProgram java file (2016072500L), current.length
is not accessible anymore, because current
is now a private StringBuffer.
So things like "len" and "size" don't compile, should you change them to getCurrent().length()
?
(see the part with "Thanks to Wolfram Esser for spotting this problem." in SnowballProgram.Java
)
Java : there is an issue with the generated stemmer code using a stemmer variable referencing the abstract grandparent SnowballProgram
class.
I think I have the solution (see below), but would like to have some feedback.
For example when compiling the Schinke Latin latin.sbl.txt stemmer :
see http://snowball.tartarus.org/otherapps/schinke/intro.html
In the generated concrete Java class LatinStemmer
, which extends SnowballStemmer
extends abstract SnowballProgram
.
In this class $noun_form
and $verb_form
variables are copies of the LatinStemmer
class.
Generated code example for LatinStemmer
from line 48 resp 57 :
48: SnowballProgram v_2 = new SnowballProgram(this);
...
57: SnowballProgram v_4 = new SnowballProgram(this);
The issue is that SnowballProgram
is an abstract class, and thus can not be instantiated.
My possible solution : change the SnowballStemmer
class copy constructor to call the super class SnowballProgram
package org.tartarus.snowball;
import java.lang.reflect.InvocationTargetException;
public abstract class SnowballStemmer extends SnowballProgram {
public abstract boolean stem();
public SnowballStemmer(SnowballStemmer other) {
super(other);
}
static final long serialVersionUID = 2016072500L;
}
Then also generate the copy constructor in the generated class:
public LatinStemmer(SnowballStemmer other) {
super(other);
}
Then the generated code can be changed as follows for "LatinStemmer":
LatinStemmer v_2 = new LatinStemmer(this);
I can't find a better solution on Stack Overflow, but maybe I'm missing something here?
Note that LatinStemmer is one of the few snowball algorithms, that uses a stemmer variable.
Is there a way to eliminate such variables?
NB: import java.lang.reflect.InvocationTargetException;
this import seems not (longer) used,
since there is no try/catch method for InvocationTargetException in SnowballStemmer ?
The Java file generated from this source:
externals (null)
define null as ()
contains this invalid method definition:
public boolean null() {
// (, line 2
return true;
}
Hi there,
Is there a way to compile snowball to a shared library?
Generating something like libsnowball.so ? or libsnowball.dll?
PyStemmer throws a UnicodeDecodeError
on specific input involving certain emojis with Danish stemmer (note the difference in inputs, the second has two 'a' in between the emojis):
> mkvirtualenv -p /usr/bin/python3 stemmer
...
> pip install PyStemmer
...
Successfully installed PyStemmer-1.3.0
> python3 --version
Python 3.6.6
> python3 -c "import Stemmer; print(Stemmer.Stemmer('da').stemWord(b'\xf0\x9f\x98\x98a\xf0\x9f\x98\x98'.decode('utf-8')))"
😘a😘
> python3 -c "import Stemmer; print(Stemmer.Stemmer('da').stemWord(b'\xf0\x9f\x98\x98aa\xf0\x9f\x98\x98'.decode('utf-8')))"
Traceback (most recent call last):
File "Stemmer.pyx", line 184, in Stemmer.Stemmer.stemWord (src/Stemmer.c:1669)
KeyError: b'\xf0\x9f\x98\x98aa\xf0\x9f\x98\x98'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "Stemmer.pyx", line 192, in Stemmer.Stemmer.stemWord (src/Stemmer.c:1772)
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 6-8: unexpected end of data
Other tested languages work:
> python3 -c "import Stemmer; [Stemmer.Stemmer(lang).stemWord(b'\xf0\x9f\x98\x98aa\xf0\x9f\x98\x98'.decode('utf-8')) for lang in ('en', 'sv', 'fi', 'fr', 'de')]" && echo 'ok'
ok
After trying the Turkish stemmer, I found there might be some inconsistencies with the original paper 'An Affix Stripping Morphological Analyzer for Turkish'.
Here a some error examples:
Yazacağım->Yazacak // should stem to Yaz, acak is a tense suffix
Yazıyorsunuz-> Yaziyor// should stem to Yazi, yor is a tense suffix
Both examples above derive from Yazi(write).
Above all, thanks a lot for the hard word to help people analyse languages they don't even know much.
Just copying request posted from mailing archive.
I’m trying to generate Java sources for http://snowballstem.org/otherapps/schinke/ algorithm. I added stem.sbl
from schinke.tgz
to “snowball/algorithms/latin/stem.sbl” sources. Then updated GNUmakefile:
diff --git a/GNUmakefile b/GNUmakefile
index d6c7606..08237fa 100644
--- a/GNUmakefile
+++ b/GNUmakefile
@@ -29,7 +29,7 @@ libstemmer_algorithms = arabic \
danish dutch english finnish french german hungarian \
italian \
norwegian porter portuguese romanian \
- russian spanish swedish tamil turkish
+ russian spanish swedish tamil turkish latin
KOI8_R_algorithms = russian
ISO_8859_1_algorithms = danish dutch english finnish french german italian \
Apparently generated sources are not compiled against
java version "1.8.0_92"
Java(TM) SE Runtime Environment (build 1.8.0_92-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.92-b14, mixed mode)
with error:
[error] ./src/main/java/org/tartarus/snowball/ext/latinStemmer.java:260: missing return statement
[error] }
Even if I stub the error with return true
or return false
, stemmer produces weird results. When I launch TestApp latin in.txt –o out.txt
for input datum
it produces string datum datum
, but should just dat
.
When generating Java from *.sbl, class name is empty.
Note : I use a windows command / batch file from the algorithms folder.
example : snowball dutch\stem_ISO_8859_1.sbl -o dutchStemmer_ISO -j
resulting file : dutchStemmer_ISO.java
generated class name should be dutchStemmer_ISO
but is empty, I put [missing] to indicate where :
// This file was generated automatically by the Snowball to Java compiler
// http://snowballstem.org/package org.tartarus.snowball.ext;
import org.tartarus.snowball.Among;
/**
- This class was automatically generated by a Snowball to Java compiler
- It implements the stemming algorithm defined by a snowball script.
*/public class [missing] extends org.tartarus.snowball.SnowballProgram {
...
Also at the end of this file the same problem arises:
public boolean equals( Object o ) {
return o instanceof [missing];
}public int hashCode() {
return [missing].class.getName().hashCode();
}
Systeme -> System
Systemen -> System
Systemes -> System
Systems -> System
but
System -> Syst
Please confer thread here as well: http://www.postgresql.org/message-id/[email protected]
The following Snowball code:
externals (stem)
define stem as not true
produces invalid Java code:
public boolean stem() {
// not, line 2
lab0: do {
return false;
} while (false);
return true;
}
which does not compile:
org/tartarus/snowball/ext/testStemmer.java:23: error: unreachable statement
return true;
^
When entering a number to finnish stemmer with 2 or more trailing zeroes, last zero is removed.
Addition: This issue applies not only to zeroes but all numbers, eg. "555", "9999" etc.
We're currently seeing a number of these warnings from CI here and below:
https://travis-ci.org/snowballstem/snowball/jobs/581014461#L342
The Rust compiler seems to be suggesting to just insert dyn
before in every instance, but I know pretty much nothing about Rust and have no idea if that's actually the appropriate fix.
@JDemler Please could you take a look?
Also is there an equivalent of GCC's -Werror
which would cause new warnings to make CI fail so we can't miss them? I only spotted this while cleaning up the CI config.
It would good to stem the feminine version of nous in the german stemmers. For example "Verkäuferin" should be stemmed to "Verkauf" like it it is done for the masculine form "Verkäufer". This is especially important for occupations for example on a job boards.
Is there a reason not to do that, or has no one gotten arround to do it or come up with an approriate algorithm? I might try to do a PR if there is any chance of getting it accepted.
*.o
files in examples
folder should be removed on make clean
. The patch:
diff --git a/GNUmakefile b/GNUmakefile
index d6c7606..8dbfefb 100644
--- a/GNUmakefile
+++ b/GNUmakefile
@@ -309,7 +309,7 @@ dist_libstemmer_c: \
echo 'stemwords: examples/stemwords.o libstemmer.o' >> $${dest}/Makefile && \
echo ' $$(CC) -o $$@ $$^' >> $${dest}/Makefile && \
echo 'clean:' >> $${dest}/Makefile && \
- echo ' rm -f stemwords *.o $(c_src_dir)/*.o runtime/*.o libstemmer/*.o' >> $${dest}/Makefile && \
+ echo ' rm -f stemwords *.o $(c_src_dir)/*.o runtime/*.o libstemmer/*.o examples/*.o' >> $${dest}/Makefile && \
(cd dist && tar zcf $${destname}.tgz $${destname}) && \
rm -rf $${dest}
I saw snowballstemmer
being brought in as a dependency and was curious what such an unusually named package was. The PyPI page doesn't mention what stemming is or why Snowball algorithms are noteworthy for it 🤷♀️
The GitHub repository's README makes a mention to a site snowballstem.org
which, frustratingly, also fails to say what stemming is. A Google search and a wikipedia entry later explains it 🤦♀️
Both the PyPI and the README ought to at least give a one line description of what stemming is, perhaps:
Stemming is the process of finding word "stems" (bases or root forms) from derived or inflected forms, such as
fisher
,fishing
,fished
→fish
The following Snowball code:
externals (stem)
define stem as (false or true)
produces invalid Java and Rust code: the return statements of stem
are missing.
The Java error is:
org/tartarus/snowball/ext/testStemmer.java:27: error: missing return statement
}
^
1 error
The Rust error is:
error[E0308]: mismatched types
--> /Users/dcorbett/snowball/rust/target/debug/build/testapp-ca6d6d95758b7568/out/test_stemmer.rs:25:9
|
25 | break 'lab0;
| ^^^^^^^^^^^ expected (), found bool
|
= note: expected type `()`
found type `bool`
Should arabic and tamil be added to modules and modules_utf8?
@mitya57 This was held for moderation by the mailing list, but it didn't seem relevant for most subscribers (and I'm not sure if you're even subscribed to the list) so putting it in an issue here instead:
Hello!
You have recently released a new version of snowballstemmer. Thanks! It will
probably make a lot of people happy. There is a tiny issue with your
package on PyPI though: it is a bit difficult to guess the exact
license you use.It is very clear that you use one of the BSD licenses. PyPI allows you
to use the following classifier:License :: OSI Approved :: BSD License
But it is not very precise. You can also use the "license" field in
setuptools, but there is no "fixed" format, so you could write "BSD2",
"BSD-2", "BSD-2-Clause", etc. This makes it hard for automated tools
to properly guess which BSD license you are using.May I suggest using the aforementioned classifier and an SPDX
identifier (https://spdx.org/licenses/) in the "license" field?Thanks,
Cyril Roelandt(This email was automatically generated, but you can answer it! A real
human being will read your answer.)
(If you think it makes more sense to list your address as maintainer for PyPI rather than the mailing list, please do).
From: algorithms/french/stem_ISO_8859_1.sbl
stringdef a^ hex 'E2' // a-circumflex
stringdef a` hex 'E0' // a-grave
stringdef c, hex 'E7' // c-cedilla
stringdef e" hex 'EB' // e-diaeresis (rare)
stringdef e' hex 'E9' // e-acute
stringdef e^ hex 'EA' // e-circumflex
stringdef e` hex 'E8' // e-grave
stringdef i" hex 'EF' // i-diaeresis
stringdef i^ hex 'EE' // i-circumflex
stringdef o^ hex 'F4' // o-circumflex
stringdef u^ hex 'FB' // u-circumflex
stringdef u` hex 'F9' // u-grave
So far there is no UTF-8 version. Why?
Hello! I hope this is the right place to report a problem I found with an app that uses PostgreSQL 10's full-text search.
There is a class of French nouns that form their plural in x
: jeux, hiboux, choux, aulx, baux, etc.
Testing with PG and reading the doc at https://snowballstem.org/algorithms/french/stemmer.html make me think that these are not handled.
Hi,
It would be very helpful for packagers if you created release tags.
https://help.github.com/articles/about-releases
Thanks!
In #66, it was reported that the Finnish stemmer damaged numbers (e.g. 2000
-> 200
). This is now fixed, but there are similar (though more subtle) issues elsewhere.
E.g. The Danish stemmer damages alphanumeric codes where the initial alpha part meets certain criteria - e.g. space1999
-> space199
, hal9000
-> hal900
, 0x0e00
-> 0x0e0
. These are significantly less problematic in practice as in many cases there probably isn't any unwanted conflation (there isn't a "space199" or a "hal900") and the stem is really just an opaque internal token in typical usage. It's more likely to be a problem for cases such as hex codes in error messages, inventory codes, etc, and I think it is worth addressing in a similar way to the fix for Finnish - i.e. by replacing a non-v
check with a c
check where c
is all the letters in danish except those in v
.
We also should review the other algorithms where non
is used (dutch, english, french, german, hungarian, indonesian, irish, italian, kraaij_pohlmann, lithuanian, norwegian, porter, portuguese, romanian, russian, spanish, swedish, turkish), and add a generic test that feeds some numbers and alphanumeric into the stemmer and checks they aren't changed, to help ensure there aren't other issues of this sort now or in the future.
Hi
can you provide maven artifact and metadata for the Java stemming algorithm library?
Thanks in advance
Regards
Hi,
All source files are without license headers
Please, confirm the licensing of code and/or content/s, and add license headers.
Thanks in advance
Regards
Hello,
I installed the library from PyPI
pip install snowballstemmer
There is a bug in https://github.com/snowballstem/snowball/blob/master/python/create_init.py#L42
----> 1 snowballstemmer.algorithms()
67 return Stemmer.language()
68 else:
---> 69 return list(_languages.key())
70
71 def stemmer(lang):
AttributeError: 'dict' object has no attribute 'key'
It should be _languages.keys()
I think it would make far easier for people to use it
Hi, I would like to contribute a stemming algorithm for Faroese.
How do I get the required permissions to do a push?
Regards
Brandur
Currently, there is no simple way to determine which changes have been made between the released versions. Consider creating a changelog which would contain this information.
Web Page French stemming algorithm does not mention other rarely used letters with diacritics. For instance:
Are these rare diacritics in French orthography considered in French stemming algorithm?
I noticed @mitya57 imported Python code generator from my code (#24).
If someone wants to use the following namespace, I will give the ownership of the PyPI namespace:
https://pypi.org/project/snowballstemmer/
Newer version should be generated from this repository instead of mine.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.