Stemming handlers (start with porter?) about ml HOT 5 CLOSED

rubixml commented on May 23, 2024

Stemming handlers (start with porter?)

from ml.

Comments (5)

simplechris commented on May 23, 2024 1

Just lurkin' around, but yeah I've ported all of the stemmers/tokenizers etc (including PorterStemmer) from lucene. I agree that it probably belongs in an 'extras' or other external package if you want tighter integration into Rubix

from ml.

andrewdalpino commented on May 23, 2024

@raijyan I think this is a great idea and I've considered it before myself

One of my concerns was with non-English use cases. I like the idea of a stemming tokenizer for the reason you've mentioned but also because it wouldn't require another argument to Word Count Vectorizer.

https://github.com/wamania/php-stemmer seems like it can be integrated into a tokenizer quite easily. We could have a single Stemmer tokenizer that wraps one of the other tokenizers (NGram, SkipGram, Word, etc.) and stems their output, or if that is not possible we could implement a stemming version of each tokenizer.

I am considering a 'Rubix ML Extras' repository and package that would include experimental features such as obscure transformers, neural network activation functions, and perhaps stemmers. The hope is that we have enough hardcore users that will install and experiment with these features before (and if) we included them into the main package.

We are currently in a 'feature freeze' until our first stable release (we just put out our first release candidate this week) which means we do not plan to add additional functionality until after then. Only optimizations, bugfixes, and mayyyyyyybe a small feature. However we are free to develop an 'Extras' package in the meantime.

I'd love to hear your thoughts

Do you or someone you know have proficiency with stemmers?

Thanks for the great recommendation and information!

from ml.

raijyan commented on May 23, 2024

Tested adding Wamania\Snowball to my dependancies - on my product descriptions dataset (4,000 products) went from:

$vectorizer = new WordCountVectorizer(50000, 3, new NGram(1, 1));
$dataset->apply($vectorizer);
$dataset->apply(new TfIdfTransformer());

echo 'Memory: ' . memory_get_usage() / 1024 / 1024 .'M'. PHP_EOL;
echo 'Tokens: ' . count($vectorizer->vocabularies()[0]) . PHP_EOL;

print_r(array_slice($vectorizer->vocabularies()[0], 0, 40));

Memory: 1448.11M
Tokens: 3575
Array
(
    [0] => style
    [1] => size
    [2] => this
...

to:

use Wamania\Snowball\StemmerFactory;
$vectorizer = new WordCountVectorizer(50000, 3, new NGram(1, 1), StemmerFactory::create('english'));
$dataset->apply($vectorizer);
$dataset->apply(new TfIdfTransformer());

echo 'Memory: ' . memory_get_usage() / 1024 / 1024 .'M'. PHP_EOL;
echo 'Tokens: ' . count($vectorizer->vocabularies()[0]) . PHP_EOL;

print_r(array_slice($vectorizer->vocabularies()[0], 0, 40));

Memory: 853.78M
Tokens: 2680
Array
(
    [0] => style
    [1] => size
    [2] => this
...

So some savings to be made, at least for my use cases. Should cut a few hours off my training times on a jaccard.

An extras setup would be cool if you're pushing for a feature freeze. Would probably look at putting in a lemmatizer/a locality normaliser too then (darn variants of English).

Loving the library so far though, working though moving my existing production NLP over to it then time for some experiments >:)

from ml.

raijyan commented on May 23, 2024

We could have a single Stemmer tokenizer that wraps one of the other tokenizers (NGram, SkipGram, Word, etc.) and stems their output, or if that is not possible we could implement a stemming version of each tokenizer.

Yeah a wrapper might help, currently just tacked it in as:

public function tokenize(string $string, $stemmer = null) : array
...
    $nGram = $stemmer ? $stemmer->stem($word) : $word;
...

Not quite as clean as i'd like has done the trick for getting it up and running. Getting some nice results using NGram over my old php-nlp/php-ai combo with single word tokens.

A slight change to the structure would be cool if it allowed for fuller use of the multi-dictionary setup you've made. Could then set token configuration per defined dictionary from the column picker.
EG: being able to configure that my tags/attributes are single word tokens, but my titles/descriptions are NGram(1, 3) when it iterates over them and builds the dictionaries to be used for vectors would offer further performance improvement for... lazy... datasets.

from ml.

andrewdalpino commented on May 23, 2024

@raijyan @simplechris

We went ahead and created an Extras package that can be installed (composer require rubix/extras) right now as dev-master

Included is the Word Stemmer which can be used alone, or as the base tokenizer for either N-Gram or Skip Gram. Example below ...

use Rubix\ML\Transformers\WordCountVectorizer;
use Rubix\ML\Other\Tokenizers\NGram;
use Rubix\ML\Other\Tokenizers\WordStemmer;

$transformer = new WordCountVectorizer(10000, 3, new NGram(1, 2, new WordStemmer('english')));

The changes to N-Gram and Skip Gram have not been released yet but you can install the latest dev-master to preview the features.

In addition, we've added Delta TF-IDF Transformer which is a supervised TF-IDF transformer that boosts term frequencies by how unique they are to a particular class not just the entire corpus.

Preliminary tests using the Sentiment example and the new Word Stemmer as the base tokenizer for N-Gram show no noticeable improvement in accuracy or training speed, however, your mileage may vary. Let me know how it works for you.

With that, we now have a standard way to introduce experimental features in Rubix ML. Feel free to suggest features or contribute to the development of the project if you are so willing.

from ml.

Stemming handlers (start with porter?) about ml HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent