Coder Social home page Coder Social logo

Comments (5)

simplechris avatar simplechris commented on May 23, 2024 1

Just lurkin' around, but yeah I've ported all of the stemmers/tokenizers etc (including PorterStemmer) from lucene. I agree that it probably belongs in an 'extras' or other external package if you want tighter integration into Rubix

from ml.

andrewdalpino avatar andrewdalpino commented on May 23, 2024

@raijyan I think this is a great idea and I've considered it before myself

One of my concerns was with non-English use cases. I like the idea of a stemming tokenizer for the reason you've mentioned but also because it wouldn't require another argument to Word Count Vectorizer.

https://github.com/wamania/php-stemmer seems like it can be integrated into a tokenizer quite easily. We could have a single Stemmer tokenizer that wraps one of the other tokenizers (NGram, SkipGram, Word, etc.) and stems their output, or if that is not possible we could implement a stemming version of each tokenizer.

I am considering a 'Rubix ML Extras' repository and package that would include experimental features such as obscure transformers, neural network activation functions, and perhaps stemmers. The hope is that we have enough hardcore users that will install and experiment with these features before (and if) we included them into the main package.

We are currently in a 'feature freeze' until our first stable release (we just put out our first release candidate this week) which means we do not plan to add additional functionality until after then. Only optimizations, bugfixes, and mayyyyyyybe a small feature. However we are free to develop an 'Extras' package in the meantime.

I'd love to hear your thoughts

Do you or someone you know have proficiency with stemmers?

Thanks for the great recommendation and information!

from ml.

raijyan avatar raijyan commented on May 23, 2024

Tested adding Wamania\Snowball to my dependancies - on my product descriptions dataset (4,000 products) went from:

$vectorizer = new WordCountVectorizer(50000, 3, new NGram(1, 1));
$dataset->apply($vectorizer);
$dataset->apply(new TfIdfTransformer());

echo 'Memory: ' . memory_get_usage() / 1024 / 1024 .'M'. PHP_EOL;
echo 'Tokens: ' . count($vectorizer->vocabularies()[0]) . PHP_EOL;

print_r(array_slice($vectorizer->vocabularies()[0], 0, 40));

Memory: 1448.11M
Tokens: 3575
Array
(
    [0] => style
    [1] => size
    [2] => this
...

to:

use Wamania\Snowball\StemmerFactory;
$vectorizer = new WordCountVectorizer(50000, 3, new NGram(1, 1), StemmerFactory::create('english'));
$dataset->apply($vectorizer);
$dataset->apply(new TfIdfTransformer());

echo 'Memory: ' . memory_get_usage() / 1024 / 1024 .'M'. PHP_EOL;
echo 'Tokens: ' . count($vectorizer->vocabularies()[0]) . PHP_EOL;

print_r(array_slice($vectorizer->vocabularies()[0], 0, 40));

Memory: 853.78M
Tokens: 2680
Array
(
    [0] => style
    [1] => size
    [2] => this
...

So some savings to be made, at least for my use cases. Should cut a few hours off my training times on a jaccard.

An extras setup would be cool if you're pushing for a feature freeze. Would probably look at putting in a lemmatizer/a locality normaliser too then (darn variants of English).

Loving the library so far though, working though moving my existing production NLP over to it then time for some experiments >:)

from ml.

raijyan avatar raijyan commented on May 23, 2024

We could have a single Stemmer tokenizer that wraps one of the other tokenizers (NGram, SkipGram, Word, etc.) and stems their output, or if that is not possible we could implement a stemming version of each tokenizer.

Yeah a wrapper might help, currently just tacked it in as:

public function tokenize(string $string, $stemmer = null) : array
...
    $nGram = $stemmer ? $stemmer->stem($word) : $word;
...

Not quite as clean as i'd like has done the trick for getting it up and running. Getting some nice results using NGram over my old php-nlp/php-ai combo with single word tokens.

A slight change to the structure would be cool if it allowed for fuller use of the multi-dictionary setup you've made. Could then set token configuration per defined dictionary from the column picker.
EG: being able to configure that my tags/attributes are single word tokens, but my titles/descriptions are NGram(1, 3) when it iterates over them and builds the dictionaries to be used for vectors would offer further performance improvement for... lazy... datasets.

from ml.

andrewdalpino avatar andrewdalpino commented on May 23, 2024

@raijyan @simplechris

We went ahead and created an Extras package that can be installed (composer require rubix/extras) right now as dev-master

Included is the Word Stemmer which can be used alone, or as the base tokenizer for either N-Gram or Skip Gram. Example below ...

use Rubix\ML\Transformers\WordCountVectorizer;
use Rubix\ML\Other\Tokenizers\NGram;
use Rubix\ML\Other\Tokenizers\WordStemmer;

$transformer = new WordCountVectorizer(10000, 3, new NGram(1, 2, new WordStemmer('english')));

The changes to N-Gram and Skip Gram have not been released yet but you can install the latest dev-master to preview the features.

In addition, we've added Delta TF-IDF Transformer which is a supervised TF-IDF transformer that boosts term frequencies by how unique they are to a particular class not just the entire corpus.

Preliminary tests using the Sentiment example and the new Word Stemmer as the base tokenizer for N-Gram show no noticeable improvement in accuracy or training speed, however, your mileage may vary. Let me know how it works for you.

With that, we now have a standard way to introduce experimental features in Rubix ML. Feel free to suggest features or contribute to the development of the project if you are so willing.

from ml.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.