Comments (5)
Just lurkin' around, but yeah I've ported all of the stemmers/tokenizers etc (including PorterStemmer) from lucene. I agree that it probably belongs in an 'extras' or other external package if you want tighter integration into Rubix
from ml.
@raijyan I think this is a great idea and I've considered it before myself
One of my concerns was with non-English use cases. I like the idea of a stemming tokenizer for the reason you've mentioned but also because it wouldn't require another argument to Word Count Vectorizer.
https://github.com/wamania/php-stemmer seems like it can be integrated into a tokenizer quite easily. We could have a single Stemmer tokenizer that wraps one of the other tokenizers (NGram, SkipGram, Word, etc.) and stems their output, or if that is not possible we could implement a stemming version of each tokenizer.
I am considering a 'Rubix ML Extras' repository and package that would include experimental features such as obscure transformers, neural network activation functions, and perhaps stemmers. The hope is that we have enough hardcore users that will install and experiment with these features before (and if) we included them into the main package.
We are currently in a 'feature freeze' until our first stable release (we just put out our first release candidate this week) which means we do not plan to add additional functionality until after then. Only optimizations, bugfixes, and mayyyyyyybe a small feature. However we are free to develop an 'Extras' package in the meantime.
I'd love to hear your thoughts
Do you or someone you know have proficiency with stemmers?
Thanks for the great recommendation and information!
from ml.
Tested adding Wamania\Snowball to my dependancies - on my product descriptions dataset (4,000 products) went from:
$vectorizer = new WordCountVectorizer(50000, 3, new NGram(1, 1));
$dataset->apply($vectorizer);
$dataset->apply(new TfIdfTransformer());
echo 'Memory: ' . memory_get_usage() / 1024 / 1024 .'M'. PHP_EOL;
echo 'Tokens: ' . count($vectorizer->vocabularies()[0]) . PHP_EOL;
print_r(array_slice($vectorizer->vocabularies()[0], 0, 40));
Memory: 1448.11M
Tokens: 3575
Array
(
[0] => style
[1] => size
[2] => this
...
to:
use Wamania\Snowball\StemmerFactory;
$vectorizer = new WordCountVectorizer(50000, 3, new NGram(1, 1), StemmerFactory::create('english'));
$dataset->apply($vectorizer);
$dataset->apply(new TfIdfTransformer());
echo 'Memory: ' . memory_get_usage() / 1024 / 1024 .'M'. PHP_EOL;
echo 'Tokens: ' . count($vectorizer->vocabularies()[0]) . PHP_EOL;
print_r(array_slice($vectorizer->vocabularies()[0], 0, 40));
Memory: 853.78M
Tokens: 2680
Array
(
[0] => style
[1] => size
[2] => this
...
So some savings to be made, at least for my use cases. Should cut a few hours off my training times on a jaccard.
An extras setup would be cool if you're pushing for a feature freeze. Would probably look at putting in a lemmatizer/a locality normaliser too then (darn variants of English).
Loving the library so far though, working though moving my existing production NLP over to it then time for some experiments >:)
from ml.
We could have a single Stemmer tokenizer that wraps one of the other tokenizers (NGram, SkipGram, Word, etc.) and stems their output, or if that is not possible we could implement a stemming version of each tokenizer.
Yeah a wrapper might help, currently just tacked it in as:
public function tokenize(string $string, $stemmer = null) : array
...
$nGram = $stemmer ? $stemmer->stem($word) : $word;
...
Not quite as clean as i'd like has done the trick for getting it up and running. Getting some nice results using NGram over my old php-nlp/php-ai combo with single word tokens.
A slight change to the structure would be cool if it allowed for fuller use of the multi-dictionary setup you've made. Could then set token configuration per defined dictionary from the column picker.
EG: being able to configure that my tags/attributes are single word tokens, but my titles/descriptions are NGram(1, 3) when it iterates over them and builds the dictionaries to be used for vectors would offer further performance improvement for... lazy... datasets.
from ml.
We went ahead and created an Extras package that can be installed (composer require rubix/extras) right now as dev-master
Included is the Word Stemmer which can be used alone, or as the base tokenizer for either N-Gram or Skip Gram. Example below ...
use Rubix\ML\Transformers\WordCountVectorizer;
use Rubix\ML\Other\Tokenizers\NGram;
use Rubix\ML\Other\Tokenizers\WordStemmer;
$transformer = new WordCountVectorizer(10000, 3, new NGram(1, 2, new WordStemmer('english')));
The changes to N-Gram and Skip Gram have not been released yet but you can install the latest dev-master to preview the features.
In addition, we've added Delta TF-IDF Transformer which is a supervised TF-IDF transformer that boosts term frequencies by how unique they are to a particular class not just the entire corpus.
Preliminary tests using the Sentiment example and the new Word Stemmer as the base tokenizer for N-Gram show no noticeable improvement in accuracy or training speed, however, your mileage may vary. Let me know how it works for you.
With that, we now have a standard way to introduce experimental features in Rubix ML. Feel free to suggest features or contribute to the development of the project if you are so willing.
from ml.
Related Issues (20)
- psr/log old version is limiting the project to be used on modern frameworks HOT 2
- Which alogorithm can be used for search result ranking ? HOT 2
- Is "Transformer Architecture Marchine Learning Model" supported on RubixML ??? HOT 4
- Map method in Dataset doesn't exist HOT 2
- Multi Language Tokenization Support HOT 2
- WordCountVectorizer Memory Issue HOT 2
- TruncatedSVD() made PHP crash without any message HOT 3
- Evaluation of the cluster quality with indicators HOT 1
- Requirements not resolved to an installable set of packages HOT 3
- Softmax Classifier & partial training HOT 1
- Does Rubix ML support Natural Language Processing (NLP)? HOT 1
- Convert `Transformers\PrincipalComponentAnalysis` to NumPower
- Convert `Transformers\LinearDiscriminantAnalysis` to NumPower
- Convert `Transformers\TruncatedSVD` to NumPower
- Convert `Generators\Blob` to NumPower
- Convert `Generators\Circle` to NumPower
- Convert `Generators\HalfMoon` to NumPower
- Add numpower to the GitHub workflow pipeline
- How to use TF-IDF when it is not categorical? HOT 1
- create chatbot with questions and answers
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ml.