Comments (16)
Summarizing what we talked about in chat ...
This error is caused by a chain of silent errors starting with numerical under/overflow due to a learning rate that is too high for the user's particular dataset. As a result the network produces NaN values at the output layer which in turn produce a prediction of false
when run through the argmax function. This false
value is then silently converted (thanks PHP) to the integer 0
when used as the key/index of an array entry used to accumulate false positives in the MCC and FBeta metrics.
The solution to this is to decrease the learning rate of the Gradient Descent optimizer to prevent the network from blowing up. To aid the user in identifying when the network has become unstable, we will catch NaN values before scoring the validation set and then throw an informative exception.
Here is a good article on exploding gradients and why decreasing the learning rate has the effect of stabilizing training https://machinelearningmastery.com/exploding-gradients-in-neural-networks/
from ml.
Hi @DivineOmega thanks for the bug report!
By any chance do any of the training sets contain 10 or less samples?
Also, here is line 107 of MCC
For some reason, the integer 0 is showing up in a prediction ... do you have a directory named 0 or one that would evaluate to 0 or false under type-coercion? (see https://www.php.net/manual/en/language.types.boolean.php#language.types.boolean.casting)
from ml.
Hi @andrewdalpino. Thanks for looking into this.
My two classes are positive
with 527 samples and non-positive
with 1035.
There's no directory named 0
and I do not think my buildLabeled
function would generate a 0
class. If this were the case, I'd expect the Metric to always and immediately fail, while sometimes it happens after several epocs.
To confirm I've only got the two classes, I ran the following code.
$dataset = DatasetHelper::buildLabeled();
dd($dataset->possibleOutcomes());
And got the following array:
array:2 [
0 => "non-positive"
1 => "positive"
]
from ml.
FYI, I'm using dev-master d0872a0
version of rubix/ml
via Composer, the latest version as of right now.
$ composer show | grep rubix/ml
rubix/ml dev-master d0872a0 A high-level machine learning and deep learning library for the PHP language.
from ml.
@DivineOmega Hmmmm ... this is a mysterious one
How often does this error occur? For example, out of 100 trainings, how many one them would error in your estimation?
Does training seem normal when these errors occur? Is the loss decreasing steadily? I'm wondering if the network is outputting NaN values if for some reason training went awry.
from ml.
@andrewdalpino So far, with that dataset it has failed 5/5 times (3 with FBeta, 2 with MCC).
$ cat storage/logs/laravel.log | grep "Undefined offset"
[2020-04-04 21:15:02] production.ERROR: Undefined offset: 0 {"exception":"[object] (ErrorException(code: 0): Undefined offset: 0 at /[REDACTED]/vendor/rubix/ml/src/CrossValidation/Metrics/FBeta.php:127)
[2020-04-04 21:34:52] production.ERROR: Undefined offset: 0 {"exception":"[object] (ErrorException(code: 0): Undefined offset: 0 at /[REDACTED]/vendor/rubix/ml/src/CrossValidation/Metrics/FBeta.php:127)
[2020-04-04 22:04:19] production.ERROR: Undefined offset: 0 {"exception":"[object] (ErrorException(code: 0): Undefined offset: 0 at /[REDACTED]/vendor/rubix/ml/src/CrossValidation/Metrics/FBeta.php:127)
[2020-04-04 22:32:21] production.ERROR: Undefined offset: 0 {"exception":"[object] (ErrorException(code: 0): Undefined offset: 0 at /[REDACTED]/vendor/rubix/ml/src/CrossValidation/Metrics/MCC.php:107)
[2020-04-05 08:24:02] production.ERROR: Undefined offset: 0 {"exception":"[object] (ErrorException(code: 0): Undefined offset: 0 at /[REDACTED]/vendor/rubix/ml/src/CrossValidation/Metrics/MCC.php:107)
$ cat storage/logs/laravel.log | grep "Undefined offset" | wc -l
5
Here's the log from the latest training session. Loss is decreasing.
$ php artisan ml:train
[2020-04-05 08:02:48] train-model.INFO: Fitted WordCountVectorizer
[2020-04-05 08:02:52] train-model.INFO: Fitted TfIdfTransformer
[2020-04-05 08:02:57] train-model.INFO: Fitted ZScaleStandardizer
[2020-04-05 08:02:59] train-model.INFO: Learner init hidden_layers=[0=Dense 1=PReLU 2=Dense 3=PReLU 4=Dense 5=PReLU 6=Dense 7=PReLU 8=Dense 9=PReLU] batch_size=100 optimizer=Adam alpha=0.0001 epochs=1000 min_change=0.0001 window=10 hold_out=0.1 cost_fn=CrossEntropy metric=MCC
[2020-04-05 08:02:59] train-model.INFO: Training started
[2020-04-05 08:07:14] train-model.INFO: Epoch 1 score=0.49118783277336 loss=0.30282976376964
[2020-04-05 08:11:25] train-model.INFO: Epoch 2 score=0.50325846378583 loss=0.18621958479585
[2020-04-05 08:15:38] train-model.INFO: Epoch 3 score=0.50114946967608 loss=0.1244821070199
[2020-04-05 08:19:52] train-model.INFO: Epoch 4 score=0.55362054479179 loss=0.096733785356479
ErrorException
Undefined offset: 0
at vendor/rubix/ml/src/CrossValidation/Metrics/MCC.php:107
103| ++$trueNeg[$class];
104| }
105| }
106| } else {
> 107| ++$falsePos[$prediction];
108| ++$falseNeg[$label];
109| }
110| }
111|
+5 vendor frames
6 app/Console/Commands/TrainModel.php:89
Rubix\ML\PersistentModel::train()
+14 vendor frames
21 artisan:37
Illuminate\Foundation\Console\Kernel::handle()
from ml.
Hmm everything seems normal up to epoch 5
I'm going to give this some thought
from ml.
@andrewdalpino I've just re-ran the training with exact same dataset, and learner/metric configuration. In this case it fails after epoch 3.
$ php artisan ml:train
[2020-04-05 10:45:40] train-model.INFO: Fitted WordCountVectorizer
[2020-04-05 10:45:45] train-model.INFO: Fitted TfIdfTransformer
[2020-04-05 10:45:49] train-model.INFO: Fitted ZScaleStandardizer
[2020-04-05 10:45:51] train-model.INFO: Learner init hidden_layers=[0=Dense 1=PReLU 2=Dense 3=PReLU 4=Dense 5=PReLU 6=Dense 7=PReLU 8=Dense 9=PReLU] batch_size=100 optimizer=Adam alpha=0.0001 epochs=1000 min_change=0.0001 window=10 hold_out=0.1 cost_fn=CrossEntropy metric=MCC
[2020-04-05 10:45:51] train-model.INFO: Training started
[2020-04-05 10:49:51] train-model.INFO: Epoch 1 score=0.56864319578525 loss=0.28955885998412
[2020-04-05 10:53:56] train-model.INFO: Epoch 2 score=0.4425972854422 loss=0.15849405919593
[2020-04-05 10:58:04] train-model.INFO: Epoch 3 score=0.48331552310563 loss=0.13056189684066
ErrorException
Undefined offset: 0
at vendor/rubix/ml/src/CrossValidation/Metrics/MCC.php:107
103| ++$trueNeg[$class];
104| }
105| }
106| } else {
> 107| ++$falsePos[$prediction];
108| ++$falseNeg[$label];
109| }
110| }
111|
+5 vendor frames
6 app/Console/Commands/TrainModel.php:89
Rubix\ML\PersistentModel::train()
+14 vendor frames
21 artisan:37
Illuminate\Foundation\Console\Kernel::handle()
from ml.
@andrewdalpino Tried again and failed after the 8th epoc. This current dataset does not appear to be succeeding at all at the moment.
I'm wondering if it is an error in the labeled dataset stratifiedSplit
method. I'll do some testing.
from ml.
While attempting to debug this, training succeeded. The only difference I made was to add 2 new samples (1 to each class). However, it did fail once with these additional samples as well, so I'd assume this is just coincidence.
[2020-04-05 13:06:53] train-model.INFO: Epoch 7 score=0.52481517908426 loss=0.042373131883474
[2020-04-05 13:06:53] train-model.INFO: Parameters restored from snapshot at epoch 4.
[2020-04-05 13:06:53] train-model.INFO: Training complete
from ml.
The change of dataset can be definitely ruled out as training just completed on the original dataset, after epoch 13.
[2020-04-05 14:05:29] train-model.INFO: Epoch 13 score=0.50379156418009 loss=0.028950132464314
[2020-04-05 14:05:29] train-model.INFO: Parameters restored from snapshot at epoch 3.
[2020-04-05 14:05:29] train-model.INFO: Training complete
from ml.
The number of training rounds that the algorithm executes should not matter, rather I was looking at how the loss was steadily decreasing as the MCC was steadily increasing with time. In the log below we see a jump downward in the MCC at epoch 2, however, this may just be the algorithm escaping a local minima so no problems are observed from the logs.
@andrewdalpino I've just re-ran the training with exact same dataset, and learner/metric configuration. In this case it fails after epoch 3.
$ php artisan ml:train [2020-04-05 10:45:40] train-model.INFO: Fitted WordCountVectorizer [2020-04-05 10:45:45] train-model.INFO: Fitted TfIdfTransformer [2020-04-05 10:45:49] train-model.INFO: Fitted ZScaleStandardizer [2020-04-05 10:45:51] train-model.INFO: Learner init hidden_layers=[0=Dense 1=PReLU 2=Dense 3=PReLU 4=Dense 5=PReLU 6=Dense 7=PReLU 8=Dense 9=PReLU] batch_size=100 optimizer=Adam alpha=0.0001 epochs=1000 min_change=0.0001 window=10 hold_out=0.1 cost_fn=CrossEntropy metric=MCC [2020-04-05 10:45:51] train-model.INFO: Training started [2020-04-05 10:49:51] train-model.INFO: Epoch 1 score=0.56864319578525 loss=0.28955885998412 [2020-04-05 10:53:56] train-model.INFO: Epoch 2 score=0.4425972854422 loss=0.15849405919593 [2020-04-05 10:58:04] train-model.INFO: Epoch 3 score=0.48331552310563 loss=0.13056189684066 ErrorException Undefined offset: 0 at vendor/rubix/ml/src/CrossValidation/Metrics/MCC.php:107 103| ++$trueNeg[$class]; 104| } 105| } 106| } else { > 107| ++$falsePos[$prediction]; 108| ++$falseNeg[$label]; 109| } 110| } 111| +5 vendor frames 6 app/Console/Commands/TrainModel.php:89 Rubix\ML\PersistentModel::train() +14 vendor frames 21 artisan:37 Illuminate\Foundation\Console\Kernel::handle()
I'm going to need to do some more digging to find out what's really going on
Thanks for the extra info @DivineOmega it's very helpful
After the latest trials, what is the training success rate about?
One thing you can try in the meantime is decreasing the learning rate of the Adam optimizer. You can also try using the non-adaptive Stochastic optimizer to rule out issues due to momentum.
Also, feel free to join our chat https://t.me/RubixML
from ml.
@andrewdalpino Today and yesterday, on datasets of that size and above, I've seen 16 failures of this type and only maybe 2 or 3 successes (so around 11 - 16% success rate).
from ml.
@andrewdalpino Interestingly, after lowering the Adam optimiser learning rate by 10x, the training completed first time without issue.
$ php artisan ml:train
[2020-04-05 22:27:14] train-model.INFO: Fitted WordCountVectorizer
[2020-04-05 22:27:18] train-model.INFO: Fitted TfIdfTransformer
[2020-04-05 22:27:22] train-model.INFO: Fitted ZScaleStandardizer
[2020-04-05 22:27:24] train-model.INFO: Learner init hidden_layers=[0=Dense 1=PReLU 2=Dense 3=PReLU 4=Dense 5=PReLU 6=Dense 7=PReLU 8=Dense 9=PReLU] batch_size=100 optimizer=Adam alpha=0.0001 epochs=1000 min_change=0.0001 window=10 hold_out=0.1 cost_fn=CrossEntropy metric=FBeta
[2020-04-05 22:27:24] train-model.INFO: Training started
[2020-04-05 22:31:49] train-model.INFO: Epoch 1 score=0.66972334473112 loss=0.3231510276241
[2020-04-05 22:36:49] train-model.INFO: Epoch 2 score=0.72543492395247 loss=0.20104000225191
[2020-04-05 22:41:36] train-model.INFO: Epoch 3 score=0.74259303923404 loss=0.11534214676067
[2020-04-05 22:46:22] train-model.INFO: Epoch 4 score=0.75292705742216 loss=0.08167593375029
[2020-04-05 22:50:53] train-model.INFO: Epoch 5 score=0.78349036224602 loss=0.062994060070852
[2020-04-05 22:55:24] train-model.INFO: Epoch 6 score=0.78586685471343 loss=0.053025888192447
[2020-04-05 22:59:44] train-model.INFO: Epoch 7 score=0.7640015718736 loss=0.049605014056035
[2020-04-05 23:04:15] train-model.INFO: Epoch 8 score=0.75258053059604 loss=0.043536833530061
[2020-04-05 23:08:41] train-model.INFO: Epoch 9 score=0.76595700309472 loss=0.040312908446744
[2020-04-05 23:12:55] train-model.INFO: Epoch 10 score=0.76807362257247 loss=0.037231249873399
[2020-04-05 23:17:15] train-model.INFO: Epoch 11 score=0.75719922146547 loss=0.034125440250398
[2020-04-05 23:21:37] train-model.INFO: Epoch 12 score=0.75719922146547 loss=0.035133840655248
[2020-04-05 23:25:57] train-model.INFO: Epoch 13 score=0.76033834586466 loss=0.03846565249604
[2020-04-05 23:30:19] train-model.INFO: Epoch 14 score=0.77751091875429 loss=0.032385857444644
[2020-04-05 23:34:40] train-model.INFO: Epoch 15 score=0.7703448072442 loss=0.032105124285677
[2020-04-05 23:39:04] train-model.INFO: Epoch 16 score=0.7703448072442 loss=0.032299122346931
[2020-04-05 23:39:04] train-model.INFO: Parameters restored from snapshot at epoch 6.
[2020-04-05 23:39:04] train-model.INFO: Training complete
from ml.
@andrewdalpino Since lowering the Adam optimiser learning rate by 10x, the training has yet to fail once, over around 5 or 6 training sessions.
This works around the problem for now, but does not solve the root cause. Might help to narrow it done though?
from ml.
Thanks again for the great bug report @DivineOmega
You can test out the fix on the latest dev-master or you can wait until the next release
from ml.
Related Issues (20)
- psr/log old version is limiting the project to be used on modern frameworks HOT 2
- Which alogorithm can be used for search result ranking ? HOT 2
- Is "Transformer Architecture Marchine Learning Model" supported on RubixML ??? HOT 4
- Map method in Dataset doesn't exist HOT 2
- Multi Language Tokenization Support HOT 2
- WordCountVectorizer Memory Issue HOT 2
- TruncatedSVD() made PHP crash without any message HOT 3
- Evaluation of the cluster quality with indicators HOT 1
- Requirements not resolved to an installable set of packages HOT 3
- Softmax Classifier & partial training HOT 1
- Does Rubix ML support Natural Language Processing (NLP)? HOT 1
- Convert `Transformers\PrincipalComponentAnalysis` to NumPower
- Convert `Transformers\LinearDiscriminantAnalysis` to NumPower
- Convert `Transformers\TruncatedSVD` to NumPower
- Convert `Generators\Blob` to NumPower
- Convert `Generators\Circle` to NumPower
- Convert `Generators\HalfMoon` to NumPower
- Add numpower to the GitHub workflow pipeline
- How to use TF-IDF when it is not categorical? HOT 1
- create chatbot with questions and answers
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ml.