Coder Social home page Coder Social logo

Comments (16)

andrewdalpino avatar andrewdalpino commented on May 22, 2024 2

Summarizing what we talked about in chat ...

This error is caused by a chain of silent errors starting with numerical under/overflow due to a learning rate that is too high for the user's particular dataset. As a result the network produces NaN values at the output layer which in turn produce a prediction of false when run through the argmax function. This false value is then silently converted (thanks PHP) to the integer 0 when used as the key/index of an array entry used to accumulate false positives in the MCC and FBeta metrics.

The solution to this is to decrease the learning rate of the Gradient Descent optimizer to prevent the network from blowing up. To aid the user in identifying when the network has become unstable, we will catch NaN values before scoring the validation set and then throw an informative exception.

Here is a good article on exploding gradients and why decreasing the learning rate has the effect of stabilizing training https://machinelearningmastery.com/exploding-gradients-in-neural-networks/

from ml.

andrewdalpino avatar andrewdalpino commented on May 22, 2024

Hi @DivineOmega thanks for the bug report!

By any chance do any of the training sets contain 10 or less samples?

Also, here is line 107 of MCC

Annotation 2020-04-04 194729

For some reason, the integer 0 is showing up in a prediction ... do you have a directory named 0 or one that would evaluate to 0 or false under type-coercion? (see https://www.php.net/manual/en/language.types.boolean.php#language.types.boolean.casting)

from ml.

DivineOmega avatar DivineOmega commented on May 22, 2024

Hi @andrewdalpino. Thanks for looking into this.

My two classes are positive with 527 samples and non-positive with 1035.

There's no directory named 0 and I do not think my buildLabeled function would generate a 0 class. If this were the case, I'd expect the Metric to always and immediately fail, while sometimes it happens after several epocs.

To confirm I've only got the two classes, I ran the following code.

$dataset = DatasetHelper::buildLabeled();
dd($dataset->possibleOutcomes());

And got the following array:

array:2 [
  0 => "non-positive"
  1 => "positive"
]

from ml.

DivineOmega avatar DivineOmega commented on May 22, 2024

FYI, I'm using dev-master d0872a0 version of rubix/ml via Composer, the latest version as of right now.

$ composer show | grep rubix/ml
rubix/ml                              dev-master d0872a0 A high-level machine learning and deep learning library for the PHP language.

from ml.

andrewdalpino avatar andrewdalpino commented on May 22, 2024

@DivineOmega Hmmmm ... this is a mysterious one

How often does this error occur? For example, out of 100 trainings, how many one them would error in your estimation?

Does training seem normal when these errors occur? Is the loss decreasing steadily? I'm wondering if the network is outputting NaN values if for some reason training went awry.

from ml.

DivineOmega avatar DivineOmega commented on May 22, 2024

@andrewdalpino So far, with that dataset it has failed 5/5 times (3 with FBeta, 2 with MCC).

$ cat storage/logs/laravel.log | grep "Undefined offset"
[2020-04-04 21:15:02] production.ERROR: Undefined offset: 0 {"exception":"[object] (ErrorException(code: 0): Undefined offset: 0 at /[REDACTED]/vendor/rubix/ml/src/CrossValidation/Metrics/FBeta.php:127)
[2020-04-04 21:34:52] production.ERROR: Undefined offset: 0 {"exception":"[object] (ErrorException(code: 0): Undefined offset: 0 at /[REDACTED]/vendor/rubix/ml/src/CrossValidation/Metrics/FBeta.php:127)
[2020-04-04 22:04:19] production.ERROR: Undefined offset: 0 {"exception":"[object] (ErrorException(code: 0): Undefined offset: 0 at /[REDACTED]/vendor/rubix/ml/src/CrossValidation/Metrics/FBeta.php:127)
[2020-04-04 22:32:21] production.ERROR: Undefined offset: 0 {"exception":"[object] (ErrorException(code: 0): Undefined offset: 0 at /[REDACTED]/vendor/rubix/ml/src/CrossValidation/Metrics/MCC.php:107)
[2020-04-05 08:24:02] production.ERROR: Undefined offset: 0 {"exception":"[object] (ErrorException(code: 0): Undefined offset: 0 at /[REDACTED]/vendor/rubix/ml/src/CrossValidation/Metrics/MCC.php:107)
$ cat storage/logs/laravel.log | grep "Undefined offset" | wc -l
5

Here's the log from the latest training session. Loss is decreasing.

$ php artisan ml:train
[2020-04-05 08:02:48] train-model.INFO: Fitted WordCountVectorizer
[2020-04-05 08:02:52] train-model.INFO: Fitted TfIdfTransformer
[2020-04-05 08:02:57] train-model.INFO: Fitted ZScaleStandardizer
[2020-04-05 08:02:59] train-model.INFO: Learner init hidden_layers=[0=Dense 1=PReLU 2=Dense 3=PReLU 4=Dense 5=PReLU 6=Dense 7=PReLU 8=Dense 9=PReLU] batch_size=100 optimizer=Adam alpha=0.0001 epochs=1000 min_change=0.0001 window=10 hold_out=0.1 cost_fn=CrossEntropy metric=MCC
[2020-04-05 08:02:59] train-model.INFO: Training started
[2020-04-05 08:07:14] train-model.INFO: Epoch 1 score=0.49118783277336 loss=0.30282976376964
[2020-04-05 08:11:25] train-model.INFO: Epoch 2 score=0.50325846378583 loss=0.18621958479585
[2020-04-05 08:15:38] train-model.INFO: Epoch 3 score=0.50114946967608 loss=0.1244821070199
[2020-04-05 08:19:52] train-model.INFO: Epoch 4 score=0.55362054479179 loss=0.096733785356479

   ErrorException 

  Undefined offset: 0

  at vendor/rubix/ml/src/CrossValidation/Metrics/MCC.php:107
    103|                         ++$trueNeg[$class];
    104|                     }
    105|                 }
    106|             } else {
  > 107|                 ++$falsePos[$prediction];
    108|                 ++$falseNeg[$label];
    109|             }
    110|         }
    111| 

      +5 vendor frames 
  6   app/Console/Commands/TrainModel.php:89
      Rubix\ML\PersistentModel::train()

      +14 vendor frames 
  21  artisan:37
      Illuminate\Foundation\Console\Kernel::handle()

from ml.

andrewdalpino avatar andrewdalpino commented on May 22, 2024

Hmm everything seems normal up to epoch 5

I'm going to give this some thought

from ml.

DivineOmega avatar DivineOmega commented on May 22, 2024

@andrewdalpino I've just re-ran the training with exact same dataset, and learner/metric configuration. In this case it fails after epoch 3.

$ php artisan ml:train
[2020-04-05 10:45:40] train-model.INFO: Fitted WordCountVectorizer
[2020-04-05 10:45:45] train-model.INFO: Fitted TfIdfTransformer
[2020-04-05 10:45:49] train-model.INFO: Fitted ZScaleStandardizer
[2020-04-05 10:45:51] train-model.INFO: Learner init hidden_layers=[0=Dense 1=PReLU 2=Dense 3=PReLU 4=Dense 5=PReLU 6=Dense 7=PReLU 8=Dense 9=PReLU] batch_size=100 optimizer=Adam alpha=0.0001 epochs=1000 min_change=0.0001 window=10 hold_out=0.1 cost_fn=CrossEntropy metric=MCC
[2020-04-05 10:45:51] train-model.INFO: Training started
[2020-04-05 10:49:51] train-model.INFO: Epoch 1 score=0.56864319578525 loss=0.28955885998412
[2020-04-05 10:53:56] train-model.INFO: Epoch 2 score=0.4425972854422 loss=0.15849405919593
[2020-04-05 10:58:04] train-model.INFO: Epoch 3 score=0.48331552310563 loss=0.13056189684066

   ErrorException 

  Undefined offset: 0

  at vendor/rubix/ml/src/CrossValidation/Metrics/MCC.php:107
    103|                         ++$trueNeg[$class];
    104|                     }
    105|                 }
    106|             } else {
  > 107|                 ++$falsePos[$prediction];
    108|                 ++$falseNeg[$label];
    109|             }
    110|         }
    111| 

      +5 vendor frames 
  6   app/Console/Commands/TrainModel.php:89
      Rubix\ML\PersistentModel::train()

      +14 vendor frames 
  21  artisan:37
      Illuminate\Foundation\Console\Kernel::handle()

from ml.

DivineOmega avatar DivineOmega commented on May 22, 2024

@andrewdalpino Tried again and failed after the 8th epoc. This current dataset does not appear to be succeeding at all at the moment.

I'm wondering if it is an error in the labeled dataset stratifiedSplit method. I'll do some testing.

from ml.

DivineOmega avatar DivineOmega commented on May 22, 2024

While attempting to debug this, training succeeded. The only difference I made was to add 2 new samples (1 to each class). However, it did fail once with these additional samples as well, so I'd assume this is just coincidence.

[2020-04-05 13:06:53] train-model.INFO: Epoch 7 score=0.52481517908426 loss=0.042373131883474
[2020-04-05 13:06:53] train-model.INFO: Parameters restored from snapshot at epoch 4.
[2020-04-05 13:06:53] train-model.INFO: Training complete

from ml.

DivineOmega avatar DivineOmega commented on May 22, 2024

The change of dataset can be definitely ruled out as training just completed on the original dataset, after epoch 13.

[2020-04-05 14:05:29] train-model.INFO: Epoch 13 score=0.50379156418009 loss=0.028950132464314
[2020-04-05 14:05:29] train-model.INFO: Parameters restored from snapshot at epoch 3.
[2020-04-05 14:05:29] train-model.INFO: Training complete

from ml.

andrewdalpino avatar andrewdalpino commented on May 22, 2024

The number of training rounds that the algorithm executes should not matter, rather I was looking at how the loss was steadily decreasing as the MCC was steadily increasing with time. In the log below we see a jump downward in the MCC at epoch 2, however, this may just be the algorithm escaping a local minima so no problems are observed from the logs.

@andrewdalpino I've just re-ran the training with exact same dataset, and learner/metric configuration. In this case it fails after epoch 3.

$ php artisan ml:train
[2020-04-05 10:45:40] train-model.INFO: Fitted WordCountVectorizer
[2020-04-05 10:45:45] train-model.INFO: Fitted TfIdfTransformer
[2020-04-05 10:45:49] train-model.INFO: Fitted ZScaleStandardizer
[2020-04-05 10:45:51] train-model.INFO: Learner init hidden_layers=[0=Dense 1=PReLU 2=Dense 3=PReLU 4=Dense 5=PReLU 6=Dense 7=PReLU 8=Dense 9=PReLU] batch_size=100 optimizer=Adam alpha=0.0001 epochs=1000 min_change=0.0001 window=10 hold_out=0.1 cost_fn=CrossEntropy metric=MCC
[2020-04-05 10:45:51] train-model.INFO: Training started
[2020-04-05 10:49:51] train-model.INFO: Epoch 1 score=0.56864319578525 loss=0.28955885998412
[2020-04-05 10:53:56] train-model.INFO: Epoch 2 score=0.4425972854422 loss=0.15849405919593
[2020-04-05 10:58:04] train-model.INFO: Epoch 3 score=0.48331552310563 loss=0.13056189684066

   ErrorException 

  Undefined offset: 0

  at vendor/rubix/ml/src/CrossValidation/Metrics/MCC.php:107
    103|                         ++$trueNeg[$class];
    104|                     }
    105|                 }
    106|             } else {
  > 107|                 ++$falsePos[$prediction];
    108|                 ++$falseNeg[$label];
    109|             }
    110|         }
    111| 

      +5 vendor frames 
  6   app/Console/Commands/TrainModel.php:89
      Rubix\ML\PersistentModel::train()

      +14 vendor frames 
  21  artisan:37
      Illuminate\Foundation\Console\Kernel::handle()

I'm going to need to do some more digging to find out what's really going on

Thanks for the extra info @DivineOmega it's very helpful

After the latest trials, what is the training success rate about?

One thing you can try in the meantime is decreasing the learning rate of the Adam optimizer. You can also try using the non-adaptive Stochastic optimizer to rule out issues due to momentum.

Also, feel free to join our chat https://t.me/RubixML

from ml.

DivineOmega avatar DivineOmega commented on May 22, 2024

@andrewdalpino Today and yesterday, on datasets of that size and above, I've seen 16 failures of this type and only maybe 2 or 3 successes (so around 11 - 16% success rate).

from ml.

DivineOmega avatar DivineOmega commented on May 22, 2024

@andrewdalpino Interestingly, after lowering the Adam optimiser learning rate by 10x, the training completed first time without issue.

$ php artisan ml:train
[2020-04-05 22:27:14] train-model.INFO: Fitted WordCountVectorizer
[2020-04-05 22:27:18] train-model.INFO: Fitted TfIdfTransformer
[2020-04-05 22:27:22] train-model.INFO: Fitted ZScaleStandardizer
[2020-04-05 22:27:24] train-model.INFO: Learner init hidden_layers=[0=Dense 1=PReLU 2=Dense 3=PReLU 4=Dense 5=PReLU 6=Dense 7=PReLU 8=Dense 9=PReLU] batch_size=100 optimizer=Adam alpha=0.0001 epochs=1000 min_change=0.0001 window=10 hold_out=0.1 cost_fn=CrossEntropy metric=FBeta
[2020-04-05 22:27:24] train-model.INFO: Training started
[2020-04-05 22:31:49] train-model.INFO: Epoch 1 score=0.66972334473112 loss=0.3231510276241
[2020-04-05 22:36:49] train-model.INFO: Epoch 2 score=0.72543492395247 loss=0.20104000225191
[2020-04-05 22:41:36] train-model.INFO: Epoch 3 score=0.74259303923404 loss=0.11534214676067
[2020-04-05 22:46:22] train-model.INFO: Epoch 4 score=0.75292705742216 loss=0.08167593375029
[2020-04-05 22:50:53] train-model.INFO: Epoch 5 score=0.78349036224602 loss=0.062994060070852
[2020-04-05 22:55:24] train-model.INFO: Epoch 6 score=0.78586685471343 loss=0.053025888192447
[2020-04-05 22:59:44] train-model.INFO: Epoch 7 score=0.7640015718736 loss=0.049605014056035
[2020-04-05 23:04:15] train-model.INFO: Epoch 8 score=0.75258053059604 loss=0.043536833530061
[2020-04-05 23:08:41] train-model.INFO: Epoch 9 score=0.76595700309472 loss=0.040312908446744
[2020-04-05 23:12:55] train-model.INFO: Epoch 10 score=0.76807362257247 loss=0.037231249873399
[2020-04-05 23:17:15] train-model.INFO: Epoch 11 score=0.75719922146547 loss=0.034125440250398
[2020-04-05 23:21:37] train-model.INFO: Epoch 12 score=0.75719922146547 loss=0.035133840655248
[2020-04-05 23:25:57] train-model.INFO: Epoch 13 score=0.76033834586466 loss=0.03846565249604
[2020-04-05 23:30:19] train-model.INFO: Epoch 14 score=0.77751091875429 loss=0.032385857444644
[2020-04-05 23:34:40] train-model.INFO: Epoch 15 score=0.7703448072442 loss=0.032105124285677
[2020-04-05 23:39:04] train-model.INFO: Epoch 16 score=0.7703448072442 loss=0.032299122346931
[2020-04-05 23:39:04] train-model.INFO: Parameters restored from snapshot at epoch 6.
[2020-04-05 23:39:04] train-model.INFO: Training complete

from ml.

DivineOmega avatar DivineOmega commented on May 22, 2024

@andrewdalpino Since lowering the Adam optimiser learning rate by 10x, the training has yet to fail once, over around 5 or 6 training sessions.

This works around the problem for now, but does not solve the root cause. Might help to narrow it done though?

from ml.

andrewdalpino avatar andrewdalpino commented on May 22, 2024

Thanks again for the great bug report @DivineOmega

You can test out the fix on the latest dev-master or you can wait until the next release

from ml.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.