Coder Social home page Coder Social logo

probsys / autogp.jl Goto Github PK

View Code? Open in Web Editor NEW
58.0 5.0 5.0 47.09 MB

Automated Bayesian model discovery for time series data

Home Page: https://probsys.github.io/AutoGP.jl/

License: Apache License 2.0

Julia 100.00%
bayesian-inference forecasting gaussian-process machine-learning structure-learning times-series icml

autogp.jl's Introduction

AutoGP.jl

Documentation GitHub Action Status GitHub Action Status Git Tag License

This package contains the Julia reference implementation of AutoGP, a method for automatically discovering Gaussian process models of univariate time series data, as described in

Sequential Monte Carlo Learning for Time Series Structure Discovery.
Saad, F A; Patton, B J; Hoffmann, M D.; Saurous, R A; Mansinghka, V K.
ICML 2023: Proc. The 40th International Conference on Machine Learning.
Proceedings of Machine Learning Research vol. 202, pages 29473-29489, 2023.

Whereas traditional Gaussian process software packages focus on inferring the numeric parameters for a fixed (user-specified) covariance kernel function, AutoGP learns both covariance kernel functions and numeric parameters for a given dataset. The plots below show two examples of online time series structure discovery using AutoGP, which discovers periodic components, trends, and smoothly-varying temporal components.

Installing

AutoGP can be installed using the Julia package manager. From the Julia REPL (version 1.8+), type ] to enter the Pkg REPL mode and run

pkg> add AutoGP

Alternatively, use the terminal command julia -e 'import Pkg; Pkg.add("AutoGP")'.

Tutorials

Please see https://probsys.github.io/AutoGP.jl

Developer Notes

Building Documentation

$ julia --project=. docs/make.jl
$ python3 -m http.server --directory docs/build/ --bind localhost 9090

Building From Clone

  1. Obtain Julia 1.8 or later.
  2. Clone this repository.
  3. Set environment variable: export JULIA_PROJECT=/path/to/AutoGP.jl
  4. Instantiate dependencies: julia -e 'using Pkg; Pkg.instantiate()'
  5. Build PyCall: PYTHON= julia -e 'using Pkg; Pkg.build("PyCall")'
  6. Verify import works: julia -e 'import AutoGP; import PyPlot; println("success!")'

Citation

@inproceedings{saad2023icml,
title        = {Sequential {Monte} {Carlo} Learning for Time Series Structure Discovery},
author       = {Saad, Feras A. and Patton, Brian J. and Hoffmann, Matthew D. and Saurous, Rif A. and Mansinghka, V. K.},
booktitle    = {Proceedings of the 40th International Conference on Machine Learning},
series       = {Proceedings of Machine Learning Research},
volume       = {202},
pages        = {29473--29489},
year         = {2023},
publisher    = {PMLR},
}

autogp.jl's People

Contributors

fsaad avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

autogp.jl's Issues

Model for interpolation?

@fsaad great project! Is it possible that this model could be used for interpolation? How could this be done?

  • could this also be extended to the multivariate case

Lighter dependencies?

Is there a reason why this package has to have such a heavy dependency list? It pulls in CSV, DataFrames, IJulia and PyPlot, which I would think aren't really necessary if I as a user wanted to e.g. get some data from an arrow file, store it in a NamedTuple, and plot it using UnicodePlots in the REPL?

Simple examples

Cool project!

I'm kicking the tires and thought I'd collect some notes here. Apologies in advance if this is a misuse of the repo issues.

I'm using this simple test function to look at the posterior distributions of some simple synthetic data:

import AutoGP

function top_models(xs, ys; n=3, verbose=false)
    model = AutoGP.GPModel(xs, ys; n_particles=16);
    AutoGP.seed!(6)
    AutoGP.fit_smc!(model; schedule=AutoGP.Schedule.linear_schedule(length(xs), .10), n_mcmc=75, n_hmc=10, verbose=verbose)
    weights = AutoGP.particle_weights(model)
    kernels = AutoGP.covariance_kernels(model)
    logml = AutoGP.log_marginal_likelihood_estimate(model)
    println("Log-ML: $(logml)\n")
    best = sort(zip(kernels, weights), rev=true, by=x->x[2])
    for (i, (k, w)) in enumerate(best[1:n])
        println("Model $(i), Weight $(w)")
        Base.display(k)
    end
end

y=x

A line without noise is inferred pretty quickly. The log-ML estimate should have domain (-∞, 0] though, right?

(I haven't learned to read the parameters in the kernels yet. I'm assuming they are somehow standardized and make sense for now.)

julia> @time top_models(Vector(1:100), Vector(1:100))
Log-ML: 282.2694850370343

Model 1, Weight 0.14821784601368354
LIN(0.37; 0.03, 0.58)

Model 2, Weight 0.08312684024116766
LIN(0.23; 0.20, 0.53)

Model 3, Weight 0.08073108954280904
LIN(0.38; 0.18, 1.48)

 56.960039 seconds (659.07 M allocations: 132.846 GiB, 24.89% gc time)

y=x^2, y=x^3

Polynomials likewise.

Should it be a surprise to see models with extra sum terms so heavily weighted at n=100 and without noise? Maybe not: I need to read about the default priors on model complexity.

julia> @time top_models(Vector(1:100), Vector(1:100).^2)
Log-ML: 271.9224429110566

Model 1, Weight 0.10509916636060361
×
├── +
│   ├── LIN(0.66; 0.38, 0.31)
│   └── LIN(0.16; 0.05, 0.24)
└── LIN(0.12; 0.21, 1.24)

Model 2, Weight 0.10355590097846587
×
├── LIN(0.46; 0.08, 0.13)
└── LIN(0.03; 0.33, 1.80)

Model 3, Weight 0.09826188261078113
×
├── LIN(0.09; 0.77, 0.61)
└── LIN(0.25; 0.11, 1.06)

160.866469 seconds (1.76 G allocations: 425.998 GiB, 26.03% gc time)
julia> @time top_models(Vector(1:100), Vector(1:100).^3)
Log-ML: 214.82244718394827

Model 1, Weight 0.20730654884622823
×
├── LIN(0.04; 0.26, 1.07)
└── ×
    ├── LIN(0.17; 1.12, 0.48)
    └── LIN(0.13; 0.14, 1.36)

Model 2, Weight 0.12859851725732885
×
├── LIN(0.10; 0.39, 0.54)
└── +
    ├── LIN(1.28; 0.38, 0.07)
    └── ×
        ├── LIN(0.17; 0.59, 0.28)
        └── ×
            ├── LIN(0.02; 0.62, 0.85)
            └── LIN(0.22; 0.07, 0.79)

Model 3, Weight 0.11513375122907542
×
├── LIN(0.20; 0.19, 0.35)
└── ×
    ├── LIN(0.19; 0.25, 0.66)
    └── LIN(0.06; 0.62, 0.53)

140.634843 seconds (1.42 G allocations: 383.902 GiB, 26.66% gc time, 0.05% compilation time)

Uniform random noise

Inference on uniform random noise takes longer and ultimately there doesn't seem to be any suitable model on the GP menu. It would be nice to have a diagnostic for flagging this posterior as being a lousy fit.

julia> @time top_models(Vector(1:100), rand(100))
Log-ML: -12.138325663870692

Model 1, Weight 0.20464732968472485
×
├── GE(1.09, 0.92; 0.23)
└── ×
    ├── ×
    │   ├── PER(0.43, 0.16; 0.10)
    │   └── GE(0.05, 0.57; 0.13)
    └── PER(0.08, 0.11; 0.09)

Model 2, Weight 0.15086263315639292
PER(0.02, 0.51; 0.05)

Model 3, Weight 0.1454059350595094
PER(0.23, 0.90; 0.01)

746.373053 seconds (4.08 G allocations: 1.519 TiB, 18.19% gc time, 0.00% compilation time)

y=k

Constant X values seem to be an awkward edge case that propagates a NaN through the simulation:

julia> @time top_models(Vector(1:100), ones(100))
Log-ML: 0.0

Model 1, Weight NaN
GE(0.05, 0.39; 0.97)

Model 2, Weight NaN
LIN(0.13; 0.78, 0.25)

Model 3, Weight NaN
GE(0.08, 1.18; 0.18)

  3.341694 seconds (15.98 M allocations: 6.689 GiB, 14.88% gc time)

big data

Finally, the default data annealing sampler seems to choke on larger datasets, somewhere in the thousands:

julia> @time top_models(Vector(1:10000), rand(10000), verbose=true)
Running SMC round 1000/10000
Particle Weights: [3.75e-224, 4.75e-205, 1.09e-109, 0.00e+00, 1.00e+00, 1.25e-88, 0.00e+00, 3.31e-320, 0.00e+00, 7.74e-21, 3.03e-47, 0.00e+00, 3.78e-91, 1.33e-91, 3.15e-94, 2.45e-29]
Particle ESS: 0.0625
resampled true
^C

I wonder how to approach this case where compute is scarcer than data? Maybe it is a textbook case of Bayesian optimization and a strategy like Kriging should be used to choose which data points to include in each batch to maximize information gain from one step of the SMC to the next?

(Thanks for indulging my thinking aloud here.)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.