AutoGP.jl

This package contains the Julia reference implementation of AutoGP, a method for automatically discovering Gaussian process models of univariate time series data, as described in

Sequential Monte Carlo Learning for Time Series Structure Discovery.
Saad, F A; Patton, B J; Hoffmann, M D.; Saurous, R A; Mansinghka, V K.
ICML 2023: Proc. The 40th International Conference on Machine Learning.
Proceedings of Machine Learning Research vol. 202, pages 29473-29489, 2023.

Whereas traditional Gaussian process software packages focus on inferring the numeric parameters for a fixed (user-specified) covariance kernel function, AutoGP learns both covariance kernel functions and numeric parameters for a given dataset. The plots below show two examples of online time series structure discovery using AutoGP, which discovers periodic components, trends, and smoothly-varying temporal components.

Installing

AutoGP can be installed using the Julia package manager. From the Julia REPL (version 1.8+), type ] to enter the Pkg REPL mode and run

pkg> add AutoGP

Alternatively, use the terminal command julia -e 'import Pkg; Pkg.add("AutoGP")'.

Tutorials

Please see https://probsys.github.io/AutoGP.jl

Developer Notes

Building Documentation

$ julia --project=. docs/make.jl
$ python3 -m http.server --directory docs/build/ --bind localhost 9090

Building From Clone

Obtain Julia 1.8 or later.
Clone this repository.
Set environment variable: export JULIA_PROJECT=/path/to/AutoGP.jl
Instantiate dependencies: julia -e 'using Pkg; Pkg.instantiate()'
Build PyCall: PYTHON= julia -e 'using Pkg; Pkg.build("PyCall")'
Verify import works: julia -e 'import AutoGP; import PyPlot; println("success!")'

Citation

@inproceedings{saad2023icml,
title        = {Sequential {Monte} {Carlo} Learning for Time Series Structure Discovery},
author       = {Saad, Feras A. and Patton, Brian J. and Hoffmann, Matthew D. and Saurous, Rif A. and Mansinghka, V. K.},
booktitle    = {Proceedings of the 40th International Conference on Machine Learning},
series       = {Proceedings of Machine Learning Research},
volume       = {202},
pages        = {29473--29489},
year         = {2023},
publisher    = {PMLR},
}

Simple examples

Cool project!

I'm kicking the tires and thought I'd collect some notes here. Apologies in advance if this is a misuse of the repo issues.

I'm using this simple test function to look at the posterior distributions of some simple synthetic data:

import AutoGP

function top_models(xs, ys; n=3, verbose=false)
    model = AutoGP.GPModel(xs, ys; n_particles=16);
    AutoGP.seed!(6)
    AutoGP.fit_smc!(model; schedule=AutoGP.Schedule.linear_schedule(length(xs), .10), n_mcmc=75, n_hmc=10, verbose=verbose)
    weights = AutoGP.particle_weights(model)
    kernels = AutoGP.covariance_kernels(model)
    logml = AutoGP.log_marginal_likelihood_estimate(model)
    println("Log-ML: $(logml)\n")
    best = sort(zip(kernels, weights), rev=true, by=x->x[2])
    for (i, (k, w)) in enumerate(best[1:n])
        println("Model $(i), Weight $(w)")
        Base.display(k)
    end
end

y=x

A line without noise is inferred pretty quickly. The log-ML estimate should have domain (-∞, 0] though, right?

(I haven't learned to read the parameters in the kernels yet. I'm assuming they are somehow standardized and make sense for now.)

julia> @time top_models(Vector(1:100), Vector(1:100))
Log-ML: 282.2694850370343

Model 1, Weight 0.14821784601368354
LIN(0.37; 0.03, 0.58)

Model 2, Weight 0.08312684024116766
LIN(0.23; 0.20, 0.53)

Model 3, Weight 0.08073108954280904
LIN(0.38; 0.18, 1.48)

 56.960039 seconds (659.07 M allocations: 132.846 GiB, 24.89% gc time)

y=x^2, y=x^3

Polynomials likewise.

Should it be a surprise to see models with extra sum terms so heavily weighted at n=100 and without noise? Maybe not: I need to read about the default priors on model complexity.

julia> @time top_models(Vector(1:100), Vector(1:100).^2)
Log-ML: 271.9224429110566

Model 1, Weight 0.10509916636060361
×
├── ＋
│   ├── LIN(0.66; 0.38, 0.31)
│   └── LIN(0.16; 0.05, 0.24)
└── LIN(0.12; 0.21, 1.24)

Model 2, Weight 0.10355590097846587
×
├── LIN(0.46; 0.08, 0.13)
└── LIN(0.03; 0.33, 1.80)

Model 3, Weight 0.09826188261078113
×
├── LIN(0.09; 0.77, 0.61)
└── LIN(0.25; 0.11, 1.06)

160.866469 seconds (1.76 G allocations: 425.998 GiB, 26.03% gc time)

julia> @time top_models(Vector(1:100), Vector(1:100).^3)
Log-ML: 214.82244718394827

Model 1, Weight 0.20730654884622823
×
├── LIN(0.04; 0.26, 1.07)
└── ×
    ├── LIN(0.17; 1.12, 0.48)
    └── LIN(0.13; 0.14, 1.36)

Model 2, Weight 0.12859851725732885
×
├── LIN(0.10; 0.39, 0.54)
└── ＋
    ├── LIN(1.28; 0.38, 0.07)
    └── ×
        ├── LIN(0.17; 0.59, 0.28)
        └── ×
            ├── LIN(0.02; 0.62, 0.85)
            └── LIN(0.22; 0.07, 0.79)

Model 3, Weight 0.11513375122907542
×
├── LIN(0.20; 0.19, 0.35)
└── ×
    ├── LIN(0.19; 0.25, 0.66)
    └── LIN(0.06; 0.62, 0.53)

140.634843 seconds (1.42 G allocations: 383.902 GiB, 26.66% gc time, 0.05% compilation time)

Uniform random noise

Inference on uniform random noise takes longer and ultimately there doesn't seem to be any suitable model on the GP menu. It would be nice to have a diagnostic for flagging this posterior as being a lousy fit.

julia> @time top_models(Vector(1:100), rand(100))
Log-ML: -12.138325663870692

Model 1, Weight 0.20464732968472485
×
├── GE(1.09, 0.92; 0.23)
└── ×
    ├── ×
    │   ├── PER(0.43, 0.16; 0.10)
    │   └── GE(0.05, 0.57; 0.13)
    └── PER(0.08, 0.11; 0.09)

Model 2, Weight 0.15086263315639292
PER(0.02, 0.51; 0.05)

Model 3, Weight 0.1454059350595094
PER(0.23, 0.90; 0.01)

746.373053 seconds (4.08 G allocations: 1.519 TiB, 18.19% gc time, 0.00% compilation time)

y=k

Constant X values seem to be an awkward edge case that propagates a NaN through the simulation:

julia> @time top_models(Vector(1:100), ones(100))
Log-ML: 0.0

Model 1, Weight NaN
GE(0.05, 0.39; 0.97)

Model 2, Weight NaN
LIN(0.13; 0.78, 0.25)

Model 3, Weight NaN
GE(0.08, 1.18; 0.18)

  3.341694 seconds (15.98 M allocations: 6.689 GiB, 14.88% gc time)

big data

Finally, the default data annealing sampler seems to choke on larger datasets, somewhere in the thousands:

julia> @time top_models(Vector(1:10000), rand(10000), verbose=true)
Running SMC round 1000/10000
Particle Weights: [3.75e-224, 4.75e-205, 1.09e-109, 0.00e+00, 1.00e+00, 1.25e-88, 0.00e+00, 3.31e-320, 0.00e+00, 7.74e-21, 3.03e-47, 0.00e+00, 3.78e-91, 1.33e-91, 3.15e-94, 2.45e-29]
Particle ESS: 0.0625
resampled true
^C

I wonder how to approach this case where compute is scarcer than data? Maybe it is a textbook case of Bayesian optimization and a strategy like Kriging should be used to choose which data points to include in each batch to maximize information gain from one step of the SMC to the next?

(Thanks for indulging my thinking aloud here.)

probsys / autogp.jl Goto Github PK

autogp.jl's Introduction

AutoGP.jl

Installing

Tutorials

Developer Notes

Building Documentation

Building From Clone

Citation

autogp.jl's People

Contributors

Stargazers

Watchers

Forkers

autogp.jl's Issues

y=x

y=x^2, y=x^3

Uniform random noise

y=k

big data

Recommend Projects

Recommend Topics

Recommend Org