Coder Social home page Coder Social logo

rhayes777 / pyautofit Goto Github PK

View Code? Open in Web Editor NEW
58.0 7.0 12.0 26.28 MB

PyAutoFit: Classy Probabilistic Programming

Home Page: https://pyautofit.readthedocs.io/

License: MIT License

Python 98.39% Shell 0.03% TeX 1.58%
probabilistic-programming statistics bayesian-inference bayesian-methods model astronomy statistical-analysis stats mcmc graphical-models

pyautofit's Introduction

PyAutoFit: Classy Probabilistic Programming

binder Tests Build Documentation Status JOSS

Installation Guide | readthedocs | Introduction on Binder | HowToFit

PyAutoFit is a Python based probabilistic programming language for model fitting and Bayesian inference of large datasets.

The basic PyAutoFit API allows us a user to quickly compose a probabilistic model and fit it to data via a log likelihood function, using a range of non-linear search algorithms (e.g. MCMC, nested sampling).

Users can then set up PyAutoFit scientific workflow, which enables streamlined modeling of small datasets with tools to scale up to large datasets.

PyAutoFit supports advanced statistical methods, most notably a big data framework for Bayesian hierarchical analysis.

Getting Started

The following links are useful for new starters:

Support

Support for installation issues, help with Fit modeling and using PyAutoFit is available by raising an issue on the GitHub issues page.

We also offer support on the PyAutoFit Slack channel, where we also provide the latest updates on PyAutoFit. Slack is invitation-only, so if you'd like to join send an email requesting an invite.

HowToFit

For users less familiar with Bayesian inference and scientific analysis you may wish to read through the HowToFits lectures. These teach you the basic principles of Bayesian inference, with the content pitched at undergraduate level and above.

A complete overview of the lectures is provided on the HowToFit readthedocs page

API Overview

To illustrate the PyAutoFit API, we use an illustrative toy model of fitting a one-dimensional Gaussian to noisy 1D data. Here's the data (black) and the model (red) we'll fit:

image

We define our model, a 1D Gaussian by writing a Python class using the format below:

class Gaussian:

    def __init__(
        self,
        centre=0.0,        # <- PyAutoFit recognises these
        normalization=0.1, # <- constructor arguments are
        sigma=0.01,        # <- the Gaussian's parameters.
    ):
        self.centre = centre
        self.normalization = normalization
        self.sigma = sigma

    """
    An instance of the Gaussian class will be available during model fitting.

    This method will be used to fit the model to data and compute a likelihood.
    """

    def model_data_1d_via_xvalues_from(self, xvalues):

        transformed_xvalues = xvalues - self.centre

        return (self.normalization / (self.sigma * (2.0 * np.pi) ** 0.5)) * \
                np.exp(-0.5 * (transformed_xvalues / self.sigma) ** 2.0)

PyAutoFit recognises that this Gaussian may be treated as a model component whose parameters can be fitted for via a non-linear search like emcee.

To fit this Gaussian to the data we create an Analysis object, which gives PyAutoFit the data and a log_likelihood_function describing how to fit the data with the model:

class Analysis(af.Analysis):

    def __init__(self, data, noise_map):

        self.data = data
        self.noise_map = noise_map

    def log_likelihood_function(self, instance):

        """
        The 'instance' that comes into this method is an instance of the Gaussian class
        above, with the parameters set to values chosen by the non-linear search.
        """

        print("Gaussian Instance:")
        print("Centre = ", instance.centre)
        print("normalization = ", instance.normalization)
        print("Sigma = ", instance.sigma)

        """
        We fit the ``data`` with the Gaussian instance, using its
        "model_data_1d_via_xvalues_from" function to create the model data.
        """

        xvalues = np.arange(self.data.shape[0])

        model_data = instance.model_data_1d_via_xvalues_from(xvalues=xvalues)
        residual_map = self.data - model_data
        chi_squared_map = (residual_map / self.noise_map) ** 2.0
        log_likelihood = -0.5 * sum(chi_squared_map)

        return log_likelihood

We can now fit our model to the data using a non-linear search:

model = af.Model(Gaussian)

analysis = Analysis(data=data, noise_map=noise_map)

emcee = af.Emcee(nwalkers=50, nsteps=2000)

result = emcee.fit(model=model, analysis=analysis)

The result contains information on the model-fit, for example the parameter samples, maximum log likelihood model and marginalized probability density functions.

pyautofit's People

Contributors

arfon avatar dependabot[bot] avatar dfm avatar jacob-hjortlund avatar jammy2211 avatar jhod0 avatar jonathanfrawley avatar matthewghgriffiths avatar rhayes777 avatar victorforouhar avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

pyautofit's Issues

Clean up optimizer grid search phase path / phase name

I have made it so that in autofit, phase_name and phase_path are now separate. Previously, we used phase name to include the full directory structure of the output phase results (e.g. 'example_data/pipeline_name/phase_name').

This doesn't play nice with our use of named phase names to link previous results between pipelines, thus phase_path n ow takes the folers before the phase_name, e.g.:

phase_path = 'example_data/pipeline_name/
phase_name='phase_name'.

Implementing this throughout autofit was straight forward, however I wasnt able to get the optimizer_grid_search to do it properly. Specifically, at line 167, the phase of each optimizer grid search is still setup using:

            name_path = "{}/{}".format(self.phase_name, "_".join(labels))
            optimizer_instance = self.optimizer_class(model_mapper=model_mapper,
                                                      phase_path='', phase_name=name_path)

That is, the path is setup using the full phase name as it was before. This passes all tests, but doesnt mesh nicely with the new layout. Iff you can get this input to use phase path as per usual we're good.

Don't resume a MultiNest run if model.results exists

Getting a fairly dodgy issue - MultiNest runs are resuming even after they have finished in a previous run (e.g. once they've made a model.results file). Not totally sure of the cause of this, I guess MultiNest has some randomness as to whether it decides it should finish or not.

This is problematic, as if MultiNest resumes a run that it previously continued beyond, then it will change the priors / hyper image / etc of the subsquent run, which can really mess things up.

I think a good solutoin to prevent this from happening from now is to simply check if a model.results file exists, and if so, to not start MultiNest.

We may also benefit from introducing some sort of check pointing, so a run 'resumes' from exactly where it left off in a pipeline. This could be a nightmare with passing / loading data though...

Backup results in a non-sim link directory.

It is common for 'ungraceful exits' to make the MultiNest results unresumeable. It is also a pain having the results in a sim-link directory (especially in docker, where this cant be accessed).

I think it would be wise to copy all results in a strategic way to a non-sim link folder (e.g. optimizer_backup) so that they can be accessed there, and used to resume a run if the sim-link results are corrupted.

Add prior width option to use % of parameter value instead of float

Currently, when we link the priors of a parameter together in two phases, we use the float value in the priors/width config file. For example:

[AbstractEllipticalSersic]
intensity=1.0
effective_radius=1.0
sersic_index = 1.2

The sersic index assumes a sigma=1.2 for the next phase's Gaussian prior (unless the errors on the parameter are greater than 1.2).

This works well for parameters like Sersic index, where we can generally anticipate the range of values from one phase to another will be covered by a value 1.2.

However, for a parameter like intensity, this isn't very useful. Sometimes, the intensity of a galaxy is of order 0.0001, sometimes its of order 1.0, sometimes maybe it is of order 100.0. There is no sensible single value we can use to generically link two intensities for very feasible galaxy value.

Therefore, we should introduce a 'percent' option in this config, which if used links the value as the % of the best-fit value inferred from the previous phase. For example:

[AbstractEllipticalSersic]
intensity=p, 50.0
effective_radius=v, 1.0
sersic_index = v, 1.2

Here, 'v' stands for 'value', and signifies the link uses the value as usual.

The 'p' stands for 'percent' and indiciates that the sigma used for the intensity is the best-fit value times 50%. For example, if intensity was 0.01, sigma=0.005, if it were 100.0, sigma=50.0.

model_mapper does not return PpriorLimitException, but Raises it

In model mapper, this function:

    def assert_within_limits(self, value):
        if not (self.lower_limit <= value <= self.upper_limit):
            raise exc.PriorLimitException(
                "The physical value {} for a prior was not within its limits {}, {}".format(value, self.lower_limit,
                                                                                            self.upper_limit))

Does not return the error. Thus, the error is not used in non_linear to resample a model outside the physical limits :(.

Tuples not showing up in model.info after being passed as a constant

This is most likely related to this issue #39

If a object is passed a constant, e.g.

            self.galaxies.lens.light = results.from_phase(
                "phase_1__lens_sersic"
            ).constant.galaxies.lens.light

Any attribute that is a tuple is now not output to the model.info file:

galaxies
    lens
        mass
            centre
                centre_0                                                                  GaussianPrior, mean = 0.0018718435852760728, sigma = 0.05
                centre_1                                                                  GaussianPrior, mean = -0.0014617864827305366, sigma = 0.05
            axis_ratio                                                                    GaussianPrior, mean = 0.9797092457492225, sigma = 0.2
            phi                                                                           GaussianPrior, mean = 97.00649382709771, sigma = 20.0
            einstein_radius                                                               GaussianPrior, mean = 1.9249201608992714, sigma = 0.09624600804496358

The attribute itself appears to carry fine, and thus the analysis is not biased or impact by this. Its simply the output to model.info that is the issue.

Generate model instance without a check on prior limits

When we create a model instance, all parameter values are compared to the limits in the limits config file.

However, I am now also creating a model instance of a parameters errors (which typically take very low values near -0.1 to 1.0 not on the same 'scale' as the parameter itself). I am doing this so that when manipulating errors with the aggregator, the ModelInstance can be manipulated which contains the GalaxyModel.

However, these errors are typically not within the limits of parametrs. Can we get a bool in the instance_from_physical_vector method to turn this off, soo that when I use a method for errors I can set the bool to False and not check liimits.

New output for optimizer grid search

Currently, the results of an optimizer grid_search look like this:

gaussians_sub_gaussian_centre_0, gaussians_sub_gaussian_centre_1, figure_of_merit
-2.00, -2.00, 1612.75
-2.00, 0.00, 1612.63
0.00, -2.00, 1612.60
0.00, 0.00, 1642.22

It would be good if we could get a lot more information in this, something like this:

Grid Search Parameter 0 :                                  gaussians_sub_gaussian_centre_0
Grid Search Parameter 1:                                   gaussians_sub_gaussian_centre_1

Figure of Merit:                                                 Evidence

Search 1: [(-2.00:0.00)->-1.2, (-2.00:0.00)->-0.8]            1612.75
Search 2: [(-2.00:0.00)->-1.3, (0.00:2.00)->0.2]               1612.63
Search 3: [(0.00:2.00)->0.9, (-2.00:0.00)->-1.0]               1612.60
Search 4: [(0.00:2.00)->1.6, (0.00:2.00)->1.5]                 1642.22

Best-fit model -> Search 4

Most Likely Model:

gaussians
    sub_gaussian
        intensity                                                                      0.033
        sigma                                                                          0.165
        centre
            centre_0                                                                  -0.976
            centre_1                                                                  -0.757

The full prior ranges are explicitly written out next to each search in square brackets (e.g. [(-2.00:0.00)] corresponds to the range of the first parameter.

The values next to these, with a -> arrow, coresponded to the best-fit inferred value for that search.

The best-fit model section at the bottom can be generated using the same code for the 'output_results' in multinest_output, we have standard textformatter things for outputting that.

I can't find the code to do this, but would be happy to do most of it if you can get a basic template for editing up.

Optimizer pickle shoould be after pass_priors function

We made it so that an NLO instance is pickled to the phase folder, allowing our aggregator to load the NLO instances for us to manipulate the model.

I think the NLO instance that is pickled is before the pass_priors function is called. This means that the NLO instance does not correspond to the model mapper actually used in the analysis.

The reason I think this is because I am using an NLO instance via the aggregator in autocti, and for the following code:

optimizers = agg.optimizers_with(pipeline=pipeline, phase=phase)

instance = optimizers[0].most_probable_model_instance

print(optimizers[0].variable.instance_from_physical_vector(physical_vector=[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]))

The physical vector should only correspond to 6 parameters, as two are fixed in the pass_priors function.

Automatically delete optimizer folder in phase when resuming a job

When I downlad the results from a super computer iget a set of optimizr folders which are empty (their sim link doesnt point to a al location on my laptop).

We created a scheme where the optimizer_backup folder would overwrite these folders, so that jobs could be restarted from the laptop. However, this crashes before the overwrite happens, as the optimizer folder exists.

Can we make it so that the optimizer folder is delete (or over written) when a pipeline starts in this way.

Config file for format string of parameters

When we print parameters, we use a format string to display them, e.g. {:.2f}, {.4e}. Currently, there are methods which use format strings to do this, but the format string is input by the user or assumes a default value.

For every parameter, there are different ways we want to disply them. For exaple, some parameters may crrespond too very small numbers like 0.0000001, so a format string {:.2e} is appropriates. Others may warrant a :.2f.

Basically, we just need a config file to accompany label.ini which lists these format strings.

Nothing being written to output.log

autofit v0.11.0
autolens v0.10.0

Nothing is being written to the output.log files during/after multinest runs.

The output section of my config file "general.ini" is:

[output]
log_interval = 50
log_file = output.log
number_of_accepted_samples_between_output = 10

Choose sym link path

Currntly, the sym link goes to the $HOME folder, if not specified otherwise.

On the supercomputer, this puts it in the home directory of the user, whiich has limited hard disk space, thus filling the folder up with multinest files.

We use the $HOME environmental variable to setup the runs before submitting them to the supercomputer, so we cant overwrite $HOME to choose our sym link path.

Could we get an option in the config file that speciies the path? E.g. if its HOME it uses home, but can also be a directory path.

Use NLO pickle to ensure resumes are identical state

If a MultiNest run is resumed, and any of the following are changed:

  • The model.
  • The priors on the model.
  • The NLO settings.

We should raise an error so that the run does not resume. Now we are pickling the NLO, we can do exactly that.

Parallel feature of GridSearch bugged

Long time since we used this, and never since the promises API, so no surprise its gone buggy. in PyAutoToy using parallel=True gives this error:

Process 0:
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/home/jammy/PycharmProjects/PyAuto/PyAutoFit/autofit/optimize/grid_search.py", line 427, in run
    self.queue.put(job.perform())
  File "/home/jammy/PycharmProjects/PyAuto/PyAutoFit/autofit/optimize/grid_search.py", line 387, in perform
    result = self.optimizer_instance.fit(self.analysis, self.model)
  File "/home/jammy/PycharmProjects/PyAuto/PyAutoFit/autofit/optimize/non_linear/non_linear.py", line 293, in timed_function
    result = func(optimizer_instance, *args, **kwargs)
  File "/home/jammy/PycharmProjects/PyAuto/PyAutoFit/autofit/optimize/non_linear/multi_nest.py", line 182, in fit
    analysis.visualize(instance=instance, during_analysis=False)
  File "/home/jammy/PycharmProjects/PyAuto/PyAutoToy/toy_gaussian/src/pipeline/phase/imaging/analysis.py", line 55, in visualize
    fit = self.masked_imaging_fit_from_instance(instance=instance)
  File "/home/jammy/PycharmProjects/PyAuto/PyAutoToy/toy_gaussian/src/pipeline/phase/imaging/analysis.py", line 44, in masked_imaging_fit_from_instance
    instance.gaussians,
  File "/home/jammy/PycharmProjects/PyAuto/PyAutoArray/autoarray/structures/arrays.py", line 209, in in_1d_binned
    return self.mask.mapping.array_binned_from_sub_array_1d(sub_array_1d=self)
  File "/home/jammy/PycharmProjects/PyAuto/PyAutoArray/autoarray/mask/mapping.py", line 160, in array_binned_from_sub_array_1d
    self.mask.sub_fraction,
AttributeError: 'Mask' object has no attribute 'sub_fraction'
Process 1:
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/home/jammy/PycharmProjects/PyAuto/PyAutoFit/autofit/optimize/grid_search.py", line 427, in run
    self.queue.put(job.perform())
  File "/home/jammy/PycharmProjects/PyAuto/PyAutoFit/autofit/optimize/grid_search.py", line 387, in perform
    result = self.optimizer_instance.fit(self.analysis, self.model)
  File "/home/jammy/PycharmProjects/PyAuto/PyAutoFit/autofit/optimize/non_linear/non_linear.py", line 293, in timed_function
    result = func(optimizer_instance, *args, **kwargs)
  File "/home/jammy/PycharmProjects/PyAuto/PyAutoFit/autofit/optimize/non_linear/multi_nest.py", line 182, in fit
    analysis.visualize(instance=instance, during_analysis=False)
  File "/home/jammy/PycharmProjects/PyAuto/PyAutoToy/toy_gaussian/src/pipeline/phase/imaging/analysis.py", line 55, in visualize
    fit = self.masked_imaging_fit_from_instance(instance=instance)
  File "/home/jammy/PycharmProjects/PyAuto/PyAutoToy/toy_gaussian/src/pipeline/phase/imaging/analysis.py", line 44, in masked_imaging_fit_from_instance
    instance.gaussians,
  File "/home/jammy/PycharmProjects/PyAuto/PyAutoArray/autoarray/structures/arrays.py", line 209, in in_1d_binned
    return self.mask.mapping.array_binned_from_sub_array_1d(sub_array_1d=self)
  File "/home/jammy/PycharmProjects/PyAuto/PyAutoArray/autoarray/mask/mapping.py", line 160, in array_binned_from_sub_array_1d
    self.mask.sub_fraction,
AttributeError: 'Mask' object has no attribute 'sub_fraction'

Once this is running on your laptop, it would be good (if we have time) for me to test this on the Durham supercomputer.

Optimizer grid search does not work for redshift

Working for the centres of a profile, but not for a galaxy redshift :O. The error is:

Traceback (most recent call last):
File "/home/jammy/PyCharm/Projects/AutoLens/workspace_jam/runners/subhalo_runner.py", line 31, in
pipeline.run(data=ccd_data)
File "/home/jammy/PyCharm/Projects/AutoLens/autolens/pipeline/pipeline.py", line 48, in run
positions=positions))
File "/home/jammy/PyCharm/Projects/AutoLens/autolens/pipeline/phase.py", line 459, in run
result = self.run_analysis(analysis)
File "/home/jammy/PyCharm/VirtualEnvs/AutoLens/lib/python3.6/site-packages/autofit/tools/phase.py", line 139, in run_analysis
return self.optimizer.fit(analysis, self.grid_priors)
File "/home/jammy/PyCharm/VirtualEnvs/AutoLens/lib/python3.6/site-packages/autofit/optimize/grid_search.py", line 145, in fit
arguments = self.make_arguments(values, grid_priors)
File "/home/jammy/PyCharm/VirtualEnvs/AutoLens/lib/python3.6/site-packages/autofit/optimize/grid_search.py", line 109, in make_arguments
if float("-inf") == grid_prior.lower_limit or float('inf') == grid_prior.upper_limit:
AttributeError: 'Redshift' object has no attribute 'lower_limit'

Continue a pipeline after a NLO grid search

The NLO grid search works great!

We now have a need to continue a pipeline after the NLO grid search has run. Basically, we want the 'previous_results' of the next phase to have access to the models of the grid search. The most important result is the highest evidence model from the entire NLO search, so maybe something like

self.model = previous_results[-1].best_grid_search_model.variable.model

Having access to all grid search models (i.e. an index from 0 -> 35 for a 6 x 6 grid search) would be nice too, but not necessary.

Overwriting a prior over-writes the upper / lower limits and resets them to +/- inf

In AutoLens, we have a pass_priors function as follows:

    class LensSourcePhase(phase.LensSourcePlanePhase):

        def pass_priors(self,previous_results):

            phase_1_results=previous_results[0]
            self.source_galaxies.source = phase_1_results.variable.source

The source GalaxyModel comes in with the correct lower / upper limits (e.g. 0.0 -> inf), but when it is over-written by the phase_1_results source, the limits automatically reset to -inf -> inf.

This means that certain parameters for certain models are going to unphysical values (e.g. negative intensity, effective radius, etc.). When we use a '.variable' to link two parameters, we should always default to the config lower / upper limits, or, at the very least, use the same values as the previous phase.

optimizer_with_all_constant_from_optimizer

In PyAutoLens, to setup the hyper galaxy phase I override the galaxies of the optimizer to ensure that only the HyperGalaxy parameters are varied:

            optimizer = phase.optimizer.copy_with_name_extension(extension=path[-1])

            optimizer.phase_tag = ""

            # TODO : This is a HACK :O

            optimizer.variable.galaxies = []

In PyAutoCTI, I have to do the same thing, but with each model component:

        phase.optimizer.variable.parallel_species = []
        phase.optimizer.variable.parallel_ccd_volume = []

Could we get a simple method which generate an 'empty' optimizer for us to manipulate in the phase extensions? OBviusly the hacks above are dodgy if new attributes are added to the optimizer.

Promises need model_absolute and model_relative attributes

The following works:

    mass = af.PriorModel(al.mp.EllipticalPowerLaw)

    mass.centre = af.last.model.galaxies.lens.mass.centre
    mass.axis_ratio = af.last.model.galaxies.lens.mass.axis_ratio
    mass.phi = af.last.model.galaxies.lens.mass.phi
    mass.einstein_radius = af.last.model.galaxies.lens.mass.einstein_radius

The following does not, due to the 'model_absolute'

    mass = af.PriorModel(al.mp.EllipticalPowerLaw)

    mass.centre = af.last.model.galaxies.lens.mass.centre
    mass.axis_ratio = af.last.model.galaxies.lens.mass.axis_ratio
    mass.phi = af.last.model.galaxies.lens.mass.phi
    mass.einstein_radius = af.last.model_absolute(a=0.3).galaxies.lens.mass.einstein_radius

Aggregator needs quick access to Output class

After moving model out of the NLO class, the aggregator is now a bit cumbersome for setting up a model output that allows one to manipulate results:


agg = af.Aggregator(
    directory=output_path + "/" + waveband + "/" + image
)

optimizers = agg.optimizers_with(pipeline=pipeline_meta, phase=phase)
phases = agg.phases_with(pipeline=pipeline_meta, phase=phase)

output = multi_nest_output.MultiNestOutput(model=phases[0].model, paths=optimizers[0].paths)

It'd be good to get a method that directly loads the MulitNestOutput class, as for aggregator use this is the main thing we'll probably ultimately use anyway.

Backup path now used for all MultliNest NLO output

The opt_path was used to load the MultiNest summary / weighted samples files. This meant that when I opened a pickled instance of a MultiNest class from the super computed, I could not load these files as the path point to a sym-link.

I have it so that the back_up path is now used for everything, as this is a Python string and thus works regardless of sym-link. This, the opt_path and sym_link are now only used during a MultiNest fit, whcih I think is what we want.

A consequence of me doing this is that the self.backup() function that runs at the end of every NLO (MultiNest, DHS, Grid) moved up a few lines to happen befores results were output and visualized. This ensures the necessary files are present in the optimizer_backup folder before they are used for output.

Output results of classic grid search

The classic grid search runs fine, but we need to output the results for quick visual inspections.

For a (y,x) grid search, can we make an model.results file that looks something like:

centre_0, centre_1, likelihood
-1.0 1.0 10.0
-0.5 1.0 15.0
-0.0 1.0 22.0

etc.

Unzip results via aggregator

Quick aggregator function that finds all results in a directory and unzips them, e.g. after downloading them from a HPC.

Aggregator With Tag

Once I add pipeline tagging to the code, the directory structure of a fit will be as follows:

-> Pipeline_1
   -> Pipeline_Tag__Does_Something_Cool
        -> Phase_1_Fit
            ->Phase_Tag__Sub_1
            -> Phase_Tag__Sub_2
        -> Phase_2_Another_Fit
            -> Phase_Tag__Sub_1
    -> Pipeline_Tag__Does_Something_Else_Cool
        -> Phase_1_Fit
            -> Phase_Tag__Sub_2
        -> Phase_2_Another_Fit
            -> Phase_Tag__Sub_1
-> Pipeline_2
   -> Pipeline_Tag__Does_Something_Cool
        -> Phase_1_Fit
            ->Phase_Tag__Sub_1

So, what we want with the aggregator is the ability to grab whatever results we want from this path structure. So, in the simplest case, we could supply all 4 strings:

pipeline_name = 'Pipeline_1'
pipeline_tag = 'Pipeline_Tag__Does_Something_Cool'
phase_name = 'Phase_2_Another_Fit
phase_tag = 'Phase_Tag__Sub_1'

This will get the results of all images which are fitted with that specific set of pipelines, tags, phases, etc.

Next, lets say we don't specify a phase tag. What should the aggregator do? For each image, it should retrieve all fits using the input pipeline_name, pipeline_tag and phase_name. Thus, for each image it should retrieve a list of fits, one for each phase_tag in the Phase_Name folder.

If a pipeline_tag isn't specified it should do the same. That is, for every image, it should retrieve a list for all of the different pipelines it was run on, given the phase_name and phase_tag which specify which folder to look in.

I guess if pipeline_tag and phase_tag are missing, we'll have list of lists.

Should the agggregator by default return lists of lists of lists? e.g.

optimizers = aggregator.opt_from_what

optimizers[pipeline_tag_index]][phase_tag_index]

Cant pass results of phase extensions using new API

Cant access an extended phase's results using the new API, e.g.:

phase1.result.hyper_combined.constant.hyper_image_sky

Gives


Traceback (most recent call last):
  File "/home/jammy/PycharmProjects/PyAutoLens/test/integration/tests/full_pipeline/hyper_with_lens_light_bg_new_api.py", line 287, in <module>
    runner.run(sys.modules[__name__])
  File "/home/jammy/PycharmProjects/PyAutoLens/test/integration/tests/runner.py", line 30, in run
    optimizer_class=optimizer_class,
  File "/home/jammy/PycharmProjects/PyAutoLens/test/integration/tests/full_pipeline/hyper_with_lens_light_bg_new_api.py", line 56, in make_pipeline
    hyper_galaxy=phase1.result.hyper_combined.constant.galaxies.lens.hyper_galaxy
AttributeError: 'Result' object has no attribute 'hyper_combined'

Promises not passing all information in results

I have a pipeline, which using the promises API makes it through 7 phases including hyper funtionality. However, at some point, some part of the results is not being fully passed, as in phase 7 the code is unable to make the hyper-images because the results required to do so are missing.

This is observed in phase 7 of the integration test 'full_pipeline/hyper_with_lens_light_bg_new_api.py'

Grid Searches

What is a grid search?

A grid search is a non-linear search which searches over each dimension of parameter space by sampling values of the parameters between a lower and upper limit, in a uniform (or log-uniform, etc.) manner. For example, if we had a 3 dimension parameter space consisting of parameters p1, p2, p3, we may define a grid search where:

p1 = -0.1 -> 0.1 in bins of 0.01
p2 = 100 -> 200 in bins of 1.0
p3 = 5 -> 15 in bins of 5.0

In this case, p1 has 20 possible values, p2 100 and p3 3. This would mean our grid-search would require 20 x 100 x 3 = 6000 samples to complete.

The benefit of a grid-seaches are:

  1. You can specify exactly in parameter space where you want to search for the highest likelihood model, which means known systematic solutions (e.g. the high-mass lens models in AutoLens) can be avoid. Or, they can be included, but the non-linear search itself will not 'get stuck' searching those regions of parameter space like something like MultiNest would.

  2. The results are returned on a uniform grid of points, which can make interpreting the nature of the parameter space a lot simpler and can facilitate fitting smooth 'fitting functions' to the behaviour of certain parameters. This is exactly what we do when we perform a sensitivity analysis of substructures in a strong lens, and thus why a grid search is necessary.

The downside is:

  1. It doesn't scale up to large dimensional problems, as the number of iterations quickly explodes. This is why we use MultiNest in general.

How will we implement a grid search?

A key difference between the simple example of a grid search above, and one implemented in AutoFit, is that in AutoFit a highest dimensional model may have been already fit in a previous phase. Therefore, we will want to perform a grid search on a sub-set of parameters, using a model consisting of many more parameters initialized already.

For example, in AutoLens, we may have initialized lens model with 20+ parameters, corresponding to the lens and source galaxies light and mass profiles. We may then want to perform a grid search on a sub-set of just three of these parameters, keeping the other 20+ parameters fixed to their most-likely values from the previous phase. The sensitivity analysis is the obvious example of this, where we fix all lens / source galaxy parameters and perform a 3D grid search over a substructures x, y and mass parameters.

In terms of implementation, this probably means we need a 'GridSearch' class in non_linear.py, and a new set of prior config files which are called for this grid search. This config files specify, by default, whether each parameter of a given model class:

  1. Use the previous phase's most likely value as a Constant.
  2. Are part of the grid-search, specifying their range of values and spacing. We can reuse classes like UniformPrior, as the grid search will span unit-hypercube values from 0.0 to 1.0.

The grid search itself is then fairly staighted forward, we just sample the specified models and output all of them (parameters, likelihoods, etc) to a file on the hard disk. This file can follow the same format as MultiNest files.

MultiNest Grid Search

We infact need a second type of grid-search. In this case, for each parameters that we grid-search over, we don't use just a constant value of that parameter at each point, but perform a complete MultiNest search of that parameter using uniform-priors spaced between its boundaries. For example, if we had two parameters:

p1 = 0.0 -> 1.0 in bins of 0.1
p2 = 10.0 -> 20.0 in bins of 5.0

This grid-search would perform 20 MultiNest searches, where the priors on each parameter are as follows:

MultiNest search 1:

p1 = UniformPrior(lower_limit=0.0, upper_limit=0.1)
p2 = UniformPrior(lower_limit=10.0, upper_limit=15.0)

MultiNest search 2:

p1 = UniformPrior(lower_limit=0.1, upper_limit=0.2)
p2 = UniformPrior(lower_limit=10.0, upper_limit=15.0)

MultiNest search 3:

p1 = UniformPrior(lower_limit=0.2, upper_limit=0.3)
p2 = UniformPrior(lower_limit=10.0, upper_limit=15.0)

... and so on up to MultiNest Search 10

MultiNest search 11:

p1 = UniformPrior(lower_limit=0.0, upper_limit=0.1)
p2 = UniformPrior(lower_limit=15.0, upper_limit=20.0)

... and repeat up to MultiNest search 20.

Obviously, this functionality should work the same way as above, whereby if we have an existing model with 20+ parameters, this MultiNest grid-search can use constants for the majority of parameters assuming the results of a previous phase.

This MultiNest grid-search is required for substructure detection, whereby when we add sub-halos we want them to have some freedom in changing their position / mass so as to best-fit the data.

variable link via absolute or relative value within pipeline

Currently, we link two parameters as

param = results.from_phase('phase').variable

The width of the Gaussian prior that makes this link is set by the width config file. It'd be nice if we could set this width manually in the link, for super customized pipelines. For example.

param = results.from_phase('phase').variable_absolute(a=1.0)
param = results.from_phase('phase').variable_relative(r=0.5)

Aggregator filtering for different phase folders

For my CTI runs, the following is an example of my directory structure of some runs:

-> ci_images_uniform
    -> parallel_x2
         -> low_resolution
                  -> pipeline_1
                           -> phase 1
                            -> phase 2
                   -> pipeline_2
                          -> phase 1
         -> mid_resolution
                  -> pipeline_1
                           -> phase 1
                            -> phase 2
                   -> pipeline_2
                          -> phase 1
    -> serial_x2
         -> low_resolution
                  -> pipeline_1
                           -> phase 1
                            -> phase 2
                   -> pipeline_2
                          -> phase 1
         -> mid_resolution
                  -> pipeline_1
                           -> phase 1
                            -> phase 2
                   -> pipeline_2
                          -> phase 1
-> ci_images_non_uniform
    -> parallel_x2
         -> low_resolution
                  -> pipeline_1
                           -> phase 1
                            -> phase 2
                   -> pipeline_2
                          -> phase 1
         -> mid_resolution
                  -> pipeline_1
                           -> phase 1
                            -> phase 2
                   -> pipeline_2
                          -> phase 1

When using the aggregator to load a list of NLOs from this directory structure, I cannot specify based on folder names. For example, I may want all results in folders named 'parallel_x2' and 'mid_resolution', too compare how these models change for the different image types (ci_images_uniform, ci_images_non_uniform). Thus, if we could pass a list of strings to filter based on directory names, that'd be swell.

Make NLO grid search output images of every individual NLO run

Currently, there is one image folder for the entire NLO grid seach, e.g.:

workspace/output/pipeline_name/phase_nlo/image/*.png

It would be better if images for each NLO grid search had their images output separately, e.g.:

workspace/output/pipeline_name/phase_nlo/centre_0_0_centre_1_0/image/*.png

workspace/output/pipeline_name/phase_nlo/centre_0_0_centre_1_1/image/*.png

etc.

It should be noted that the 'pdf_triangles.pdf' file actually is output correctly in each phases image folder, it is just the lens_fit / tracer images that are not.

Fast access to results of a pipeline

After we run a pipeline, we return the results of the pipeline, e.g.

pipeline = lens_light_and_source_inversion.make_pipeline(pipeline_path='/slacs/' + lens_name)
result = pipeline.run(data=ccd_data)

Normally, we don't want too do anything with the results, as the pipeline visualizes and outputs everything we need. However, if we want to customize the visualize of the results, the results object is handy, as it gives us access to everything we need, e.g.


from autolens.model.inversion.plotters import inversion_plotters
inversion_plotters.plot_reconstructed_pixelization(inversion=result[4].most_likely_fit.inversion)

This allows us to customize the figures output via the plotters, e.g. in a Juypter notebook.

from autolens.model.inversion.plotters import inversion_plotters

inversion_plotters.plot_reconstructed_pixelization(inversion=result[4].most_likely_fit.inversion, 
                                                   title='SLACS1430+4105 Reconstructed Source', 
                                                   xlabelsize=24, ylabelsize=24, xyticksize=24)

However, accessing these results is slow. We have to run the entire pipeline, which involves multinest running an optimizer and lots of visualization. It would be good if when we ran the pipeline, we could tell it to go through the entire pipeline so as to generate the results, but skip running opttimiers / visualization so that it is a lot faster.

Automated Latex table generation from model

Lets say we have a most probable model:

instance.thing.param1 = 1.0
instance.thing.param2 = 2.0
instance.thing2.param1 = 3.0

with upper errors (e.g. 3.0 sigma):

instance.thing.param1 = 0.1
instance.thing.param2 = 0.5
instance.thing2.param1 = 0.3

and lower errors:

instance.thing.param1 = 0.05
instance.thing.param2 = 0.3
instance.thing2.param1 = 0.67

And the parameters have labels:

instance.thing.param1 = \rho
instance.thing.param2 = \rho
instance.thing2.param1 = D

Using a model mapper, we want to generate a latex table (e.g. for a paper) that displays these results. Something like:

$\rho_{1} = 1.0^{+0.1}{-0.05}$ & $\rho{2} = 2.0{+0.5}{-0.3}$ & D^{+0.3}{-0.67}

We may also want to add extra columns, e.g.:

Model A & 3 Species & $\rho_{1} = 1.0^{+0.1}{-0.05}$ & $\rho{2} = 2.0{+0.5}{-0.3}$ & D^{+0.3}{-0.67} \[4pt]

I think we have all the tools we need to do this with autofit, and now just need a routine that given a model instance produces a string with the above code.

I guess the hard part is we need a clever way of making it so the order of the parameters in the latex string can be specified, as by default the order will be weird based on prior passing. We probably also want this method to take a list of optimizer (and therefore model mappers) so that large LaTex tables (e.g. of many lenses) can be produced.

Raise error is 'result.model' or 'result.instance' is not called

I had a pipeline with a typo, which setup a prior as light=phase1.result.variale.galaxies.lens.light'. This didn't raise an error, but intead lead to an infnite loop in the recursion (the error was thatt he recusion infinite loop was hit).

We should make sure typos are flagged.

Issue linking galaxies as constants through hyper phases

If I link a galaxy prior to a previous phase, using the following code:

            self.galaxies.lens= results.from_phase(
                "phase_3__lens_sersic_sie__source_sersic"
            ).constant.galaxies.lens

There will be a bug when it comes to the hyper combined phase. The lens galaxy will correctly be fitted for in the hyper-galaxy phase, and it will be linked to the hyper combined phase in the sense that its model.info displays correctly in the model.info file, implying the prior was passed. However, its parameters will not be passed to the hyper combined phase that is if the lens galaxy is the only galaxy the hyper combined phase would have a dimensionality of 0.

I can hack a fix to this at the moment but linking galaxies to constants as follows:

            self.galaxies.lens.light = results.from_phase(
                "phase_3__lens_sersic_sie__source_sersic"
            ).constant.galaxies.lens.light

            self.galaxies.lens.mass = results.from_phase(
                "phase_3__lens_sersic_sie__source_sersic"
            ).constant.galaxies.lens.mass

            self.galaxies.lens.shear = results.from_phase(
                "phase_3__lens_sersic_sie__source_sersic"
            ).constant.galaxies.lens.shear

Stop cti phase names literring output folder

When I run a phase, a lot of phase names are generated in the output folder that have nothing t do with my run, e.g.:

image

This happens in AutoLens too, so I gess its some weird autofit thing.

model.info files are screwed

Here is a model.info file from AutoLens:

CollectionPriorModel

lens_galaxies_GalaxyModel

lens_galaxies_lens_EllipticalIsothermal

lens_galaxies_lens_mass_axis_ratio                                                  UniformPrior, lower_limit = 0.2, upper_limit = 1.0
lens_galaxies_lens_mass_phi                                                         UniformPrior, lower_limit = 0.0, upper_limit = 180.0
lens_galaxies_lens_mass_Length

lens_galaxies_lens_mass_einstein_radius_value                                                       UniformPrior, lower_limit = 0.0, upper_limit = 4.0

CollectionPriorModel

source_galaxies_GalaxyModel

source_galaxies_source_EllipticalSersic

source_galaxies_source_light_axis_ratio                                                  UniformPrior, lower_limit = 0.2, upper_limit = 1.0
source_galaxies_source_light_phi                                                         UniformPrior, lower_limit = 0.0, upper_limit = 180.0
source_galaxies_source_light_sersic_index                                                UniformPrior, lower_limit = 0.8, upper_limit = 8.0
source_galaxies_source_light_Luminosity

source_galaxies_source_light_intensity_value                                                       LogUniformPrior, lower_limit = 1e-06, upper_limit = 10.0
source_galaxies_source_light_Length

source_galaxies_source_light_effective_radius_value                                                       UniformPrior, lower_limit = 0.0, upper_limit = 4.0

After you went on holiday we managed to screw these up somehow. Prob not too tricky to fix.

Attributes of hyper combined phase need to default to None

If I run a pipeline with all hyper-mode features off, e.g.:

    phase1 = phase1.extend_with_multiple_hyper_phases(
        hyper_galaxy=False,
        include_background_sky=False,
        include_background_noise=False,
    )

This will not produced a 'hyper_combined' phase. We typically use this phase to setup these parts of the model in the next phase:

    phase2 = al.PhaseImaging(
        phase_name="phase_2__lens_sie__source_sersic",
        phase_folders=phase_folders,
        galaxies=dict(
            lens=al.GalaxyModel(
                redshift=redshift_lens,
                bulge=phase1.result.instance.galaxies.lens.bulge,
                disk=phase1.result.instance.galaxies.lens.disk,
                mass=mass,
                shear=shear,
                hyper_galaxy=phase1.result.hyper_combined.instance.galaxies.lens.hyper_galaxy,
            ),
            source=al.GalaxyModel(
                redshift=redshift_source, light=al.lp.EllipticalSersic
            ),
        ),
        hyper_image_sky=phase1.result.hyper_combined.instance.hyper_image_sky,
        hyper_background_noise=phase1.result.hyper_combined.instance.hyper_background_noise,
        sub_size=sub_size,
        signal_to_noise_limit=signal_to_noise_limit,
        bin_up_factor=bin_up_factor,
        positions_threshold=positions_threshold,
        inner_mask_radii=inner_mask_radii,
        pixel_scale_interpolation_grid=pixel_scale_interpolation_grid,
        optimizer_class=af.MultiNest,
    )

Hyper pipelines take optional inputs to determine if these features are used or not, so having them all "False" is a plausible run one could do. Would it be possible to make it so if a hyper phase doesn't run, using result.hyper_combined will always return a None irrespective of the attribute access?

This is currently leaidng to a bug in the pipeline where an infinite recursion error is raised, presumeably because it cannot find the relevent hyper parameters.

Make sym links play nicely when mirroring data between a laptop and supercomputer

We are now going to begin running large analyses on the Durham super computing, for both AutoLens and AutoCTI. This effectively means we want to 'mirror' the output directory of each on our laptops and the super computer, so we can run large jobs on the super computer but analyse the output on our laptops. Furthermore, we may want to 'initialize' a model (e.g. phase 1) on our laptop to make sure it works and then send those results to the super computer for a more intensive analysis of the subsequent phases.

Passing data is straight forward, we can just make sure everyone has the appropriate rsync command line commands to pass data between the two.

However, I'm gonna guess that this is going to completely ruin our sym links to the MultiNest optimizer folder? YAY!

I'm not totally sure how we deal with this. I guess we could make it so that, by default, AutoLens always sets up a new optiimzer folder and sym link on every single run, and copies the files in the optimizer_backup folder to this? Its ugly, but erm...yeah.

I havent explicitly check that rsync plays nicely with sym links from laptops to super computers either.

Packaging / unpackaging of results on the fly

We're running into file-size limits on the Durham super computer, where basically autolens / cti are producing way too many files per run. This is understandable, given each run has ~10 MultiNest outputs, ~10 other autofit files > 10-50 visualizations.

I think we need functionality that if switched on automatically packages an entire phase into a .zip or .tar file. I guess when we 'resume' a run the code would need to be able to unpackage results if they are in this format.

model.results mislabelled parameters

autofit v0.11.0
autolens v0.10.0

Some of the parameters and their values are mislabelled in the file model.results.

e.g

lens_mass_axis_ratio 2.007
lens_mass_phi 0.731
lens_mass_einstein_radius 144.182

should read

lens_mass_axis_ratio 0.731
lens_mass_phi 144.182
lens_mass_einstein_radius 2.007

How often do IntervalCounters count?

For visualization, back-up and logging, we IntervalCounters, which then perform the task depending on the config values:

    class Fitness(object):
        
        def __init__(self, nlo, analysis, image_path):

            self.nlo = nlo
            self.result = None
            self.max_likelihood = -np.inf
            self.image_path = image_path
            self.analysis = analysis
            visualise_interval = conf.instance.general.get('output', 'visualise_interval', int)
            log_interval = conf.instance.general.get('output', 'log_interval', int)
            backup_interval = conf.instance.general.get('output', 'backup_interval', int)

            self.should_log = IntervalCounter(log_interval)
            self.should_visualise = IntervalCounter(visualise_interval)
            self.should_backup = IntervalCounter(backup_interval)

        def fit_instance(self, instance):
            likelihood = self.analysis.fit(instance)

            if likelihood > self.max_likelihood:
                self.max_likelihood = likelihood
                self.result = Result(instance, likelihood)

                if self.should_visualise():
                    self.analysis.visualize(instance, image_path=self.image_path, during_analysis=True)

            if self.should_backup():
                self.nlo.backup()

            return likelihood

My question is, do these counts add 1 every time an accepted sample (e.g. it inceases the likelihood) is made, or just for every single model called in fit_instance?

If it is the latter, I think that we should change it to the former, as it is leading to a large amount of backups / logs to take place, which hurts us due to high IO on the super computer. If it isn't the latter, then it means something else is slowing down our super computer runs...

After a MultiNest run, delete the symlink folders

Again, to reduce the amount of file space used, we should delete the symlink folers after a MultiNest run finished, incuding the optimizer and .al folders.

The optimizer_backup folder is then the primary sstore for results and is what is zipped up when compressing outputs.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.