polarwandering / paleosampling Goto Github PK

View Code? Open in Web Editor NEW

7.0 7.0 3.0 47.97 MB

Quantitative Analysis of Paleomagnetic Sampling Strategies

Home Page: https://polarwandering.github.io/PaleoSampling/

License: MIT License

Jupyter Notebook 98.94% Python 1.03% Makefile 0.02% TeX 0.02%

error-quantification paleomagnetism statistics

paleosampling's People

Contributors

Stargazers

Watchers

Forkers

facusapienza21 lengallo tianyining

paleosampling's Issues

Compute the error with outliers for the case with one sample per site and simple outlier test

For Figure 3, it could be useful to also include in panel (c) the figure with the one sample per site case where we allow to have the very simple test of filtering points that are outside the 45 degree band around the previously estimated mean.

This also open the question of how to estimate the mean in the more efficient way.

visualize distributions and the error to the true pole for the 95 percentile largest error

When I am evaluating a pole for use in a paleogeographic reconstruction or designing a study for develop a pole, I want to have high confidence that the pole is a good estimate of the true pole direction.

The current approach of showing the mean square error between the simulated poles and the true pole is asking a slightly different question. The mean is what showing what the average deviation from the true pole is. But in practice I want to have a sense not of what this all looks like in the mean, but what it looking like elsewhere in the distribution (how often do I get a result that is far from the truth).

The histograms of Figure 3 give one window into this, but we need to think through a way to visualize this aspect of the result over more of the parameter space. One approach that I think could work would be to show the 95 percentile of the error to the true pole in addition to showing the mean in a plot that has the same structure as the plot showing the mean square error.

We can discuss other approaches like picking example parameter combinations for box/whisker or violin plots. Perhaps there could be an interactive widget where one could select a parameter combination and see the distribution in addition to the mean square error?

`TypeError` in the Figure 1 notebook that is rendered in the JupyterBook

There is currently a TypeError in the Figure 1 notebook that is rendered in the JupyterBook: https://polarwandering.github.io/PaleoSampling/figures/figure1/Figure1.html

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[6], line 1
----> 1 df_pole = df_filter[df_filter.n0 <= 20].pivot('N', 'n0', 'error_angle_S')
      2 df_scatter = df_filter[df_filter.n0 <= 20].pivot('N', 'n0', 'error_vgp_scatter')
      4 plot_angle_error(df_pole, 
      5                  df_scatter,
      6                  save_plot=True)

TypeError: pivot() takes 1 positional argument but 4 were given

Create intro tabs in JupyterBook

Add more navigation to the JupyterBook in the main notebooks. This should be very simple links that points to the respective notebooks to run the different experiments, specially those that are easy to execute.

Compute mean kappa in addition to mean angle error

These two are different (mean of a function vs function of the mean). This is important because the theoretical result holds for kappas, without information on the mean angle. In order to assert if the result is actually working, we need to compute this quantity .

make plot in Sampling_example.ipynb be site means

The example shown in Sampling_example.ipynb generates samples from 10 sites with 4 samples per site

params0 = smp.Params(N=10,
                     n0=4,
                     kappa_within_site=100,
                     site_lat=10, 
                     site_long=0,
                     outlier_rate=0.10,
                     secular_method="G",
                     kappa_secular=None)

All of these samples are then plotted which makes them look like independent samples. It is uncommon to plot sample data while skipping the site hierarchy as done here. I think that it makes sense to either generate the sample with n0=1 or to calculate a site mean for each site. This would need to be done in directional space and then transformed to VGP space.

It also seems valuable to have this example show the angular error from the true pole for the sample. This would be illustrative.

Sample outliers in inclination-declination space

Check that we are sampling from the uniform distribution in inclination-declination space instead of latitude-longitude space

incorporate calculation of S parameter into the process

In addition to estimating $\mu$, we want to calculate the VGP scatter ($S$) as this is another parameter that paleomagnetists seek to determine.

VGP scatter is quantified by the parameter $S_p$ (e.g., 1969), defined as:

$$ S_p^2 = (N -1)^{-1} \sum_{i=1}^N (\Delta_i)^2 $$

where $N$ is the number of observations and $\Delta_i$ is the angle between the $i^{th}$ VGP and the spin axis.

To address the issue of within site scatter, some studies use a cutoff for $\kappa$ or $\alpha_{95}$ for inclusion in the calculation, while others adjust the value of $S_p$ to account for the within site scatter $S_w$.

McElhinny and McFadden (1997) defined a parameter $S_f$:

$$ S_f^2 = S_p^2 - (S_w^2/\bar n), $$

where $\bar n$ is the average number of samples per site.

The VGP scatter is calculated in https://github.com/PmagPy/PmagPy/blob/master/pmagpy/ipmag.py with the function sb_vgp_calc. This function is currently setup to accept a data frame — which should be generalized. It has a parameter for whether to not to do within site scatter correction.

Standardize notation in all code

Change all notation in the code to be consistent with the manuscript and the standard notation

N_sites: Total number of sites.
n_samples_site: Total number of samples per site.
n_samples: Total number of samples. This should be equal to N_sites * n_samples_site.

We can also use

kappa_insite: Kappa of the Fisher distribution for sampling within site.
outlier_rate: Proportion of expected outliers.
latitude, longitude: These correspond to the paleo-latitudes/longitudes used for going from (dec, inc) to (lat, lon).

Reading files once `smpsite` is installed

The mapping from kappa to angular dispersion is currently computed based on an interpolation of a pre-computed file located in kappa_secular. Now, in order to read this file once the package has been installed I had added the following line to define the relative path inside kappa.py:

_file_location = pathlib.Path(__file__).parent.joinpath("kappa_tabular/kappa2angular.csv")#.parent.joinpath("")
df = pd.read_csv(_file_location, header=0)

However, even after installation, this path doesn't seem to be recognized in the Binder hub. I am not sure why this is the case, but I guess this is not the best way of reading the file. This is the message error once we launch Binder

Sampling example notebook not running in JupyterBook

@Swanson-Hysell the sampling notebook is printing an error when executed by JupyterBook. If you cannot fix this error, you can always uncomments the notebook in the _config.yml under executed_patterns to ignore the execution of this notebook and instead just display it.

Check on transformation between directional and coordinate space

In the function generate_samples in smpsite/smpsite/sampling.py, I am converting the directions corresponding to secular variation from directional space to (inclination, declination). Then I add Fisher noise in this space. I want to check that the functions I am using from pmagpy are the right ones to do this transformation.

    # sample secular variations
    # lat, lon
    # here is where I can replace by TK03 model
    secular_declinations, secular_inclinations = ipmag.fishrot(k=params.kappa_secular,
                                                               n=N_sites, 
                                                               dec=0, 
                                                               inc=90, 
                                                               di_block=False)
    
    for i, nk in enumerate(design):
        outliers = np.random.binomial(1, params.outlier_rate, nk)
        outliers = sorted(outliers)
        n_outliers = np.sum(outliers)
        n_samples = nk - n_outliers
        
        # Transform VGP coordinates to directions (D, I) coordinates 
        # inc, dec
        dec_vgp, inc_vgp = pmag.vgp_di(plat=secular_inclinations[i], 
                                       plong=secular_declinations[i], 
                                       slat=params.site_lat, 
                                       slong=params.site_long)

    
        # Sample real samples (within-site)
        declinations, inclinations = ipmag.fishrot(k=params.kappa_vgp, 
                                                   n=n_samples,
                                                   dec=dec_vgp,
                                                   inc=inc_vgp,
                                                   di_block=False)

        # Convert specimenst to geographical space
        for j in range(len(declinations)):
            trans_dec, trans_inc, _, _ = pmag.dia_vgp(declinations[j], inclinations[j], 0, params.site_lat, params.site_long)
            declinations[j] = trans_dec
            inclinations[j] = trans_inc

I would also like to include the TK03 sampling of the secular variations in the sampling, instead of sampling from Fisher with unknown kappa.

On the other side, I wonder if I need to do the same at the moment of estimating the mean direction. That is, instead of taking averages of averages in coordinate space, should I compute the average of the site directions in the directional space, transform these to coordinate space and then compute the Fisher mean but in coordinate space? That is, the mean-site direction is computed in (inc, dec) space, while the mean of VGPs is computed in the (lat, lon) space?

Thank you @LenGallo for the help! Feel free to do a Pull Request with changes or open issues if there is something that needs fix.

Optimize Sampling/Estimation time of code

Right now, a bottleneck of all the simulations is the required time. A profiling of the code should be executed to see what is consuming all the execution time. A quick scan revealed that the problem doesn't seem to be coming from the sampling or mean estimation per se, so there must be some other operation in the code that is time-consuming and probably can be replaced.

More exploration is needed.

Figure 2 for different algorithms performace

For the second figure in the paper, we want to understand how the Mean Square Error dispersion from the true pole compares when we use

Lower number of samples per site (=1) without outlier detection algorithm
High number of samples per site (=5) with a perfect outlier detection algorithm
In the case where the total number of samples is kept fixed. I was thinking of something like 1x50 vs 5x10 for this.

We are going to compare these two methods when we vary i) the dispersion parameter kappa per site (from 10 to 1000, logscale) and ii) the latitude (from 0 to 90). Here we want to compare the difference between MSE between 1 and 2 as we move i and ii, possibly in a colormap 2d plot.

Thoughts on this @LenGallo @Swanson-Hysell @duserzym @matdomeier @bramvaes?

Make RMSE for VGP scatter its own figure

I think it would be more effective to make the RMSE for VGP scatter its own figure with the same overlay annotations with the black contours and the dashed white total sample numbers. It is hard to read when it is as small as it is.

In addition, while I understand that perceptually uniform color maps are in favor they actually make it rather hard to tell the difference between numbers which makes the contours

Extend down to sample per site = 1 in n vs N plot

I realize that this figure is a placeholder:

But it makes sense to extend down to n = 1 (i.e. samples per site = 1) for every number of sites rather than having a lower bound of total samples as there looks to be here.

crop x-axis (samples per site) on Figure 1 panels

For both of these plots, it seems a bit silly to have the x-axis go all the way to 40. I don't know of any paleomagnetic study in the literature that has developed that many samples per site. Given that the improvement in number of sample per site asymptotes well before then, not much information is being communicated. Given that it is interested to see the asymptotic behavior, but to also see that there is improvement in the range of 1 to 5 having the x_max be 15 or 20 would be better.

Add `make figures` to Makefile

Add an action to create automatically all figures inside Makefile. The sequence of action should be organized as follows

figures : fig1 fig2a ...
fig1 : figures/figure1/fig1.png
figures/figure1/fig1.png : outputs/fig1a..
     <execure notebook>

We probably want to use a matching pattern to generate all the actios for the different figures with the same command.

interactive widgets

I order to make the calculations broadly accessible, it would be valuable to develop interactive widgets for calculating expected accuracy given varied parameters.

applying vandamme cutoff on the one-sample-per-site strategy

Hi all,

As implemented now, the 'perfect outlier detection algorithm' detects outliers in the directions within the sites. This makes sense because one argument favouring taking several samples per site would be that outliers can be seen and discarded accordingly (there should be at least 3 samples...). When we compare any given strategy to the one that has more than one sample per site, the latter should have ignore_outliers=True.

It is also fair, and sensible, to permit the 'one-sample-per-site' strategy to get rid of outliers as well, but in this case, the outlier would be detected against the population of VGPs. This is what Vandamme does.

Other cutoffs could be applied (in Gerritsen's paper there's and in-depth discussion), but I think Vandamme is just fine.

Looking forward to seeing the outcome! I have quickly gone through the simulations and I can say that the 'one-sample-per-site' strategy plus Vandamme, always wins!

Explore using Model G and BB18 as PSV models

In addition to doing runs using TK03, we can explore using Model G of McFadden and McElhinny (1988): https://doi.org/10.1029/JB093iB10p11583. This approach could be numerically more efficient given that VGP scatter which can be calculated using A and B coefficients and then linked to kappa.

Pmagpy in environment file

When exporting a conda environment that has pmagpy installed inside, this doesn't show in the environment.yml file or in the listed dependencies with conda. I wonder if I should manually add at the end with the pip clause or is there is a better way of doing this. Maybe exporting the dependencies with pip? How do you usually do this @Swanson-Hysell @duserzym ?

The Contributors

Adding contributions along the project using the all-contributors bot.

Add scatter plot to Figure 1

It will be useful to illustrate the idea to add a scatter plot of the sampled poles on top of some locations in Figure 1, just to give more context to the figure and make it easier to understand.