kstoreyf / suave Goto Github PK

This project forked from manodeep/corrfunc

The Continuous-Function Estimator for 2-point statistics (trash the bins!)

License: MIT License

Python 18.78% Makefile 3.94% C 76.52% TeX 0.50% C++ 0.26%

suave's Introduction

`suave`: The Continuous-Function Estimator

This is an implementation of the Continuous-Function Estimator, a generalization of the standard (Landy-Szalay) estimator for the two-point correlation function. We call this tool suave which means smooth in Spanish (pronounced swah-beh), as it can produce smooth (continuous) correlation functions. It is built within the Corrfunc package, by Manodeep Sinha and Lehman Garrison; check out the full Corrfunc README at the original repo.

The 2-point correlation function measures the clustering of galaxies (or other tracers) as a function of scale. Traditionally, this is done by counting the pairs of galaxies in a given separation bin, and normalizing by the pairs in a uniform random catalog.

The Continuous-Function Estimator eliminates the need for binning, in separation or any other quantity. Rather, it projects the pairs onto any user-defined set of basis functions. It replaces the pair counts with vectors, and the random normalization vector term with a matrix, that describe the contribution of the pairs to each basis function. The correlation function can then be directly evaluated at any separation, resulting in a continuous estimation.

An example script for using the estimator is in example_theory.ipynb. The Continuous-Function Estimator is currently implemented in the DD(s, mu) pair counting statistic for both mock and theory data. Currently implemented bases are tophat and piecewise. General r-dependent basis functions can be read in from a file; helper routines for these include spline basis functions of any order and a baryon acoustic oscillation fitting function.

The paper presenting this method can be found at https://arxiv.org/abs/2011.01836 (Storey-Fisher & Hogg, Accepted to ApJ). Feel free to email [email protected] with any comments or questions, or submit an issue.

Installation

Pre-requisites

Suave has most of the same pre-reqs as Corrfunc, as well as a couple more:

make >= 3.80
OpenMP capable compiler like icc, gcc>=4.6 or clang >= 3.7. You should already have a system install, but on mac/linux you can install gcc with conda install gcc.
gsl >= 2.4. Use either conda install -c conda-forge gsl (MAC/linux) or (sudo) port install gsl (MAC) to install gsl if necessary.
python >= 2.7 or python>=3.4 for compiling the C extensions.
numpy >= 1.7 for compiling the C extensions.
scipy >= 1.6 for the spline basis functions for suave (lower versions may work but untested)
colossus >= 1.2 for the BAO basis functions for suave (lower versions may work but untested)
six >= 1.15 (colossus dependency, lower versions may work but untested)

Install with pip

You can install suave via pip. We recommend doing this into a clean conda environment. You can do this and install the dependencies with the following set of commands:

$ conda create -c conda-forge -n suaveenv python gsl
$ conda activate suaveenv
$ pip install suave

Install from source

You should also be able to install from source. Once again you can do this in a clean conda environment:

$ conda create -c conda-forge -n suaveenv python gsl
$ conda activate suaveenv
$ git clone https://github.com/kstoreyf/suave/
$ cd suave
$ make
$ make install
$ pip install . (--user)

Author & Maintainers

The suave package was implemented by Kate Storey-Fisher. It is built within Corrfunc, which was designed by Manodeep Sinha and is currently maintained by Lehman Garrison and Manodeep Sinha.

Citing

If you use or reference suave, please cite the ApJ paper with this bibtex entry (this will be updated once the accepted paper is published):

@misc{storeyfisher2020twopoint,
   title={Two-point statistics without bins: A continuous-function generalization of the correlation function estimator for large-scale structure},
   author={Kate Storey-Fisher and David W. Hogg},
   year={2020},
   eprint={2011.01836},
   archivePrefix={arXiv},
   primaryClass={astro-ph.CO}
}

If you use the code, please additionally cite the original MNRAS Corrfunc code paper with the following bibtex entry:

@ARTICLE{2020MNRAS.491.3022S,
    author = {{Sinha}, Manodeep and {Garrison}, Lehman H.},
    title = "{CORRFUNC - a suite of blazing fast correlation functions on
    the CPU}",
    journal = {\mnras},
    keywords = {methods: numerical, galaxies: general, galaxies:
    haloes, dark matter, large-scale structure of Universe, cosmology:
    theory},
    year = "2020",
    month = "Jan",
    volume = {491},
    number = {2},
    pages = {3022-3041},
    doi = {10.1093/mnras/stz3157},
    adsurl =
    {https://ui.adsabs.harvard.edu/abs/2020MNRAS.491.3022S},
    adsnote = {Provided by the SAO/NASA
    Astrophysics Data System}
}

Finally, if you benefit from the enhanced vectorised kernels in Corrfunc (not currently used in suave but likely used if you're also using out-of-the-box Corrfunc), then please also cite this paper:

@InProceedings{10.1007/978-981-13-7729-7_1,
    author="Sinha, Manodeep and Garrison, Lehman",
    editor="Majumdar, Amit and Arora, Ritu",
    title="CORRFUNC: Blazing Fast Correlation Functions with AVX512F SIMD Intrinsics",
    booktitle="Software Challenges to Exascale Computing",
    year="2019",
    publisher="Springer Singapore",
    address="Singapore",
    pages="3--20",
    isbn="978-981-13-7729-7",
    url={https://doi.org/10.1007/978-981-13-7729-7_1}
}

LICENSE

Suave is released under the MIT license. Basically, do what you want with the code, including using it in commercial application.

Project URLs

Documentation (http://suave.rtfd.io/)
Source Repository (https://github.com/kstoreyf/suave)
Original Corrfunc Documentation (http://corrfunc.rtfd.io/)
Original Corrfunc Source Repository (https://github.com/manodeep/Corrfunc)

Support

This work was supported by a NASA FINESST grant under award 80NSSC20K1545.

suave's People

Contributors

Stargazers

Watchers

Forkers

abbyw24 xyh-cosmo

suave's Issues

PYTHON variable issue when making pip-installable

General information

versions: smoothcorrfunc 0.0.0, Corrfunc: 2.3.2
platform: linux
installation method (pip/source/other?): source

Issue description

I am trying to make my package installable via pip. I have edited the setup.py and common.mk files to make a new distribution name and version number. When I upload the distribution to testpypi, it seems fine, but then when I try to install on another system, it hits an error because of the PYTHON variable in common.mk.

I figured out that the setup.py file checks that the python executable that ran the setup script is the same as the one in common.mk, and if it's not, it updates the common.mk one to the python path used. This means that my miniconda path ends up hardcoded, so when I try to install on another system it fails.

Actual behavior

This is the line that updates the PYTHON variable and causes the issue:
python setup.py sdist
where python points to my miniconda python installation, /home/users/ksf293/miniconda3/bin/python3.

Then I do
python -m twine upload --repository testpypi --skip-existing dist/*

The package is successfully uploaded to testpypi here:
https://test.pypi.org/project/smoothcorrfunc/0.0.0
and can be installed with:
pip install -i https://test.pypi.org/simple/ smoothcorrfunc==0.0.0

When I try this on a different system, I get the error:
RuntimeError: command = /home/users/ksf293/miniconda3/bin/python3 -c 'from __future__ import print_function; import sys; print(sys.executable)' failed with stdout = b'' stderr = b'/bin/sh: /home/users/ksf293/miniconda3/bin/python3: No such file or directory\n' status 127

Expected behavior

For now I edited setup.py to not update common.mk, and just set PYTHON:=python in common.mk.

I did this for an updated pip distribution, so doing
pip install -i https://test.pypi.org/simple/ smoothcorrfunc==0.0.1
installs with no errors.

Is there a better solution?

code comments

Congrats again on getting the paper submitted! :)

I took a look at the new code and had the following comments. Hope these are useful and understandable

These lines might be faster if re-written as:

    for(int p=0; p<projdata->nsbins;p++){
        u[p] = (sqr_s >= projdata->supp_sqr[p] && sqr_s < projdata->supp_sqr[p+1]) ? ONE:ZERO;
    }

Regarding this comment in that file, the compile failure is because of padding bytes being inserted by the compiler to allow correct alignment. The C standard requires that memory addresses for variables must be divisible by the width of the variable -- i.e., 4-byte/32-bit integers must begin on memory addresses divisible by 4, 8-byte/64-bit integers must be allocated on memory addresses divisible by 8 - this is referred to as "alignment". structs and pointers are all 8-byte aligned (for practical purposes). If you look at the definition of the struct extra_options -

struct extra_options
{
    // Two possible weight_structs (at most we will have two loaded sets of particles)
    weight_struct weights0;
    weight_struct weights1;
    weight_method_t weight_method; // the function that will get called to give the weight of a particle pair
    proj_method_t proj_method;
    int nprojbins;
    char *projfn;
    uint8_t reserved[EXTRA_OPTIONS_HEADER_SIZE - 2*sizeof(weight_struct) - sizeof(weight_method_t)
                                              - sizeof(proj_method_t) - sizeof(int) - sizeof(char *)];
};

Since extra_options is a struct, the memory address must be divisible by 8. The first two elements are 8 bytes aligned (they are 88 bytes each), the weight_method_t and proj_method_t are both enums - i.e., 32-bit int - which means they are 4 bytes each. But because there are two of them, the end of proj_method is still divisible by 8. But then int nprojbins appears, which means address at the end of nprojbins is only divisible by 4, but char * projfn is a pointer and needs to have a memory address divisible by 8. Therefore, the compiler inserts 4 padding bytes between nprojbins and projfn. You can see this as a warning if you enabled -Wpadded (or -Wpadding, can't remember). You can fix the error by altering nprojbins to int64_t instead of int, and changing the calculation (sizeof(int) -> sizeof(int64_t)) in reserved. Afterwards, you should also be able to reset the total size of EXTRA_OPTIONS back to 1024 bytes.

You should also be able to condense these lines to

                const DOUBLE fac = need_weightavg ? pairweight:ONE;
                for(int p1=0;p1<nprojbins;p1++){
                    projpairs[p1] += u[p1]*fac;
                }
                if (need_tensor) { 
                    for(int p1=0;p1<nprojbins;p1++){
                        for(int p2=0;p2<nprojbins;p2++){
                            projpairs_tensor[ p1*nprojbins + p2] += u[p1] * u[p2] * fac;
                        }
                    }
                }

But that might be fetch the data twice - so check that the runtime is not affected dramatically. If the runtime changes by a lot, then your original implementation would be better (but replace the if(need_weightavg) with the multiplication by fac as I have done here).

Similarly, these lines can be condensed to:

    if(need_proj) {
      //nsbin is number of edges, want number of bins
      int nsbins = nsbin-1;
      projdata->nsbins = nsbins;
      projdata->supp = supp;
      projdata->supp_sqr = supp_sqr;    
      for(int i=0; i<nprojbins; i++) {
        projpairs[i] = ZERO;
      }
      if(need_tensor) {
        for(int i=0; i<nprojbins*nprojbins; i++) {
          projpairs_tensor[i] = ZERO;
        }
    }

Overall, nothing jumps out as obviously inefficient to me, but a profiling will essential to understand where the hotspots are. To improve performance, the next step would be to add in vectorised kernels for each of the filters

(P.S. I edited code on this GitHub issue - quite likely to contain translation/implementation bug)

Can't set order to 0 in Corrfunc.bases.spline

proj_type = 'generalr'
kwargs = {'order': 0} # 1: linear spline
projfn = 'quadratic_spline.dat'
nprojbins = int(nbins/1)
spline.write_bases(rmin, rmax, nprojbins, projfn, ncont=1000, **kwargs)

returns

No order given, defaulting to 1 (linear)

change name of pip import

i have changed the name of this repo to suave to reflect that fact that this is a new estimator with new features, though it is still a fork of the Corrfunc project so i can keep it in sync with upstream changes. i have also created a pip distribution under the name suave.

the issue is, the python subdirectory is still called Corrfunc, so the package must be imported with 'import Corrfunc'. i think there are a few options (cred to @dfm):

change all instances of Corrfunc in the repo to suave.
- could then do import suave
- would be annoying to keep repo in sync with the upstream
keep distribution name and repo name as suave, keep python directory as Corrfunc
- would then have to do import Corrfunc, and warn users about this
- users could not keep out-of-the-box Corrfunc separate from suave (should be one-way-compatible such that all regular Corrfunc functionality works the same when called from suave, but would be nice to have ability for them to be separate)
something fancy with the python directory named suave containing a subdirectory named Corrfunc, and changing the imports within the code to relative imports?
- same issues as 1.

@manodeep, do you have any other ideas, or opinions on which of these would be best? thank you!