differentiableuniverseinitiative / flowpm Goto Github PK

View Code? Open in Web Editor NEW

90.0 90.0 20.0 67.65 MB

Particle Mesh simulation in TensorFlow

Home Page: https://flowpm.readthedocs.io/

License: MIT License

Python 1.66% Jupyter Notebook 98.26% Shell 0.08%

flowpm's People

Contributors

Stargazers

Watchers

Forkers

eiffl bhorowitz mrzyzhaozeyu xrosliang jmsull maxelee ml-lab nialljeffrey thessal jplalor modichirag lbeki01 santiagocasas wendazhou shrejais minhmpa margiofabiolad cbyrohl duxngsi

flowpm's Issues

Scaling of distributed FFT with Mesh TensorFlow

The computational bottleneck of a Particle-Mesh simulation is the computation of the Fast Fourier Transform of the 3D mesh. In practice when distributing over many nodes, communications between nodes are the main limitations.

Here is a breakdown of computation time for the recent Hidden Valley simulations described in Modi et al. 2019

Even before having a full simulation code ready, we can investigate the scaling of the FFT operation on its own, as it is likely to be the main bottleneck.

Our point of comparison is PFFT (https://github.com/mpip/pfft), a parallel OpenMP+MPI distributed FFT relying on FFTW. This is what is used in our optimized C simulation code
FastPM and has been able to scale to half a million MPI ranks.

Here are the particular steps needed to measure scaling:

Write simple Mesh TensorFlow-based FFT benchmark (see this issue)
Establish benchmark on SIMD mesh implementation on TPU
Establish benchmark of DevicePlacement mesh implementation on Cori-GPU with default gRPC backend
Stretch goal: Optimization for Cori-GPU, probably MPI backend instead of gRPC, maybe MPI-enabled SIMD mesh_impl

Moving presentations to different directory

Due to the fact that we host the 2 FlowPM presentations on this repo, it's quite big, and slow to download. We could move these talks elsewhere, and erase them from the git history, making the whole repo a lot more lightweight.
@modichirag do you think there are some hardcoded links somewhere to the slides you presented at the NERSC meeting?
My talk I can move to my personal talks repo.

Either make flowpm TF2 compatible or restrict to TF1

Travis is currently failing due to the latest TF 2.0 release....

Add weak lensing convergence output

This issue is to track the implementation of weak lensing raytracing within FlowPM. Our goal here is to add to the single GPU implementation of FlowPM the tools necessary to output a convergence lightcone. As I understand it, here are the rough steps that would be required:

: Define box rotations
: Implement lensing efficiency computation
: Projecting particles onto 2D plane as the simulation runs
: Exporting convergence maps

In addition, we will want to implement the following tests:

: Comparing convergence power spectrum against theory at each source redshift

I have created a lensing branch https://github.com/modichirag/flowpm/tree/lensing where we can start working on that implementation. Concretely, what I think we will need to do is to modify the nbody function here:
https://github.com/modichirag/flowpm/blob/b79ecf1f6aa445ac73aa82d9f4b4cb8df16e0a9d/flowpm/tfpm.py#L311
to support outputting additional fields than just the final states.

Because it's the first time that I try to implement weak lensing ray tracing I may have a way too simplified understanding of the steps and tests that will be required. @VMBoehm @Maxelee feel free to comment on this issue if there are particularly tricky things that we should be on the lookout for.

And I think for now we don't want to worry about PGD or distributed computation, we will take care of these things afterwards.

Seed differs from previous code

The same seed generates different IC for this version of the code than the fastpm, which was consistent with the old version of flowpm.
Following is an image showing this.

https://files.slack.com/files-pri/TD19RFPCM-FNUBPPCP7/image.png

Slow gradients through growth ODE

I am finding that estimating gradients through the growth ODE are slow, taking almost ~15 seconds, which is about the same time as estimating gradients of 5 step PM simulation on a 128^3 grid.
This can prove to be a quite cumbersome.

I have attached the code that estimates and compares time for various gradient estimations, including a case without growth ODE to highlight that this is the bottleneck.

If the tests are right, then some thoughts -

we have not implemented growth ODE efficiently, maybe it is reconstructing the graph for that step again and again
(though this seens unlikely since we see that first iteration which generates graph does take much longer...maybe only the ODE part of the graph is getting regenerated again and again)
maybe TF ODE solver is just this slow
one thing to note is that we solve the ODE on 256 'a' (time) points but need the value on only the 5 time-steps we do the simulation. Maybe some workaround can be constructed here.
Crude, but we can generate and save an interpolating function for growth and other relevant ODEs. In lambda-CDM, each of these is in 2 dimensional space, depending only on omega_m and h.

Note:
time_grads2.py.txt
Not sure how to add code here..so adding it as txt file. Also, to run it, check the paths for imports, especially the power spectrum code needs to be imported since afaik, it's not part of flowpm yet.

Improve memory layout of particle vector in cic paint/readout for TPU

We currently reshape the particle positions, removing the batch dimension, when computing the updates for cic painting and readout. Because of TPU memory padding rules, this leads to heavily inflated tensors, and we run out of memory. Here is a trace showing the problem:

E1111 07:35:00.272763 140408571025152 error_handling.py:81] Closing session due to error From /job:worker/replica:0/task:0:
Compilation failure: Ran out of memory in memory space hbm. Used 41.74G of 16.00G hbm. Exceeded hbm capacity by 25.74G.

Total hbm usage >= 41.74G:
    reserved        528.00M
    program          41.22G
    arguments       unknown size

Output size unknown.

Program hbm requirement 41.22G:
    reserved           4.0K
    global           29.82M
    HLO temp         41.19G (24.2% utilization, 0.7% fragmentation (293.84M))

  Largest program allocations in hbm:

  1. Size: 32.00G
     Operator: op_type="FloorMod" op_name="slicewise_79/cic_update/FloorMod"
     Shape: s32[8388608,8,4]{2,1,0:T(8,128)}
     Unpadded size: 1.00G
     Extra memory due to padding: 31.00G (32.0x expansion)
     XLA label: %copy.40488 = s32[8388608,8,4]{2,1,0:T(8,128)} copy(s32[8388608,8,4]{0,2,1:T(4,128)} %get-tuple-element.32429), metadata={op_type="FloorMod" op_name="slicewise_79/cic_update/FloorMod"}
     Allocation type: HLO temp
     ==========================

  2. Size: 1.00G
     Operator: op_type="FloorMod" op_name="slicewise_80/cic_update/FloorMod"
     Shape: s32[8388608,8,4]{0,2,1:T(4,128)}
     Unpadded size: 1.00G
     XLA label: %fusion.935 = (s32[8388608,8,4]{0,2,1:T(4,128)}, s32[8388608,8,4]{0,2,1:T(4,128)}, s32[8388608,8,4]{0,2,1:T(4,128)}, s32[8388608,8,4]{0,2,1:T(4,128)}, s32[8388608,8,4]{0,2,1:T(4,128)}, s32[8388608,8,4]{0,2,1:T(4,128)}, s32[8388608,8,4]{0,2,1:T(4,128)}, s32...
     Allocation type: HLO temp
     ==========================

It's not a big problem, we just need to be more careful about the dimensions of these tensors

Cosmology functions API rework

We are running into some problems with the current API for the cosmology background computations:

Cosmology object is just a dictionary, it will not properly compute derived parameters like omega_lambda when omega_k is fixed and you modify omega_m.
The same dictionary is currently used to store cached values, but not in a consistent way, and it's not clear that the caches are correctly propagated.

This is causing problems left and right :-/

This issue is to track the reworking of this cosmology background API to try to solve these problems. We probably want to keep a functional API for most cosmology computations, but we could adopt a tf.Module for the cosmology object.

Code cleanup and style convention

Our code base is a fine mess currently in terms of code style ^^'

I suggest we adopt the TensorFlow code conventions for uniformity. Here is a quick link to the documentation and style document for TF: https://www.tensorflow.org/community/contribute/code_style

Implementing generic paint and readout for mesh-TF implementation

As far as I can see, I don't think the current implementation trivially accommodates a generic catalog of positions, randomly shuffled, for painting and readout.
Currently mesh_utils.cic_paint takes in the tensor 'part' which is particle positions of the shape [batch, nx, ny, nz, 3]
A generic paint or readout function needs [batch, npart, 3], along with associated functions to determine which rank the particle belongs to.
This needs to be implemented.

Add FastPM-compatible IO

Currently, we don't have any tools to export a simulation snapshot in a standard format. This issue is to track the addition of a few IO tools to export the output of a simulation as a standard FastPM file, which can then be read with various tools.

Note that this is also a consideration linked to #26 it'd be nice if ultimately the I/O we implement was compatible with a distributed setting.

Additional validation test for ray-tracing implementation

We implemented a new test to validate our raytracing implementation.
Instead of running the N-body simulation, the snapshots at different scale factors are generated by using the same linear power spectrum at the scale factor a=0.5229969 (with the order set to 1).
The following image shows a comparison between the Jax-cosmo linear power spectrum (dashed line) and the 3D Power spectrum computed for each snapshot :

Then, we exported the lens planes, created the Convergence map and computed the angular power spectrum from it. The final angular power spectrum is obtained averaging over 20 samples. We also tried to investigate the behaviour of the angular power spectrum for large scale for different values of the field and Box size.

In particular we adopted the following setting:

Number of lens = 22 (or 11 for Boxsize =200 Mpc^3)
Simulated Volume = 100 Mpc^3 (or 200 Mpc^3 for Number of lens = 11 )
nc = 128
field size = 5x5 deg^2 or (10x10 deg^2)
Convergence map resolution = 512 (or 1024 for 10x10 deg^2)
Lensplane resolution: 256

The specific setting can be found here
To follow, the result of 4 setting combinations compared to the Jax-cosmo linear power spectrum.

The two vertical lines indicate the scale l=300 and l=3000.

Implementing super-sampled mesh with B>1

The current prototype for the Mesh TensorFlow implementation of FlowPM is using B=1, i.e. a mesh resolution identical to the resolution of the initial particle grid.
This issue is to track the implementation of B>1, which can greatly improve the accuracy of the final field.

Benchmarking full N-body simulation on GPU clusters

This issue is to track the work on benchmarking the new Horovod backend for GPU clusters and getting profiling information for FlowPM.

We want to do the following things:

: Run profiler on https://github.com/DifferentiableUniverseInitiative/flowpm/blob/mesh_update/scripts/mesh_nbody_benchmark.py and https://github.com/DifferentiableUniverseInitiative/flowpm/blob/mesh_update/scripts/pyramid_nbody.py
: Compare the performance of both implementations as a function of the mesh size
: Identify the most costly parts of the computation

To learn how to do this profiling, keep an eye on DifferentiableUniverseInitiative/IDRIS-hackathon#2

Port Weak Lensing code to Mesh TensorFlow

For now, the raytracing code we have developed with @dlanzieri is based on the non-distributed version of FlowPM. I'm opening this issue to document the porting of the necessary pieces to Mesh Tensorflow.

Here is the list of things I think we need:

: 2D cic painting for pojecting snapshots

Things that are optional at this stage but would be nice to have

: 2D interpolation on distributed tensors
: Differentiable cosmology in the mesh nbody code
: Differentiable initial conditions

How to specify the code to run with NVIDIA GPU?

Hi, thank you for your effort to utilize TF, do your code perform better on GPU than CPU?
If so, how should I run the code on GPUs on a cluster without slurm management system ?

Implementing 2-LPT initial displacement

Our current implementation for the Mesh TensorFlow version is only doing 1LPT to evolve initial conditions down to a0. There is no particular difficulty in implementing 2LPT, I just didn't get around to it.

Here is where this functionality should go:
https://github.com/modichirag/flowpm/blob/3582e807fd3195dd20f7513ddd2abc8b28ac50e9/flowpm/mtfpm.py#L96

And here was our TensorFlow only implementation:
https://github.com/modichirag/flowpm/blob/9c867c506075938ebecedc2955eb2c992b534bf3/flowpm/tfpm.py#L159

Validating the tensorflow interpolation function

@dlanzieri noticed that in some situations, using interp_tf can return NaN value. @dlanzieri can you find a MWE that reproduces your problem?

Reconstruction run for 256^3 grid

5 step PM runs with reconstruction are failing, even on 2 nodes i.e. 16 GPUs

This is surprising given that the following configurations are working well enough -

128^3 grid with 5 steps PM is reconstructing on 1 GPU.
256^3 grid with LPT forward model is reconstructing on 1 GPU
256^3 grid forward model with 5 step PM is running on 1 GPU

This seems to be a combination of OOM and communication issue. This effectively limits our reconstruction capability on GPU since going to larger number of nodes is going to be prohibitively slow in terms of lowering the graph, and situation is likely to be worse for 512^3 and higher meshes.

Again, maybe it will not be so with SIMD placement and/or on TPUs, but it will still be good to understand this issue in the first place.

Attached is the log generated by the run.

meshnbody.o443185.txt

tf.signal.fft with kernel weighting (like Guassian smoothing) not working as expected

To implement Gaussian smoothing and other transfer functions in Fourier space, we will need to Fourier transform and multiply with a kernel in Fourier space and then transform back. This seems to not work as expected. For eg. multiplying with a Gaussian smoothing kernel seems to be changing both, the transfer function and cross correlation with the original function even on the large scales.

An example code to replicate and see this is below

bs, nc = 200, 128
R=1
dtype, cdtype = np.float32, np.complex64

#Get a Gaussian smoothing kernel 
kk = flowpm.kernels.fftk([nc, nc, nc], symmetric=False)
kmesh = sum(ki**2 for ki in kk)**0.5
kmesh = np.expand_dims(np.expand_dims(kmesh, -1), 0)
kwts = np.exp(-0.5*kmesh**2*R**2).astype(cdtype)

tmp = np.random.normal(size=nc**3).reshape(nc, nc, nc).astype(dtype)

#first with python code
tf.reset_default_graph()

x = tf.placeholder(tf.float32, [None, None, None, None, 1])
x1f = tf.signal.fft3d(tf.cast(x, cdtype))
x1f = tf.multiply(x1f, kwts)
x1 = tf.cast(tf.signal.ifft3d(x1f), dtype)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    predtf = sess.run(tf.squeeze(x1), {x:tmp.reshape(1, nc, nc, nc, 1)})


#now with python code
x2f = np.fft.fftn(tmp)
x2f = x2f * np.squeeze(kwts)
predpy = np.fft.ifftn(x2f).astype(dtype)

# plot and see. power function is in <recon> branch, examples/utils/tools.py 
k, p1 = tools.power(predpy, boxsize=bs)
k, p2 = tools.power(tmp, boxsize=bs)
k, p12x = tools.power(predpy, f2=tmp, boxsize=bs)

plt.plot(k, (p1/p2)**0.5, 'C0', label='Transfer Py')
plt.plot(k, p12x/(p2*p1)**0.5, 'C1', label='xcorrelation Py')

k, p1 = tools.power(predtf, boxsize=bs)
k, p2 = tools.power(tmp, boxsize=bs)
k, p12x = tools.power(predtf, f2=tmp, boxsize=bs)

plt.plot(k, (p1/p2)**0.5, 'C0--', label='Transfer Py')
plt.plot(k, p12x/(p2*p1)**0.5, 'C1--', label='xcorrelation TF')

plt.legend()
plt.semilogx()
plt.grid(which='both')
plt.ylim(0.5, 1.25)

This generates the attached figure

Add support for distributed I/O

When eventually scaling up mesh computation to very large volumes, we need to output the layed out tensors to a file system.
For cori and similar systems we will need to have each process output its own slice of the data. Not sure what's the best approach to do this yet.

Building FFT benchmark script

This issue is to track the development of a benchmark script to test the scaling of a distributed FFT in Mesh TensorFlow under different environments.

I have a prototype script for Cori in https://github.com/modichirag/flowpm/tree/mesh/scripts

But there is a lot of room for improvement, in particular:

Measuring the communication and compute times
Support TPU enviroments as well as SLURM cluster
Automatically run a scaling experiment and report the results

Help is most welcome :-)

FlowPM Blog Post Out of Date

The flowpm_blog.ipynb file which is lined from the main README still uses tensorflow 1.x rather than 2.6. I'm not sure what the highest 2.x flowpm has been tested on so far...

Parameters of PGD and optimal weighting/cic compensation

Hi @modichirag , I compared the matter power spectrum computed with and without PDG. I need know what do you think about these results!
Initially I used kl, ks, mu, and alpha0 parameters obtained by running this Notebook with the following setting for the N-Body simulation :

nc = 128         
field_size = 5.   
nsteps=40
B=2

This is how the power spectrum looks like for all the snapshots:

Because of the alpha parameter we used for the PGD is scale factor dependent according the following relation:
alpha=alpha0*a**mu,
I tried to see how the amplitude of the power spectrum at small scales changes as a function of alpha0 parameter:

I see that the power spectra that better approximate the analytical one are the power spectra obtained with alpha0=0.008, alpha0=0.009, alpha0=0.01, for bigger and smaller values they start to lose power.
This is how the power spectrum looks like for all the snapshots using alpha0=0.01:

Now, I think that probably my scale translation is not right. The kl,ks obtained with the Vanessa's notebook should be in hMpc^(-1), but I'm not sure.
So, still using the the alpha0 obtained from the notebook (alpha0=0.00269), I tried different scale translation:

kl_array=[kl*0.7*0.5/(nc*B/ box_size),kl*0.5/(nc*B/ box_size),kl*2/(nc*B/ box_size),kl*2*0.7/(nc*B/ box_size) ]
ks_array=[ks*0.7*0.5/(nc*B/ box_size),ks*0.5/(nc*B/ box_size),ks*2/(nc*B/ box_size),ks*2*0.7/(nc*B/ box_size)]

Here the peaks of the PGD kernel :

and the power spectrum of only one snapshot:

From this plot looks like the right scaling factor for kl and ks are obtained multiplying them for h and 0.5 (this depends how we defined k) .
Assuming this is the better rescaling, I used it to see again how the power spectrum at small scales changes as a function of alpha0

And, alpha0=0.01 still seems like the better choice.

Implement distributed lensing

This issue is to track the development of a distributed version of our FlowPM lensing implementation.

We essentially need to create a distributed version of these two funnctions:

flowpm/flowpm/raytracing.py

Line 49 in 9145d0c

def density_plane(state,
flowpm/flowpm/raytracing.py

Line 112 in 9145d0c

def convergenceBorn(cosmo,

The first one should be easily adaptable from the distributed versions of cic_paint which can be found here:

flowpm/flowpm/mesh_utils.py

Line 125 in 9145d0c

def cic_paint(mesh, part, halo_size, weight=None, name=None):

The second one is a tiny bit more sophisticated because it will involve a distributed interpolation operation and this might require some thinking

Tensorflow 1 is unsupported in Colab.

The current colab experience fails to run now that Tensorflow 1 is now longer supported.

The sample notebook should be updated for Tensorflow 2.

Tracing of distances is painfully slow

Currently in the distance computations, we have a very naive implementation using the BDF ODE integrator and an integrand that is not under a tf.function. As a result, there is a lot of retracing that happens for simple distance computations unless the entire call is under an external tf.function.
This wouldn't affect performance in an actual simulation script, but this is quite inconvenient in practice.

So, it'd be nice to rewrite rad_comoving_distance so that parts of it can be under a @tf.function and avoid unecessary retracing when called eagerly.

Neural ODEs for FLOWPM and ODE TF correction

In this notebook I try to find the correction terms to compensate for the PM approximations (like PGD in the n-body), following what has been done in this notebook.
The idea was using the same correction parameters found with the jax neural network, but using the TF neural network to inject them.
To do that, I followed this example. The problem is that the Dormand-Prince solver for the neural ODE takes forever.
@EiffL any idea of what could have gone wrong?

Testing new Slack Integration again

So, now any PR or Issue should pop up in slack, for easy notification

Fix Continuous Integration

Something happened with conda, but our tests no longer run automatically in GitHub actions, this needs to be fixed!

Implementing Tests for Distributed Ops

We currently have tests to evaluate the TensorFlow version of FlowPM against our reference FastPM simulation code. They run automatically on Travis CI.
The problem now is that we want to be doing the same thing but with the Mesh implementation, which requires running the ops on a TensorFlow cluster.

We need to figure out how to run those tests automatically. The most likely answer will be to spawn a TF cluster on Travis, with like 4 CPU processes, and run the tests by connecting to this local cluster.

There are some caveats with that approach though, because as we have already seen, some ops behave differently on CPU and GPU, so...

Implementing Paint and Readout as Mesh Ops

In our current Mesh FlowPM, readout and painting is implemented as a series of slicewise and shift operations, not as intrinsic Mesh Ops. The problem is that to decide how much shifting needs to happen, you need to know about the particular mesh distribution. Right now, we are doing that by providing the splitted dims, and numbers of split directly to these operations, as can be seen here:
https://github.com/modichirag/flowpm/blob/3582e807fd3195dd20f7513ddd2abc8b28ac50e9/flowpm/mesh_utils.py#L113

One solution, adopted for instance by the block convolutions in Mesh TensorFlow is to include this information about the mesh by having specific dimensions indicating the topology of the mesh. i.e. a 2d feature map would have shape [batch_size, n_split_x, n_split_y, block_size_x, block_size_y, nfeatures]
That works but I'm not a super fan of this, another solution is to implemenent painting and readout ops, as Mesh Ops, which have access to all the info they need about the specific tensor layout.

Implement Multi-Grid Force Computation

As noted in #14 distributed FFTs are slow and costly. In an effort to make distributed FlowPM more computationally efficient we are investigating the possibility of implementing a multi-resolution algorithm, where the distributed FFT is limited to handling the large scale modes, and we compute the short range interactions through a local FFT only.

Instead of going through the usual route of painting and processing independently separate grids at different resolutions, we are being much smarter and using a multiresolution pyramid of the density field to separate small and large scale which can then be processed independently before being merged back together.

Experiments can be found here: https://github.com/modichirag/flowpm/blob/multigrid/notebooks/flowpm-MultiGrid.ipynb

Here are the necessary steps:

: Make sure scale separation and combination works in the non-distributed case, in particular with respect to periodicity of the large scale
: Implement pyramid decomposition in Mesh TensorFlow
: Implement force computation through slice wise FFT in Mesh TensorFlow
: Implement force combination and readout
: Profit! aka measure speed up

Return field equal to 1

For some reason, when increasing nc to 64, the field I get out of the nbody sim is equal to 1 everywhere. @modichirag this sounds very much like the issue I had on colab recently, you fixed it I don't know how, do you remember that?

TEsting slack integration

bonjour

Lowering operation for reconstruction

The time for lowering operation for reconstruction graph with N-Body/PM forward model is too high.
It is also high for LPT forward model, but not unreasonably so.

The reconstruction graph is same as the forward model graph with 2 extra computes - gradient with respect to the linear field and update operation for the linear field.
In the commit - modichirag@5c3fe39
it can be found in the file- https://github.com/modichirag/flowpm/blob/recon/examples/pyramid_recon.py
(in case future commits change the graph)

For timing comparison -
Config: 1 node, 8 GPUs on cori with nx*ny=4x2 config of mesh, 64^3 PM grid, 5 step PM model and pyramid scheme

time to lowering a graph of forward nbody model with the same number of steps is ~60 second
time to lowering a graph of recon i.e. forward nbody model + gradients wrt linear field is ~2700 second
In case of single-mesh scheme, time to lowering a graph of recon i.e. forward nbody model + gradients wrt linear field is ~1100 second

This issue is important since it will make it hard to experiment with N-body models for reconstruction.

Efficient 3D FFT implementation

As described in our first round of benchmark, the very naive FFT op we implemented seems to be quite inefficient both on TPU and Cori-GPU.

See here: https://github.com/modichirag/flowpm/issues/14#issuecomment-552263768

I'm opening this issue to track the development of a better FFT transform, with these two goals:

Lower communications and array reshaping cost
Better utilization of TPUs for the actual FFT

For reference our current strategy is implemented here:
https://github.com/modichirag/flowpm/blob/ca8eadb849fd688c28f9b63eb205378cac4f7862/flowpm/mesh_ops.py#L31

And the basic idea is to implement a 3D FFT by perfoming 1D transforms along the last array dimension, 3 times, transposing the array to bring a new dimension at the last position every time. Something like that:

  for d in range(3):
    x = mtf.slicewise(tf.signal.ifft, [x],
                      output_dtype=tf.complex64,
                      splittable_dims=x.shape[:-1])
    x = mtf.transpose(x, new_shape=outer_dims+[y_dim, z_dim, x_dim])
    x = mtf.replace_dimensions(x, [y_dim, z_dim, x_dim], [x_dim, y_dim, z_dim])

the mtf.transpose is transposing the array, and mtf.replace_dimensions is there to reshape the array so that the last dimension is not splitted. These two steps can very probably be merged together, but unless I'm missing something, not without implementing a specialized op.

Switch to TF2 everything that can be switched

This issue is to document the process of updating most of FlowPM to full TF2. The Mesh TensorFlow parts still rely on the tf1 API but we already know that we need tf 2.1 to support large tensors with many dimensions.

Differentiability with respect to cosmology

One of the obvious things we want to be able to do with FlowPM is to get gradients with respect to cosmology. Currently, we are relying on the background.py file from the python implementation of FastPM, so we can't backpropagate with respect to cosmological parameters.

The goal of this issue is to add a pure tensorflow implementation of cosmology computations that will allow us to replace all the calls to background.py. I have created a tfcosmo branch: https://github.com/modichirag/flowpm/tree/tfcosmo to work on this implementation.

I have made a small list of things we want to be able to compute in pure tensorflow:

: Analytic initial matter power spectrum, probably Eisenstein & Hu
: Linear growth function D1
: Hubble function E

Other functions used in the nbody code can easily be derived once we have this in place.

Very concretely, we can start building a tfbackground.py file and add these features as we go, until we can replace the background.py file altogether.

For "inspiration" we can reuse code from the jax-cosmo project, where all of this is implemented, but alas in Jax and not TensorFlow : https://github.com/DifferentiableUniverseInitiative/jax_cosmo/blob/master/jax_cosmo/background.py

I'm happy to start looking at this with @dlanzieri

Add documentation

We currently have 0 documentation. This is bad. This issue is to track the deployment of a readthedocs documentation, and discussing the content we need.

I'm happy to start putting in place the tools for a first pass at a doc and then we can iterate here further.