differentiableuniverseinitiative / flowpm Goto Github PK
View Code? Open in Web Editor NEWParticle Mesh simulation in TensorFlow
Home Page: https://flowpm.readthedocs.io/
License: MIT License
Particle Mesh simulation in TensorFlow
Home Page: https://flowpm.readthedocs.io/
License: MIT License
The computational bottleneck of a Particle-Mesh simulation is the computation of the Fast Fourier Transform of the 3D mesh. In practice when distributing over many nodes, communications between nodes are the main limitations.
Here is a breakdown of computation time for the recent Hidden Valley simulations described in Modi et al. 2019
Even before having a full simulation code ready, we can investigate the scaling of the FFT operation on its own, as it is likely to be the main bottleneck.
Our point of comparison is PFFT (https://github.com/mpip/pfft), a parallel OpenMP+MPI distributed FFT relying on FFTW. This is what is used in our optimized C simulation code
FastPM and has been able to scale to half a million MPI ranks.
Here are the particular steps needed to measure scaling:
Due to the fact that we host the 2 FlowPM presentations on this repo, it's quite big, and slow to download. We could move these talks elsewhere, and erase them from the git history, making the whole repo a lot more lightweight.
@modichirag do you think there are some hardcoded links somewhere to the slides you presented at the NERSC meeting?
My talk I can move to my personal talks repo.
Travis is currently failing due to the latest TF 2.0 release....
This issue is to track the implementation of weak lensing raytracing within FlowPM. Our goal here is to add to the single GPU implementation of FlowPM the tools necessary to output a convergence lightcone. As I understand it, here are the rough steps that would be required:
In addition, we will want to implement the following tests:
I have created a lensing
branch https://github.com/modichirag/flowpm/tree/lensing
where we can start working on that implementation. Concretely, what I think we will need to do is to modify the nbody
function here:
https://github.com/modichirag/flowpm/blob/b79ecf1f6aa445ac73aa82d9f4b4cb8df16e0a9d/flowpm/tfpm.py#L311
to support outputting additional fields than just the final states.
Because it's the first time that I try to implement weak lensing ray tracing I may have a way too simplified understanding of the steps and tests that will be required. @VMBoehm @Maxelee feel free to comment on this issue if there are particularly tricky things that we should be on the lookout for.
And I think for now we don't want to worry about PGD or distributed computation, we will take care of these things afterwards.
The same seed generates different IC for this version of the code than the fastpm, which was consistent with the old version of flowpm.
Following is an image showing this.
https://files.slack.com/files-pri/TD19RFPCM-FNUBPPCP7/image.png
I am finding that estimating gradients through the growth ODE are slow, taking almost ~15 seconds, which is about the same time as estimating gradients of 5 step PM simulation on a 128^3 grid.
This can prove to be a quite cumbersome.
I have attached the code that estimates and compares time for various gradient estimations, including a case without growth ODE to highlight that this is the bottleneck.
If the tests are right, then some thoughts -
Note:
time_grads2.py.txt
Not sure how to add code here..so adding it as txt file. Also, to run it, check the paths for imports, especially the power spectrum code needs to be imported since afaik, it's not part of flowpm yet.
We currently reshape the particle positions, removing the batch dimension, when computing the updates for cic painting and readout. Because of TPU memory padding rules, this leads to heavily inflated tensors, and we run out of memory. Here is a trace showing the problem:
E1111 07:35:00.272763 140408571025152 error_handling.py:81] Closing session due to error From /job:worker/replica:0/task:0:
Compilation failure: Ran out of memory in memory space hbm. Used 41.74G of 16.00G hbm. Exceeded hbm capacity by 25.74G.
Total hbm usage >= 41.74G:
reserved 528.00M
program 41.22G
arguments unknown size
Output size unknown.
Program hbm requirement 41.22G:
reserved 4.0K
global 29.82M
HLO temp 41.19G (24.2% utilization, 0.7% fragmentation (293.84M))
Largest program allocations in hbm:
1. Size: 32.00G
Operator: op_type="FloorMod" op_name="slicewise_79/cic_update/FloorMod"
Shape: s32[8388608,8,4]{2,1,0:T(8,128)}
Unpadded size: 1.00G
Extra memory due to padding: 31.00G (32.0x expansion)
XLA label: %copy.40488 = s32[8388608,8,4]{2,1,0:T(8,128)} copy(s32[8388608,8,4]{0,2,1:T(4,128)} %get-tuple-element.32429), metadata={op_type="FloorMod" op_name="slicewise_79/cic_update/FloorMod"}
Allocation type: HLO temp
==========================
2. Size: 1.00G
Operator: op_type="FloorMod" op_name="slicewise_80/cic_update/FloorMod"
Shape: s32[8388608,8,4]{0,2,1:T(4,128)}
Unpadded size: 1.00G
XLA label: %fusion.935 = (s32[8388608,8,4]{0,2,1:T(4,128)}, s32[8388608,8,4]{0,2,1:T(4,128)}, s32[8388608,8,4]{0,2,1:T(4,128)}, s32[8388608,8,4]{0,2,1:T(4,128)}, s32[8388608,8,4]{0,2,1:T(4,128)}, s32[8388608,8,4]{0,2,1:T(4,128)}, s32[8388608,8,4]{0,2,1:T(4,128)}, s32...
Allocation type: HLO temp
==========================
It's not a big problem, we just need to be more careful about the dimensions of these tensors
We are running into some problems with the current API for the cosmology background computations:
Cosmology object is just a dictionary, it will not properly compute derived parameters like omega_lambda when omega_k is fixed and you modify omega_m.
The same dictionary is currently used to store cached values, but not in a consistent way, and it's not clear that the caches are correctly propagated.
This is causing problems left and right :-/
This issue is to track the reworking of this cosmology background API to try to solve these problems. We probably want to keep a functional API for most cosmology computations, but we could adopt a tf.Module for the cosmology object.
Our code base is a fine mess currently in terms of code style ^^'
I suggest we adopt the TensorFlow code conventions for uniformity. Here is a quick link to the documentation and style document for TF: https://www.tensorflow.org/community/contribute/code_style
As far as I can see, I don't think the current implementation trivially accommodates a generic catalog of positions, randomly shuffled, for painting and readout.
Currently mesh_utils.cic_paint takes in the tensor 'part' which is particle positions of the shape [batch, nx, ny, nz, 3]
A generic paint or readout function needs [batch, npart, 3], along with associated functions to determine which rank the particle belongs to.
This needs to be implemented.
Currently, we don't have any tools to export a simulation snapshot in a standard format. This issue is to track the addition of a few IO tools to export the output of a simulation as a standard FastPM file, which can then be read with various tools.
Note that this is also a consideration linked to #26 it'd be nice if ultimately the I/O we implement was compatible with a distributed setting.
We implemented a new test to validate our raytracing implementation.
Instead of running the N-body simulation, the snapshots at different scale factors are generated by using the same linear power spectrum at the scale factor a=0.5229969 (with the order set to 1).
The following image shows a comparison between the Jax-cosmo linear power spectrum (dashed line) and the 3D Power spectrum computed for each snapshot :
In particular we adopted the following setting:
The specific setting can be found here
To follow, the result of 4 setting combinations compared to the Jax-cosmo linear power spectrum.
The current prototype for the Mesh TensorFlow implementation of FlowPM is using B=1, i.e. a mesh resolution identical to the resolution of the initial particle grid.
This issue is to track the implementation of B>1, which can greatly improve the accuracy of the final field.
This issue is to track the work on benchmarking the new Horovod backend for GPU clusters and getting profiling information for FlowPM.
We want to do the following things:
To learn how to do this profiling, keep an eye on DifferentiableUniverseInitiative/IDRIS-hackathon#2
For now, the raytracing code we have developed with @dlanzieri is based on the non-distributed version of FlowPM. I'm opening this issue to document the porting of the necessary pieces to Mesh Tensorflow.
Here is the list of things I think we need:
Things that are optional at this stage but would be nice to have
Hi, thank you for your effort to utilize TF, do your code perform better on GPU than CPU?
If so, how should I run the code on GPUs on a cluster without slurm management system ?
Our current implementation for the Mesh TensorFlow version is only doing 1LPT to evolve initial conditions down to a0. There is no particular difficulty in implementing 2LPT, I just didn't get around to it.
Here is where this functionality should go:
https://github.com/modichirag/flowpm/blob/3582e807fd3195dd20f7513ddd2abc8b28ac50e9/flowpm/mtfpm.py#L96
And here was our TensorFlow only implementation:
https://github.com/modichirag/flowpm/blob/9c867c506075938ebecedc2955eb2c992b534bf3/flowpm/tfpm.py#L159
@dlanzieri noticed that in some situations, using interp_tf
can return NaN value. @dlanzieri can you find a MWE that reproduces your problem?
5 step PM runs with reconstruction are failing, even on 2 nodes i.e. 16 GPUs
This is surprising given that the following configurations are working well enough -
This seems to be a combination of OOM and communication issue. This effectively limits our reconstruction capability on GPU since going to larger number of nodes is going to be prohibitively slow in terms of lowering the graph, and situation is likely to be worse for 512^3 and higher meshes.
Again, maybe it will not be so with SIMD placement and/or on TPUs, but it will still be good to understand this issue in the first place.
Attached is the log generated by the run.
To implement Gaussian smoothing and other transfer functions in Fourier space, we will need to Fourier transform and multiply with a kernel in Fourier space and then transform back. This seems to not work as expected. For eg. multiplying with a Gaussian smoothing kernel seems to be changing both, the transfer function and cross correlation with the original function even on the large scales.
An example code to replicate and see this is below
bs, nc = 200, 128
R=1
dtype, cdtype = np.float32, np.complex64
#Get a Gaussian smoothing kernel
kk = flowpm.kernels.fftk([nc, nc, nc], symmetric=False)
kmesh = sum(ki**2 for ki in kk)**0.5
kmesh = np.expand_dims(np.expand_dims(kmesh, -1), 0)
kwts = np.exp(-0.5*kmesh**2*R**2).astype(cdtype)
tmp = np.random.normal(size=nc**3).reshape(nc, nc, nc).astype(dtype)
#first with python code
tf.reset_default_graph()
x = tf.placeholder(tf.float32, [None, None, None, None, 1])
x1f = tf.signal.fft3d(tf.cast(x, cdtype))
x1f = tf.multiply(x1f, kwts)
x1 = tf.cast(tf.signal.ifft3d(x1f), dtype)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
predtf = sess.run(tf.squeeze(x1), {x:tmp.reshape(1, nc, nc, nc, 1)})
#now with python code
x2f = np.fft.fftn(tmp)
x2f = x2f * np.squeeze(kwts)
predpy = np.fft.ifftn(x2f).astype(dtype)
# plot and see. power function is in <recon> branch, examples/utils/tools.py
k, p1 = tools.power(predpy, boxsize=bs)
k, p2 = tools.power(tmp, boxsize=bs)
k, p12x = tools.power(predpy, f2=tmp, boxsize=bs)
plt.plot(k, (p1/p2)**0.5, 'C0', label='Transfer Py')
plt.plot(k, p12x/(p2*p1)**0.5, 'C1', label='xcorrelation Py')
k, p1 = tools.power(predtf, boxsize=bs)
k, p2 = tools.power(tmp, boxsize=bs)
k, p12x = tools.power(predtf, f2=tmp, boxsize=bs)
plt.plot(k, (p1/p2)**0.5, 'C0--', label='Transfer Py')
plt.plot(k, p12x/(p2*p1)**0.5, 'C1--', label='xcorrelation TF')
plt.legend()
plt.semilogx()
plt.grid(which='both')
plt.ylim(0.5, 1.25)
When eventually scaling up mesh computation to very large volumes, we need to output the layed out tensors to a file system.
For cori and similar systems we will need to have each process output its own slice of the data. Not sure what's the best approach to do this yet.
This issue is to track the development of a benchmark script to test the scaling of a distributed FFT in Mesh TensorFlow under different environments.
I have a prototype script for Cori in https://github.com/modichirag/flowpm/tree/mesh/scripts
But there is a lot of room for improvement, in particular:
Help is most welcome :-)
The flowpm_blog.ipynb file which is lined from the main README still uses tensorflow 1.x rather than 2.6. I'm not sure what the highest 2.x flowpm has been tested on so far...
Hi @modichirag , I compared the matter power spectrum computed with and without PDG. I need know what do you think about these results!
Initially I used kl, ks, mu, and alpha0 parameters obtained by running this Notebook with the following setting for the N-Body simulation :
nc = 128
field_size = 5.
nsteps=40
B=2
This is how the power spectrum looks like for all the snapshots:
Because of the alpha parameter we used for the PGD is scale factor dependent according the following relation:
alpha=alpha0*a**mu
,
I tried to see how the amplitude of the power spectrum at small scales changes as a function of alpha0 parameter:
I see that the power spectra that better approximate the analytical one are the power spectra obtained with alpha0=0.008, alpha0=0.009, alpha0=0.01, for bigger and smaller values they start to lose power.
This is how the power spectrum looks like for all the snapshots using alpha0=0.01:
Now, I think that probably my scale translation is not right. The kl,ks obtained with the Vanessa's notebook should be in hMpc^(-1), but I'm not sure.
So, still using the the alpha0 obtained from the notebook (alpha0=0.00269), I tried different scale translation:
kl_array=[kl*0.7*0.5/(nc*B/ box_size),kl*0.5/(nc*B/ box_size),kl*2/(nc*B/ box_size),kl*2*0.7/(nc*B/ box_size) ]
ks_array=[ks*0.7*0.5/(nc*B/ box_size),ks*0.5/(nc*B/ box_size),ks*2/(nc*B/ box_size),ks*2*0.7/(nc*B/ box_size)]
Here the peaks of the PGD kernel :
and the power spectrum of only one snapshot:
From this plot looks like the right scaling factor for kl and ks are obtained multiplying them for h and 0.5 (this depends how we defined k) .
Assuming this is the better rescaling, I used it to see again how the power spectrum at small scales changes as a function of alpha0
And, alpha0=0.01 still seems like the better choice.
This issue is to track the development of a distributed version of our FlowPM lensing implementation.
We essentially need to create a distributed version of these two funnctions:
The first one should be easily adaptable from the distributed versions of cic_paint
which can be found here:
Line 125 in 9145d0c
The second one is a tiny bit more sophisticated because it will involve a distributed interpolation operation and this might require some thinking
Currently in the distance computations, we have a very naive implementation using the BDF ODE integrator and an integrand that is not under a tf.function. As a result, there is a lot of retracing that happens for simple distance computations unless the entire call is under an external tf.function.
This wouldn't affect performance in an actual simulation script, but this is quite inconvenient in practice.
So, it'd be nice to rewrite rad_comoving_distance
so that parts of it can be under a @tf.function and avoid unecessary retracing when called eagerly.
In this notebook I try to find the correction terms to compensate for the PM approximations (like PGD in the n-body), following what has been done in this notebook.
The idea was using the same correction parameters found with the jax neural network, but using the TF neural network to inject them.
To do that, I followed this example. The problem is that the Dormand-Prince solver for the neural ODE takes forever.
@EiffL any idea of what could have gone wrong?
So, now any PR or Issue should pop up in slack, for easy notification
Something happened with conda, but our tests no longer run automatically in GitHub actions, this needs to be fixed!
We currently have tests to evaluate the TensorFlow version of FlowPM against our reference FastPM simulation code. They run automatically on Travis CI.
The problem now is that we want to be doing the same thing but with the Mesh implementation, which requires running the ops on a TensorFlow cluster.
We need to figure out how to run those tests automatically. The most likely answer will be to spawn a TF cluster on Travis, with like 4 CPU processes, and run the tests by connecting to this local cluster.
There are some caveats with that approach though, because as we have already seen, some ops behave differently on CPU and GPU, so...
In our current Mesh FlowPM, readout and painting is implemented as a series of slicewise and shift operations, not as intrinsic Mesh Ops. The problem is that to decide how much shifting needs to happen, you need to know about the particular mesh distribution. Right now, we are doing that by providing the splitted dims, and numbers of split directly to these operations, as can be seen here:
https://github.com/modichirag/flowpm/blob/3582e807fd3195dd20f7513ddd2abc8b28ac50e9/flowpm/mesh_utils.py#L113
One solution, adopted for instance by the block convolutions in Mesh TensorFlow is to include this information about the mesh by having specific dimensions indicating the topology of the mesh. i.e. a 2d feature map would have shape [batch_size, n_split_x, n_split_y, block_size_x, block_size_y, nfeatures]
That works but I'm not a super fan of this, another solution is to implemenent painting and readout ops, as Mesh Ops, which have access to all the info they need about the specific tensor layout.
As noted in #14 distributed FFTs are slow and costly. In an effort to make distributed FlowPM more computationally efficient we are investigating the possibility of implementing a multi-resolution algorithm, where the distributed FFT is limited to handling the large scale modes, and we compute the short range interactions through a local FFT only.
Instead of going through the usual route of painting and processing independently separate grids at different resolutions, we are being much smarter and using a multiresolution pyramid of the density field to separate small and large scale which can then be processed independently before being merged back together.
Experiments can be found here: https://github.com/modichirag/flowpm/blob/multigrid/notebooks/flowpm-MultiGrid.ipynb
Here are the necessary steps:
: Make sure scale separation and combination works in the non-distributed case, in particular with respect to periodicity of the large scale
: Implement pyramid decomposition in Mesh TensorFlow
: Implement force computation through slice wise FFT in Mesh TensorFlow
: Implement force combination and readout
: Profit! aka measure speed up
For some reason, when increasing nc to 64, the field I get out of the nbody sim is equal to 1 everywhere. @modichirag this sounds very much like the issue I had on colab recently, you fixed it I don't know how, do you remember that?
bonjour
The time for lowering operation for reconstruction graph with N-Body/PM forward model is too high.
It is also high for LPT forward model, but not unreasonably so.
The reconstruction graph is same as the forward model graph with 2 extra computes - gradient with respect to the linear field and update operation for the linear field.
In the commit - modichirag@5c3fe39
it can be found in the file- https://github.com/modichirag/flowpm/blob/recon/examples/pyramid_recon.py
(in case future commits change the graph)
For timing comparison -
Config: 1 node, 8 GPUs on cori with nx*ny=4x2 config of mesh, 64^3 PM grid, 5 step PM model and pyramid scheme
This issue is important since it will make it hard to experiment with N-body models for reconstruction.
As described in our first round of benchmark, the very naive FFT op we implemented seems to be quite inefficient both on TPU and Cori-GPU.
See here: https://github.com/modichirag/flowpm/issues/14#issuecomment-552263768
I'm opening this issue to track the development of a better FFT transform, with these two goals:
For reference our current strategy is implemented here:
https://github.com/modichirag/flowpm/blob/ca8eadb849fd688c28f9b63eb205378cac4f7862/flowpm/mesh_ops.py#L31
And the basic idea is to implement a 3D FFT by perfoming 1D transforms along the last array dimension, 3 times, transposing the array to bring a new dimension at the last position every time. Something like that:
for d in range(3):
x = mtf.slicewise(tf.signal.ifft, [x],
output_dtype=tf.complex64,
splittable_dims=x.shape[:-1])
x = mtf.transpose(x, new_shape=outer_dims+[y_dim, z_dim, x_dim])
x = mtf.replace_dimensions(x, [y_dim, z_dim, x_dim], [x_dim, y_dim, z_dim])
the mtf.transpose
is transposing the array, and mtf.replace_dimensions
is there to reshape the array so that the last dimension is not splitted. These two steps can very probably be merged together, but unless I'm missing something, not without implementing a specialized op.
This issue is to document the process of updating most of FlowPM to full TF2. The Mesh TensorFlow parts still rely on the tf1 API but we already know that we need tf 2.1 to support large tensors with many dimensions.
One of the obvious things we want to be able to do with FlowPM is to get gradients with respect to cosmology. Currently, we are relying on the background.py file from the python implementation of FastPM, so we can't backpropagate with respect to cosmological parameters.
The goal of this issue is to add a pure tensorflow implementation of cosmology computations that will allow us to replace all the calls to background.py. I have created a tfcosmo
branch: https://github.com/modichirag/flowpm/tree/tfcosmo to work on this implementation.
I have made a small list of things we want to be able to compute in pure tensorflow:
Other functions used in the nbody code can easily be derived once we have this in place.
Very concretely, we can start building a tfbackground.py
file and add these features as we go, until we can replace the background.py
file altogether.
For "inspiration" we can reuse code from the jax-cosmo project, where all of this is implemented, but alas in Jax and not TensorFlow : https://github.com/DifferentiableUniverseInitiative/jax_cosmo/blob/master/jax_cosmo/background.py
I'm happy to start looking at this with @dlanzieri
We currently have 0 documentation. This is bad. This issue is to track the deployment of a readthedocs documentation, and discussing the content we need.
I'm happy to start putting in place the tools for a first pass at a doc and then we can iterate here further.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.