hi-pace / hipace Goto Github PK

Highly efficient Plasma Accelerator Emulation, quasistatic particle-in-cell code

Home Page: https://hipace.readthedocs.io

License: Other

CMake 4.33% C++ 82.60% Shell 5.84% Python 5.98% Jupyter Notebook 0.21% C 1.04%

hipace's Introduction

HiPACE++

HiPACE++ is an open-source portable GPU-capable quasi-static particle-in-cell code for wakefield acceleration written in C++. It is a full re-writing of the legacy code HiPACE, the Highly efficient Plasma ACcelerator Emulator. Its main features are:

Multiple beam and plasma species to simulation beam-driven wakefield acceleration
A laser envelope solver to simulate laser-driven wakefield acceleration
An advanced explicit field solver for increased accuracy
Diagnostics compliant with the openPMD standard
Arbitrary profiles for the beams and plasma profiles
Readers from files for the beam and laser profiles
Adaptive time step and sub-cycling
Additional physics (field ionization, binary collisions, temperature effects, radiation reactions)

HiPACE++ is built on the AMReX library, which provides for particle and field data structures.

Please have a look at our documentation and join the chat!

Announcement

On the 11th of July 2023, there will be a virtual HiPACE++ workshop from 4pm to 7pm CET. Feel free to sign up on the indico webpage

Copyright Notice

HiPACE++ Copyright (c) 2021, The Regents of the University of California, through Lawrence Berkeley National Laboratory (subject to receipt of any required approvals from the U.S. Dept. of Energy) and Deutsches Elektronen-Synchrotron (DESY). All rights reserved.

Please see the full license agreement and notices in license.txt.
Please see the notices in legal.txt.
The SPDX license identifier is BSD-3-Clause-LBNL.

hipace's People

Contributors

Stargazers

Watchers

Forkers

maxthevenet severindiederichs savicente ax3l remilehe alexandersinn wetzel-desy weiqunzhang coulibaly-mouhamed qjohn2017 atmyers rtsandberg mfkiwl huixingjian

hipace's Issues

Bx and By seem to be incorrect

It seems like Bx and By are incorrect by a factor -1.

In the particle pusher, the electron particles were pulled inwards instead of pushed away from the electron drive beam.
A comparison with HiPACE-C showed that the fields are flipped.
Dummy simulation of HiPACE-C:

In HiPACE++ the fields are:

To fix this, the missing -1 needs to be found. The current is correct. Additionally, the analysis script must be updated. It would be helpful to also check for Bx along x and By along y and ensure that it is 0 (needs to be averaged between the 2 grid points).
Afterwards, all checksum tests need to be reset.

Correct current deposition for SI units

At the moment, the current deposition only works in normalized units. In SI units, it currently deposits
Jx = e * ne * ux.

It should deposit (if I am not mistaken)
Jx = e * ne * c * ux / gamma

Besides the missing c, which can be added easily, we are missing a 1/gamma, which seems to cancel out in normalized units.

Wrapper function for plasma particle pusher

For anticipated performance reasons, the plasma particle pusher should be implemented in the following way:

One wrapper function iterating over all particles.
In the particle loop, three functions have to be called:

Field Gather for each particle (see #76 )
Update the force terms with the given field values.
Update the position of the particles with the particle pusher (see #93 )

Option: slice MultiFab is an alias to 1 slice of 3D MultiFab

When looping slices, the current 2D slice is copied from the 3D array Fields::m_F to the 2D slice Fields::m_slices, then processed in the 2D slice, then copied back to the 3D array. We consider 2 approaches for the GPU porting (that would be two options in the final version of the code):

The whole data lives in device memory. In that case, these copies are unnecessary and the slice could be processed directly in the 3D array.
The 3D arrays lives on the host, the 2D slice on the device, and the data is copied for each slice.

One way to enable both approaches efficiently is to have the option make the slice multifab an alias of the corresponding slice in the 3D array for (1), and then process it as we do now. This will only be relevant if we observe that these copies are taking a significant amount of time.

Transverse redistribute taking too long?

Running with this input file (could be a production simulation) and submission script on a single 32GB V100 GPU, we observe that, on development, particle redistribution is taking a significant amount of time. See the Tiny Profiler output for more information, and here are the first few lines of exclusive timing:

---------------------------------------------------------------------------------------------------
Name                                                NCalls  Excl. Min  Excl. Avg  Excl. Max   Max %
---------------------------------------------------------------------------------------------------
Redistribute_partition                               20480      32.46      32.46      32.46  33.75%
UpdateForcePushParticles_PlasmaParticleContainer()   40960      23.03      23.03      23.03  23.95%
DepositCurrent_PlasmaParticleContainer()             20500      11.92      11.92      11.92  12.40%
Fields::ComputeRelBFieldError()                      20480      5.247      5.247      5.247   5.46%
DepositCurrent_BeamParticleContainer()                  20      4.325      4.325      4.325   4.50%
FFTPoissonSolverPeriodic::SolvePoissonEquation()     51200      2.462      2.462      2.462   2.56%

Handle plasma particles with longitudinal parallelization

In the loop over slices in Hipace::Evolve, when running parallel, calls to Wait and Notify transfer field data from the last slice of a rank to the first slice of the rank downstream. An equivalent should be implemented for plasma particles. We could have WaitPlasma and NotifyPlasma function calls that copies all the particle data to the rank downstream.

CMake: Release (or replace -O2)

Use -DCMAKE_BUILD_TYPE=Release for optimal performance for now (-O3).

Unfortunately, the current default I picked RelWithDebInfo builds with -O2.

Possible solutions:

overwrite RelWithDebInfo defaults to replace -O2 with -O3 (e.g. https://phabricator.kde.org/D10447)
switch to Release by default (discussion: https://gitlab.kitware.com/cmake/cmake/-/merge_requests/591)

Refs.:

Problem with I/O for host-device if 3D array stays on host.

PR #143 introduces the possibility to keep the 3D array on the host and copy single slices to the device. This allows to run higher resolution simulations, which would normally exceed the limited device memory.

Unfortunately, there is a bug with I/O, which should be done only from the host and not affect device memory.
Currently, I/O affects the device memory.

Using the following input script (please remove the .txt)
inputs_normalized.txt

With I/O hipace.output_period = 1 :

Total GPU global memory (MB) spread across MPI: [7973 ... 7973]
Free  GPU global memory (MB) spread across MPI: [1998 ... 1998]

Without I/O hipace.output_period = -1:

Total GPU global memory (MB) spread across MPI: [7973 ... 7973]
Free  GPU global memory (MB) spread across MPI: [6350 ... 6350]

This greatly reduces the applicability of this concept and needs to be fixed.

Our 3D field array is allocated in the pinned memory. If we want to write the 3D field array to file via VisMF::Write (FabArray),
we noticed that Gpu::inLaunchRegion() = true, although it should be false in that case. This memory should be written from the host not the device. The access to the device memory for the 3D array may exceed the device memory.

Improve function TransverseDerivative

Function TransverseDerivative can only do operations like J(i,j,k) = df/dx(i,j,k). To solve other fields, it should also support J(i,j,k) += coef*df/dx(i,j,k), so it requires two improvements:

have option to add (+=) instead of assign (=). An optional input parameter could handle that, default = assign.
have a multiplying factor coef. Another optional input parameter could handle that, default = 1.

Plasma density deposition

At the moment, only the current densities are deposited. In the current deposition, also the plasma density should be deposited.
This will help assessing the correctness of #93

CUDA binary name

@MaxThevenet reported today that the CUDA binary with cmake ../ -DHiPACE_COMPUTE=CUDA -DCUDA_ARCH=Volta looks odd on CUDA11.

I am trying to reproduce this but for me it reads

$ ll bin/
hipace -> /home/axel/src/hipace-amrex/build/bin/hipace.MPI.CUDA.DP
hipace.MPI.CUDA.DP

I am using CUDA 11.0.194 and CMake 3.18.0.

@MaxThevenet Can you please share more info on this?

Add checksum regression tests for CI

Regression tests check that the code gives the same result as the previous version. In WarpX, this is done via automated checksum tests, where we run 1 simulation, compute a reduced quantity (1 number, a checksum, per field for instance), and compare it to a benchmark stored somewhere in the repo.

In order to avoid breaking the code, I think it would be good to add checksum tests in CI. We could copy-paste the python files in https://github.com/ECP-WarpX/WarpX/tree/development/Regression/Checksum into the hipace repo, and start setting this up. For now, let us simply setup a very simple test. Soon enough, I will submit a PR to setup more CI, and we could link this to checksum tests at this time.

Finally, in the sun run, the goal is again to share these files with WarpX, not just copy-paste them. In that view, let us avoid making changes to the Hipace version of these scripts. If they could be improved (probably not a priority), this should be done in WarpX and then copied to Hipace, so it is easy to merge them back, in the future.

The WarpX documentation is in https://warpx.readthedocs.io/en/latest/developers/checksum.html.

Get field gather from WarpX, and generalize it to 2,2,0 shape factor

WarpX has a field gather function. Ideally, we would simply re-use it, but (i) we don't have WarpX as a dependency for now, and there is a plan to externalize a few WarpX functionalities so we can share them with Hipace and (ii) the WarpX functions would need to be generalized anyway because Hipace needs different shape factors in different directions, which WarpX does not support so far.

As a temporary solution, let us:

copy the field gather (on 1 particle) function from WarpX https://github.com/ECP-WarpX/WarpX/blob/development/Source/Particles/Gather/FieldGather.H into src/particles
Remove all non-relevant parts (RZ stuff etc.), it should end up being a much simpler and nicer function
modify it to allow different shape factor. For this, let us use the same method as we did for the current deposition, for consistency, i.e., template <int depos_order_xy, int depos_order_z> in https://github.com/Hi-PACE/hipace/blob/development/src/particles/deposition/BeamDepositCurrentInner.H.

Note that the field gather function should be general enough to work for both Beam particles and Plasma particles, which I think should not be too difficult (the way it is written in WarpX is already general enough, I believe). Then, this field gather function will be called from within the particle pushers.

Testing performance and accuracy of using psi instead of psi + 1

In PR #140, the particle attribute of the pseudo potential psi was changed. Previously, psiwas actually representing psi + 1. In that PR, psi now actually represents psi. However, this has 2 effects:

accuracy: to calculate vz in the plasma particle current deposition, one needs to use actually psi, therefore, it used to be 1 - (psi + 1) and now it is only psi`. This might positively affect accuracy, especially on single precision.
performance: in UpdateForceTerms.H, using psi instead of psi + 1 results in 9 extra additions of +1, which might negatively affect performance.

This should be tested.

Assess cost of Tiny Profiler

We are instrumenting the code heavily. We should check that this does not impact performance on HPC GPUs (V100). This can be done by compiling with and with Tiny Profiler support, and compare the runtimes. This test should be done once the main functions are in place (after #133 should already give us a good idea), then checked regularly.

Dirichlet and Periodic BC work on GPU and CPU \o/ - let's test!

If PR #164 tests are positive, we can merge it, and run a bunch of tests. A few points:

We may have to re-initialize particles at each time step, otherwise running many time steps may not be relevant.
Let's keep the script somewhere nice (I can create a Confluence page), and organize this study (to run as few simulations as possible to answer our questions), so we can re-run it later (e.g., on A100 for the performance tests when we have access)
production-like simulation means we're taking most of the GPU memory (say, between 1/2 and 3/4), let's discuss first on parameters, and on a proper way to measure run time.

Correctness tests

Write an automated test, on CPU for the moment, to compare with theory in a linear case, both SI and normalized (SI still missing).
Run this test on GPU, and make sure it gives the same results (with same number of iterations).
Check that Hipace++ requires roughly the SAME NUMBER of iterations as Hipace in the predictor-corrector loop for each slice.
Run a similar tests in the blowout regime. Once we see that it gives the same results as Hipace, let's just have a low-resolution automated tests, with comparison with theory (if theory is good enough...) or just the checksum test. Again, check that Hipace and Hipace++ roughly require the same number of iterations.

Performance tests

All this tests must be done on an HPC GPU (e.g., V100 on Summit or ), for production (or production-like simulations), it has to be representative (relatively high-order deposition, fill the GPU, etc.). Can also be compared with a local GPU.

Performance for a production run

Maybe add some wall time timers in the code (to get time/slice, similar to what's done in WarpX at https://github.com/ECP-WarpX/WarpX/blob/ebde54faa8bcae2f1b37b81270a8d0a64bf58a98/Source/Evolve/WarpXEvolve.cpp#L213)
Compare performance for a production simulation between Hipace and Hipace++ in the linear regime.
Compare performance for a production simulation between Hipace and Hipace++ in the blowout regime.
Compare the exact same simulation in SI and normalized units, and see if one is more costly than the other.

Low-level performance numbers

And then the fun begins: let's try to extract a time/particle, time/Poisson solve, etc.:

Cost and scaling for Poisson solve: Simulation with no beam and no plasma (it'll just have poisson solve), and plot the time/Poisson solve for a 256^2, 512^2, 1024^2 transverse problem (provided all fit in GPU memory), with say 10 or 20 slices or whatever fits in memory.
Cost of particle iteration: Simulation with say 512512100 grid points, run with ppc = [0, 1, 2, 4, 8, etc. (until GPU memory overflow)], and plot the result as a function of ppc. Extract time/iteration/particle.
Cost of particle mixing: Run a production-like simulation in the blowout regime, and re-run the exact same one with a beam with no charge (so plasma particles don't move and remain well-ordered in memory).
Cost of particle shape factor: Run a production-like simulation with 0-charge beam, with shape factor of order 0, 1, 2, 3, and plot the time.
Cost of particle shape factor: Run a production-like simulation in the blowout regime, with shape factor of order 0, 1, 2, 3, and plot the time.
Cost of non-symmetric box: Compare a production simulation with 256256, 256512 and 512*512 grid points transversally, to see if there's any surprise.
Cost of Dirichlet BC: Production-like simulation, Dirichlet vs. periodic (I observed a ~20% difference, let's make this a nicer test).
Memory footprint of slices and particles: run the following tests, and compare with theory to check if there's a surprise
- [beam=0ppc, plasma=0ppc, large box] -> size of 1 cell
- [beam=100ppc, plasma=0ppc, small box] -> size of 1 beam particle
- [beam=0ppc, plasma=100ppc, small box] -> size of 1 plasma particle
H2D copies: Run the same simulation with everything on the device, or with H2D/D2H copies. Then try a too big problem for Device, on the host.

QuickPIC vs. Hipace vs. Hipace++ benchmark

Ming ran a performance comparison between Hipace and QuickPIC. I believe it was on a 512**3 box. If we have everything up and ready, let's extract a time/step from Ming's study, and run the simulation with Hipace++ to see if we have any speedup.

Predictor-corrector loop to compute transverse magnetic fields

Depends on #103.

Bx is obtained by solving the following equation on 1 transverse slice:

Delta Bx = mu_0 * (- d_y Jz + d_zeta Jy)

A simplified version of it is already implemented in Hipace, where the second term in the RHS (d_zeta Jy) is omitted, and this issue proposes to include it.

The zeta derivate requires a predictor-corrector loop. Starting from an initial guess for Bx and By, the plasma particles are pushed to a future slice and deposit their currents so the zeta derivatives can be evaluated, and Bx_new and By_new can be computed. This is repeated until Bx and Bx_new are close enough, to a certain tolerance.

Segfault with deposition order 0 0

I observed a segfault when running with particle shape factor order 0 transversally. This would actually be useful for the beam_in_vacuum automated test, but I had to set it to 1 because of this bug. Worth trying to reproduce it and see what the code says in DEBUG mode (cmake .. -D CMAKE_BUILD_TYPE=Debug).

Bug in plasma deposition in parallel

When running in serial, the plasma current deposition looks correct (this was introduced and checked in PR #95, although we don't have an automated test for that). However, when running in parallel, the current deposition does not work. This can be reproduced by doing the following things

# Compile the code (with default options, i.e., MPI ON and OpenMP OFF)
# Open examples/can_beam/inputs and:
# - Add plasma.u_mean = 1.e3 2.e3 3.e3 # Otherwise the plasma does not deposit anything
# - set beam.density = 0. # Let the beam deposit 0, just to avoid confusion
cd tests
./can_beam.1Rank.sh ../build/bin/hipace .. # checksum test will show that jx, jy and jz are non-zero
./can_beam.2Rank.sh ../build/bin/hipace .. # checksum test will show that jx, jy and jz are zero

See below for a more detailed description.

Assess cost of FFT solver

Great problem to look into, but requires some understanding of GPU asynchronicity.

Function FFTPoissonSolver::SolvePoissonEquation in https://github.com/Hi-PACE/hipace/blob/development/src/fields/fft_poisson_solver/FFTPoissonSolver.cpp is properly instrumented: it contains BL_PROFILE("FFTPoissonSolver::SolvePoissonEquation()"); as well as calls to AnyFFT::Execute(m_backward_plan[mfi]);, which is also instrumented. However, when running the simulation on GPUs, AnyFFT::Execute(m_backward_plan[mfi]); (which actually does the FFT) seems to take a negligible amount of time, suggesting that the most time-consuming aspects of Function FFTPoissonSolver::SolvePoissonEquation are the simple copies

hipace/src/fields/fft_poisson_solver/FFTPoissonSolver.cpp

Line 120 in 6ac846d

amrex::ParallelFor( m_spectralspace_ba[mfi],

and

hipace/src/fields/fft_poisson_solver/FFTPoissonSolver.cpp

Line 132 in 6ac846d

amrex::ParallelFor( mfi.validbox(),

As @RemiLehe mentioned, the Tiny Profiler probably shows inaccurate information because Execute only measures the kernel launch time, which is different from the execution time due to asynchronicity. Adding a bunch of Gpu::streamSynchronize(); between each of these steps (the executes and the copies), as well as appropriate profiler region should help understand what the actual cost of the FFT is. For an example of profiling a region, see
https://github.com/ECP-WarpX/WarpX/blob/57cc9a3397f3e89e57b3bbc15ba8215ad2144b4f/Source/Particles/PhysicalParticleContainer.cpp#L933
where WARPX_PROFILE_... does the same thing as BL_PROFILE_....

Charge deposition for ions

When ions can be assumed immobile, QSA still needs their charge to be deposited on the grid (i.e., on each slice, as done for electrons). There are several ways to do it:

Timon's way

Allocate a new slice for ion charge rho_ion
At the beginning of each time step, deposit electron charge density on the first slice (-1) into rho_ion
when looping over slices, also add rho_ion to rho
when all slices of current proc are done, communicate rho_ion to rank downstream

Analytical way

Allocate a new slice for ion charge rho_ion
At the beginning of the simulation (or, later, at the beginning of each time step if we want a non-uniform plasma profile), calculate the theoretical rho_ion (theoretical plasma profile convoluted with shape factor, with dependence on ppc).
when looping over slices, also add rho_ion to rho.

The second way is a bit nicer because rho_ion doesn't have to be communicated.

NOTE: this is not urgent, as we implemented a stupid version of the "Analytical way" that works (only) with order 0 deposition.

Implement Beam particle pusher

Section 3.3.3 in Timon's PhD manuscript https://bib-pubdb1.desy.de/record/191881/files/PUBDB-2014-04069.pdf describes the beam pusher. @severindie is this up-to-date? If so, we could start implementing it. This pusher would be called in

hipace/src/Hipace.cpp

Line 228 in 6ac846d

/* xxxxxxxxxx Gather and push beam particles xxxxxxxxxx */

, and could probably be a new member function in https://github.com/Hi-PACE/hipace/blob/development/src/particles/BeamParticleContainer.cpp

Note that this WILL have to call the field gather function, see issue #76.

Note that one can start from the WarpX pusher, see https://github.com/ECP-WarpX/WarpX/tree/development/Source/Particles/Pusher and where they are called.

Implement actual plasma current deposition

The current deposition for plasma particles is a dummy version for the moment. It is located in file https://github.com/Hi-PACE/hipace/blob/development/src/particles/deposition/PlasmaDepositCurrentInner.H and basically deposits a fixed charge (this was implemented for testing, just to assess performance).

Using Timon's PhD manuscript https://bib-pubdb1.desy.de/record/191881/files/PUBDB-2014-04069.pdf, in particular section 3.3.2.2 page 94 (at least in my version), we should write the actual current deposition routine.

Call Poisson solver to solve other fields

Depends on #103.

Currently, only a simple version of Bx and By are calculated in Hipace, by solving a Poisson equation.
We should also implement solvers for the following fields:

This will give us all the final fields, except for Bx and By for which we will need a predictor-corrector loop.

Fixed-weight gaussian beam

For production simulations, we would need the possibility to initialise a beam with a Gaussian distribution, where all particles have a fixed weight. A solution with managed memory would probably be the easiest, but would only work for Nvidia GPUs (at least for now). One can have a look at WarpX https://github.com/ECP-WarpX/WarpX/blob/bb5e5e5b3b605f5650f6257af796f16ebdfcda9a/Source/Particles/PhysicalParticleContainer.cpp#L267 to see how it's done there.

Another approach would be to write a framework to

Write particles to a file (we could write this file at the openPMD format, typically with a Python script)
Read this file from Hipace++, and send particles on the devices.

This would make things easier to interface Hipace++ with experiments, for instance.

Accelerate Host-Device copies when 3D array stays on the host

PR #143 adds the option for the 3D array to stay on the host (in pinned memory). However, data transfers between host and device are dominating the simulation time. Here are a few ways to accelerate this:

Only transfer the data we need, not all components in the slice
Accelerate the copy (using cudaMemcpy etc.)
Have a buffer (or several if needed) on the device, where/to data would be Async-copied.
Another approach would be to have multiple box (say 8, in the longitudinal direction) per GPU. The computing of each box would happen fully on the GPU, but all other 7 boxes would be on the host in the meantime. Would need more thinking, to see if it could potentially work (it could help if Copy is latency-bound, not if it us BW-limited, IMO).

This can be tested on PR #143 (once it is tested), with this input file. With a small problem size (1024 1024 10) everything should fit in device memory (e.g., 16 GB on Summit) of 1 GPU. With a larger problem size (1024 1024 200) the 3D data would exceed the device memory so option in #143 becomes relevant.

Plasma particle pusher

We should implement the plasma particle pusher. The pusher used in Hipace-C is not documented anywhere yet so Severin, who knows it, can take care of the implementation.

Do we need to pass Bx etc. to Poisson solver functions?

These fields are members of Fields and so are the Poisson solver functions.

[NOT DECIDED YET] Slice deposition for the beam

Currently, the beam species lives on the device (which is the long-term plan anyway), but so does the 3D array of current density J, and the beam particles deposit directly into J.

For M2, we would like to investigate the possibility to store the 3D arrays on the host, and copy slices 1 by 1 onto the device, where they are processed. For this, we should be able to deposit the current of beam particles only on the current 2D slice, not on the full 3D array. I can see two options:

The dirty one
have an if statement directly in the current deposition: if the particle is not in the current slice, return
The clean one
- sort beam particles in the z direction
- select the beam particle slice corresponding to the current field slice
- deposit the current of this slice

Note that it is fine if the beam always lives on the device, all we need to do is to select the proper slice.

Also, I would be in favor of the dirty one. Recent tests showed that, even for relatively large resolution (2048x2048x10, with 4x4 plasma ppc), we could consider putting the whole 3D array on the device. If it is so, maybe we won't even have play too much with host-device exchanges, @RemiLehe @ax3l what do you think?

Assess correctness of SI and normalized code

Correctness of the SI Poisson solver can be assessed by running a simulation with examples/beam_in_vacuum/inputs, and analyzing the data with examples/beam_in_vacuum/analysis.py. In addition to the current test, we should:

run the test on GPU, and make sure the result is similar (this may require to install yt on Maxwell, see the WarpX doc for yt)
increase the resolution and the transverse box size (on GPU because it is faster), to make sure the code converges to the analytical solution (may need NGP deposition to get closer to the physical value)
re-do these tests in normalized units (hipace.normalized_units = 1, and adapt the input file of course, see examples/beam_in_vacuum/inputs_normalized)

fix test checking that input files are tested.

A test checks that all input files in examples/ are tested. This is not activated for now, because it needs to be fixed (it is adapted to the WarpX CI syntax, but should be fixed for the Hipace CI using ctests).

Introduced in PR #87.

HiPACE-C Units: Weight

Just as a follow-up to: #57 (comment)

I find that assigning a charge to the particle "weight" in HiPACE-C is quite confusing. When we are finalizing comparing to the HiPACE-C version, we could maybe make this also a unitless property (independent of it's absolute scale) to avoid outsmarting the reader. Or we could just call it charge.

Check that all high-level functions are instrumented

Each relatively time-consuming high-level (i.e., not operating on 1 particle or 1 cell, etc.) should have a HIPACE_PROFILE call in it.

Correctness of update force terms for SI

In PR #108, the force term update was introduced.
However, it was only tested for normalized units. In SI units, it has to be tested, if the charge_mass_ratio has to be removed entirely or if it has to be e/m_e.

Fixed-ppc Gaussian beam

While not the most useful for production runs, it would be nice (and relatively easy) to have the option to inject a Gaussian beam with fixed number of PPC. In particular, a can beam has sharp boundaries, which make it hard for the predcorr to converge. Using a Gaussian beam instead would make things a bit easier and nicer.

In practice, I think we could rename BeamParticleContainer::InitCanBeam to BeamParticleContainer::InitBeam, remove the following arguments

             const amrex::RealVect& a_u_std,
             const amrex::RealVect& a_u_mean,
             const Real a_density,

and instead pass two functors: One that does amrex::Real weight = f(x, y, z), and one that does amrex::Real* momentum_3d = f(x, y, z).

To have an idea of what this is, one could look at this example in WarpX https://github.com/ECP-WarpX/WarpX/blob/development/Source/Particles/Filter/FilterFunctors.H, where each functor does bool whether_to_keep_particle = f(particle).

Standardize way to access particle position

PR #108 introduces the modern way to get and set plasma particle positions. In other parts of the code (current deposition, maybe others) this is done differently. Let us standardize this and use the clean way with GetParticlePosition everywhere.

Error in rho after predictor corrector loop

It seems that the plasma particles deposit to rho in the predictor corrector loop, despite the statement

                    if (current_depo_type == ToSlice::This)
                    {
                        amrex::Gpu::Atomic::Add(
                            &jz_arr(lo.x+j_cell+ix, lo.y+k_cell+iy, 0),
                            sx_cell[ix]*sy_cell[iy]*wqz);
                        amrex::Gpu::Atomic::Add(
                            &rho_arr(lo.x+j_cell+ix, lo.y+k_cell+iy, 0),
                            sx_cell[ix]*sy_cell[iy]*wq);
                    }

It does work for jz though. This has to be fixed

Handle breakdown of QSA

In a simulation, it can happen that some plasma particles violate the QSA assumption. They should be handles (basically removed.)

Improve naming for slices

Currently, the slice multifabs are accessed with something like m_slices[lev][I] where 0<=i<4.

Each slice has a meaning, so we could use an enum struct instead of an int I. Could be

WhichSlice::Next
WhichSlice::This
WhichSlice::Previous1
WhichSlice::Previous2

Then, the rho_ion MF mentioned in issue #122 could be just another slice, with index

WhichSlice::RhoIon

Dirichlet boundary conditions

First simulations showed weird results with periodic transverse boundary conditions. If the boundaries significantly affect the result even after #105 is implemented, we will have to implement Dirichlet BC soon after, as it would be the last piece of work needed to have physical results.

Implement Sine transform on GPU

Parser should not require plasma density in input file

Currently, the parser requires the plasma density stated in the input file.

The example can beam works fine. The example beam in vacuum crashes with

ParmParse::getval ParmParse::getval(): plasma.density not found in table

The quick and dirty fix is to simply add

plasma.density = 0
plasma.ppc = 0 0

Although most simulations will require a plasma density, this should be fixed in the parser.

Consistent naming for pseudo-potential on plasma particles

Plasma particles currently have an attribute psi which stands for the pseudo-potential at the particle position. Problem is: in the code, it actually stands for psi + 1 because this is what is mostly used in the code, so we should do one of these:

Rename the variable from psi to psi_p_1 for psi plus one or something like that
Store the pseudo-potential (and add 1 where it is needed)

I have a preference for the second option, as it is more straightforward. Note that this could affect accuracy, particularly in single precision (pushing psi and computing psi+1 may be less accurate than pushing directly psi_p_1, but there is a psi_p_1 - 1 that would be replaced by psi, improving accuracy).

Serial run: dump data every N iterations

This is pretty minimal, and would already be useful to implement the beam particle pusher.

Fix bug with SumBoundary

see #156

Poisson solve functions should be part of class Fields, not Hipace

The title says it all: all Poisson solver functions should be members of class Fields instead of Hipace. In general, we this would be an occasion to go through the class Hipace and see what can be moved to other classes.

Checksum benchmarks have too many digits

When resetting a benchmark for a good reason, we expect that only fields with a significant difference will be updated. This is however not the case, and we regularly end up with changes like

-    "Ez": 3.109040443014284e-39,
+    "Ez": 3.109040443014281e-39,

that are not meaningful (machine-precision noise) and just obfuscate the review and add spurious line changes. Benchmark should be written with a bit fewer digits (maybe just 1 less). Note that checksums do handle tolerance already, this is just about how numbers are written to json files.

CMake cleaning after PR 85

PR #85 slightly changes the syntax used for automated tests: now each test has its own bash script in tests/. This was originally done for flexibility (some tests just check if the simulation runs, others compare with a checksum, others have an analysis file, etc.), but as tests start to differ it also has the advantage that, looking at this bash script, one can see exactly how the test was run (which is handy, in particular to reset the checksum benchmark properly, or use different checksum tolerance etc.).

If we adopt this way of doing, we should:

Check if the syntax can be improved (for instance all tests currently take the same input parameters $<TARGET_FILE:HiPACE> and ${HiPACE_SOURCE_DIR}, so maybe they could be put in environment variable accessed directly from the bash scripts, making the add_test lines and the bash scripts nicer)
Adapt the CMake files accordingly (for instance variable MPI_TEST_EXE is not used anymore, and maybe neither is configure_mpiexec(2))

@ax3l could we discuss this when you have time?

Cleaning

I'll gather ideas for cleaning in the description of this PR, feel free to suggest new ideas:

Better accessor for things like m_fields.getSlices(lev, 1).boxArray(), e.g., m_fields.SliceBoxArray(lev)
We already have Fields::getSlices(lev) and Fields::getSlices(lev, icomp). We could add a Fields::getSlices(lev, icomp, fieldcomp) that would return an alias to the component requested. That way, for instance, we wouldn't have to pass the Bx component to ComputeRelBFieldError. It would be cleaner BUT we should first check that making this alias repeatedly doesn't cost performance.
Stop having all member variables public
Have const get functions (and const functions in general)

Reset plasma particles

Currently, the plasma particles are not reset, which means that they have the same position and momentum etc. at the beginning of the next time step as they had at the end of the box in the previous time step. This allows for an infinite plasma wake which gets amplified with every beam passing! Actually, this is exactly what we want for the long term plasma evolution (besides we want to have only 1 beam in the first box, not another beam in every single box).

For meaningful simulations in the moving window of the beam, we need to reset the plasma particles at the beginning of the time step to the same values that they had at the initialization.