Record TODO, Work-in-Progress and Done tasks in LLNL.
ouankou / llnl-work-notes Goto Github PK
View Code? Open in Web Editor NEWRecord TODO, Work-in-Progress and Done tasks in LLNL.
License: GNU Affero General Public License v3.0
Record TODO, Work-in-Progress and Done tasks in LLNL.
License: GNU Affero General Public License v3.0
Now they take too much space (a few pages). We could put them side-by-side.
Instead of only checking problem size, we like to call three modelling functions to determine which device should be used.
Currently, only about 10 data points are collected to create the model. We should use much more, especially for the smaller problem size and crosspoint area.
The AMR example has been fixed in Charm (charmplusplus/charm#2980).
OpenMP CPU kernel has been added on top of it (ouankou/charm@33cfe33).
The next step would be to add omp target
to the kernel for GPU offloading.
On Lassen, it's fine to load Clang 10.0 and compile the OpenMP code with GPU offloading.
However, on Pascal (NVIDIA Tesla P100), profiling the stencil program nvprof ./pascal.out
gave the error:
======== Warning: CUDA device error, GPU profiling skipped
GPU time(ms): 0.4
CPU time(ms): 0.3
verify dif =0.000000
======== Warning: No profile data collected.
It seems the code wasn't offloaded to GPU.
Instead of providing basic GPU informaiton, nvidia-smi
showed:
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
A native CUDA program can be compiled but running it threw the following error.
CUDA error at ../../Common/helper_cuda.h:791 code=100(cudaErrorNoDevice) "cudaGetDeviceCount(&device_count)"
All the commands run fine on Lassen. I'm contacting LC Hotline for help.
On Fornax server in UNCC, the performance of running Stencil on CPU and GPU has a cross point. It indicates that CPU and GPU have their own advantage on different problem sizes.
On Lassen, not matter what the problem size is, GPU is always faster. The CPU of Lassen is IBM Power9, which might be reason why it's constantly slower than NVIDIA Tesla V100.
The papers have some more advanced models to guide loop execution.
Test Stencil on Pascal, using the number of CPU threads and problem size as parameters.
Mabrouk, L., Huet, S., Houzet, D. et al. Efficient adaptive load balancing approach for compressive background subtraction algorithm on heterogeneous CPU–GPU platforms. J Real-Time Image Proc (2019). https://doi.org/10.1007/s11554-019-00916-4
Since the subproblems are independent, they can be processed concurrently by CPU and GPU. The question is how to distribute them to make the total time the least.
We can have multiple metadirective for those portions and the created model will guide the offloading.
It should generate a table automatically for most of results.
https://github.com/OpenMP/Examples/tree/master/sources
https://github.com/LLNL/dataracebench/tree/master/micro-benchmarks
The result looks like the following table.
https://github.com/LLNL/dataracebench/wiki/Summary-report-Apr-2019#evaluation-platform
Summarize the papers when the terms are introduced first time.
Carina or Lassen.
The result should be added to both Google sheet and Overleaf with explanation.
Jeannot, E., Žilinskas, J., Ilic, A. and Sousa, L. (2014). Efficient Multilevel Load Balancing on Heterogeneous CPU + GPU Systems. In High‐Performance Computing on Complex Environments (eds E. Jeannot and J. Žilinskas). https://doi.org/10.1002/9781118711897.ch14
The original research was for load balancing, but its idea may be also suitable for metadirective.
In the AMR cell, a flag and a performance record could be set. The first 20-50 iterations could be used for profiling. The CPU and GPU performance data are stored.
The flag is set to 0 initially. After the profiling, it is set to 1 or 2, which corresponds to CPU or GPU or any other computing devices. This profiling step is overhead, but considering a AMR application usually has thousands of iterations or even more, 20 iterations is not that significant.
To be safe, the file should be passed to /tmp
and never be executed.
Then the file can be downloaded from frontend by user.
Now the AMR example in Charm is working as a complete application. It can prove that the optimization from metadirective can be isolated inside a cell without affecting the outer layers, such as MPI. The OpenMP support has been added. The computing can be performed either on CPU or GPU in OpenMP.
However, due to some bug in Charm's AMR library, when the problem size of Jacobi is larger than around 600, a segmentation fault will be triggered. This issue has nothing to do with the computing kernel, so I'll create a separate program to test Jacobi kernel without using the original framework. Whether the testing is conducted in the framework or individually does not affect our conclusion. If the bug is fixed in the future, this separate program can be integrated back with minor changes.
With this Jacobi program, we can have two kernels for evaluation. It's probably enough.
Given a series of fixed problem size: 16, 32, 64, ...
For each problem size, it runs 5000 iterations or more.
Three groups are compared, which are pure CPU, pure GPU, and CPM-based metadirective.
Surface has 36 CPU cores. Assume this compute node will handle 4 AMR grids, so for a single node, 8 CPU threads are used in OpenMP.
Given a series of problem size: 16, 32, 64, ...
For each problem size, it runs 2000 iterations. All the iterations are considered as one test.
Four groups are compared, which are pure CPU, pure GPU, CPM-based metadirective, and FPM-based metadirective.
Jeannot, E., Žilinskas, J., Ilic, A. and Sousa, L. (2014). Efficient Multilevel Load Balancing on Heterogeneous CPU + GPU Systems. In High‐Performance Computing on Complex Environments (eds E. Jeannot and J. Žilinskas). https://doi.org/10.1002/9781118711897.ch14
Constant performance modeling (CPM) is discussed in the paper above. It's only suitable for a group of discrete problem sizes, which is not as adaptive as functional performance modeling (FPM). However, it fits well to the AMR example we discussed since inside a AMR grid, the problem size could be a constant value or a small group of constant values.
CPM requires much less resources to create a model than FPM because it only covers finite number of problem sizes. In some cases, it has only one problem size.
The initial description of this model has been added to Overleaf.
AMR frameworks mostly use a complex class to store the data and methods. However, OpenMP doesn't allow to specify a class member inside the map clause.
In the AMR example of Charm, the local data in a cell is stored in the class member dataGrid
and newDataGrid
. The direct mapping leads to the following error.
jacobi2DAMR.C: In member function ‘virtual void Jacobi2DAMR::doComputation()’:
jacobi2DAMR.C:64:137: error: ‘Jacobi2DAMR::cellSize’ is not a variable in ‘map’ clause
#pragma omp target teams distribute parallel for map(to: cellSize, dataGrid[0:N]) map(tofrom: newDataGrid[0:N]) collapse(2)
jacobi2DAMR.C:64:81: error: ‘Jacobi2DAMR::dataGrid’ is not a variable in ‘map’ clause
#pragma omp target teams distribute parallel for map(to: cellSize, dataGrid[0:N]) map(tofrom: newDataGrid[0:N]) collapse(2)
...
In this case, the data structure is already very simple, which is only a local array. In some other cases, such as Gamer or 3D-LBM-AMR, they have even more complicated data structures in multiple locations for computing task.
I think this is a design issue, which is not directly related to our focus - how to use metadirective to speed up computation itself. For Charm, I manually copy the data from class member to regular variables and copy the back after computing. This step doesn't count as overhead and is excluded from time measurement. For now, I don't plan to address this kind of problem. Please notice that the time cost of transferring created variables between host and device is counted as part of GPU offloading.
Commit ouankou/charm@5bf78db shows my workaround.
In 3D-LBM-AMR and Gamer, omp parallel for
is widely used for computing on CPU. If the involved data are properly mapped, we can offload the computing to GPU.
However, both of them use the complex data structure to hold both data and computing methods.
https://github.com/ouankou/3D-LBM-AMR/blob/master/src/Node.h
class Node
{
public:
double u0,u1,u2,usqr,rho,xcoord,ycoord,zcoord,nu0;
// double u0_prev,u1_prev,u_error,u_sum;
// double m1eq,m2eq,m4eq,m6eq,m7eq,m8eq;//moments (starting from 0, to 8)
// double m1,m2,m4,m6,m7,m8;//moments (starting from 0, to 8)
// double Q11,Q12,Q21,Q22,Q,Smag,OmegaStar,Cs;
double Smag,OmegaStar,Cs;
// double PI11, PI22, PI33, PI12, PI13, PI23;
// double PI11,PI22,PI12;
double u0avg, u1avg, u2avg;
int image,level,outnumb,edge;
bool feqflag;//1 if MeqToFeq was used
Node * Parent;
std::vector<double> f,feq,ftemp,delta,m,meq;
std::vector<Node *> nb;
Node * child[8];
Node();
Node(Node& parent);
void Initialize(double u0_ini,double u1_ini,double u2_ini,double rho_ini,int x,int y,int z);
void SetImage(int Image_val);
void ComputeM();//compute M from f and meq
void ComputeFeq();
void Collide();
void ComputeMacros();
void VelAverage(int tStep);//compute time averaged velocity
void Symmetry();
void Periodic(Node *);
void BounceBack();
void CurvedB();
void BBForce(double& ForceX,double& ForceY,double& ForceZ,int BBimage);
void BoundaryMacro();
void InletOutlet();
void RegularizedBoundary();
void DeleteNode();
void Refine();
void ChildNeighbor();
void Stream1();
void Stream2();
void SpatialInterp1();
void SpatialInterp2();
void SpatialAverage();
void TemporalInterp();
void TemporalInterp2();
void MeqToFeq();
void MToF();
void ComputeDelta(double obX,double obY,double obX2,double obY2,double R);
// void Outlet2();
};
It has a 3D array of this object. We could either figure out a way to map this 3D array to GPU, which would need much less changes to the code, or pick up the involve data to create a new data array and rewrite a kernel function based current computing methods.
It may have 4 subfigures.
A grid covers multiple cells. Different level of grid has different number of cells.
This procedure is called regridding. After this, a grid is fed into GPU or CPU to calculate the value of its cells.
parallel
metadirective
for
parallel for
It's used to help FreeCompilerCamp create some tutorials. Since EDG is not open source, it's difficult to add new nodes in ROSE without source code of EDG. Using Clang as frontend instead, it may solve the problem.
Currently, the Clang support in ROSE is quite limited. For example, Clang:Var type, which is related to variable declaration, is not supported yet. At this moment, only Clang 9.x works with ROSE development branch.
Add more tech details of multiple server/docker images support.
Add more description of ROSE compiler.
https://github.com/ouankou/3D-LBM-AMR
Without OpenMP disabled, the program runs fine.
With OpenMP enabled, it gives the following error.
...
Iter: 145
Iter: 146
Iter: 147
Iter: 148
Iter: 149
Iter: 150
Iter: 151
Velocity explosion124, 0, 29
Negative density at124, 0, 29
Velocity explosion124, 0, 31
Velocity explosion125, 0, 30
Error has occured. Exiting at timestep: 151
There might be data race caused by some OpenMP code.
Source code:
https://github.com/chunhualiao/homp-demo/tree/master/xomp-gen
Currently, the example has to be manually copied to a separate folder and change the input parameter file manually for each problem size.
The output is always overwritten by later tests.
From Dr. Liao:
The goal is to consider both model accuracy and overhead. There may be some existing metrics like this. Please find them out.
In investment, there is a concept of risk-adjusted return. https://www.investopedia.com/terms/r/riskadjustedreturn.asp
You can borrow this concept and apply it to models, by defining something like overhead-adjusted accuracy.
The -fopenmp
option should be added, otherwise the OpenMP information gets lost before ROSE iterates the Clang AST.
2-page work.
Firstly, list the bullet points and a summary. Add the details later.
The texts are too much and hard to understand. A figure should be added to help the explanation.
The baseline should be the pure GPU version. The speedup of all the models should be listed over the baseline.
In our research, only OpenMP part is studied. We need to make sure the proposal could be applied to MPI+X architecture without much code changing. It should not conflict with the parallelism conducted by MPI layer.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.