ouankou / llnl-work-notes Goto Github PK

View Code? Open in Web Editor NEW

0.0 1.0 0.0 14 KB

Record TODO, Work-in-Progress and Done tasks in LLNL.

License: GNU Affero General Public License v3.0

llnl-work-notes's Introduction

llnl-work-notes

Record TODO, Work-in-Progress and Done tasks in LLNL.

llnl-work-notes's People

Contributors

Watchers

llnl-work-notes's Issues

Rearrange the diagrams of testing results

Now they take too much space (a few pages). We could put them side-by-side.

Find three models to drive the switching in metadirective

Instead of only checking problem size, we like to call three modelling functions to determine which device should be used.

Regenerate CPM and FPM testing results on Surface, Pascal and Lassen

Currently, only about 10 data points are collected to create the model. We should use much more, especially for the smaller problem size and crosspoint area.

Add OpenMP GPU kernel to AMR example in Charm

The AMR example has been fixed in Charm (charmplusplus/charm#2980).
OpenMP CPU kernel has been added on top of it (ouankou/charm@33cfe33).

The next step would be to add omp target to the kernel for GPU offloading.

Cannot access the GPU on Pascal system

On Lassen, it's fine to load Clang 10.0 and compile the OpenMP code with GPU offloading.

However, on Pascal (NVIDIA Tesla P100), profiling the stencil program nvprof ./pascal.out gave the error:

======== Warning: CUDA device error, GPU profiling skipped
GPU time(ms): 0.4
CPU time(ms): 0.3
verify dif =0.000000
======== Warning: No profile data collected.

It seems the code wasn't offloaded to GPU.

Instead of providing basic GPU informaiton, nvidia-smi showed:

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

A native CUDA program can be compiled but running it threw the following error.

CUDA error at ../../Common/helper_cuda.h:791 code=100(cudaErrorNoDevice) "cudaGetDeviceCount(&device_count)"

All the commands run fine on Lassen. I'm contacting LC Hotline for help.

Review the pull requests to FreeCC

CPU performance on Lassen is too low

On Fornax server in UNCC, the performance of running Stencil on CPU and GPU has a cross point. It indicates that CPU and GPU have their own advantage on different problem sizes.
On Lassen, not matter what the problem size is, GPU is always faster. The CPU of Lassen is IBM Power9, which might be reason why it's constantly slower than NVIDIA Tesla V100.

Read the modeling papers from Dr. Liao and maybe use some of them

The papers have some more advanced models to guide loop execution.

Test Stencil on Pascal

Test Stencil on Pascal, using the number of CPU threads and problem size as parameters.

Create a model for the computing that has multiple independent stages or portions

Mabrouk, L., Huet, S., Houzet, D. et al. Efficient adaptive load balancing approach for compressive background subtraction algorithm on heterogeneous CPU–GPU platforms. J Real-Time Image Proc (2019). https://doi.org/10.1007/s11554-019-00916-4

Since the subproblems are independent, they can be processed concurrently by CPU and GPU. The question is how to distribute them to make the total time the least.
We can have multiple metadirective for those portions and the created model will guide the offloading.

Write scripts for ompparser to test the given OpenMP examples.

It should generate a table automatically for most of results.

https://github.com/OpenMP/Examples/tree/master/sources
https://github.com/LLNL/dataracebench/tree/master/micro-benchmarks

The result looks like the following table.
https://github.com/LLNL/dataracebench/wiki/Summary-report-Apr-2019#evaluation-platform

Add CPM, FPM and ODDC related works to Overleaf

Summarize the papers when the terms are introduced first time.

Run the Stencil experiment on Volta GPU

Carina or Lassen.

The result should be added to both Google sheet and Overleaf with explanation.

Add Clang/LLVM background to FreeCC paper on Overleaf

Create a model to evaluate CPU and GPU performance dynamically for metadirective branching

Jeannot, E., Žilinskas, J., Ilic, A. and Sousa, L. (2014). Efficient Multilevel Load Balancing on Heterogeneous CPU + GPU Systems. In High‐Performance Computing on Complex Environments (eds E. Jeannot and J. Žilinskas). https://doi.org/10.1002/9781118711897.ch14

The original research was for load balancing, but its idea may be also suitable for metadirective.
In the AMR cell, a flag and a performance record could be set. The first 20-50 iterations could be used for profiling. The CPU and GPU performance data are stored.
The flag is set to 0 initially. After the profiling, it is set to 1 or 2, which corresponds to CPU or GPU or any other computing devices. This profiling step is overhead, but considering a AMR application usually has thousands of iterations or even more, 20 iterations is not that significant.

Add a feature in FreeCC to pass a file from container to the host

To be safe, the file should be passed to /tmp and never be executed.
Then the file can be downloaded from frontend by user.

Test Jacobi kernel on CPU and GPU

Now the AMR example in Charm is working as a complete application. It can prove that the optimization from metadirective can be isolated inside a cell without affecting the outer layers, such as MPI. The OpenMP support has been added. The computing can be performed either on CPU or GPU in OpenMP.

However, due to some bug in Charm's AMR library, when the problem size of Jacobi is larger than around 600, a segmentation fault will be triggered. This issue has nothing to do with the computing kernel, so I'll create a separate program to test Jacobi kernel without using the original framework. Whether the testing is conducted in the framework or individually does not affect our conclusion. If the bug is fixed in the future, this separate program can be integrated back with minor changes.

With this Jacobi program, we can have two kernels for evaluation. It's probably enough.

Compare the performance of CPM

Given a series of fixed problem size: 16, 32, 64, ...

For each problem size, it runs 5000 iterations or more.
Three groups are compared, which are pure CPU, pure GPU, and CPM-based metadirective.

Test 2D stencil on Surface

Surface has 36 CPU cores. Assume this compute node will handle 4 AMR grids, so for a single node, 8 CPU threads are used in OpenMP.

Compare FPM performance

Given a series of problem size: 16, 32, 64, ...

For each problem size, it runs 2000 iterations. All the iterations are considered as one test.
Four groups are compared, which are pure CPU, pure GPU, CPM-based metadirective, and FPM-based metadirective.

Create a metadirective guiding model named constant performance modeling

Constant performance modeling (CPM) is discussed in the paper above. It's only suitable for a group of discrete problem sizes, which is not as adaptive as functional performance modeling (FPM). However, it fits well to the AMR example we discussed since inside a AMR grid, the problem size could be a constant value or a small group of constant values.
CPM requires much less resources to create a model than FPM because it only covers finite number of problem sizes. In some cases, it has only one problem size.

The initial description of this model has been added to Overleaf.

Can't map class members via map clause in OpenMP

AMR frameworks mostly use a complex class to store the data and methods. However, OpenMP doesn't allow to specify a class member inside the map clause.

In the AMR example of Charm, the local data in a cell is stored in the class member dataGrid and newDataGrid. The direct mapping leads to the following error.

jacobi2DAMR.C: In member function ‘virtual void Jacobi2DAMR::doComputation()’:
jacobi2DAMR.C:64:137: error: ‘Jacobi2DAMR::cellSize’ is not a variable in ‘map’ clause
 #pragma omp target teams distribute parallel for map(to: cellSize, dataGrid[0:N]) map(tofrom: newDataGrid[0:N]) collapse(2)
jacobi2DAMR.C:64:81: error: ‘Jacobi2DAMR::dataGrid’ is not a variable in ‘map’ clause
 #pragma omp target teams distribute parallel for map(to: cellSize, dataGrid[0:N]) map(tofrom: newDataGrid[0:N]) collapse(2)
...

In this case, the data structure is already very simple, which is only a local array. In some other cases, such as Gamer or 3D-LBM-AMR, they have even more complicated data structures in multiple locations for computing task.

I think this is a design issue, which is not directly related to our focus - how to use metadirective to speed up computation itself. For Charm, I manually copy the data from class member to regular variables and copy the back after computing. This step doesn't count as overhead and is excluded from time measurement. For now, I don't plan to address this kind of problem. Please notice that the time cost of transferring created variables between host and device is counted as part of GPU offloading.

Commit ouankou/charm@5bf78db shows my workaround.

Add omp target to omp parallel for in AMR applications

In 3D-LBM-AMR and Gamer, omp parallel for is widely used for computing on CPU. If the involved data are properly mapped, we can offload the computing to GPU.
However, both of them use the complex data structure to hold both data and computing methods.
https://github.com/ouankou/3D-LBM-AMR/blob/master/src/Node.h

class Node
{
public: 
    double u0,u1,u2,usqr,rho,xcoord,ycoord,zcoord,nu0;
//    double u0_prev,u1_prev,u_error,u_sum;
//	double m1eq,m2eq,m4eq,m6eq,m7eq,m8eq;//moments (starting from 0, to 8)
//	double m1,m2,m4,m6,m7,m8;//moments (starting from 0, to 8)
//	double Q11,Q12,Q21,Q22,Q,Smag,OmegaStar,Cs;
	double Smag,OmegaStar,Cs;
//	double PI11, PI22, PI33, PI12, PI13, PI23;
//	double PI11,PI22,PI12;
	double u0avg, u1avg, u2avg;
    int image,level,outnumb,edge;
	bool feqflag;//1 if MeqToFeq was used
	Node * Parent;
    std::vector<double> f,feq,ftemp,delta,m,meq;
	std::vector<Node *> nb;
    Node * child[8];
	Node();
	Node(Node& parent);
    void Initialize(double u0_ini,double u1_ini,double u2_ini,double rho_ini,int x,int y,int z);
    void SetImage(int Image_val);
	void ComputeM();//compute M from f and meq
    void ComputeFeq();
    void Collide();
    void ComputeMacros();
	void VelAverage(int tStep);//compute time averaged velocity
    void Symmetry();
	void Periodic(Node *);
    void BounceBack();
    void CurvedB();
	void BBForce(double& ForceX,double& ForceY,double& ForceZ,int BBimage);
    void BoundaryMacro();
    void InletOutlet();
    void RegularizedBoundary();
    void DeleteNode();
    void Refine();
	void ChildNeighbor();
	void Stream1();
	void Stream2();
	void SpatialInterp1();
	void SpatialInterp2();
	void SpatialAverage();
	void TemporalInterp();
	void TemporalInterp2();
	void MeqToFeq();
	void MToF();
	void ComputeDelta(double obX,double obY,double obX2,double obY2,double R);
//	void Outlet2();

};

It has a 3D array of this object. We could either figure out a way to map this 3D array to GPU, which would need much less changes to the code, or pick up the involve data to create a new data array and rewrite a kernel function based current computing methods.

Create a figure to explain AMR

It may have 4 subfigures.

A grid covers multiple cells. Different level of grid has different number of cells.

State 0: grid 0 has two children grid 1 and 2. Then grid 1 has its children grid 3 and 4.
Tree 0: 5 grids in a tree.
State 1: At some point, grid 3 is not needed and we can remove it and coarsen that section in grid 1. Meanwhile, a section in grid 2 is getting hot. We add refine it to add a child grid 5 to grid 2.
Tree 1: 5 grids in a tree, grid 3 is removed, grid 5 is added.

This procedure is called regridding. After this, a grid is fed into GPU or CPU to calculate the value of its cells.

Support a small set of OpenMP directives in ROSE with Clang

parallel
metadirective
for
parallel for

It's used to help FreeCompilerCamp create some tutorials. Since EDG is not open source, it's difficult to add new nodes in ROSE without source code of EDG. Using Clang as frontend instead, it may solve the problem.
Currently, the Clang support in ROSE is quite limited. For example, Clang:Var type, which is related to variable declaration, is not supported yet. At this moment, only Clang 9.x works with ROSE development branch.

Complete the TODO items in the FreeCC paper

Add more tech details of multiple server/docker images support.
Add more description of ROSE compiler.

OpenMP implementation in 3D-LBM-AMR caused incorrect result

https://github.com/ouankou/3D-LBM-AMR
Without OpenMP disabled, the program runs fine.

With OpenMP enabled, it gives the following error.

...
Iter: 145
Iter: 146
Iter: 147
Iter: 148
Iter: 149
Iter: 150
Iter: 151
Velocity explosion124, 0, 29
Negative density at124, 0, 29
Velocity explosion124, 0, 31
Velocity explosion125, 0, 30
Error has occured. Exiting at timestep: 151

There might be data race caused by some OpenMP code.

Describe the result of Stencil experiment with figure and texts on Overleaf

Describe the experiment settings including hardware, software, and how it's conducted.
Explain how it's related to our goal, how it's mapped to the kernel in AMR applications.
Add a figure of the result and explain how it guides the CPU/GPU switching.

Measure overhead for Smith-Waterman adaptive version.

Binary code size.
Execution time.

Source code:
https://github.com/chunhualiao/homp-demo/tree/master/xomp-gen

Add a Makefile or script to test the examples in Gamer with different problem sizes.

Currently, the example has to be manually copied to a separate folder and change the input parameter file manually for each problem size.
The output is always overwritten by later tests.

Add ROSE background to FreeCC paper on Overleaf

Provide a new metric to compare the models being evaluated

From Dr. Liao:
The goal is to consider both model accuracy and overhead. There may be some existing metrics like this. Please find them out.
In investment, there is a concept of risk-adjusted return. https://www.investopedia.com/terms/r/riskadjustedreturn.asp
You can borrow this concept and apply it to models, by defining something like overhead-adjusted accuracy.