Coder Social home page Coder Social logo

llnl-work-notes's Introduction

Hi there 👋

ouankou's GitHub stats

Top Langs

llnl-work-notes's People

Contributors

ouankou avatar

Watchers

 avatar

llnl-work-notes's Issues

OpenMP implementation in 3D-LBM-AMR caused incorrect result

https://github.com/ouankou/3D-LBM-AMR
Without OpenMP disabled, the program runs fine.

With OpenMP enabled, it gives the following error.

...
Iter: 145
Iter: 146
Iter: 147
Iter: 148
Iter: 149
Iter: 150
Iter: 151
Velocity explosion124, 0, 29
Negative density at124, 0, 29
Velocity explosion124, 0, 31
Velocity explosion125, 0, 30
Error has occured. Exiting at timestep: 151

There might be data race caused by some OpenMP code.

CPU performance on Lassen is too low

On Fornax server in UNCC, the performance of running Stencil on CPU and GPU has a cross point. It indicates that CPU and GPU have their own advantage on different problem sizes.
On Lassen, not matter what the problem size is, GPU is always faster. The CPU of Lassen is IBM Power9, which might be reason why it's constantly slower than NVIDIA Tesla V100.

Can't map class members via map clause in OpenMP

AMR frameworks mostly use a complex class to store the data and methods. However, OpenMP doesn't allow to specify a class member inside the map clause.

In the AMR example of Charm, the local data in a cell is stored in the class member dataGrid and newDataGrid. The direct mapping leads to the following error.

jacobi2DAMR.C: In member function ‘virtual void Jacobi2DAMR::doComputation()’:
jacobi2DAMR.C:64:137: error: ‘Jacobi2DAMR::cellSize’ is not a variable in ‘map’ clause
 #pragma omp target teams distribute parallel for map(to: cellSize, dataGrid[0:N]) map(tofrom: newDataGrid[0:N]) collapse(2)
jacobi2DAMR.C:64:81: error: ‘Jacobi2DAMR::dataGrid’ is not a variable in ‘map’ clause
 #pragma omp target teams distribute parallel for map(to: cellSize, dataGrid[0:N]) map(tofrom: newDataGrid[0:N]) collapse(2)
...

In this case, the data structure is already very simple, which is only a local array. In some other cases, such as Gamer or 3D-LBM-AMR, they have even more complicated data structures in multiple locations for computing task.

I think this is a design issue, which is not directly related to our focus - how to use metadirective to speed up computation itself. For Charm, I manually copy the data from class member to regular variables and copy the back after computing. This step doesn't count as overhead and is excluded from time measurement. For now, I don't plan to address this kind of problem. Please notice that the time cost of transferring created variables between host and device is counted as part of GPU offloading.

Commit ouankou/charm@5bf78db shows my workaround.

Compare FPM performance

Given a series of problem size: 16, 32, 64, ...

For each problem size, it runs 2000 iterations. All the iterations are considered as one test.
Four groups are compared, which are pure CPU, pure GPU, CPM-based metadirective, and FPM-based metadirective.

Test Stencil on Pascal

Test Stencil on Pascal, using the number of CPU threads and problem size as parameters.

Compare the performance of CPM

Given a series of fixed problem size: 16, 32, 64, ...

For each problem size, it runs 5000 iterations or more.
Three groups are compared, which are pure CPU, pure GPU, and CPM-based metadirective.

Add omp target to omp parallel for in AMR applications

In 3D-LBM-AMR and Gamer, omp parallel for is widely used for computing on CPU. If the involved data are properly mapped, we can offload the computing to GPU.
However, both of them use the complex data structure to hold both data and computing methods.
https://github.com/ouankou/3D-LBM-AMR/blob/master/src/Node.h

class Node
{
public: 
    double u0,u1,u2,usqr,rho,xcoord,ycoord,zcoord,nu0;
//    double u0_prev,u1_prev,u_error,u_sum;
//	double m1eq,m2eq,m4eq,m6eq,m7eq,m8eq;//moments (starting from 0, to 8)
//	double m1,m2,m4,m6,m7,m8;//moments (starting from 0, to 8)
//	double Q11,Q12,Q21,Q22,Q,Smag,OmegaStar,Cs;
	double Smag,OmegaStar,Cs;
//	double PI11, PI22, PI33, PI12, PI13, PI23;
//	double PI11,PI22,PI12;
	double u0avg, u1avg, u2avg;
    int image,level,outnumb,edge;
	bool feqflag;//1 if MeqToFeq was used
	Node * Parent;
    std::vector<double> f,feq,ftemp,delta,m,meq;
	std::vector<Node *> nb;
    Node * child[8];
	Node();
	Node(Node& parent);
    void Initialize(double u0_ini,double u1_ini,double u2_ini,double rho_ini,int x,int y,int z);
    void SetImage(int Image_val);
	void ComputeM();//compute M from f and meq
    void ComputeFeq();
    void Collide();
    void ComputeMacros();
	void VelAverage(int tStep);//compute time averaged velocity
    void Symmetry();
	void Periodic(Node *);
    void BounceBack();
    void CurvedB();
	void BBForce(double& ForceX,double& ForceY,double& ForceZ,int BBimage);
    void BoundaryMacro();
    void InletOutlet();
    void RegularizedBoundary();
    void DeleteNode();
    void Refine();
	void ChildNeighbor();
	void Stream1();
	void Stream2();
	void SpatialInterp1();
	void SpatialInterp2();
	void SpatialAverage();
	void TemporalInterp();
	void TemporalInterp2();
	void MeqToFeq();
	void MToF();
	void ComputeDelta(double obX,double obY,double obX2,double obY2,double R);
//	void Outlet2();

};

It has a 3D array of this object. We could either figure out a way to map this 3D array to GPU, which would need much less changes to the code, or pick up the involve data to create a new data array and rewrite a kernel function based current computing methods.

Test Jacobi kernel on CPU and GPU

Now the AMR example in Charm is working as a complete application. It can prove that the optimization from metadirective can be isolated inside a cell without affecting the outer layers, such as MPI. The OpenMP support has been added. The computing can be performed either on CPU or GPU in OpenMP.

However, due to some bug in Charm's AMR library, when the problem size of Jacobi is larger than around 600, a segmentation fault will be triggered. This issue has nothing to do with the computing kernel, so I'll create a separate program to test Jacobi kernel without using the original framework. Whether the testing is conducted in the framework or individually does not affect our conclusion. If the bug is fixed in the future, this separate program can be integrated back with minor changes.

With this Jacobi program, we can have two kernels for evaluation. It's probably enough.

Create a metadirective guiding model named constant performance modeling

Jeannot, E., Žilinskas, J., Ilic, A. and Sousa, L. (2014). Efficient Multilevel Load Balancing on Heterogeneous CPU + GPU Systems. In High‐Performance Computing on Complex Environments (eds E. Jeannot and J. Žilinskas). https://doi.org/10.1002/9781118711897.ch14

Constant performance modeling (CPM) is discussed in the paper above. It's only suitable for a group of discrete problem sizes, which is not as adaptive as functional performance modeling (FPM). However, it fits well to the AMR example we discussed since inside a AMR grid, the problem size could be a constant value or a small group of constant values.
CPM requires much less resources to create a model than FPM because it only covers finite number of problem sizes. In some cases, it has only one problem size.

The initial description of this model has been added to Overleaf.

Cannot access the GPU on Pascal system

On Lassen, it's fine to load Clang 10.0 and compile the OpenMP code with GPU offloading.

However, on Pascal (NVIDIA Tesla P100), profiling the stencil program nvprof ./pascal.out gave the error:

======== Warning: CUDA device error, GPU profiling skipped
GPU time(ms): 0.4
CPU time(ms): 0.3
verify dif =0.000000
======== Warning: No profile data collected.

It seems the code wasn't offloaded to GPU.

Instead of providing basic GPU informaiton, nvidia-smi showed:

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

A native CUDA program can be compiled but running it threw the following error.

CUDA error at ../../Common/helper_cuda.h:791 code=100(cudaErrorNoDevice) "cudaGetDeviceCount(&device_count)"

All the commands run fine on Lassen. I'm contacting LC Hotline for help.

Revise ODDC description

The texts are too much and hard to understand. A figure should be added to help the explanation.

Test 2D stencil on Surface

Surface has 36 CPU cores. Assume this compute node will handle 4 AMR grids, so for a single node, 8 CPU threads are used in OpenMP.

Support a small set of OpenMP directives in ROSE with Clang

  1. parallel
  2. metadirective
  3. for
  4. parallel for

It's used to help FreeCompilerCamp create some tutorials. Since EDG is not open source, it's difficult to add new nodes in ROSE without source code of EDG. Using Clang as frontend instead, it may solve the problem.
Currently, the Clang support in ROSE is quite limited. For example, Clang:Var type, which is related to variable declaration, is not supported yet. At this moment, only Clang 9.x works with ROSE development branch.

Create a model for the computing that has multiple independent stages or portions

Mabrouk, L., Huet, S., Houzet, D. et al. Efficient adaptive load balancing approach for compressive background subtraction algorithm on heterogeneous CPU–GPU platforms. J Real-Time Image Proc (2019). https://doi.org/10.1007/s11554-019-00916-4

Since the subproblems are independent, they can be processed concurrently by CPU and GPU. The question is how to distribute them to make the total time the least.
We can have multiple metadirective for those portions and the created model will guide the offloading.

Check the role of MPI in MPI+X

In our research, only OpenMP part is studied. We need to make sure the proposal could be applied to MPI+X architecture without much code changing. It should not conflict with the parallelism conducted by MPI layer.

Create a figure to explain AMR

It may have 4 subfigures.

A grid covers multiple cells. Different level of grid has different number of cells.

  1. State 0: grid 0 has two children grid 1 and 2. Then grid 1 has its children grid 3 and 4.
  2. Tree 0: 5 grids in a tree.
  3. State 1: At some point, grid 3 is not needed and we can remove it and coarsen that section in grid 1. Meanwhile, a section in grid 2 is getting hot. We add refine it to add a child grid 5 to grid 2.
  4. Tree 1: 5 grids in a tree, grid 3 is removed, grid 5 is added.

This procedure is called regridding. After this, a grid is fed into GPU or CPU to calculate the value of its cells.

Create a model to evaluate CPU and GPU performance dynamically for metadirective branching

Jeannot, E., Žilinskas, J., Ilic, A. and Sousa, L. (2014). Efficient Multilevel Load Balancing on Heterogeneous CPU + GPU Systems. In High‐Performance Computing on Complex Environments (eds E. Jeannot and J. Žilinskas). https://doi.org/10.1002/9781118711897.ch14

The original research was for load balancing, but its idea may be also suitable for metadirective.
In the AMR cell, a flag and a performance record could be set. The first 20-50 iterations could be used for profiling. The CPU and GPU performance data are stored.
The flag is set to 0 initially. After the profiling, it is set to 1 or 2, which corresponds to CPU or GPU or any other computing devices. This profiling step is overhead, but considering a AMR application usually has thousands of iterations or even more, 20 iterations is not that significant.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.