xtra-computing / thundergp Goto Github PK

View Code? Open in Web Editor NEW

136.0 10.0 34.0 7.26 MB

HLS-based Graph Processing Framework on FPGAs

License: Apache License 2.0

Makefile 5.87% C 29.22% C++ 62.80% Shell 1.47% MATLAB 0.58% Tcl 0.06%

graph-processing fpgas graph-analytic-algorithm xilinx hls programmability accelerators

thundergp's Introduction

ThunderGP: HLS-based Graph Processing Framework on FPGAs

What's new?

ThunderGP for HBM-Enabled FPGA platforms has been released. Please see the branch.

ThunderGP has been published in ACM Transactions on Reconfigurable Technology and Systems (TRETS), as one of "Best Papers in FPGA 2021".

ThunderGP won the third place in 2020 Xilinx Adaptive Computing Developer Contest, top 9 out of 79 teams.

ThunderGP is accepted to be FPGA 2021. Read the paper.

ThunderGP is featured at Xilinx Apps and Libraries.

ThunderGP was presented at XACC@NUS Workshop Series 2020: Reconfigurable Computing Systems. see Slides, Video/Youtube, Video/bilibili.

Introduction

ThunderGP enables data scientists to enjoy the performance of FPGA-based graph processing without compromising programmability. To our best knowledge and experiments, this is the fastest graph processing framework on HLS-based FPGAs.

Two aspacts make the ThunderGP deliver superior performance. On the one hand, ThunderGP embraces an improved execution flow to better exploit the pipeline parallelism of FPGA and alleviate the data access amount to the global memory. On the other hand, the memory accesses are highly optimized to fully utilize the memory bandwidth capacity of the hardware platforms.

On Xilinx multi-SLR based FPGAs, it is running at 250Mhz, and the performance can be up to 6400 MTEPS (million traversed edges per second), or a 2.9 times speedup over the state-of-the-art.

Prerequisites

The gcc-9.3
Tools:
- SDAccel 2018.3 Design Suit
- SDAccel 2019.1 Design Suit
Evaluated platforms from Xilinx:
- Xilinx Virtex UltraScale+ FPGA VCU1525 Acceleration Development Kit (SDAccel 2018.3)
- Alveo U200 Data Center Accelerator Card (SDAccel 2019.1)
- Alveo U250 Data Center Accelerator Card (SDAccel 2019.1)

Run the code

ThunderGP currently has seven build-in graph algorithms: PageRank (PR), Sparse Matrix-Vector Multiplication (SpMV), Breadth-First Search (BFS), Single Source Shortest Path (SSSP), Closeness Centrality (CC), ArticleRank (AR), and Weakly Connected Component (WCC). The desired application can be implemented by passing argument app=[the algorithm] to make command. The below table is for quick reference.

Argument	Accelerated algorithm
`app=pr`	PageRank (PR)
`app=spmv`	Sparse Matrix-vector Multiplication (SpMV)
`app=bfs`	Breadth First Search (BFS)
`app=sssp`	Single Source Shortest Path (SSSP)
`app=cc`	Closeness Centrality (CC)
`app=ar`	ArticleRank (AR)
`app=wcc`	Weakly Connected Component (WCC)

Here is the example of implementing the accelerator for PageRank on Alveo U250 platform with SDAccel 2019.1.

$ git clone https://github.com/Xtra-Computing/ThunderGP.git
$ cd ./
$ vim ThunderGP.mk 
$ # configure the DEVICE as DEVICES := xilinx_u250_xdma_201830_2; configure TARGETS := hw
$ make app=pr all # make the host execution program and the FPGA bitstream. It takes time :)
# For execution on real hardware. The path of graph dataset needs to be provided by the user. 
$ ./host_graph_fpga_pr xclbin_pr/*.xclbin ./dataset/rmat-14-32.txt

More details: Compiling ThunderGP

Results (performance)

Throughput (MTEPS) of different graph processing algorithms over datasets on VCU1525 platform.

App.	PR	SPMV	BFS	SSSP	CC	AR	WCC
R21	5,015	4,190	5,417	3,901	4,623	4,848	4,584
R24	4,599	3,781	3,437	3,430	4,339	4,486	4,328
G24	5,039	4,037	5,216	3,725	4,752	4,927	4,704
G25	4,464	3,615	4,660	3,343	4,344	4,389	4,356
WT	2,884	2,874	2,717	2,427	2,776	2,833	2,776
MG	4,454	3,883	4,939	3,699	4,077	4,285	4,088
PK	4,001	3,729	4,251	3,169	3,833	3,909	3,716
WP	3,030	2,994	3,112	2,491	2,993	2,931	2,894
LJ	3,186	3,003	3,408	2,623	3,113	3,081	3,099
TW	2,938	2,801	2,120	2,425	2,962	2,853	2,894

Throughput (MTEPS) of different graph processing algorithms over datasets on U250 platform.

App.	PR	SPMV	BFS	SSSP	CC	AR	WCC
R21	4,669	5,056	6,028	4,879	4,783	4,667	4,901
R24	4,732	4,946	5,897	4,285	4,939	4,732	4,988
G24	5,040	5,305	5,772	4,428	3,705	5,040	5,303
G25	4,978	4,072	4,974	3,864	3,661	4,984	5,254
WT	2,251	2,938	2,630	2,583	2,369	2,253	2,405
MG	3,756	4,195	4,949	4,378	3,914	3,737	3,891
PK	3,630	4,372	4,629	3,927	3,865	3,662	3,841
WP	3,255	3,652	4,058	3,417	3,341	3,259	3,432
LJ	3,342	3,693	4,329	3,614	3,557	3,328	3,708
TW	3,538	3,959	3,671	3,585	3,759	3,533	3,806

Datasets and More Results

APIs (programmability)

Benefiting from the high level abstraction of HLS, our APIs natively support C/C++ languages.
ThunderGraph covers three levels of API for implementation or further exploration. APIs in L1 and L2 are for building the accelerators, and APIs of L3 are for host program. Details are as below:

L1 is used to construct the basic modules to build the compute kernels and the dataflow.
L2 provides hooks for mapping graph processing algorithms.
- Mapping new graph analytic algorithms
L3 provides the high-level APIs on host side to deploy or control graph processing accelerator. Since recent FPGAs usually consist of multiple (SLRs), L3 also wraps the partition scheduling and memory management interface for multiple SLRs.
More details: ThunderGP APIs

Framework Overview

The Adopted Computation Model

The Gather-Apply-Scatter (GAS) model is widely used for FPGA-based graph processing frameworks as computation model due to its extensibility to various graph processing algorithms. ThunderGP adopts a simplified version of GAS model by following work On-the-fly-data-shuffling-for-OpenCL-based-FPGAs. This model updates the vertex property by propagating from source vertex to destination vertex. The input for the model is an unordered set of directed edges of the graph. Undirected edges in a graph can be represented by a pair of directed edges.

The process per iteration mainly contains three stages: Scatter, Gather, and Apply.

In the Scatter stage (shown in line 2 to 6), for each input edge with format <src, dst, weight>, an update tuple is generated for the destination vertex of the edge. The update tuple is of the format <dst, value>, where the dst is the destination vertex of the edge and value is generated by processing the vertex properties and edge weights.
In the Gather stage (shown in line 7 to 9), all the update tuples generated in the Scatter stage are accumulated to update destination vertices.
The final Apply stage (shown in line 10 to 12) executes an apply function on all the vertices of the graph.

The Execution Flow of ThunderGP

As shown in the above diagram, The edges in one partition are streamed into Scatter stage, For each edges, the property of source vertices will be fetched from the global memory by the per-fetching and the cache module, at the same time, the property of corresponding edge, or the weight of edge is loaded from global memory in stream, then these two value go through an algorithm-specific processing which return an update of the property of the destination vertex, finally, at the end of scatter stage, this update value and the destination of this edge is combined to create a update tuple. The update tuples are streamed into the shuffle stage which dispatches the tuples to corresponding gather processing engines(PEs). The Gather PEs accumulates the update value in local on-chip memory which is caching the property of destination vertices. After all the edges in this partition are processed, the cached data in gather PEs will be aggregated to the global memory. and the Apply stage which calls algorithm-specific function updates all the vertices for the next iteration.

Future Work

Application wrapper for high level platform (Spark, etc.)
Hardware-accelerated query engine.
Cycle-precision software simulation for the verification of dynamic modules(Cache, etc.) and channel depth tuning.
Optimization for large scale graph. (distributed processing or HBM-based memory hierarchy)

How to cite ThunderGP

If you use ThunderGP in your paper, please cite our work (full version).

@article{10.1145/3517141,
author = {Chen, Xinyu and Cheng, Feng and Tan, Hongshi and Chen, Yao and He, Bingsheng and Wong, Weng-Fai and Chen, Deming},
title = {ThunderGP: Resource-Efficient Graph Processing Framework on FPGAs with HLS},
year = {2022},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
issn = {1936-7406},
url = {https://doi.org/10.1145/3517141},
doi = {10.1145/3517141},
note = {Just Accepted},
journal = {ACM Trans. Reconfigurable Technol. Syst.},
month = {feb},
keywords = {FPGA; HBM; HLS; Graph Processing; Framework}
}

@inbook{10.1145/3431920.3439290,
author = {Chen, Xinyu and Tan, Hongshi and Chen, Yao and He, Bingsheng and Wong, Weng-Fai and Chen, Deming},
title = {ThunderGP: HLS-Based Graph Processing Framework on FPGAs},
year = {2021},
url = {https://doi.org/10.1145/3431920.3439290},
booktitle = {The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays},
pages = {69–80},
numpages = {12}
}

Related publications

Xinyu Chen*, Ronak Bajaj^, Yao Chen, Jiong He, Bingsheng He, Weng-Fai Wong and Deming Chen. On-The-Fly Parallel Data Shuffling for Graph Processing on OpenCL-based FPGAs. FPL, 2019.
Xinyu Chen*, Yao Chen^, Ronak Bajaj, Jiong He, Bingsheng He, Weng-Fai Wong and Deming Chen. Is FPGA useful for hash joins. CIDR 2020: Conference on Innovative Data Systems Research

Related systems

Graph systems on GPU: G3 | Medusa
Other Thunder-series systems in Xtra NUS: ThunderGBM | ThunderSVM

Key members

Hongshi Tan, Xinyu Chen (NUS)
Advisor: Bingsheng He and Weng-Fai Wong, NUS
Collaborators: Yao Chen (ADSC) and Deming Chen (UIUC)

Acknowledgement

Xilinx Adaptive Compute Clusters (XACC) program
Singapore MoE Tier 2 grant (MOE2017-T2-1-122).

thundergp's People

Contributors

Stargazers

Watchers

thundergp's Issues

Possible bug with specific application, CModel gives correct results

I believe there may be a bug running this specific application, which is meant to double the property of all vertices (via edge properties).

When executing a superstep, the vertex properties remain unaltered. CModel reports what I would expect to be the correct result, but additionally seems to think the vertex properties are all 0 when they are not (since they are unaltered!).

Note that I set the vertex property to i+1 (so vertex 0 has property 1).

See below the results of a run. After the CModel verification I print the first 10 entries for MEM_ID_PUSHIN_PROP and MEM_ID_PUSHIN_PROP_MAPPED. Note that some vertices are also missing from MEM_ID_PUSHIN_PROP_MAPPED as per #8 but this is not the issue at hand here. None of the vertices have had their property doubled.

I am running on a U250 and using SDAccel 2018.3.

$ ./host_graph_fpga_dc1 xclbin_dc1/*.xclbin dataset/rmat-14-32.txt 
Found 1 platforms!
Found 1 devices!
file is xclbin_dc1/graph_fpga.hw.xilinx_u250_xdma_201830_2.xclbin xclbin_dc1/graph_fpga.hw.xilinx_u250_xdma_201830_2.xclbin 
INFO: Importing xclbin_dc1/graph_fpga.hw.xilinx_u250_xdma_201830_2.xclbin
INFO: Loaded file
INFO: Created Binary
INFO: Built Program
Graph dataset/rmat-14-32.txt is loaded.
vertex num: 16384
edge num: 524288

unpad edge_tuple_range 524288
ratio 12501 / 16384 is 0.763000 
ratio 12501 / 16384 is 0.763000 
ratio 12501 / 16384 is 0.763000 
ratio 12501 / 16384 is 0.763000 
[EST]: 1 is expected to exe in 0.401833ms
[EST]: 0 is expected to exe in 0.401825ms
[EST]: 2 is expected to exe in 0.401783ms
[EST]: 3 is expected to exe in 0.401734ms
[EST]: finalOrder 3 total exe: 0.401734ms
[EST]: finalOrder 2 total exe: 0.401783ms
[EST]: finalOrder 0 total exe: 0.401825ms
[EST]: finalOrder 1 total exe: 0.401833ms
[SIZE] 526336 cur_edge_num sub 131072
[SIZE] 526336 cur_edge_num sub 131072
[SIZE] 526336 cur_edge_num sub 131072
[SIZE] 526336 cur_edge_num sub 131072

----------------------------------------------------------------------------------
[PART] subPartitions 0 info :
[PART] 	 edgelist from 0 to 131072
[PART] 	 dst. vertex from 0 to 12500
[PART] 	 src. vertex from 9600 to 12500
[PART] 	 dump: 9600 - 12500
[PART] scatter cache ratio 0.000000 
[PART] v/e 0.095367 
[PART] v: 12500 e: 131072 
[PART] est. efficient 10.484921
[PART] compressRatio 0.763000 

[SCHE] 0 with 526336 @ 0 
transfer base mem start
transfer base mem
transfer subPartitions mem
transfer cu mem
data transfer 5.678447 
cmodel error 0 0x00000004 hw: 0x00000000  diff 0x00000004 !!!!
cmodel error 1 0x00000006 hw: 0x00000000  diff 0x00000006 !!!!
cmodel error 2 0x0000000a hw: 0x00000000  diff 0x0000000a !!!!
cmodel error 3 0x0000000e hw: 0x00000000  diff 0x0000000e !!!!
cmodel error 5 0x00000012 hw: 0x00000000  diff 0x00000012 !!!!
cmodel error 6 0x00000016 hw: 0x00000000  diff 0x00000016 !!!!
cmodel error 7 0x00000018 hw: 0x00000000  diff 0x00000018 !!!!
cmodel error 8 0x0000001a hw: 0x00000000  diff 0x0000001a !!!!
cmodel error 9 0x0000001c hw: 0x00000000  diff 0x0000001c !!!!
cmodel error 10 0x0000001e hw: 0x00000000  diff 0x0000001e !!!!
cmodel error 11 0x00000020 hw: 0x00000000  diff 0x00000020 !!!!
cmodel error 12 0x00000022 hw: 0x00000000  diff 0x00000022 !!!!
cmodel error 15 0x0000002a hw: 0x00000000  diff 0x0000002a !!!!
cmodel error 16 0x0000002c hw: 0x00000000  diff 0x0000002c !!!!
cmodel error 17 0x0000002e hw: 0x00000000  diff 0x0000002e !!!!
cmodel error 19 0x00000032 hw: 0x00000000  diff 0x00000032 !!!!
cmodel error 20 0x00000034 hw: 0x00000000  diff 0x00000034 !!!!
cmodel error 22 0x0000003a hw: 0x00000000  diff 0x0000003a !!!!
cmodel error 23 0x0000003c hw: 0x00000000  diff 0x0000003c !!!!
cmodel error 24 0x00000040 hw: 0x00000000  diff 0x00000040 !!!!
cmodel error 26 0x00000044 hw: 0x00000000  diff 0x00000044 !!!!
cmodel error 27 0x00000048 hw: 0x00000000  diff 0x00000048 !!!!
cmodel error 28 0x0000004a hw: 0x00000000  diff 0x0000004a !!!!
cmodel error 29 0x0000004c hw: 0x00000000  diff 0x0000004c !!!!
cmodel error 31 0x00000052 hw: 0x00000000  diff 0x00000052 !!!!
cmodel error 32 0x00000054 hw: 0x00000000  diff 0x00000054 !!!!
cmodel error 34 0x00000058 hw: 0x00000000  diff 0x00000058 !!!!
cmodel error 35 0x0000005a hw: 0x00000000  diff 0x0000005a !!!!
cmodel error 36 0x0000005c hw: 0x00000000  diff 0x0000005c !!!!
cmodel error 37 0x0000005e hw: 0x00000000  diff 0x0000005e !!!!
cmodel error 38 0x00000060 hw: 0x00000000  diff 0x00000060 !!!!
cmodel error 39 0x00000062 hw: 0x00000000  diff 0x00000062 !!!!
cmodel error 40 0x00000064 hw: 0x00000000  diff 0x00000064 !!!!
cmodel error 41 0x00000068 hw: 0x00000000  diff 0x00000068 !!!!
cmodel error 42 0x0000006a hw: 0x00000000  diff 0x0000006a !!!!
cmodel error 43 0x0000006c hw: 0x00000000  diff 0x0000006c !!!!
cmodel error 44 0x0000006e hw: 0x00000000  diff 0x0000006e !!!!
cmodel error 45 0x00000072 hw: 0x00000000  diff 0x00000072 !!!!
cmodel error 47 0x0000007a hw: 0x00000000  diff 0x0000007a !!!!
cmodel error 48 0x0000007c hw: 0x00000000  diff 0x0000007c !!!!
cmodel error 50 0x00000080 hw: 0x00000000  diff 0x00000080 !!!!
cmodel error 51 0x00000082 hw: 0x00000000  diff 0x00000082 !!!!
cmodel error 52 0x00000084 hw: 0x00000000  diff 0x00000084 !!!!
cmodel error 53 0x00000086 hw: 0x00000000  diff 0x00000086 !!!!
cmodel error 54 0x00000088 hw: 0x00000000  diff 0x00000088 !!!!
cmodel error 55 0x0000008c hw: 0x00000000  diff 0x0000008c !!!!
cmodel error 56 0x0000008e hw: 0x00000000  diff 0x0000008e !!!!
cmodel error 57 0x00000090 hw: 0x00000000  diff 0x00000090 !!!!
cmodel error 58 0x00000092 hw: 0x00000000  diff 0x00000092 !!!!
total cmodel error: 11352
Property for "first 10" vertices:
1
2
3
4
5
6
7
8
9
10
Number of vertices with non-null property:16384

Property for "first 10" vertices:
2
3
5
7
8
9
11
12
13
14

The exploration of resources in the u50 version

Hi~ I wonder why SubpartitionNum(KernelNum) is a fixed value in the u50 version, is it because there is something in the vitis version that limits the exploration of resources?

Buffer Allocation Failed on U280 when Running on Larger Graphs

Description
Hi! I cloned the code from branch 'v_HBM' and attempted to execute BFS on U280 FPGA. The execution went smoothly with the provided dataset (rmat-14-32.txt). However, when I tried to run it with larger graphs——specifically, using soc-LiveJournal1.txt downloaded from the Stanford Network Dataset Collection——the program crashed during buffer allocation on the U280. Below is the error log:

[XRT] ERROR: std::bad_alloc
./main.cpp:86 Error calling edge_buffers.emplace_back(cl::Buffer(context, flags, edgesHeadArrays[i].size, &(edgesHeadArrays[i].ext_attr), &status)), error code is: -5

I've made a few adjustments to the host code by rewriting it using OpenCL2, but it's not the cause of the error. It appears that the issue lies in XRT's inability to allocate a buffer larger than 256MB with HBM.

Referrence:

So, I'm wondering it may be hard for the accelerator to support large graph processing as long as the size of edgesHeadArray exceeds 256 MB.

To Reproduce
Steps to reproduce the behavior:

Clone the v_HBM code.
Compile the project.
Download graph soc-LiveJournal from SNAP, extract it to dataset folder.
Run command ./host_graph_fpga_bfs xclbin_hw_bfs/graph_fpga.hw.xilinx_u280_gen3x16_xdma_1_202211_1.xclbin dataset/soc-LiveJournal1.txt.
See error.

Platform:

Device: U280 FPGA
OS: Ubuntu 18.04.6 LTS

Your assistance in resolving this issue is greatly appreciated!

Accessing Properties of Vertices with Null Out-Degree

Hi everyone.

I'm wondering about accessing the property of vertices with out-degree equal to 0.
I understand these are somehow filtered out of certain computations since they generate no updates, but they should still receive updates so I'd like to access their property after some number of supersteps.

I would also appreciate an explanation on the various MEM_ID pointers.

Supporting Vitis and XRT recent versions

As indicated in the README, the current platform and tools under evaluation exclusively support the 2019 version of SDAccel. However, I am interested in understanding whether there are plans in place to update this setup to accommodate Vitis and more recent versions of XRT.

My attempts to install SDAccel 2019 have proven unsuccessful, primarily due to the fact that the platform I'm using is setup for XRT 2022, a version that SDAccel 2019 does not recognize.

Is there any indication or possibility of recompiling the code using Vitis 2022? I've made an attempt in this direction, but I encountered errors related to some of the generated IPCores. It appears that a comprehensive and thorough revision of the code may be necessary to align it with the new environment.

I appreciate any hint and help,
Best regards

Example of implementing the accelerator for PageRank on Alveo U250 platform

While running the commands as listed in the github page for pagerank application on u250 platform, the make process fails
A clear and concise description of what the bug is.

To Reproduce
Steps to reproduce the behavior:

Run the commands as :
git clone https://github.com/Xtra-Computing/ThunderGP.git
$ cd ./
$ vim ThunderGP.mk
$ # configure the DEVICE as DEVICES := xilinx_u250_xdma_201830_2; configure TARGETS := hw
$ make app=pr all
3
error.txt
. See error

Expected behavior
The make process should pass.

Screenshots
The resulting logs have been attached to this issue.

Desktop :

OS: Ubuntu 18.04
Xilinx SDx 2018.3
Gcc version 7.5(this might be the problem that might be causing since required is 9.3)
Device-Alveo u250

Thank you

Questions on EDGE_NUM and PE_NUM.

Unless I'm misunderstanding, ThunderGP accesses EDGE_NUM edges and shuffles them to PE_NUM PEs in each cycle.

Is there any reason that the number of PEs is set as twice as that of edges?

Moreover, I also wonder whether the DRAM bandwidth can be fully utilized with 8 edges and 16 PEs since U200 provides up to 77GB/s bandwidth.