Parallel-Matrix-Multiplication-FOX-Algorithm

☕Implement of Parallel Matrix Multiplication Methods Using FOX Algorithm on Peking University's High-performance Computing System

Yes We Code

Reference Documents
- Thomas Anastasio, Example of Matrix Multiplication by Fox Method
- Jaeyoung Choi, A New Parallel Matrix Multiplication Algorithm on Distributed-Memory Concurrent Computers
- Ned Nedialkov, Communicators and Topologies: Matrix Multiplication Example
Source Codes
- C language
- Fortran
- Source Codes' Contents
Code Tests
- Dell XPS8900
  - Code Test on Dell XPS8900 Workstation (Intel® Core™ i7-6700K Processor)
  - Analyzing MPI Performance Using Intel Trace Analyzer
- PKU-HPC
  - Lenovo X8800 Supercomputer Platform
  - Code Performance Tests on X8800 Supercomputer Platform's CPU Node (Intel® Xeon® Processor E5-2697A v4)
  - Code Performance Tests on X8800 Supercomputer Platform's MIC Node (Intel® Xeon Phi™ Processor 7250)
- Code Tests' Contents
Reports
- 1801111621_洪瑶_并行程序报告.pdf
- 并行程序报告.docx
- 洪瑶_1801111621并行程序设计报告.pptx
- Parallel FOX Algorithm Project Report.pptx (will be added in the future)
- Parallel FOX Algorithm Project Report Paper.tex (will be added in the future)
- Parallel FOX Algorithm Project Report Paper.pdf (will be added in the future)
- Reports' Contents
Imagines
- FOX.png
- FOX Stage Whole.JPG
- FOX Stage Loading Balance.png

Brief Introduction to Parallel Matrix Multiplication FOX Algorithm

Basic Concepts

规约计算 (Reduction)
拥有者计算原则 (Owner Computing Rule)
流水并行(Pipeline Parallelism):
- 在一个进程上，矩阵计算被划分为P个阶段 (P Supercomputing Steps in a Process)
数据并行 (Data Parallelism):
- 在每个进程上同时计算局部的矩阵乘积 (Local Matrix Multiplications are computing on every processess at the same Computing Step)

Serial Matrix Multiplication

Mathematical Modeling of Matrix Multiplication
- $C_{ij}=\sum_{k=0}^{K-1} A_{ik}B_{kj}; \quad (i=0,N-1), \quad (j=0,M-1)$
Time Complexity
- $O\left ( N^{3} \right )$
Storage Complexity
- $O\left ( N^{3} \right )$
Example Implementation in C Language

for (i = 0; i < n; i++)                                      
        for (j = 0; j < n; j++)              
            for (k = 0; k < n; k++)
                C(i,j) = C(i,j) + A(i,k)*B(k,j);

Parallel Computing Modeling Design

Basic Flow

Matrix $\mathbf{A}$ 's Dimension is $M \times K$ , and Matirx $\mathbf{B}$ 's Dimension is a $K \times N$ .
Compute Matrix $\mathbf{C} = \mathbf{A}\mathbf{B}$ in parallel.
Let is the number of processors, and $q=\sqrt{p}$ be an integer such that it devides and .
Create a Cartesian topology with process mesh $P_{ij}$ , and , .
Denote $\hat{M} = \frac{M}{q}$ , $\hat{K}=\frac{K}{q}$ , $\hat{N}=\frac{N}{q}$ .
Distribute $\mathbf{A}$ and $\mathbf{B}$ by blocks on p processess such that $A_{ij}$ is $\hat{M} \times \hat{K}$ block and $B_{ij}$ is $\hat{K} \times \hat{N}$ block, stored on process $P_{ij}$ .

Details

Partitions of Matrices A, B and C. (Index syntax in Mathematical form: start from 1)
- Matrix A
- Matirx B
- Matrix C
Data Distribution on the 2-D Cartesian Topology Processes Mesh (Index syntax in Mathematical formulars: start from 1)
- Data Mapping
  - Data Mesh Mapping Process Mesh
- Partition may not perfect such that every sub-matrix is a square matrix. Yet, that's not a problem, except load unbalance on each process!
- Unbalanced Partition
  - Item Object
    
    Data Partition
    
    Data Partition
    
    Process Mesh
- Mathematical Modeling of Sub-Matirx Multiplication

Item	Object
Data Partition
Data Partition
Process Mesh

Parallel Algorithm Design on BSP

Parallelism type: Data parallelism with Pipeline parallelism

Rewrite the formula of Sub-Matirx Multiplication as q−1 Supercomputing Steps
- Stage Mathematical Operation
  
  0 $\mathbf{C_{ij}}=\mathbf{A_{ii}}\mathbf{B_{ij}}$
  
  1 $\mathbf{C_{ij}}=\mathbf{C_{ij}}+\mathbf{A_{ii+1}}\mathbf{B_{i+1j}}$
  
  2 $\mathbf{C_{ij}}=\mathbf{C_{ij}}+\mathbf{A_{ii+2}}\mathbf{B_{i+2j}}$
  
  ... ...
  
  q-2-i $\mathbf{C_{ij}}=\mathbf{C_{ij}}+\mathbf{A_{iq-2}}\mathbf{B_{q-2j}}$
  
  q-1-i $\mathbf{C_{ij}}=\mathbf{C_{ij}}+\mathbf{A_{iq-1}}\mathbf{B_{q-1j}}$
  
  ... $\mathbf{C_{ij}}=\mathbf{C_{ij}}+\mathbf{A_{i1}}\mathbf{B_{1j}}$
  
  ... $\mathbf{C_{ij}}=\mathbf{C_{ij}}+\mathbf{A_{i2}}\mathbf{B_{2j}}$
  
  ... ...
  
  q-1 $\mathbf{C_{ij}}=\mathbf{C_{ij}}+\mathbf{A_{ii-1}}\mathbf{B_{i-1j}}$
- Data parallelism: Local Matrix Multiplication operation in each processes for each supercomputing step.

Stage	Mathematical Operation
0	$\mathbf{C_{ij}}=\mathbf{A_{ii}}\mathbf{B_{ij}}$
1	$\mathbf{C_{ij}}=\mathbf{C_{ij}}+\mathbf{A_{ii+1}}\mathbf{B_{i+1j}}$
2	$\mathbf{C_{ij}}=\mathbf{C_{ij}}+\mathbf{A_{ii+2}}\mathbf{B_{i+2j}}$
...	...
q-2-i	$\mathbf{C_{ij}}=\mathbf{C_{ij}}+\mathbf{A_{iq-2}}\mathbf{B_{q-2j}}$
q-1-i	$\mathbf{C_{ij}}=\mathbf{C_{ij}}+\mathbf{A_{iq-1}}\mathbf{B_{q-1j}}$
...	$\mathbf{C_{ij}}=\mathbf{C_{ij}}+\mathbf{A_{i1}}\mathbf{B_{1j}}$
...	$\mathbf{C_{ij}}=\mathbf{C_{ij}}+\mathbf{A_{i2}}\mathbf{B_{2j}}$
...	...
q-1	$\mathbf{C_{ij}}=\mathbf{C_{ij}}+\mathbf{A_{ii-1}}\mathbf{B_{i-1j}}$

Parallel Modeling Algorithm Operations on each step:

Stage	Algorithm Operation
0	1. Process $P_{ij}$ has $\mathbf{A_{ij}}$ , $\mathbf{B_{ij}}$ but needs $\mathbf{A_{ii}}$ (for each index ) 2. Process $\mathbf{P_{ii}}$ broadcast $\mathbf{A_{ii}}$ across process mesh row 3. Process $P_{ij}$ computes $\mathbf{C_{ij}=A_{ii}B_{ij}}$
1	1. $P_{ij}$ has $\mathbf{A_{ij}}$ and $\mathbf{B_{ij}}$ but needs $\mathbf{A_{ii+1}}$ and $\mathbf{B_{i+1j}}$ 1.1 Shift the block column of $\mathbf{B_{ij}}$ by one block up (block goes to block ) (period) 1.2 $P_{ii+1}$ broadcast $\mathbf{A_{ii+1}}$ across process mesh row 2. Process $P_{ij}$ Compute $\mathbf{C_{ij}=C_{ij}+A_{ii+1}B_{i+1j}}$
2	1. $P_{ij}$ has $\mathbf{A_{ij}}$ and $\mathbf{B_{ij}}$ but needs $\mathbf{A_{ii+2}}$ and $\mathbf{B_{i+2j}}$ 1.1 Shift the block column of $\mathbf{B_{ij}}$ by one block up (block goes to block ) (period) 1.2 $P_{ii+2}$ broadcast $\mathbf{A_{ii+2}}$ across process mesh row 2. Process $P_{ij}$ Compute $\mathbf{C_{ij}=C_{ij}+A_{ii+2}B_{i+2j}}$
...	...
q-2-i	$\mathbf{C_{ij}=C_{ij}+A_{iq-2}B_{q-2j}}$
q-1-i	$\mathbf{C_{ij}=C_{ij}+A_{iq-1}B_{q-1j}}$
...	$\mathbf{C_{ij}=C_{ij}+A_{i1}B_{1j}}$
...	$\mathbf{C_{ij}=C_{ij}+A_{i2}B_{2j}}$
...	...
q-1	$\mathbf{C_{ij}=C_{ij}+A_{ii-1}B_{i-1j}}$

Pipe parallelism: The $(0 \to q-1)$ Computing Steps for each process $P_{ij}$ .

Algorithm Analysis

Algorithm Analysis on each Supercomputing Step

Stage	Algorithm Operation	Computing and Communication Analysis
0	1. Process $P_{ij}$ has $\mathbf{A_{ij}}$ , $\mathbf{B_{ij}}$ but needs $\mathbf{A_{ii}}$ (for each index ) 2. Process $\mathbf{P_{ii}}$ broadcast $\mathbf{A_{ii}}$ across process mesh row 3. Process $P_{ij}$ computes $\mathbf{C_{ij}=A_{ii}B_{ij}}$	Communication in Broadcast Operation: $\left( q-1 \right) \times q \times \frac{M}{q} \times \frac{K}{q}$ Computing in each process: $\frac{M}{q} \times \frac{K}{q} \times \frac{N}{q}$ Computing in total: $\left( \frac{M}{q} \times \frac{K}{q} \times \frac{N}{q} \right) q \times q$
1	1. $P_{ij}$ has $\mathbf{A_{ij}}$ and $\mathbf{B_{ij}}$ but needs $\mathbf{A_{ii+1}}$ and $\mathbf{B_{i+1j}}$ 1.1 Shift the block column of $\mathbf{B_{ij}}$ by one block up (block goes to block ) (period) 1.2 $P_{ii+1}$ broadcast $\mathbf{A_{ii+1}}$ across process mesh row 2. Process $P_{ij}$ Compute $\mathbf{C_{ij}=C_{ij}+A_{ii+1}B_{i+1j}}$	Communication in shift operation: $\left( q \times q \right) \times \left( \frac{K}{q} \times \frac{N}{q} \right)$ Communication in broadcast operation: $\left[\left( q-1 \right) \times q\right] \times \left( \frac{M}{q} \times \frac{K}{q}\right)$ Communication in total: $\left( q \times q \right) \times \left( \frac{K}{q} \times \frac{N}{q}\right) + \left[ \left( q-1 \right)\times q \right] \times \left( \frac{M}{q} \times \frac{K}{q}\right)$ Computing in each process: $\frac{M}{q} \times \frac{K}{q} \times \frac{N}{q}$ Computing in total: $\left( \frac{M}{q} \times \frac{K}{q} \times \frac{N}{q} \right) \times q \times q$
2	1. $P_{ij}$ has $\mathbf{A_{ij}}$ and $\mathbf{B_{ij}}$ but needs $\mathbf{A_{ii+2}}$ and $\mathbf{B_{i+2j}}$ 1.1 Shift the block column of $\mathbf{B_{ij}}$ by one block up (block goes to block ) (period) 1.2 $P_{ii+2}$ broadcast $\mathbf{A_{ii+2}}$ across process mesh row 2. Process $P_{ij}$ Compute $\mathbf{C_{ij}=C_{ij}+A_{ii+2}B_{i+2j}}$	Communication in shift operation: $\left( q \times q \right) \times \left( \frac{K}{q} \times \frac{N}{q} \right)$ Communication in broadcast operation: $\left[\left( q-1 \right) \times q \right] \times \left( \frac{M}{q} \times \frac{K}{q} \right)$ Communication in total: $\left[\left( q-1 \right) \times q \right] \times \left( \frac{M}{q} \times \frac{K}{q}\right)$ Computing in each process: $\frac{M}{q} \times \frac{K}{q} \times \frac{N}{q}$ Computing in total: $\left(\frac{M}{q} \times \frac{K}{q} \times \frac{N}{q}\right) \times q \times q$
...	...
q-2-i	$\mathbf{C_{ij}=C_{ij}+A_{iq-2}B_{q-2j}}$
q-1-i	$\mathbf{C_{ij}=C_{ij}+A_{iq-1}B_{q-1j}}$
...	$\mathbf{C_{ij}=C_{ij}+A_{i1}B_{1j}}$
...	$\mathbf{C_{ij}=C_{ij}+A_{i2}B_{2j}}$
...	...
q-1	$\mathbf{C_{ij}=C_{ij}+A_{ii-1}B_{i-1j}}$

Communication in total

Computing in total

FOX Kernel in the Parallel MPI-C Program

    n_bar = n/grid->q;
    Set_to_zero(local_C);

    source = (grid->my_row + 1) % grid->q;
    dest = (grid->my_row + grid->q - 1) % grid->q;

    temp_A = Local_matrix_allocate(n_bar);

    for (stage = 0; stage < grid->q; stage++) {
        bcast_root = (grid->my_row + stage) % grid->q;
        if (bcast_root == grid->my_col) {
          MPI_Bcast(local_A, 1, local_matrix_mpi_t,
                    bcast_root, grid->row_comm);
          Local_matrix_multiply(local_A, local_B,local_C);
        } else {
          MPI_Bcast(temp_A, 1, local_matrix_mpi_t,
                    bcast_root, grid->row_comm);
          Local_matrix_multiply(temp_A, local_B,local_C);
        }
        MPI_Sendrecv_replace(local_B, 1, local_matrix_mpi_t,
                             dest, 0, source, 0, grid->col_comm, &status);
     }

Analysis

Intel® Trace Analyzer Statistics results While FOX Kernel Executing
Intel® Trace Analyzer's Load Balance Analysis While FOX Kernel Executing

Warranty

Maybe, there are many mistakes in the both documents and Codes, because of the limitation of our knowledge and strength. As a result: THESE DOCUMENTS AND CODES ARE PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND. I MAKE NO WARRANTIES, EXPRESS OR IMPLIED, THAT THEY ARE FREE OF ERROR.

Copyright

You can use and copy these works for any academic purpose, Except just copy to finish your homework or republish these works without proper declare their original author.

yangyang14641 / parallel-matrix-multiplication-fox-algorithm Goto Github PK