Coder Social home page Coder Social logo

yangyang14641 / parallel-matrix-multiplication-fox-algorithm Goto Github PK

View Code? Open in Web Editor NEW
12.0 3.0 4.0 55.78 MB

:coffee:Implement of Parallel Matrix Multiplication Methods Using FOX Algorithm on Peking University's High-performance Computing System

License: Academic Free License v3.0

C 96.33% Fortran 0.75% Shell 2.92%
matrix-multiplication parallel-algorithm fox-algorithm data-parallelism algorithm-analysis intel mpi openmp high-performance-computing supercomputing

parallel-matrix-multiplication-fox-algorithm's Introduction

Parallel-Matrix-Multiplication-FOX-Algorithm

☕Implement of Parallel Matrix Multiplication Methods Using FOX Algorithm on Peking University's High-performance Computing System

Yes We Code

Contents

  1. Reference Documents
    • Thomas Anastasio, Example of Matrix Multiplication by Fox Method
    • Jaeyoung Choi, A New Parallel Matrix Multiplication Algorithm on Distributed-Memory Concurrent Computers
    • Ned Nedialkov, Communicators and Topologies: Matrix Multiplication Example
  2. Source Codes
  3. Code Tests
    • Dell XPS8900
      • Code Test on Dell XPS8900 Workstation (Intel® Core™ i7-6700K Processor)
      • Analyzing MPI Performance Using Intel Trace Analyzer
    • PKU-HPC
      • Lenovo X8800 Supercomputer Platform
      • Code Performance Tests on X8800 Supercomputer Platform's CPU Node (Intel® Xeon® Processor E5-2697A v4)
      • Code Performance Tests on X8800 Supercomputer Platform's MIC Node (Intel® Xeon Phi™ Processor 7250)
    • Code Tests' Contents
  4. Reports
    • 1801111621_洪瑶_并行程序报告.pdf
    • 并行程序报告.docx
    • 洪瑶_1801111621并行程序设计报告.pptx
    • Parallel FOX Algorithm Project Report.pptx (will be added in the future)
    • Parallel FOX Algorithm Project Report Paper.tex (will be added in the future)
    • Parallel FOX Algorithm Project Report Paper.pdf (will be added in the future)
    • Reports' Contents
  5. Imagines
    • FOX.png
    • FOX Stage Whole.JPG
    • FOX Stage Loading Balance.png

Brief Introduction to Parallel Matrix Multiplication FOX Algorithm

Basic Concepts

  • 规约计算 (Reduction)
  • 拥有者计算原则 (Owner Computing Rule)
  • 流水并行(Pipeline Parallelism):
    • 在一个进程上,矩阵计算被划分为P个阶段 (P Supercomputing Steps in a Process)
  • 数据并行 (Data Parallelism):
    • 在每个进程上同时计算局部的矩阵乘积 (Local Matrix Multiplications are computing on every processess at the same Computing Step)

Serial Matrix Multiplication

  • Mathematical Modeling of Matrix Multiplication

    • C_{ij}=\sum_{k=0}^{K-1} A_{ik}B_{kj}; \quad (i=0,N-1), \quad (j=0,M-1)
  • Time Complexity

    • O\left ( N^{3} \right )
  • Storage Complexity

    • O\left ( N^{3} \right )
  • Example Implementation in C Language

for (i = 0; i < n; i++)                                      
        for (j = 0; j < n; j++)              
            for (k = 0; k < n; k++)
                C(i,j) = C(i,j) + A(i,k)*B(k,j);

Parallel Computing Modeling Design

  1. Basic Flow
  • Matrix \mathbf{A}'s Dimension is M \times K, and Matirx \mathbf{B}'s Dimension is a K \times N.
  • Compute Matrix \mathbf{C} = \mathbf{A}\mathbf{B} in parallel.
  • Let p=num(processors) is the number of processors, and q=\sqrt{p} be an integer such that it devides M and N.
  • Create a Cartesian topology with process mesh P_{ij}, and i=0..q-1, j=0..q-1.
  • Denote \hat{M} = \frac{M}{q}, \hat{K}=\frac{K}{q}, \hat{N}=\frac{N}{q}.
  • Distribute \mathbf{A} and \mathbf{B} by blocks on p processess such that A_{ij} is \hat{M} \times \hat{K} block and B_{ij} is \hat{K} \times \hat{N} block, stored on process P_{ij}.
  1. Details
  • Partitions of Matrices A, B and C. (Index syntax in Mathematical form: start from 1)

    • Matrix A

    • Matirx B

    • Matrix C

  • Data Distribution on the 2-D Cartesian Topology Processes Mesh (Index syntax in Mathematical formulars: start from 1)

    • Data Mapping

      • Data Mesh Mapping Process Mesh
    • Partition may not perfect such that every sub-matrix is a square matrix. Yet, that's not a problem, except load unbalance on each process!

    • Unbalanced Partition

      • Item Object
        Data Partition
        Data Partition
        Process Mesh
    • Mathematical Modeling of Sub-Matirx Multiplication

Parallel Algorithm Design on BSP

Parallelism type: Data parallelism with Pipeline parallelism

  1. Rewrite the formula of Sub-Matirx Multiplication as q−1 Supercomputing Steps

    • Stage Mathematical Operation
      0 \mathbf{C_{ij}}=\mathbf{A_{ii}}\mathbf{B_{ij}}
      1 \mathbf{C_{ij}}=\mathbf{C_{ij}}+\mathbf{A_{ii+1}}\mathbf{B_{i+1j}}
      2 \mathbf{C_{ij}}=\mathbf{C_{ij}}+\mathbf{A_{ii+2}}\mathbf{B_{i+2j}}
      ... ...
      q-2-i \mathbf{C_{ij}}=\mathbf{C_{ij}}+\mathbf{A_{iq-2}}\mathbf{B_{q-2j}}
      q-1-i \mathbf{C_{ij}}=\mathbf{C_{ij}}+\mathbf{A_{iq-1}}\mathbf{B_{q-1j}}
      ... \mathbf{C_{ij}}=\mathbf{C_{ij}}+\mathbf{A_{i1}}\mathbf{B_{1j}}
      ... \mathbf{C_{ij}}=\mathbf{C_{ij}}+\mathbf{A_{i2}}\mathbf{B_{2j}}
      ... ...
      q-1 \mathbf{C_{ij}}=\mathbf{C_{ij}}+\mathbf{A_{ii-1}}\mathbf{B_{i-1j}}
    • Data parallelism: Local Matrix Multiplication operation in each processes for each supercomputing step.

  2. Parallel Modeling Algorithm Operations on each step:

    • Stage Algorithm Operation
      0 1. Process P_{ij} has \mathbf{A_{ij}}, \mathbf{B_{ij}} but needs \mathbf{A_{ii}} (for each index i)
      2. Process \mathbf{P_{ii}} broadcast \mathbf{A_{ii}} across process mesh row i
      3. Process P_{ij} computes \mathbf{C_{ij}=A_{ii}B_{ij}}
      1 1. P_{ij} has \mathbf{A_{ij}} and \mathbf{B_{ij}} but needs \mathbf{A_{ii+1}} and \mathbf{B_{i+1j}}
      1.1 Shift the j-th block column of \mathbf{B_{ij}} by one block up (block 0 goes to block q-1) (period)
      1.2 P_{ii+1} broadcast \mathbf{A_{ii+1}} across process mesh row i
      2. Process P_{ij} Compute \mathbf{C_{ij}=C_{ij}+A_{ii+1}B_{i+1j}}
      2 1. P_{ij} has \mathbf{A_{ij}} and \mathbf{B_{ij}} but needs \mathbf{A_{ii+2}} and \mathbf{B_{i+2j}}
      1.1 Shift the j-th block column of \mathbf{B_{ij}} by one block up (block 0 goes to block q−1) (period)
      1.2 P_{ii+2} broadcast \mathbf{A_{ii+2}} across process mesh row i
      2. Process P_{ij} Compute \mathbf{C_{ij}=C_{ij}+A_{ii+2}B_{i+2j}}
      ... ...
      q-2-i \mathbf{C_{ij}=C_{ij}+A_{iq-2}B_{q-2j}}
      q-1-i \mathbf{C_{ij}=C_{ij}+A_{iq-1}B_{q-1j}}
      ... \mathbf{C_{ij}=C_{ij}+A_{i1}B_{1j}}
      ... \mathbf{C_{ij}=C_{ij}+A_{i2}B_{2j}}
      ... ...
      q-1 \mathbf{C_{ij}=C_{ij}+A_{ii-1}B_{i-1j}}
    • Pipe parallelism: The (0 \to q-1) Computing Steps for each process P_{ij}.

Algorithm Analysis

  1. Algorithm Analysis on each Supercomputing Step

    • Stage Algorithm Operation Computing and Communication Analysis
      0 1. Process P_{ij} has \mathbf{A_{ij}}, \mathbf{B_{ij}} but needs \mathbf{A_{ii}} (for each index i)
      2. Process \mathbf{P_{ii}} broadcast \mathbf{A_{ii}} across process mesh row i
      3. Process P_{ij} computes \mathbf{C_{ij}=A_{ii}B_{ij}}
      Communication in Broadcast Operation:
      \left( q-1 \right)  \times q \times \frac{M}{q} \times \frac{K}{q}
      Computing in each process:
      \frac{M}{q} \times \frac{K}{q} \times \frac{N}{q}
      Computing in total:
      \left( \frac{M}{q} \times \frac{K}{q} \times \frac{N}{q} \right) q \times q
      1 1. P_{ij} has \mathbf{A_{ij}} and \mathbf{B_{ij}} but needs \mathbf{A_{ii+1}} and \mathbf{B_{i+1j}}
      1.1 Shift the j-th block column of \mathbf{B_{ij}} by one block up (block 0 goes to block q-1) (period)
      1.2 P_{ii+1} broadcast \mathbf{A_{ii+1}} across process mesh row i
      2. Process P_{ij} Compute \mathbf{C_{ij}=C_{ij}+A_{ii+1}B_{i+1j}}
      Communication in shift operation:
      \left( q \times q \right) \times \left( \frac{K}{q} \times \frac{N}{q} \right)
      Communication in broadcast operation:
      \left[\left( q-1 \right) \times q\right] \times \left( \frac{M}{q} \times \frac{K}{q}\right)
      Communication in total:
      \left( q \times q \right) \times \left( \frac{K}{q} \times \frac{N}{q}\right) + \left[ \left( q-1 \right)\times q \right] \times \left( \frac{M}{q} \times \frac{K}{q}\right)
      Computing in each process:
      \frac{M}{q} \times \frac{K}{q} \times \frac{N}{q}
      Computing in total:
      \left( \frac{M}{q} \times \frac{K}{q} \times \frac{N}{q} \right) \times q \times q
      2 1. P_{ij} has \mathbf{A_{ij}} and \mathbf{B_{ij}} but needs \mathbf{A_{ii+2}} and \mathbf{B_{i+2j}}
      1.1 Shift the j-th block column of \mathbf{B_{ij}} by one block up (block 0 goes to block q−1) (period)
      1.2 P_{ii+2} broadcast \mathbf{A_{ii+2}} across process mesh row i
      2. Process P_{ij} Compute \mathbf{C_{ij}=C_{ij}+A_{ii+2}B_{i+2j}}
      Communication in shift operation:
      \left( q \times q \right) \times \left( \frac{K}{q} \times \frac{N}{q} \right)
      Communication in broadcast operation:
      \left[\left( q-1 \right) \times q \right] \times \left( \frac{M}{q} \times \frac{K}{q} \right)
      Communication in total:
      \left[\left( q-1 \right) \times q \right] \times \left( \frac{M}{q} \times \frac{K}{q}\right)
      Computing in each process:
      \frac{M}{q} \times \frac{K}{q} \times \frac{N}{q}
      Computing in total:
      \left(\frac{M}{q} \times \frac{K}{q} \times \frac{N}{q}\right) \times q \times q
      ... ...
      q-2-i \mathbf{C_{ij}=C_{ij}+A_{iq-2}B_{q-2j}}
      q-1-i \mathbf{C_{ij}=C_{ij}+A_{iq-1}B_{q-1j}}
      ... \mathbf{C_{ij}=C_{ij}+A_{i1}B_{1j}}
      ... \mathbf{C_{ij}=C_{ij}+A_{i2}B_{2j}}
      ... ...
      q-1 \mathbf{C_{ij}=C_{ij}+A_{ii-1}B_{i-1j}}
  2. Communication in total

  1. Computing in total

FOX Kernel in the Parallel MPI-C Program

  •     n_bar = n/grid->q;
        Set_to_zero(local_C);
    
        source = (grid->my_row + 1) % grid->q;
        dest = (grid->my_row + grid->q - 1) % grid->q;
    
        temp_A = Local_matrix_allocate(n_bar);
    
        for (stage = 0; stage < grid->q; stage++) {
            bcast_root = (grid->my_row + stage) % grid->q;
            if (bcast_root == grid->my_col) {
              MPI_Bcast(local_A, 1, local_matrix_mpi_t,
                        bcast_root, grid->row_comm);
              Local_matrix_multiply(local_A, local_B,local_C);
            } else {
              MPI_Bcast(temp_A, 1, local_matrix_mpi_t,
                        bcast_root, grid->row_comm);
              Local_matrix_multiply(temp_A, local_B,local_C);
            }
            MPI_Sendrecv_replace(local_B, 1, local_matrix_mpi_t,
                                 dest, 0, source, 0, grid->col_comm, &status);
         }

Analysis

  • Intel® Trace Analyzer Statistics results While FOX Kernel Executing
  • Intel® Trace Analyzer's Load Balance Analysis While FOX Kernel Executing

Warranty

Maybe, there are many mistakes in the both documents and Codes, because of the limitation of our knowledge and strength. As a result: THESE DOCUMENTS AND CODES ARE PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND. I MAKE NO WARRANTIES, EXPRESS OR IMPLIED, THAT THEY ARE FREE OF ERROR.

Copyright

You can use and copy these works for any academic purpose, Except just copy to finish your homework or republish these works without proper declare their original author.

parallel-matrix-multiplication-fox-algorithm's People

Contributors

hong-yao avatar yangyang14641 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.