Coder Social home page Coder Social logo

intelpython / dpbench Goto Github PK

View Code? Open in Web Editor NEW
17.0 17.0 19.0 2.01 MB

Benchmark suite to evaluate Data Parallel Extensions for Python

License: Apache License 2.0

Python 92.41% CMake 0.91% C++ 6.18% Mako 0.05% Batchfile 0.29% Shell 0.16%
benchmark dpctl dpnp numba numba-dpex numpy performance

dpbench's People

Contributors

adarshyoga avatar akharche avatar alexanderkalistratov avatar andresguzman-ballen avatar beckermr avatar chudur-budur avatar denisscherbakov avatar diptorupd avatar hardcode84 avatar jharlow-intel avatar jtchilders avatar mingjie-intel avatar npolina4 avatar oleksandr-pavlyk avatar pokhodenkosa avatar reazulhoque avatar samaid avatar zzeekkaa avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dpbench's Issues

The given compilation instructions for numba-dpex doesn't work

If I follow the numba-dpex compilation instructions according to dpbench's readme, it doesn't work. I get this error:

python setup.py develop
Compiling numba_dpex/dpnp_iface/dpnp_fptr_interface.pyx because it changed.
[1/1] Cythonizing numba_dpex/dpnp_iface/dpnp_fptr_interface.pyx
running develop
/localdisk/work/$USER/.dpbench/lib/python3.9/site-packages/setuptools/command/easy_install.py:144: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools.
  warnings.warn(
/localdisk/work/$USER/.dpbench/lib/python3.9/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
  warnings.warn(
running egg_info
creating numba_dpex.egg-info
writing numba_dpex.egg-info/PKG-INFO
writing dependency_links to numba_dpex.egg-info/dependency_links.txt
writing entry points to numba_dpex.egg-info/entry_points.txt
writing requirements to numba_dpex.egg-info/requires.txt
writing top-level names to numba_dpex.egg-info/top_level.txt
writing manifest file 'numba_dpex.egg-info/SOURCES.txt'
reading manifest file 'numba_dpex.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
adding license file 'LICENSE'
writing manifest file 'numba_dpex.egg-info/SOURCES.txt'
running build_ext
building 'numba_dpex._usm_allocators_ext' extension
creating build
creating build/temp.linux-x86_64-3.9
creating build/temp.linux-x86_64-3.9/numba_dpex
creating build/temp.linux-x86_64-3.9/numba_dpex/dpctl_iface
gcc -pthread -B /localdisk/work/$USER/.dpbench/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -O2 -Wall -fPIC -O2 -isystem /localdisk/work/$USER/.dpbench/include -I/localdisk/work/$USER/.dpbench/include -fPIC -O2 -isystem /localdisk/work/$USER/.dpbench/include -fPIC -I/localdisk/work/$USER/.dpbench/lib/python3.9/site-packages -I/localdisk/work/$USER/.dpbench/lib/python3.9/site-packages/dpctl/include -I/localdisk/work/$USER/.dpbench/include/python3.9 -c numba_dpex/dpctl_iface/usm_allocators_ext.c -o build/temp.linux-x86_64-3.9/numba_dpex/dpctl_iface/usm_allocators_ext.o
In file included from /localdisk/work/$USER/.dpbench/lib/python3.9/site-packages/dpctl/include/dpctl_sycl_interface.h:35,
                 from numba_dpex/dpctl_iface/usm_allocators_ext.c:35:
/localdisk/work/$USER/.dpbench/lib/python3.9/site-packages/dpctl/include/syclinterface/dpctl_sycl_device_interface.h:253:1: error: expected identifier or ‘(’ before ‘[’ token
  253 | [[deprecated("Use DPCTLDevice_WorkItemSizes3d instead")]] DPCTL_API
      | ^
error: command '/usr/bin/gcc' failed with exit code 1

However, if I follow the instructions from the numba-dpex page, like this:

conda create -n dpbench-dev -c /opt/intel/oneapi/conda_channel python=3.9 dpctl dpnp numba spirv-tools llvm-spirv llvmdev cython pytest
conda activate dpbench-dev
python setup.py develop

It works.

I think the compilation instructions on dpbench readme needs to be changed.

Possible improvements to framework

  • It would be nice to use Python's logger as opposed to print statements for outputting benchmarking information.

  • The need to keep benchmark implementation argument names the same across benchmarks is odd. It stems from passing them using keyword arguments. In reality these are all positional. Perhaps we could add positional_args entry to the benchmark config which specifies the list of named arguments.
    We can then pop these from the dictionary and call the implementation as impl_fn(*posargs, **kwargs).
    Otherwise anybody adding new benchmarks invariably will run into the issue of argument name mismatch.

Originally posted by @oleksandr-pavlyk in #115 (comment)

Reference configuration

New developers or users need to understand how to get started. It could be documented and scripts for helping could be provided.

  • How to configure environment
    • How to create conda environment
    • How to install dpbench
    • How to install oneAPI
    • How to install VTune
    • ...
  • How to run workloads
  • Where to search for reference results
  • ...

DBscan dpex_k implementation does not work under compute follows data.

The dpex_k implementation for DBSCAN currently fails execution with a rather cryptic message stating:

"Datatypes of array passed to @numba_dpex.kernel has to be the same. Passed datatypes: "...

The error message needs fixing and I am working on a dpex PR to address that.

What the error message is really telling is that the implementation does not follow the "compute follows data" programming model. Under the compute follows data programming model the execution queue for the kernel should be discoverable from the input array arguments.

There are two problems with the current implementation.

  • In the dpex_k implementation, the arguments to the kernel are n_samples, min_pts, assignments, sizes, indices_list. Of these, sizes and indices_list are not allocated in the initialize function and therefore are never copied to usm_ndarray. The kernel inputs are a mix of numpy.ndarray and dpctl.tensor.usm_ndarray and there is no way to infer the execution queue using compute follows data. Thus, the dpex error. To fix the issue, the creation of these two arrays need to be moved into the initialize call.

  • Fixing the first problem will lead to the next issue that is hidden by the first failure right now. Only the get_neighborhood function is a kernel. The compute_cluster function is a njit function. Currently, njit functions cannot consume usm_ndarray and thus to make it work we will have to copy data back to host after the get_neighborhood call. Doing so will mess up the timing measurement. Moreover, implementing dbscan_dpex_k in this fashion is inaccurate in terms of comparing the kernel implementation with other implementations as the whole benchmark never runs on a device/GPU. If implemented this way, comparing the timing of kernel with any other implementation is not an apples to apples comparison. We either need to implement compute_cluster as a kernel or remove the dbscan_dpek_k module.

Refactoring tracker

Suggestions for improvements and consolidation of the workloads running, time measurements and testing:

  • Common functionality for time measurements (for python and native workloads). Time measurement should include the execution time considering the re-launch of the workload. The result time should be Median of all the times.
  • Separate time measurements for kernel and data transferring.
  • Common functionality for data generation and writing to file. One method in the "utils".
  • Documentation
  • #38
  • Add packaging (i.e. for utils)
  • Separate WL core from infrastructure
  • CI
    • #30
    • #43
    • Use clang-format for C/C++ code formatting
    • #36
  • Use config files for WL parameters (as alternative for command line parameters) (i.e. configparser)
  • #37
  • Eliminate 2 versions of DPC++ (and others) implementations (remove code duplication, measure data manipulations and kernel executions in one code)
  • ...

Add the npbench benchmarks to dpbench

The new npbench based infrastructure now makes it possible to move all npbench benchmarks into dpbench. The following steps are to be completed:

  • We should move the benchmarks from https://github.com/spcl/npbench/tree/main/npbench/benchmarks to a benchmarks/npbench directory under dpbench. Only numba and numpy versions of the npbench benchmarks should be moved from npbench upstream.
  • The bench_info for all the npbench tests should be added to dpbench/configs/bench_info/npbench.
  • The runner.py should be updated to run npbench tests.

Reorganize folders tree to `WL/API/HW`

Now code is organized in following tree:

- API
  - WL
    - HW

Example:

- dpnp
  - blackscholes
    - CPU

I propose to reorganize the tree in following way:

- WL
  - API
    - HW

Example:

- workloads
  - blackscholes
    - dpnp
      - CPU

I think it is more natural to start from workload and then go to details of implementation.
Moreover we will have single folder for workload where we can place README with explanations of the algorithm.

It could be implemented via creating symlinks, assess if it is more convenient and then later copy code.

  • #39
  • Copy code for new tree order

dpcpp versions of benchmarks should be runnable via dpbench module

A dpcpp version of a dpbench benchmark should be part of the dpbench Python package:

  • dpbench.runner should run these programs
  • the programs should be built via setup.py
  • timing for these should be added to the sqlite3 database

#75 #72 touch upon aspects of these requirements (possibly going into a solution space).

Maybe we need a basic ctypes wrapper as well for the dpcpp implementation and record the execution time in the dpcpp program itself and then report it back to python via a ctype callback.

Evaluation of Gaussian elimination

To evaluate the performance of Numba workloads against corresponding dpcpp implementations we need to ensure that both implementations read the same input data and have a mechanism to validate the results of the computation automatically. The following steps need to be performed to add the common infrastructure.

  • Consolidate dpcpp and numba versions
  • Add common data generation and naïve python implementation to utils directory
  • Update numba implementation to read data from common data generation infrastructure
  • Add "--test" option to validate the results of the numba_dppy implementation with that of the naïve python implementation
  • Add python driver to run the native dpcpp implementation. Also add call into the common data generation module to write input data to a file
  • Update dpcpp implementation to read data from binary files
  • Add validation checks for dpcpp implementation through python driver program

Improvements to implementation summary output to make it more informative

Dpbench prints the implementation summary for the available benchmarks by calling the dpbench.infrastructure.datamodel.print_implementation_summary(conn=conn) function. The function gets called inside dpbench.runner.run_benchmarks.

The output of the function needs to be improved by also printing some more additional information.

Implementation name Description
numba_dpex_k Dpex kernel implementation
numba_dpex_n Dpex numpy-based implementation
  • Print a table with the platform/device the benchmark ran. The platform info can be extracted using Py-cpuinfo for numpy, numba and for dpex, dpnp, dpcpp we can use dpctl.device_info
    (#187)
Implementation name Platform Parallel Number of threads
numba_dpex_k Intel CPU Yes 4
numba_dpex_n Intel CPU Yes 4
numpy Intel CPU No 4
  • Replace the problem size acronym with actual size in MB/GB (#144)

Thus, the output of the print_implementation_summary should be something like:

Implementation name Description
numba_dpex_k Dpex kernel implementation
numba_dpex_n Dpex numpy-based implementation
Implementation name Platform Parallel Number of threads
numba_dpex_k Intel CPU Yes 4
numba_dpex_n Intel CPU Yes 4
numpy Intel CPU No 4
As_of benchmark problem_size numba_dpex_k numba_dpex_p
09.29.2022_23.00.12 black_scholes 50MB Success Success
09.29.2022_23.00.12 dbscan 50MB Success Success

Add a C++ init and validation module

The initialization of the input data and validation of the output data from dpcpp versions of the benchmarks is done via data files generated respectively via numpy and C++. We should opt for a better design to keep the validation and initialization processes quicker and avoid file IO overheads:

Proposed design.

  • Use a small C++ library to generate random numbers. Add a Pybind11 wrapper for the library so that it maybe called from Python and use the same library from Python as well as DPC++.
  • Do the reverse for validation: call Python from the C++ library and generate the results and validate the C++ results based on the Python implementation output.

Add Python wrappers for all DPC++ implementations of benchmarks using dpctl

Along with directly calling the DPC++ implementations of the benchmark, we should also add a Python wrapper for all such DPC++ versions of the benchmarks. Once we have the wrappers the native benchmarks can be evaluated via Python. Another benefit is to catch any overhead introduced by dpctl bindings.

  • black-scholes
  • pairwise-distance
  • knn
  • kmeans
  • l2-norm
  • dbscan
  • rambo

An initial attempt to implement a native wrapper for the black-scholes benchmark is in #61

Report problem size in MB/GB rather than preset values such as "S", "M"

dpbench stores the problem sizes in the results table as "S", "M","L" that are defined in the JSON files in the bench_info directory. The problem size is reported in the implementation summary table in that form.

However, the acronyms (S/M/L) are hard to understand and do not give a clear idea of the actual data footprint for a benchmark. The data footprint is required to understand if a benchmark is running of main memory, HBM, L3, etc.

Instead, dpbench should store the actual data sizes calculated as: sum [for each input and output array argument: size of array * size of element] the data size should be stored in the database and reported back.

Evaluation of l2_distance

To evaluate the performance of Numba workloads against corresponding dpcpp implementations we need to ensure that both implementations read the same input data and have a mechanism to validate the results of the computation automatically. The following steps need to be performed to add the common infrastructure.

  • Consolidate dpcpp and numba versions
  • Add common data generation and naïve python implementation to utils directory
  • Update numba implementation to read data from common data generation infrastructure
  • Add "--test" option to validate the results of the numba_dppy implementation with that of the naïve python implementation
  • Add python driver to run the native dpcpp implementation. Also add call into the common data generation module to write input data to a file
  • Update dpcpp implementation to read data from binary files
  • Add validation checks for dpcpp implementation through python driver program

Occsional hang when running benchmarks

Occasionally dpbench.run_benchmarks() with freeze up before completing the execution. The freeze can arbitrarily happen in different benchmarks and I have personally seen it occur in knn, kmeans, l2 and gpairs.

@oleksandr-pavlyk helped narrow down the problem to a deadlock in a DPCTLQueue_Wait. The issue needs proper investigation and a resolution.

Offload gen_rand_data in Rambo numba dpex

At this moment, gen_rand_data is on CPU only due to compiling error. The compiler will complain can't not find matching rand() function when replacing numpy with dpnp function:

Failed in dpex_nopython mode pipeline (step: nopython frontend)
No implementation of function Function(<built-in method rand of numpy.random.mtrand.RandomState object at 0x7f2767ccd640>) found for signature:

rand()

There are 2 candidate implementations:
- Of which 2 did not match due to:
Overload in function 'rand': File: numba/cpython/randomimpl.py: Line 1328.
With argument(s): '()':
Rejected as the implementation raised a specific error:
TypingError: Failed in dpex_nopython mode pipeline (step: nopython frontend)
No implementation of function Function(<class 'numba_dpex.dpnp_iface.stubs.dpnp.random'>) found for signature:

   >>> random()

Running on CPU will reduce the overall performance of numba dpex. This ticket is to offload gen_rand_data.

Build a driver program in C++ to run dppy.kernel generated SPIR-V

The arguments to dppy-kernel are flattened out. We need to convert C++ arrays into the flattened representation prior to submitting the numba-dppy generated SPIR-V kernel to a device.

It will be good to have a script that generates the driver function to call a numba-dppy generate kernel.

Evaluation of pairwise_distance

To evaluate the performance of Numba workloads against corresponding dpcpp implementations we need to ensure that both implementations read the same input data and have a mechanism to validate the results of the computation automatically. The following steps need to be performed to add the common infrastructure.

  • Consolidate dpcpp and numba versions
  • Add common data generation and naïve python implementation to utils directory
  • Update numba implementation to read data from common data generation infrastructure
  • Add "--test" option to validate the results of the numba_dppy implementation with that of the naïve python implementation
  • Add python driver to run the native dpcpp implementation. Also add call into the common data generation module to write input data to a file
  • Update dpcpp implementation to read data from binary files
  • Add validation checks for dpcpp implementation through python driver program

Improve validate in new framework

Currently in new framework, results are validated by comparing two framework: numpy and numba or numba-dpex. It is better to make sure only results or output from tests are compared.

Evaluate DPC++ Kmeans version against numba-dppy

  • Get Rodinia_SYCL's Kmeans to compile with oneAPI latest DPC++
  • Get Rodinia_SYCL's Kmeans to compile with public DPC++ (https://github.com/intel/llvm)
  • Performance evaluation
    • Measure performance on ATS - SDP Dev Cloud and JLSE
    • Compare performance with numba-dppy's dppy.kernel version of Kmeans on ATS
  • Compare numba-dppy to oneDAL implementation of Kmeans on ATS.

Compatibility of different dependencies (Numba, dpctl, Cuda)

This benchmark suite uses multiple runtimes (Numba, Cuda) to schedule and execute computations on multiple types of devices. It requires some improvement for users to be able to understand the version compatility of the runtimes. Suggested improvements:

  1. Tags to identify different versions of this benchmark.
  2. Each implementation directory containing a README that specifies the version of the runtime this version of benchmark supports.

The goal is to allow users to understand if they are running the benchamark with compatible runtime dependencies.

Not all benchmarks have NumPy implementations

The reference implementation for some benchmark is in pure Python or even some other library such as sklearn in case of DBscan. Yet, all the reference implementation files have the “_numpy” suffix. The naming convention is followed because we inherited npbench’s infrastructure. However, adding “_numpy” suffix to non-NumPy implementations can be confusing.

I propose updating the infrastructure to also support a “_baseline” framework. If no NumPy version is available, the baseline version is used as reference for validation and performance comparison.

Need fix "Unable to join threads to shut down before fork(). This can break multithreading in child processes." warning from numba_dpex runs

The numba_dpex test cases are printing warning message:

Unable to join threads to shut down before fork(). This can break multithreading in child processes.

The warning seems only is related to tbb, by setting numba threading layers to omp or workqueue will not have warnings:
export NUMBA_THREADING_LAYER=workqueue/omp; python -c "import dpbench; dpbench.run_benchmarks()"

An example dpbench log is attached.
dpbench_warning.log

Create common evaluation infrastructure for workloads

To evaluate the performance of Numba workloads against corresponding dpcpp implementations we need to ensure that both implementations read the same input data and have a mechanism to validate the results of the computation automatically. The following steps need to be performed to add the common infrastructure.

  1. Add common data generation and naïve python implementation to utils directory.
  2. Update numba implementation to read data from common data generation infrastructure
  3. Add "--test" option to validate the results of the numba_dppy implementation with that of the naïve python implementation.
  4. Add python driver to run the native dpcpp implementation. Also add call into the common data generation module to write input data to a file
  5. Update dpcpp implementation to read data from binary files
  6. Add validation checks for dpcpp implementation through python driver program.

List of workloads:

  • blackscholes
  • knn
  • gpairs
  • kmeans
  • dbscan
  • pairwise distance
  • L2 distance
  • rambo
  • pathfinder
  • gaussian elim

Add Dpnp implementations to dpbench

Port dpnp implementations from main_old to main.

The following benchmarks are available in main_old:

  • blackscholes
  • pairwise_distance
  • pca

Note: The l2_distance benchmark was rewritten in main and the existing dpnp implementation is no longer applicable.

Improvements to runner.py

  • Provide a way to run a benchmark for a single framework. Currently, run_benchmark runs a benchmark for all framework.
  • Running of benchmark and validating should be separate steps
  • Updating the results database should be a separate step

L2 distance benchmark fails validation

The L2 distance numba_dpex_k implementation fails the data validation step on CPU devices. Incidentally, the reference dpcpp implementation also fails data validation. The issue may point to use of atomic references. Needs further investigation and can be migration to numba-dpex or dpcpp once we know the root cause.

dpbench open problems to fix

dpbench reliability

  • #79
  • benchmark uses the same version for kernel and no-kernel if one of them absent instead of explicit error. No checks if we actually have kernel or parlor version, same code is used in both cases if not
  • quality of numpy reference implementation was not evaluated yet it might have the same issues found in numba codes.
  • perf mode doesn't check accuracy
  • different folders for CPU and GPU, hence too much duplication. Bad code wise and conceptually. If we merge CPU and GPU implementations it gives user an understanding how complicated it would be to offload
  • lack of automation
  • tiny (7 workloads/11 kernels + perfor versions)

workloads

  • L2 distance kernel mode doesn't work for dppy (inconsistent time and Incorrect results)
  • blackscholes uses context for CPU version as well, so the only benchmark that tests CPU SPIR_V offload
  • rambo benchmarks with data generation while other workloads are not
  • kmeans offloads only 1st kernel out of 3 to GPU with context
  • too much dead/commented out code. Inconsistent naming, abandoned experimental files
  • what is our license for dpbench?
  • dbscan (and more) mixed offload/non offload codes
  • different data types in different workloads makes them hardly comparable
  • workloads extensively use atomics which seems to be slow
  • do we have reasoning for problem sizes?
    • Time should grow with problem size
    • What is the resolution of the timer?
    • What are those MOPS for?
    • Npbench approach for problem size selection S, L, M
  • How do we support mixed CPU/GPU execution?

dppy

  • no checks if we actually offload on GPU. It may results in misleading numbers
  • cannot run both dpcomp and dppy in the same env, which is strange. Dppy seems to change somehow default Numba flow

dpcomp

  • no debug info about actual offload
  • macro for dumping IR to file/directory

benchmark summary

# workload kernel API Runs, dppy kernel API Passes, dppy Njit, dppy # kernels (+ jit functions) kernel, dpcomp Njit, dpcomp Access pattern/ Dwarf
1 blackscholes Yes Yes 1 no lower Yes
2 dbscan Yes Yes 1 (+1) cannot offload compute_clusters Yes
3 kmeans Yes Yes 5 Yes Yes
4 knn Yes Yes No impl 1 Yes No impl
5 l2_distance Yes False 1 Yes Yes Reduction
6 pairwise_distance Yes Yes 1 Yes Yes
7 pca Yes No test 1 (+2) No test
8 rambo Yes Yes 1 (+5) no lower for random Yes
9 gpairs Yes Yes No impl 1 Yes No impl
10 pathfinder disabled N/A N/A N/A N/A N/A N/A
11 gaussian_elim disabled N/A N/A N/A N/A N/A N/A

Branch to reproduce

dppy reported problems

multi emphasis

Types of benchmarks

  1. Performance case study/Microbenchmarks to stress architecture capabilities/learn architecture by trying/roofline analysis
  2. Python performance quality
    • should implement express the same algorithm dpcpp and Numba/expressiveness of Numba/quality of SPIR-V codegen
    • should produce similar spirit-v/highlight missing/architecture problems/frontend/inexpressiveness
  3. Compiler benchmark compare Numba-Numba with different backend DPPY and dpcomp - faster than dppy/slower than dppy and why/know where we are is important for feature planning/it makes task of creating ideal compiler a task to create production compiler that works

Future plans

  • finish PR
  • USM memory
  • npbench
  • publish scripts for running
  • performance key study for some workload

Add an automation to generate the __init__.py file for each benchmark

As benchmarks directories are now Python modules any new benchmark will require an __init__.py file. The requirement is an impediment to adding npbench benchmarks into dpbench, as too much manual work would be required.

The __init__.py can instead be mechanically generated by going through the directory and looking at configuration JSON files.

We need to add such a script that needs to be run manually after new benchmarks are added.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.