The given compilation instructions for numba-dpex doesn't work

If I follow the numba-dpex compilation instructions according to dpbench's readme, it doesn't work. I get this error:

python setup.py develop
Compiling numba_dpex/dpnp_iface/dpnp_fptr_interface.pyx because it changed.
[1/1] Cythonizing numba_dpex/dpnp_iface/dpnp_fptr_interface.pyx
running develop
/localdisk/work/$USER/.dpbench/lib/python3.9/site-packages/setuptools/command/easy_install.py:144: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools.
  warnings.warn(
/localdisk/work/$USER/.dpbench/lib/python3.9/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
  warnings.warn(
running egg_info
creating numba_dpex.egg-info
writing numba_dpex.egg-info/PKG-INFO
writing dependency_links to numba_dpex.egg-info/dependency_links.txt
writing entry points to numba_dpex.egg-info/entry_points.txt
writing requirements to numba_dpex.egg-info/requires.txt
writing top-level names to numba_dpex.egg-info/top_level.txt
writing manifest file 'numba_dpex.egg-info/SOURCES.txt'
reading manifest file 'numba_dpex.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
adding license file 'LICENSE'
writing manifest file 'numba_dpex.egg-info/SOURCES.txt'
running build_ext
building 'numba_dpex._usm_allocators_ext' extension
creating build
creating build/temp.linux-x86_64-3.9
creating build/temp.linux-x86_64-3.9/numba_dpex
creating build/temp.linux-x86_64-3.9/numba_dpex/dpctl_iface
gcc -pthread -B /localdisk/work/$USER/.dpbench/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -O2 -Wall -fPIC -O2 -isystem /localdisk/work/$USER/.dpbench/include -I/localdisk/work/$USER/.dpbench/include -fPIC -O2 -isystem /localdisk/work/$USER/.dpbench/include -fPIC -I/localdisk/work/$USER/.dpbench/lib/python3.9/site-packages -I/localdisk/work/$USER/.dpbench/lib/python3.9/site-packages/dpctl/include -I/localdisk/work/$USER/.dpbench/include/python3.9 -c numba_dpex/dpctl_iface/usm_allocators_ext.c -o build/temp.linux-x86_64-3.9/numba_dpex/dpctl_iface/usm_allocators_ext.o
In file included from /localdisk/work/$USER/.dpbench/lib/python3.9/site-packages/dpctl/include/dpctl_sycl_interface.h:35,
                 from numba_dpex/dpctl_iface/usm_allocators_ext.c:35:
/localdisk/work/$USER/.dpbench/lib/python3.9/site-packages/dpctl/include/syclinterface/dpctl_sycl_device_interface.h:253:1: error: expected identifier or ‘(’ before ‘[’ token
  253 | [[deprecated("Use DPCTLDevice_WorkItemSizes3d instead")]] DPCTL_API
      | ^
error: command '/usr/bin/gcc' failed with exit code 1

However, if I follow the instructions from the numba-dpex page, like this:

conda create -n dpbench-dev -c /opt/intel/oneapi/conda_channel python=3.9 dpctl dpnp numba spirv-tools llvm-spirv llvmdev cython pytest
conda activate dpbench-dev
python setup.py develop

It works.

I think the compilation instructions on dpbench readme needs to be changed.

PCA and RAMBO benchmarks should be migrated to new infrastructure

PCA and RAMBO benchmarks have to be migrated to new infrastructure.

initialize function
numpy/baseline function for validation

Possible improvements to framework

It would be nice to use Python's logger as opposed to print statements for outputting benchmarking information.
The need to keep benchmark implementation argument names the same across benchmarks is odd. It stems from passing them using keyword arguments. In reality these are all positional. Perhaps we could add positional_args entry to the benchmark config which specifies the list of named arguments.
We can then pop these from the dictionary and call the implementation as impl_fn(*posargs, **kwargs).
Otherwise anybody adding new benchmarks invariably will run into the issue of argument name mismatch.

Originally posted by @oleksandr-pavlyk in #115 (comment)

Missing Code comments in most of the numba samples

Most of the code samples in numba folder are missing code comments. Can you please add detailed code comments where necessary to have a better understanding of the examples

add a module to dpbench to run performance analysis

To make it easy for users of dpbench to profile performance, the Vtune and Adviser tooling should be provided directly via a submodule in dpbench. Users should not have to invoke separate scripts.

Deep copy can be removed, since validation is done only on output array.

Deep copy can be removed, since validation is done only on output array.
v_to_c_i=votes_to_classes[i] will work now.

Originally posted by @mingjie-intel in #111 (comment)

Reference configuration

New developers or users need to understand how to get started. It could be documented and scripts for helping could be provided.

DBscan dpex_k implementation does not work under compute follows data.

The dpex_k implementation for DBSCAN currently fails execution with a rather cryptic message stating:

"Datatypes of array passed to @numba_dpex.kernel has to be the same. Passed datatypes: "...

The error message needs fixing and I am working on a dpex PR to address that.

What the error message is really telling is that the implementation does not follow the "compute follows data" programming model. Under the compute follows data programming model the execution queue for the kernel should be discoverable from the input array arguments.

There are two problems with the current implementation.

In the dpex_k implementation, the arguments to the kernel are n_samples, min_pts, assignments, sizes, indices_list. Of these, sizes and indices_list are not allocated in the initialize function and therefore are never copied to usm_ndarray. The kernel inputs are a mix of numpy.ndarray and dpctl.tensor.usm_ndarray and there is no way to infer the execution queue using compute follows data. Thus, the dpex error. To fix the issue, the creation of these two arrays need to be moved into the initialize call.
Fixing the first problem will lead to the next issue that is hidden by the first failure right now. Only the get_neighborhood function is a kernel. The compute_cluster function is a njit function. Currently, njit functions cannot consume usm_ndarray and thus to make it work we will have to copy data back to host after the get_neighborhood call. Doing so will mess up the timing measurement. Moreover, implementing dbscan_dpex_k in this fashion is inaccurate in terms of comparing the kernel implementation with other implementations as the whole benchmark never runs on a device/GPU. If implemented this way, comparing the timing of kernel with any other implementation is not an apples to apples comparison. We either need to implement compute_cluster as a kernel or remove the dbscan_dpek_k module.

Refactoring tracker

Suggestions for improvements and consolidation of the workloads running, time measurements and testing:

Gpairs numba_dpex_k fails validation with preset = "M"

With the larger problem size of "M" the gpairs implementation using numba_dpex_k no longer passes validation.

Add the npbench benchmarks to dpbench

The new npbench based infrastructure now makes it possible to move all npbench benchmarks into dpbench. The following steps are to be completed:

We should move the benchmarks from https://github.com/spcl/npbench/tree/main/npbench/benchmarks to a benchmarks/npbench directory under dpbench. Only numba and numpy versions of the npbench benchmarks should be moved from npbench upstream.
The bench_info for all the npbench tests should be added to dpbench/configs/bench_info/npbench.
The runner.py should be updated to run npbench tests.

Reorganize folders tree to `WL/API/HW`

Now code is organized in following tree:

- API
  - WL
    - HW

Example:

- dpnp
  - blackscholes
    - CPU

I propose to reorganize the tree in following way:

- WL
  - API
    - HW

Example:

- workloads
  - blackscholes
    - dpnp
      - CPU

I think it is more natural to start from workload and then go to details of implementation.
Moreover we will have single folder for workload where we can place README with explanations of the algorithm.

It could be implemented via creating symlinks, assess if it is more convenient and then later copy code.

#39
Copy code for new tree order

dpcpp versions of benchmarks should be runnable via dpbench module

A dpcpp version of a dpbench benchmark should be part of the dpbench Python package:

dpbench.runner should run these programs
the programs should be built via setup.py
timing for these should be added to the sqlite3 database

#75 #72 touch upon aspects of these requirements (possibly going into a solution space).

Maybe we need a basic ctypes wrapper as well for the dpcpp implementation and record the execution time in the dpcpp program itself and then report it back to python via a ctype callback.

Make `automate_run.py` run tests for all workloads

It would be nice to have automate_run.py option to return list of the workloads, but it is out of scope of this PR

Originally posted by @Hardcode84 in #43 (comment)

Update runner.py to make it possible to reuse the sqlite3 database

Right now we generate a new Sqllite3 database for each run. There should be a way to update an existing database so that trend analysis can be done.

Evaluation of Gaussian elimination

To evaluate the performance of Numba workloads against corresponding dpcpp implementations we need to ensure that both implementations read the same input data and have a mechanism to validate the results of the computation automatically. The following steps need to be performed to add the common infrastructure.

Consolidate dpcpp and numba versions
Add common data generation and naïve python implementation to utils directory
Update numba implementation to read data from common data generation infrastructure
Add "--test" option to validate the results of the numba_dppy implementation with that of the naïve python implementation
Add python driver to run the native dpcpp implementation. Also add call into the common data generation module to write input data to a file
Update dpcpp implementation to read data from binary files
Add validation checks for dpcpp implementation through python driver program

Not consider dimension in l2_distance native

Missing dimension considering in data generation in l2_distance:
https://github.com/IntelPython/dpbench/blob/main/native_dpcpp/l2_distance/GPU/data_gen.cpp#L27

Build a Python driver to submit DPC++ generated SPIR-V kernels to a device

We can build a small driver function that uses dpctl.program to submit DPC++-generated SPIR-V kernels to a device.

Improvements to implementation summary output to make it more informative

Dpbench prints the implementation summary for the available benchmarks by calling the dpbench.infrastructure.datamodel.print_implementation_summary(conn=conn) function. The function gets called inside dpbench.runner.run_benchmarks.

The output of the function needs to be improved by also printing some more additional information.

Print a legends table that defines each implementation name. The definitions are available in https://github.com/IntelPython/dpbench/blob/main/dpbench/configs/impl_postfix.json E.g.,

Implementation name	Description
numba_dpex_k	Dpex kernel implementation
numba_dpex_n	Dpex numpy-based implementation

Print a table with the platform/device the benchmark ran. The platform info can be extracted using Py-cpuinfo for numpy, numba and for dpex, dpnp, dpcpp we can use dpctl.device_info
(#187)

Implementation name	Platform	Parallel	Number of threads
numba_dpex_k	Intel CPU	Yes	4
numba_dpex_n	Intel CPU	Yes	4
numpy	Intel CPU	No	4

Replace the problem size acronym with actual size in MB/GB (#144)

Thus, the output of the print_implementation_summary should be something like:

Implementation name	Description
numba_dpex_k	Dpex kernel implementation
numba_dpex_n	Dpex numpy-based implementation

Implementation name	Platform	Parallel	Number of threads
numba_dpex_k	Intel CPU	Yes	4
numba_dpex_n	Intel CPU	Yes	4
numpy	Intel CPU	No	4

As_of	benchmark	problem_size	numba_dpex_k	numba_dpex_p
09.29.2022_23.00.12	black_scholes	50MB	Success	Success
09.29.2022_23.00.12	dbscan	50MB	Success	Success

Add a C++ init and validation module

The initialization of the input data and validation of the output data from dpcpp versions of the benchmarks is done via data files generated respectively via numpy and C++. We should opt for a better design to keep the validation and initialization processes quicker and avoid file IO overheads:

Proposed design.

Use a small C++ library to generate random numbers. Add a Pybind11 wrapper for the library so that it maybe called from Python and use the same library from Python as well as DPC++.
Do the reverse for validation: call Python from the C++ library and generate the results and validate the C++ results based on the Python implementation output.

Add Python wrappers for all DPC++ implementations of benchmarks using dpctl

Along with directly calling the DPC++ implementations of the benchmark, we should also add a Python wrapper for all such DPC++ versions of the benchmarks. Once we have the wrappers the native benchmarks can be evaluated via Python. Another benefit is to catch any overhead introduced by dpctl bindings.

An initial attempt to implement a native wrapper for the black-scholes benchmark is in #61

Add option to verify run results against some 'gold' results (e.g. generated by unmodified Numpy)

Vendor a copy of onetrace inside dpbench to help verify if execution did infact happen on asked for device

@oleksandr-pavlyk I think it will be a good idea to vendor onetrace with dpbench and capture the device information for the warmup run. What do you think? Otherwise the issue always is there as to how to verify is if offload did work.

Turn on the validation step in Github CI

The validation step in the github action is disabled for now as the database integration is broken. Once a new database design is added the step should be reinstated.

Originally posted by @mingjie-intel in #115 (comment)

Numba object mode validation for gpairs is failing

Validation for the gpairs benchmark implementation using Numba fails for current main.

Add KNN numba dpex kernel code

knn_numba_dpex_k.py is added into new benchmarks.

Resetting input args across repeated executions of workloads fails to reset all args

Array arguments to the workloads are reset between repeated executions of a workload. But currently, only those args that are output by the workload are reset. This causes a validation error in some workloads when certain array args are not reset. The infrastructure should instead reset all array args.

Report problem size in MB/GB rather than preset values such as "S", "M"

dpbench stores the problem sizes in the results table as "S", "M","L" that are defined in the JSON files in the bench_info directory. The problem size is reported in the implementation summary table in that form.

However, the acronyms (S/M/L) are hard to understand and do not give a clear idea of the actual data footprint for a benchmark. The data footprint is required to understand if a benchmark is running of main memory, HBM, L3, etc.

Instead, dpbench should store the actual data sizes calculated as: sum [for each input and output array argument: size of array * size of element] the data size should be stored in the database and reported back.

Run validation tests in CI

Evaluation of l2_distance

To evaluate the performance of Numba workloads against corresponding dpcpp implementations we need to ensure that both implementations read the same input data and have a mechanism to validate the results of the computation automatically. The following steps need to be performed to add the common infrastructure.

Consolidate dpcpp and numba versions
Add common data generation and naïve python implementation to utils directory
Update numba implementation to read data from common data generation infrastructure
Add "--test" option to validate the results of the numba_dppy implementation with that of the naïve python implementation
Add python driver to run the native dpcpp implementation. Also add call into the common data generation module to write input data to a file
Update dpcpp implementation to read data from binary files
Add validation checks for dpcpp implementation through python driver program

Disable pairwise_distance_numba_dpex_n.py

Due to this issue: IntelPython/numba-dpex#782
pairwise_distance_numba_dpex_n will be removed for now. It will be added back once dpex#782 is fixed.

Occsional hang when running benchmarks

Occasionally dpbench.run_benchmarks() with freeze up before completing the execution. The freeze can arbitrarily happen in different benchmarks and I have personally seen it occur in knn, kmeans, l2 and gpairs.

@oleksandr-pavlyk helped narrow down the problem to a deadlock in a DPCTLQueue_Wait. The issue needs proper investigation and a resolution.

Offload gen_rand_data in Rambo numba dpex

At this moment, gen_rand_data is on CPU only due to compiling error. The compiler will complain can't not find matching rand() function when replacing numpy with dpnp function:

Failed in dpex_nopython mode pipeline (step: nopython frontend)
No implementation of function Function(<built-in method rand of numpy.random.mtrand.RandomState object at 0x7f2767ccd640>) found for signature:

rand()

There are 2 candidate implementations:
- Of which 2 did not match due to:
Overload in function 'rand': File: numba/cpython/randomimpl.py: Line 1328.
With argument(s): '()':
Rejected as the implementation raised a specific error:
TypingError: Failed in dpex_nopython mode pipeline (step: nopython frontend)
No implementation of function Function(<class 'numba_dpex.dpnp_iface.stubs.dpnp.random'>) found for signature:
   >>> random()

Running on CPU will reduce the overall performance of numba dpex. This ticket is to offload gen_rand_data.

Build a driver program in C++ to run dppy.kernel generated SPIR-V

The arguments to dppy-kernel are flattened out. We need to convert C++ arrays into the flattened representation prior to submitting the numba-dppy generated SPIR-V kernel to a device.

It will be good to have a script that generates the driver function to call a numba-dppy generate kernel.

Evaluation of pairwise_distance

To evaluate the performance of Numba workloads against corresponding dpcpp implementations we need to ensure that both implementations read the same input data and have a mechanism to validate the results of the computation automatically. The following steps need to be performed to add the common infrastructure.

Consolidate dpcpp and numba versions
Add common data generation and naïve python implementation to utils directory
Update numba implementation to read data from common data generation infrastructure
Add "--test" option to validate the results of the numba_dppy implementation with that of the naïve python implementation
Add python driver to run the native dpcpp implementation. Also add call into the common data generation module to write input data to a file
Update dpcpp implementation to read data from binary files
Add validation checks for dpcpp implementation through python driver program

Improve validate in new framework

Currently in new framework, results are validated by comparing two framework: numpy and numba or numba-dpex. It is better to make sure only results or output from tests are compared.

Evaluate DPC++ Kmeans version against numba-dppy

Get Rodinia_SYCL's Kmeans to compile with oneAPI latest DPC++
Get Rodinia_SYCL's Kmeans to compile with public DPC++ (https://github.com/intel/llvm)
Performance evaluation
- Measure performance on ATS - SDP Dev Cloud and JLSE
- Compare performance with numba-dppy's dppy.kernel version of Kmeans on ATS
Compare numba-dppy to oneDAL implementation of Kmeans on ATS.

kmeans dpex_n is based on numba prange

dpbench/dpbench/benchmarks/kmeans/kmeans_numba_dpex_n.py

Line 16 in 20ba057

for i0 in numba.prange(num_points):

The dpex_n implementation should not use prange and should use only numpy vector expressions and numpy calls. Either the implementation needs to be updated or removed.

Compatibility of different dependencies (Numba, dpctl, Cuda)

This benchmark suite uses multiple runtimes (Numba, Cuda) to schedule and execute computations on multiple types of devices. It requires some improvement for users to be able to understand the version compatility of the runtimes. Suggested improvements:

Tags to identify different versions of this benchmark.
Each implementation directory containing a README that specifies the version of the runtime this version of benchmark supports.

The goal is to allow users to understand if they are running the benchamark with compatible runtime dependencies.

Not all benchmarks have NumPy implementations

The reference implementation for some benchmark is in pure Python or even some other library such as sklearn in case of DBscan. Yet, all the reference implementation files have the “_numpy” suffix. The naming convention is followed because we inherited npbench’s infrastructure. However, adding “_numpy” suffix to non-NumPy implementations can be confusing.

I propose updating the infrastructure to also support a “_baseline” framework. If no NumPy version is available, the baseline version is used as reference for validation and performance comparison.

Port dpex and dppy versions of benchmarks to dpbench module

The numba-dpex and dpnp versions of the benchmark need to be moved from previous location to the dpbench module.

The autofallback feature in numba_dpex may induce false positives

The auto-fallback feature of numba-dpex can suppress execution failure and lead to incorrect assumptions regarding benchmark execution. The feature should be disabled by explicitly setting the config inside NumbaDpexBenchmark.

Need fix "Unable to join threads to shut down before fork(). This can break multithreading in child processes." warning from numba_dpex runs

The numba_dpex test cases are printing warning message:

Unable to join threads to shut down before fork(). This can break multithreading in child processes.

The warning seems only is related to tbb, by setting numba threading layers to omp or workqueue will not have warnings:
export NUMBA_THREADING_LAYER=workqueue/omp; python -c "import dpbench; dpbench.run_benchmarks()"

An example dpbench log is attached.
dpbench_warning.log

Create common evaluation infrastructure for workloads

To evaluate the performance of Numba workloads against corresponding dpcpp implementations we need to ensure that both implementations read the same input data and have a mechanism to validate the results of the computation automatically. The following steps need to be performed to add the common infrastructure.

Add common data generation and naïve python implementation to utils directory.
Update numba implementation to read data from common data generation infrastructure
Add "--test" option to validate the results of the numba_dppy implementation with that of the naïve python implementation.
Add python driver to run the native dpcpp implementation. Also add call into the common data generation module to write input data to a file
Update dpcpp implementation to read data from binary files
Add validation checks for dpcpp implementation through python driver program.

List of workloads:

Add Dpnp implementations to dpbench

Port dpnp implementations from main_old to main.

The following benchmarks are available in main_old:

blackscholes
pairwise_distance
pca

Note: The l2_distance benchmark was rewritten in main and the existing dpnp implementation is no longer applicable.

Improvements to runner.py

Provide a way to run a benchmark for a single framework. Currently, run_benchmark runs a benchmark for all framework.
Running of benchmark and validating should be separate steps
Updating the results database should be a separate step

L2 distance benchmark fails validation

The L2 distance numba_dpex_k implementation fails the data validation step on CPU devices. Incidentally, the reference dpcpp implementation also fails data validation. The issue may point to use of atomic references. Needs further investigation and can be migration to numba-dpex or dpcpp once we know the root cause.

Create a directory for Rodinia benchmarks under benchmarks

We will target implementing all Rodinia benchmarks using dpex and dpnp. As a starting point a folder under benchmark needs to be created to store the two Rodinia benchmarks that are already implemented for dpex.

dpbench open problems to fix

dpbench reliability

#79
benchmark uses the same version for kernel and no-kernel if one of them absent instead of explicit error. No checks if we actually have kernel or parlor version, same code is used in both cases if not
quality of numpy reference implementation was not evaluated yet it might have the same issues found in numba codes.
perf mode doesn't check accuracy
different folders for CPU and GPU, hence too much duplication. Bad code wise and conceptually. If we merge CPU and GPU implementations it gives user an understanding how complicated it would be to offload
lack of automation
tiny (7 workloads/11 kernels + perfor versions)

workloads

dppy

no checks if we actually offload on GPU. It may results in misleading numbers
cannot run both dpcomp and dppy in the same env, which is strange. Dppy seems to change somehow default Numba flow

dpcomp

no debug info about actual offload
macro for dumping IR to file/directory

benchmark summary

#	workload	kernel API Runs, dppy	kernel API Passes, dppy	Njit, dppy	# kernels (+ jit functions)	kernel, dpcomp	Njit, dpcomp	Access pattern/ Dwarf
1	blackscholes	Yes	Yes		1	no lower	Yes
2	dbscan	Yes	Yes		1 (+1)	cannot offload compute_clusters	Yes
3	kmeans	Yes	Yes		5	Yes	Yes
4	knn	Yes	Yes	No impl	1	Yes	No impl
5	l2_distance	Yes	False		1	Yes	Yes	Reduction
6	pairwise_distance	Yes	Yes		1	Yes	Yes
7	pca	Yes	No test		1 (+2)		No test
8	rambo	Yes	Yes		1 (+5)	no lower for random	Yes
9	gpairs	Yes	Yes	No impl	1	Yes	No impl
10	pathfinder	disabled	N/A	N/A	N/A	N/A	N/A	N/A
11	gaussian_elim	disabled	N/A	N/A	N/A	N/A	N/A	N/A

Branch to reproduce

dppy reported problems

multi emphasis

Types of benchmarks

Performance case study/Microbenchmarks to stress architecture capabilities/learn architecture by trying/roofline analysis
Python performance quality
- should implement express the same algorithm dpcpp and Numba/expressiveness of Numba/quality of SPIR-V codegen
- should produce similar spirit-v/highlight missing/architecture problems/frontend/inexpressiveness
Compiler benchmark compare Numba-Numba with different backend DPPY and dpcomp - faster than dppy/slower than dppy and why/know where we are is important for feature planning/it makes task of creating ideal compiler a task to create production compiler that works

Future plans

Add an automation to generate the init.py file for each benchmark

As benchmarks directories are now Python modules any new benchmark will require an __init__.py file. The requirement is an impediment to adding npbench benchmarks into dpbench, as too much manual work would be required.

The __init__.py can instead be mechanically generated by going through the directory and looking at configuration JSON files.

We need to add such a script that needs to be run manually after new benchmarks are added.

intelpython / dpbench Goto Github PK

dpbench's People

Contributors

Stargazers

Watchers

Forkers

dpbench's Issues

dpbench reliability

workloads

dppy

dpcomp

benchmark summary

multi emphasis

Future plans

Recommend Projects

Recommend Topics

Recommend Org