intelpython / dpbench Goto Github PK
View Code? Open in Web Editor NEWBenchmark suite to evaluate Data Parallel Extensions for Python
License: Apache License 2.0
Benchmark suite to evaluate Data Parallel Extensions for Python
License: Apache License 2.0
If I follow the numba-dpex
compilation instructions according to dpbench
's readme, it doesn't work. I get this error:
python setup.py develop
Compiling numba_dpex/dpnp_iface/dpnp_fptr_interface.pyx because it changed.
[1/1] Cythonizing numba_dpex/dpnp_iface/dpnp_fptr_interface.pyx
running develop
/localdisk/work/$USER/.dpbench/lib/python3.9/site-packages/setuptools/command/easy_install.py:144: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools.
warnings.warn(
/localdisk/work/$USER/.dpbench/lib/python3.9/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
warnings.warn(
running egg_info
creating numba_dpex.egg-info
writing numba_dpex.egg-info/PKG-INFO
writing dependency_links to numba_dpex.egg-info/dependency_links.txt
writing entry points to numba_dpex.egg-info/entry_points.txt
writing requirements to numba_dpex.egg-info/requires.txt
writing top-level names to numba_dpex.egg-info/top_level.txt
writing manifest file 'numba_dpex.egg-info/SOURCES.txt'
reading manifest file 'numba_dpex.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
adding license file 'LICENSE'
writing manifest file 'numba_dpex.egg-info/SOURCES.txt'
running build_ext
building 'numba_dpex._usm_allocators_ext' extension
creating build
creating build/temp.linux-x86_64-3.9
creating build/temp.linux-x86_64-3.9/numba_dpex
creating build/temp.linux-x86_64-3.9/numba_dpex/dpctl_iface
gcc -pthread -B /localdisk/work/$USER/.dpbench/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -O2 -Wall -fPIC -O2 -isystem /localdisk/work/$USER/.dpbench/include -I/localdisk/work/$USER/.dpbench/include -fPIC -O2 -isystem /localdisk/work/$USER/.dpbench/include -fPIC -I/localdisk/work/$USER/.dpbench/lib/python3.9/site-packages -I/localdisk/work/$USER/.dpbench/lib/python3.9/site-packages/dpctl/include -I/localdisk/work/$USER/.dpbench/include/python3.9 -c numba_dpex/dpctl_iface/usm_allocators_ext.c -o build/temp.linux-x86_64-3.9/numba_dpex/dpctl_iface/usm_allocators_ext.o
In file included from /localdisk/work/$USER/.dpbench/lib/python3.9/site-packages/dpctl/include/dpctl_sycl_interface.h:35,
from numba_dpex/dpctl_iface/usm_allocators_ext.c:35:
/localdisk/work/$USER/.dpbench/lib/python3.9/site-packages/dpctl/include/syclinterface/dpctl_sycl_device_interface.h:253:1: error: expected identifier or ‘(’ before ‘[’ token
253 | [[deprecated("Use DPCTLDevice_WorkItemSizes3d instead")]] DPCTL_API
| ^
error: command '/usr/bin/gcc' failed with exit code 1
However, if I follow the instructions from the numba-dpex
page, like this:
conda create -n dpbench-dev -c /opt/intel/oneapi/conda_channel python=3.9 dpctl dpnp numba spirv-tools llvm-spirv llvmdev cython pytest
conda activate dpbench-dev
python setup.py develop
It works.
I think the compilation instructions on dpbench
readme needs to be changed.
PCA and RAMBO benchmarks have to be migrated to new infrastructure.
It would be nice to use Python's logger as opposed to print
statements for outputting benchmarking information.
The need to keep benchmark implementation argument names the same across benchmarks is odd. It stems from passing them using keyword arguments. In reality these are all positional. Perhaps we could add positional_args
entry to the benchmark config which specifies the list of named arguments.
We can then pop these from the dictionary and call the implementation as impl_fn(*posargs, **kwargs)
.
Otherwise anybody adding new benchmarks invariably will run into the issue of argument name mismatch.
Originally posted by @oleksandr-pavlyk in #115 (comment)
Most of the code samples in numba folder are missing code comments. Can you please add detailed code comments where necessary to have a better understanding of the examples
To make it easy for users of dpbench to profile performance, the Vtune and Adviser tooling should be provided directly via a submodule in dpbench. Users should not have to invoke separate scripts.
Deep copy can be removed, since validation is done only on output array.
v_to_c_i=votes_to_classes[i] will work now.
Originally posted by @mingjie-intel in #111 (comment)
New developers or users need to understand how to get started. It could be documented and scripts for helping could be provided.
The dpex_k implementation for DBSCAN currently fails execution with a rather cryptic message stating:
"Datatypes of array passed to @numba_dpex.kernel has to be the same. Passed datatypes: "...
The error message needs fixing and I am working on a dpex PR to address that.
What the error message is really telling is that the implementation does not follow the "compute follows data" programming model. Under the compute follows data programming model the execution queue for the kernel should be discoverable from the input array arguments.
There are two problems with the current implementation.
In the dpex_k implementation, the arguments to the kernel are n_samples
, min_pts
, assignments
, sizes
, indices_list
. Of these, sizes
and indices_list
are not allocated in the initialize
function and therefore are never copied to usm_ndarray
. The kernel inputs are a mix of numpy.ndarray
and dpctl.tensor.usm_ndarray
and there is no way to infer the execution queue using compute follows data. Thus, the dpex error. To fix the issue, the creation of these two arrays need to be moved into the initialize call.
Fixing the first problem will lead to the next issue that is hidden by the first failure right now. Only the get_neighborhood
function is a kernel. The compute_cluster
function is a njit
function. Currently, njit
functions cannot consume usm_ndarray
and thus to make it work we will have to copy data back to host after the get_neighborhood
call. Doing so will mess up the timing measurement. Moreover, implementing dbscan_dpex_k
in this fashion is inaccurate in terms of comparing the kernel implementation with other implementations as the whole benchmark never runs on a device/GPU. If implemented this way, comparing the timing of kernel with any other implementation is not an apples to apples comparison. We either need to implement compute_cluster
as a kernel or remove the dbscan_dpek_k module.
Suggestions for improvements and consolidation of the workloads running, time measurements and testing:
With the larger problem size of "M" the gpairs implementation using numba_dpex_k
no longer passes validation.
The new npbench based infrastructure now makes it possible to move all npbench benchmarks into dpbench. The following steps are to be completed:
benchmarks/npbench
directory under dpbench. Only numba and numpy versions of the npbench benchmarks should be moved from npbench upstream.bench_info
for all the npbench tests should be added to dpbench/configs/bench_info/npbench
.Now code is organized in following tree:
- API
- WL
- HW
Example:
- dpnp
- blackscholes
- CPU
I propose to reorganize the tree in following way:
- WL
- API
- HW
Example:
- workloads
- blackscholes
- dpnp
- CPU
I think it is more natural to start from workload and then go to details of implementation.
Moreover we will have single folder for workload where we can place README with explanations of the algorithm.
It could be implemented via creating symlinks, assess if it is more convenient and then later copy code.
A dpcpp version of a dpbench benchmark should be part of the dpbench Python package:
#75 #72 touch upon aspects of these requirements (possibly going into a solution space).
Maybe we need a basic ctypes wrapper as well for the dpcpp implementation and record the execution time in the dpcpp program itself and then report it back to python via a ctype callback.
It would be nice to have automate_run.py
option to return list of the workloads, but it is out of scope of this PR
Originally posted by @Hardcode84 in #43 (comment)
Right now we generate a new Sqllite3 database for each run. There should be a way to update an existing database so that trend analysis can be done.
To evaluate the performance of Numba workloads against corresponding dpcpp implementations we need to ensure that both implementations read the same input data and have a mechanism to validate the results of the computation automatically. The following steps need to be performed to add the common infrastructure.
Missing dimension considering in data generation in l2_distance:
https://github.com/IntelPython/dpbench/blob/main/native_dpcpp/l2_distance/GPU/data_gen.cpp#L27
We can build a small driver function that uses dpctl.program
to submit DPC++-generated SPIR-V kernels to a device.
Dpbench prints the implementation summary for the available benchmarks by calling the dpbench.infrastructure.datamodel.print_implementation_summary(conn=conn)
function. The function gets called inside dpbench.runner.run_benchmarks
.
The output of the function needs to be improved by also printing some more additional information.
Implementation name | Description |
---|---|
numba_dpex_k | Dpex kernel implementation |
numba_dpex_n | Dpex numpy-based implementation |
Py-cpuinfo
for numpy, numba and for dpex, dpnp, dpcpp we can use dpctl.device_infoImplementation name | Platform | Parallel | Number of threads |
---|---|---|---|
numba_dpex_k | Intel CPU | Yes | 4 |
numba_dpex_n | Intel CPU | Yes | 4 |
numpy | Intel CPU | No | 4 |
Thus, the output of the print_implementation_summary
should be something like:
Implementation name | Description |
---|---|
numba_dpex_k | Dpex kernel implementation |
numba_dpex_n | Dpex numpy-based implementation |
Implementation name | Platform | Parallel | Number of threads |
---|---|---|---|
numba_dpex_k | Intel CPU | Yes | 4 |
numba_dpex_n | Intel CPU | Yes | 4 |
numpy | Intel CPU | No | 4 |
As_of | benchmark | problem_size | numba_dpex_k | numba_dpex_p |
---|---|---|---|---|
09.29.2022_23.00.12 | black_scholes | 50MB | Success | Success |
09.29.2022_23.00.12 | dbscan | 50MB | Success | Success |
The initialization of the input data and validation of the output data from dpcpp versions of the benchmarks is done via data files generated respectively via numpy and C++. We should opt for a better design to keep the validation and initialization processes quicker and avoid file IO overheads:
Proposed design.
Along with directly calling the DPC++ implementations of the benchmark, we should also add a Python wrapper for all such DPC++ versions of the benchmarks. Once we have the wrappers the native benchmarks can be evaluated via Python. Another benefit is to catch any overhead introduced by dpctl bindings.
An initial attempt to implement a native wrapper for the black-scholes
benchmark is in #61
@oleksandr-pavlyk I think it will be a good idea to vendor onetrace with dpbench and capture the device information for the warmup run. What do you think? Otherwise the issue always is there as to how to verify is if offload did work.
The validation step in the github action is disabled for now as the database integration is broken. Once a new database design is added the step should be reinstated.
Originally posted by @mingjie-intel in #115 (comment)
Validation for the gpairs benchmark implementation using Numba fails for current main.
knn_numba_dpex_k.py is added into new benchmarks.
Array arguments to the workloads are reset between repeated executions of a workload. But currently, only those args that are output by the workload are reset. This causes a validation error in some workloads when certain array args are not reset. The infrastructure should instead reset all array args.
dpbench stores the problem sizes in the results table as "S", "M","L" that are defined in the JSON files in the bench_info directory. The problem size is reported in the implementation summary table in that form.
However, the acronyms (S/M/L) are hard to understand and do not give a clear idea of the actual data footprint for a benchmark. The data footprint is required to understand if a benchmark is running of main memory, HBM, L3, etc.
Instead, dpbench should store the actual data sizes calculated as: sum [for each input and output array argument: size of array * size of element]
the data size should be stored in the database and reported back.
To evaluate the performance of Numba workloads against corresponding dpcpp implementations we need to ensure that both implementations read the same input data and have a mechanism to validate the results of the computation automatically. The following steps need to be performed to add the common infrastructure.
Due to this issue: IntelPython/numba-dpex#782
pairwise_distance_numba_dpex_n will be removed for now. It will be added back once dpex#782 is fixed.
Occasionally dpbench.run_benchmarks()
with freeze up before completing the execution. The freeze can arbitrarily happen in different benchmarks and I have personally seen it occur in knn
, kmeans
, l2
and gpairs
.
@oleksandr-pavlyk helped narrow down the problem to a deadlock in a DPCTLQueue_Wait
. The issue needs proper investigation and a resolution.
At this moment, gen_rand_data is on CPU only due to compiling error. The compiler will complain can't not find matching rand() function when replacing numpy with dpnp function:
Failed in dpex_nopython mode pipeline (step: nopython frontend)
No implementation of function Function(<built-in method rand of numpy.random.mtrand.RandomState object at 0x7f2767ccd640>) found for signature:rand()
There are 2 candidate implementations:
- Of which 2 did not match due to:
Overload in function 'rand': File: numba/cpython/randomimpl.py: Line 1328.
With argument(s): '()':
Rejected as the implementation raised a specific error:
TypingError: Failed in dpex_nopython mode pipeline (step: nopython frontend)
No implementation of function Function(<class 'numba_dpex.dpnp_iface.stubs.dpnp.random'>) found for signature:>>> random()
Running on CPU will reduce the overall performance of numba dpex. This ticket is to offload gen_rand_data.
The arguments to dppy-kernel
are flattened out. We need to convert C++ arrays into the flattened representation prior to submitting the numba-dppy generated SPIR-V kernel to a device.
It will be good to have a script that generates the driver function to call a numba-dppy generate kernel.
To evaluate the performance of Numba workloads against corresponding dpcpp implementations we need to ensure that both implementations read the same input data and have a mechanism to validate the results of the computation automatically. The following steps need to be performed to add the common infrastructure.
Currently in new framework, results are validated by comparing two framework: numpy and numba or numba-dpex. It is better to make sure only results or output from tests are compared.
dppy.kernel
version of Kmeans on ATSThe dpex_n implementation should not use prange and should use only numpy vector expressions and numpy calls. Either the implementation needs to be updated or removed.
This benchmark suite uses multiple runtimes (Numba, Cuda) to schedule and execute computations on multiple types of devices. It requires some improvement for users to be able to understand the version compatility of the runtimes. Suggested improvements:
The goal is to allow users to understand if they are running the benchamark with compatible runtime dependencies.
The reference implementation for some benchmark is in pure Python or even some other library such as sklearn in case of DBscan. Yet, all the reference implementation files have the “_numpy” suffix. The naming convention is followed because we inherited npbench’s infrastructure. However, adding “_numpy” suffix to non-NumPy implementations can be confusing.
I propose updating the infrastructure to also support a “_baseline” framework. If no NumPy version is available, the baseline version is used as reference for validation and performance comparison.
The numba-dpex and dpnp versions of the benchmark need to be moved from previous location to the dpbench module.
The auto-fallback feature of numba-dpex can suppress execution failure and lead to incorrect assumptions regarding benchmark execution. The feature should be disabled by explicitly setting the config inside NumbaDpexBenchmark.
The numba_dpex test cases are printing warning message:
Unable to join threads to shut down before fork(). This can break multithreading in child processes.
The warning seems only is related to tbb, by setting numba threading layers to omp or workqueue will not have warnings:
export NUMBA_THREADING_LAYER=workqueue/omp; python -c "import dpbench; dpbench.run_benchmarks()"
An example dpbench log is attached.
dpbench_warning.log
To evaluate the performance of Numba workloads against corresponding dpcpp implementations we need to ensure that both implementations read the same input data and have a mechanism to validate the results of the computation automatically. The following steps need to be performed to add the common infrastructure.
List of workloads:
Port dpnp implementations from main_old to main.
The following benchmarks are available in main_old:
Note: The l2_distance benchmark was rewritten in main and the existing dpnp implementation is no longer applicable.
run_benchmark
runs a benchmark for all framework.The L2 distance numba_dpex_k
implementation fails the data validation step on CPU devices. Incidentally, the reference dpcpp implementation also fails data validation. The issue may point to use of atomic references. Needs further investigation and can be migration to numba-dpex or dpcpp once we know the root cause.
We will target implementing all Rodinia benchmarks using dpex and dpnp. As a starting point a folder under benchmark needs to be created to store the two Rodinia benchmarks that are already implemented for dpex.
# | workload | kernel API Runs, dppy | kernel API Passes, dppy | Njit, dppy | # kernels (+ jit functions) | kernel, dpcomp | Njit, dpcomp | Access pattern/ Dwarf |
---|---|---|---|---|---|---|---|---|
1 | blackscholes | Yes | Yes | 1 | no lower | Yes | ||
2 | dbscan | Yes | Yes | 1 (+1) | cannot offload compute_clusters | Yes | ||
3 | kmeans | Yes | Yes | 5 | Yes | Yes | ||
4 | knn | Yes | Yes | No impl | 1 | Yes | No impl | |
5 | l2_distance | Yes | False | 1 | Yes | Yes | Reduction | |
6 | pairwise_distance | Yes | Yes | 1 | Yes | Yes | ||
7 | pca | Yes | No test | 1 (+2) | No test | |||
8 | rambo | Yes | Yes | 1 (+5) | no lower for random | Yes | ||
9 | gpairs | Yes | Yes | No impl | 1 | Yes | No impl | |
10 | pathfinder | disabled | N/A | N/A | N/A | N/A | N/A | N/A |
11 | gaussian_elim | disabled | N/A | N/A | N/A | N/A | N/A | N/A |
Types of benchmarks
As benchmarks directories are now Python modules any new benchmark will require an __init__.py
file. The requirement is an impediment to adding npbench benchmarks into dpbench, as too much manual work would be required.
The __init__.py
can instead be mechanically generated by going through the directory and looking at configuration JSON files.
We need to add such a script that needs to be run manually after new benchmarks are added.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.