beehive-lab / tornadovm Goto Github PK

TornadoVM: A practical and efficient heterogeneous programming framework for managed languages

License: Apache License 2.0

Makefile 0.05% Shell 0.12% Python 1.55% Java 95.34% C 1.09% CMake 0.10% C++ 1.68% JavaScript 0.04% Ruby 0.01% Batchfile 0.02%

opencl java multi-core graalvm gpus gpu-computing spirv cuda gpu-acceleration levelzero

tornadovm's Introduction

TornadoVM

TornadoVM is a plug-in to OpenJDK and GraalVM that allows programmers to automatically run Java programs on heterogeneous hardware. TornadoVM targets OpenCL, PTX and SPIR-V compatible devices which include multi-core CPUs, dedicated GPUs (Intel, NVIDIA, AMD), integrated GPUs (Intel HD Graphics and ARM Mali), and FPGAs (Intel and Xilinx).

TornadoVM has three backends that generate OpenCL C, NVIDIA CUDA PTX assembly, and SPIR-V binary. Developers can choose which backends to install and run.

Website: tornadovm.org

Documentation: https://tornadovm.readthedocs.io/en/latest/

For a quick introduction please read the following FAQ.

Latest Release: TornadoVM 1.0.6 - 27/06/2024 : See CHANGELOG.

1. Installation

In Linux and macOS, TornadoVM can be installed automatically with the installation script. For example:

$ ./bin/tornadovm-installer
usage: tornadovm-installer [-h] [--version] [--jdk JDK] [--backend BACKEND] [--listJDKs] [--javaHome JAVAHOME]

TornadoVM Installer Tool. It will install all software dependencies except the GPU/FPGA drivers

optional arguments:
  -h, --help           show this help message and exit
  --version            Print version of TornadoVM
  --jdk JDK            Select one of the supported JDKs. Use --listJDKs option to see all supported ones.
  --backend BACKEND    Select the backend to install: { opencl, ptx, spirv }
  --listJDKs           List all JDK supported versions
  --javaHome JAVAHOME  Use a JDK from a user directory

NOTE Select the desired backend:

opencl: Enables the OpenCL backend (requires OpenCL drivers)
ptx: Enables the PTX backend (requires NVIDIA CUDA drivers)
spirv: Enables the SPIRV backend (requires Intel Level Zero drivers)

Example of installation:

# Install the OpenCL backend with OpenJDK 21
$ ./bin/tornadovm-installer --jdk jdk21 --backend opencl

# It is also possible to combine different backends:
$ ./bin/tornadovm-installer --jdk jdk21 --backend opencl,spirv,ptx

Alternatively, TornadoVM can be installed either manually from source or by using Docker.

If you are planning to use Docker with TornadoVM on GPUs, you can also follow these guidelines.

You can also run TornadoVM on Amazon AWS CPUs, GPUs, and FPGAs following the instructions here.

2. Usage Instructions

TornadoVM is currently being used to accelerate machine learning and deep learning applications, computer vision, physics simulations, financial applications, computational photography, and signal processing.

Featured use-cases:

kfusion-tornadovm: Java application for accelerating a computer-vision application using the Tornado-APIs to run on discrete and integrated GPUs.
Java Ray-Tracer: Java application accelerated with TornadoVM for real-time ray-tracing.

We also have a set of examples that includes NBody, DFT, KMeans computation and matrix computations.

Additional Information

3. Programming Model

TornadoVM exposes to the programmer task-level, data-level and pipeline-level parallelism via a light Application Programming Interface (API). In addition, TornadoVM uses single-source property, in which the code to be accelerated and the host code live in the same Java program.

Compute-kernels in TornadoVM can be programmed using two different approaches (APIs):

a) Loop Parallel API

Compute kernels are written in a sequential form (tasks programmed for a single thread execution). To express parallelism, TornadoVM exposes two annotations that can be used in loops and parameters: a) @Parallel for annotating parallel loops; and b) @Reduce for annotating parameters used in reductions.

The following code snippet shows a full example to accelerate Matrix-Multiplication using TornadoVM and the loop-parallel API:

public class Compute {
    private static void mxmLoop(Matrix2DFloat A, Matrix2DFloat B, Matrix2DFloat C, final int size) {
        for (@Parallel int i = 0; i < size; i++) {
            for (@Parallel int j = 0; j < size; j++) {
                float sum = 0.0f;
                for (int k = 0; k < size; k++) {
                    sum += A.get(i, k) * B.get(k, j);
                }
                C.set(i, j, sum);
            }
        }
    }

    public void run(Matrix2DFloat A, Matrix2DFloat B, Matrix2DFloat C, final int size) {

        // Create a task-graph with multiple tasks. Each task points to an exising Java method 
        // that can be accelerated on a GPU/FPGA
        TaskGraph taskGraph = new TaskGraph("myCompute")
                .transferToDevice(DataTransferMode.FIRST_EXECUTION, A, B) // Transfer data from host to device only in the first execution
                .task("mxm", Compute::mxmLoop, A, B, C, size)             // Each task points to an existing Java method
                .transferToHost(DataTransferMode.EVERY_EXECUTION, C);     // Transfer data from device to host
        
        // Create an immutable task-graph
        ImmutableTaskGraph immutableTaskGraph = taskGraph.snaphot();

        // Create an execution plan from an immutable task-graph
        try (TornadoExecutionPlan executionPlan = new TornadoExecutionPlan(immutableTaskGraph)) {

            // Run the execution plan on the default device
            TorandoExecutionResult executionResult = executionPlan.execute();

        } catch (TornadoExecutionPlanException e) {
            // handle exception 
            // ...
        }
    }
}

b) Kernel API

Another way to express compute-kernels in TornadoVM is via the Kernel API. To do so, TornadoVM exposes the KernelContext data structure, in which the application can directly access the thread-id, allocate memory in local memory (shared memory on NVIDIA devices), and insert barriers. This model is similar to programming compute-kernels in SYCL, oneAPI, OpenCL and CUDA. Therefore, this API is more suitable for GPU/FPGA expert programmers that want more control or want to port existing CUDA/OpenCL compute kernels into TornadoVM.

The following code-snippet shows the Matrix Multiplication example using the kernel-parallel API:

public class Compute {
    private static void mxmKernel(KernelContext context, Matrix2DFloat A, Matrix2DFloat B, Matrix2DFloat C, final int size) {
        int idx = context.globalIdx
        int jdx = context.globalIdy;
        float sum = 0;
        for (int k = 0; k < size; k++) {
            sum += A.get(idx, k) * B.get(k, jdx);
        }
        C.set(idx, jdx, sum);
    }

    public void run(Matrix2DFloat A, Matrix2DFloat B, Matrix2DFloat C, final int size) {
        // When using the kernel-parallel API, we need to create a Grid and a Worker
        WorkerGrid workerGrid = new WorkerGrid2D(size, size);    // Create a 2D Worker
        GridScheduler gridScheduler = new GridScheduler("myCompute.mxm", workerGrid);  // Attach the worker to the Grid
        KernelContext context = new KernelContext();             // Create a context
        workerGrid.setLocalWork(16, 16, 1);                      // Set the local-group size
  
        TaskGraph taskGraph = new TaskGraph("myCompute")
                .transferToDevice(DataTransferMode.FIRST_EXECUTION, A, B) // Transfer data from host to device only in the first execution
                .task("mxm", Compute::mxmKernel, context, A, B, C, size)   // Each task points to an existing Java method
                .transferToHost(DataTransferMode.EVERY_EXECUTION, C);     // Transfer data from device to host

        // Create an immutable task-graph
        ImmutableTaskGraph immutableTaskGraph = taskGraph.snapshot();

        // Create an execution plan from an immutable task-graph
        try (TornadoExecutionPlan executionPlan = new TornadoExecutionPlan(immutableTaskGraph)) {
            // Run the execution plan on the default device
            // Execute the execution plan
            TorandoExecutionResult executionResult = executionPlan
                        .withGridScheduler(gridScheduler)
                        .execute();
        } catch (TornadoExecutionPlanException e) {
            // handle exception 
            // ...
        }    
    }
}

Additionally, the two modes of expressing parallelism (kernel and loop parallelization) can be combined in the same task graph object.

4. Dynamic Reconfiguration

Dynamic reconfiguration is the ability of TornadoVM to perform live task migration between devices, which means that TornadoVM decides where to execute the code to increase performance (if possible). In other words, TornadoVM switches devices if it can detect that a specific device can yield better performance (compared to another).

With the task-migration, the TornadoVM's approach is to only switch device if it detects an application can be executed faster than the CPU execution using the code compiled by C2 or Graal-JIT, otherwise it will stay on the CPU. So TornadoVM can be seen as a complement to C2 and Graal JIT compilers. This is because there is no single hardware to best execute all workloads efficiently. GPUs are very good at exploiting SIMD applications, and FPGAs are very good at exploiting pipeline applications. If your applications follow those models, TornadoVM will likely select heterogeneous hardware. Otherwise, it will stay on the CPU using the default compilers (C2 or Graal).

To use the dynamic reconfiguration, you can execute using TornadoVM policies. For example:

// TornadoVM will execute the code in the best accelerator.
executionPlan.withDynamicReconfiguration(Policy.PERFORMANCE, DRMode.PARALLEL)
             .execute();

Further details and instructions on how to enable this feature can be found here.

Dynamic reconfiguration: https://dl.acm.org/doi/10.1145/3313808.3313819

5. How to Use TornadoVM in your Projects?

To use TornadoVM, you need two components:

a) The TornadoVM jar file with the API. The API is licensed as GPLV2 with Classpath Exception. b) The core libraries of TornadoVM along with the dynamic library for the driver code (.so files for OpenCL, PTX and/or SPIRV/Level Zero).

You can import the TornadoVM API by setting this the following dependency in the Maven pom.xml file:

<repositories>
    <repository>
        <id>universityOfManchester-graal</id>
        <url>https://raw.githubusercontent.com/beehive-lab/tornado/maven-tornadovm</url>
    </repository>
</repositories>

<dependencies>
<dependency>
    <groupId>tornado</groupId>
    <artifactId>tornado-api</artifactId>
    <version>1.0.6</version>
</dependency>
<dependency>
    <groupId>tornado</groupId>
    <artifactId>tornado-matrices</artifactId>
    <version>1.0.6</version>
</dependency>
</dependencies>

To run TornadoVM, you need to either install the TornadoVM extension for GraalVM/OpenJDK, or run with our Docker images.

6. Additional Resources

Here you can find videos, presentations, tech-articles and artefacts describing TornadoVM, and how to use it.

7. Academic Publications

If you are using TornadoVM >= 0.2 (which includes the Dynamic Reconfiguration, the initial FPGA support and CPU/GPU reductions), please use the following citation:

@inproceedings{Fumero:DARHH:VEE:2019,
 author = {Fumero, Juan and Papadimitriou, Michail and Zakkak, Foivos S. and Xekalaki, Maria and Clarkson, James and Kotselidis, Christos},
 title = {{Dynamic Application Reconfiguration on Heterogeneous Hardware.}},
 booktitle = {Proceedings of the 15th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments},
 series = {VEE '19},
 year = {2019},
 doi = {10.1145/3313808.3313819},
 publisher = {Association for Computing Machinery}
}

If you are using Tornado 0.1 (Initial release), please use the following citation in your work.

@inproceedings{Clarkson:2018:EHH:3237009.3237016,
 author = {Clarkson, James and Fumero, Juan and Papadimitriou, Michail and Zakkak, Foivos S. and Xekalaki, Maria and Kotselidis, Christos and Luj\'{a}n, Mikel},
 title = {{Exploiting High-performance Heterogeneous Hardware for Java Programs Using Graal}},
 booktitle = {Proceedings of the 15th International Conference on Managed Languages \& Runtimes},
 series = {ManLang '18},
 year = {2018},
 isbn = {978-1-4503-6424-9},
 location = {Linz, Austria},
 pages = {4:1--4:13},
 articleno = {4},
 numpages = {13},
 url = {http://doi.acm.org/10.1145/3237009.3237016},
 doi = {10.1145/3237009.3237016},
 acmid = {3237016},
 publisher = {ACM},
 address = {New York, NY, USA},
 keywords = {Java, graal, heterogeneous hardware, openCL, virtual machine},
}

Selected publications can be found here.

8. Acknowledgments

This work is partially funded by Intel corporation. In addition, it has been supported by the following EU & UKRI grants (most recent first):

EU Horizon Europe & UKRI AERO 101092850.
EU Horizon Europe & UKRI INCODE 101093069.
EU Horizon Europe & UKRI ENCRYPT 101070670.
EU Horizon Europe & UKRI TANGO 101070052.
EU Horizon 2020 ELEGANT 957286.
EU Horizon 2020 E2Data 780245.
EU Horizon 2020 ACTiCLOUD 732366.

Furthermore, TornadoVM has been supported by the following EPSRC grants:

9. Contributions and Collaborations

We welcome collaborations! Please see how to contribute to the project in the CONTRIBUTING page.

Write your questions and proposals:

Additionally, you can open new proposals on the GitHub discussions page.

Alternatively, you can share a Google document with us.

Collaborations:

For Academic & Industry collaborations, please contact here.

10. TornadoVM Team

Visit our website to meet the team.

11. Licenses

To use TornadoVM, you can link the TornadoVM API to your application which is under Apache 2.

Each Java TornadoVM module is licensed as follows:

Module	License
Tornado-API
Tornado-Runtime	+ CLASSPATH Exception
Tornado-Assembly
Tornado-Drivers	+ CLASSPATH Exception
Tornado-Drivers-OpenCL-Headers
Tornado-scripts
Tornado-Annotation
Tornado-Unittests
Tornado-Benchmarks
Tornado-Examples
Tornado-Matrices

tornadovm's People

Contributors

Stargazers

Watchers

Forkers

gstrato mikepapadim jjfumero diffblue-benchmarks e2data wileylou chriscummins achilleasatha guanjibaopingan cybernetics mbrukman jew-gang neoxharsh moneytech andromem tpterovtt zeta1999 stratika mgiammarco stiansols yechen3 dalexandrov ashishgupta-mca xcodeuuuuu66699 5gapp wiltonlazary diegopacheco luisa-poeschl badalsarkar kostasbitsakos zoltanvadasz yijunwu rossonet neocdtv grfrost rishi-s8 ncppd vsilaev sudonone sshyran gigiblender elegant-h2020 ilibx mrcodechef lhongjum fowlerlee blackjyn wen501271303 hooji sednai chenyu2016 mfkiwl kimjs2939 laplacekorea javalibrary m-yadav007 cyberflamego cxz hellblazer metavai lj12306 mirekkrenc exiahan mmaywald05 kabutz aaronms1 icodein yuenxq 188xuhe reloadbrain pankajshet aero-project-eu xcleancode stjordanis fkellner taisz syswu maxstreese jnorthrup iotamudelta shalireza sirywell gogo2 org-tekeli-borisp tango-eu-project mairooni gauravssnl anietieakpan clownsw murodin conmi-whitejoker jbhateja shahinsharifi otabuzzman acproject justinsanta wilson226q11ee electrostat-lab ayazhassan coldav

tornadovm's Issues

Segmentation fault for OOO execution on Intel CPU Windows 10

Run Tornado 0.8 on Windows 10 with OOO-enabled on Intel CPU:

$ tornado --printBytecodes --debug -Dtornado.ooo-execution.enable=true -Ds0.t0.device=0:1 uk.ac.manchester.tornado.examples.compute.MatrixMultiplication2D 128
Computing MxM of 128x128
task info: s0.t0
        platform          : Intel(R) OpenCL
        device            : Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz CL_DEVICE_TYPE_CPU (available)
        dims              : 2
        global work offset: [0, 0]
        global work size  : [12, 1]
        local  work size  : null

/d/HUPLOAD/0.8/TornadoVM/bin/bin/tornado: line 269:  1367 Segmentation fault      ${JAVA_CMD} ${JAVA_FLAGS} $@

This is reproducible for any example and test.

OS: Windows 10 Pro (20H2 19042.746)
OpenCL: 2.0 (Intel CPU Runtime for OpenCL 21.191)
TornadoVM 0.8 release

As a side note, I have NVIDIA CUDA installed as well, so there are 2 devices on different platforms. Without OOO enabled (default) TornadoVM works fine on either Intel CPU or NVIDIA GTX GPU.

Handling char Java primitives in kernel.

Describe the bug
It looks like char Java primitive type is not handled well in kernels, and the higher byte is set to some random value.
Discovered in test uk.ac.manchester.tornado.unittests.arrays.TestArrays#testVectorChars

How To Reproduce
Run arrays test from std. test package:

$ tornado -Dtornado.unittests.verbose=True -Xmx6g -Dtornado.recover.bailout=False  uk.ac.manchester.tornado.unittests.tools.TornadoTestRunner  uk.ac.manchester.tornado.unittests.arrays.TestArrays

In the output under some rare conditions (not always) the following output is shown:

        Running test: testVectorChars            ................  [FAILED]
                \_[REASON] expected:<102> but was:<16486>

        Running test: testVectorChars            ................  [FAILED]
                \_[REASON] expected:<102> but was:<25958>

If you pay attention, 102 = 0x66, but 16486 = 0x4066 and 25958=0x6566.
You may notice the pattern -- the lowest byte in word is set correctly (0x66 corresponds to char 'f' assigned in kernel code), but the highest byte takes some random value.

Interesting, I changed 'f' to '\u3333' in code to re-run the test -- but same error appears periodically. So even if char takes two bytes it's still processed incorrectly in kernel.

The next interesting point is that testVectorCharsMessage always works correctly - but ther char is summed up with int.

Expected behavior
Test should pass and the char value must be assigned correctly.

Computing system setup (please complete the following information):

OS: Windows 10
OpenCL 1.2 (NVIDIA or AMD)
TornadoVM 0.9

Additional context
Tested with both NVIDIA & AMD GPU-s

Generated Kernel is NULL

Hello,
In a fresh git clone, I'm getting the following error when I run "make tests":

Test: class uk.ac.manchester.tornado.unittests.TestHello
	Running test: testHello                  ................  [FAILED] 
		\_[REASON] Task was not executed.
	Running test: testVectorAddition         ................  [FAILED] 
		\_[REASON] [ERROR] Generated Kernel is NULL. 
Please report this issue to https://github.com/beehive-lab/TornadoVM
	Running test: testSimpleCompute          ................  [FAILED] 
		\_[REASON] [ERROR] Generated Kernel is NULL. 
Please report this issue to https://github.com/beehive-lab/TornadoVM
	Running test: testSimpleCompute2         ................  [FAILED] 
		\_[REASON] [ERROR] Generated Kernel is NULL. 
Please report this issue to https://github.com/beehive-lab/TornadoVM
	Running test: testSimpleInOut            ................  [FAILED] 
		\_[REASON] [ERROR] Generated Kernel is NULL. 
Please report this issue to https://github.com/beehive-lab/TornadoVM
Test ran: 5, Failed: 5

And also more similar output for the rest of the unittests.
Also, on a simple test program of mine, adapted from the testHello unittest:

uk.ac.manchester.tornado.api.exceptions.TornadoRuntimeException: [ERROR] Generated Kernel is NULL. 
Please report this issue to https://github.com/beehive-lab/TornadoVM
	at [email protected]/uk.ac.manchester.tornado.drivers.opencl.graal.OCLInstalledCode.checkKernelNotNull(OCLInstalledCode.java:353)
	at [email protected]/uk.ac.manchester.tornado.drivers.opencl.graal.OCLInstalledCode.submitWithoutEvents(OCLInstalledCode.java:359)
	at [email protected]/uk.ac.manchester.tornado.drivers.opencl.graal.OCLInstalledCode.launchWithoutDeps(OCLInstalledCode.java:389)
	at [email protected]/uk.ac.manchester.tornado.runtime.TornadoVM.execute(TornadoVM.java:498)
	at [email protected]/uk.ac.manchester.tornado.runtime.TornadoVM.execute(TornadoVM.java:194)
	at [email protected]/uk.ac.manchester.tornado.runtime.tasks.TornadoTaskSchedule.scheduleInner(TornadoTaskSchedule.java:421)
	at [email protected]/uk.ac.manchester.tornado.runtime.tasks.TornadoTaskSchedule.schedule(TornadoTaskSchedule.java:642)
	at [email protected]/uk.ac.manchester.tornado.api.TaskSchedule.execute(TaskSchedule.java:230)
	at TestHello.testHello(TestHello.java:69)
	at TestHello.main(TestHello.java:33)
Exception in thread "main" java.lang.AssertionError: Task was not executed.
	at [email protected]/org.junit.Assert.fail(Assert.java:88)
	at [email protected]/org.junit.Assert.assertTrue(Assert.java:41)
	at TestHello.testHello(TestHello.java:73)
	at TestHello.main(TestHello.java:33)

'tornado --devices' shows:

Number of Tornado drivers: 1
Total number of devices  : 1
Tornado device=0:0
	Clover -- AMD Radeon RX 5700 XT (NAVI10, DRM 3.36.0, 5.6.13-arch1-1, LLVM 10.0.0)
		Global Memory Size: 8.0 GB
		Local Memory Size: 32.0 KB
		Workgroup Dimensions: 3
		Max WorkGroup Configuration: [256, 256, 256]
		Device OpenCL C version: OpenCL C 1.1

Plans for Windows support?

Hi,

Loving the project - something I've been looking for a long time to learn GPGPU programming. I'm very surprised it took me so long to hear about the work you guys have done.

I have a (long-term) goal to use TornadoVM for development of a videogame and simulations (I'm only a hobbyist for now, nothing commercial) so naturally I'm curious if there's any plans to bring Tornado to Windows since most of my intended audience would be using it.

Are there any known roadblocks for Windows or is it just something that hasn't been looked at yet?

Just to clarify - I only use Windows myself as a secondary OS so this is by no means urgent - just curious on if it's an option for future.

Thanks!

Altering default device not working with reduction tests

It looks like #61 PR conflicts with recent logic for calculating / checking GridInfo by Thanos Stratikopoulos.

I've tested code in PR on multiple devices and it worked ok, however, after merging develop branch with recent changes by Thanos, it stops working if default device is changed via tornado.properties.

Try to run any reduction tests with tornado.driver=0 and tornado.device=1 (i.e. non default, assuming you have at least 2 OpenCL devices) -- all such tests fail now.

Regulate the amount of memory depending on the current target device present in the system for the unittests

Some of the unittests consume a lot of memory (on purpose). Regulate the amount of memory depending on the current target device that is present in the system.

By default, TornadoVM allocates a heap of 1GB on the target device. Therefore, it assumes the device has at least 1GB. Regulate the space to be allocated if the memory available on the target device is less than the default value (e.g., by allocating 512MB).

The device heap is controlled using the following flag:
-Dtornado.heap.allocation=1GB

Return flatten arrays from TornadoVM custom types

Return the internal primitive array to the user for the custom types such as ImageByte2, ImageByte3, etc.

uk.ac.manchester.tornado.drivers.opencl.CLEvent incorrectly uses static field "buffer"

Describe the bug
uk.ac.manchester.tornado.drivers.opencl.CLEvent incorrectly uses static field buffer.

This static field is used from multiple instance methods like readEventTime and getCLStatus.
When multiple threads trying to execute separate TaskSchedule events are conflicting accessing this field. The race leads to clearing buffer right before reading value returned from JNI and code fails with empty buffer error.

The buffer should be a local variable of method rather than static field.

How To Reproduce
Run several Java threads each executing own TaskSchedule.execute()

Expected behavior
Running several separate TaskSchedule-s from different threads should work ok.

Computing system setup (please complete the following information):

OS: any
OpenCL Version 1.2
TornadoVM commit id: develop branch of 0.9 version

Runtime error: Generated Kernel is NULL

Hello TornadoVM team!

I'm in the process of implementing a branch-and-bound algorithm in TornadoVM.
It seems like all my TaskSchedules report that the generated kernel is null, and I haven't found a way to resolve it yet.

The first TaskSchedule is called as follows:
new TaskSchedule("Internal-UB-SV13-Calculation") //
.streamIn(ratios) // .task("taskSV13",ParallelBBModel::loopInternalSV13,internalUBs,ratios,currWeight,currProfit,currSize,capacity) //
.streamOut(internalUBs) //
.executeWithProfiler(Policy.PERFORMANCE);

(Part of) the error message is as follows:
uk.ac.manchester.tornado.api.exceptions.TornadoRuntimeException: [ERROR] Generated Kernel is NULL.
Please report this issue to https://github.com/beehive-lab/TornadoVM
at uk.ac.manchester.tornado.drivers.opencl.graal.OCLInstalledCode.checkKernelNotNull(OCLInstalledCode.java:353)
at uk.ac.manchester.tornado.drivers.opencl.graal.OCLInstalledCode.submitWithoutEvents(OCLInstalledCode.java:359)
at uk.ac.manchester.tornado.drivers.opencl.graal.OCLInstalledCode.launchWithoutDependencies(OCLInstalledCode.java:389)
at uk.ac.manchester.tornado.runtime.TornadoVM.execute(TornadoVM.java:507)
at uk.ac.manchester.tornado.runtime.TornadoVM.execute(TornadoVM.java:197)
at uk.ac.manchester.tornado.runtime.tasks.TornadoTaskSchedule.scheduleInner(TornadoTaskSchedule.java:436)
at uk.ac.manchester.tornado.runtime.tasks.TornadoTaskSchedule.schedule(TornadoTaskSchedule.java:666)
at uk.ac.manchester.tornado.api.TaskSchedule.execute(TaskSchedule.java:230)
at uk.ac.manchester.tornado.runtime.tasks.TornadoTaskSchedule.lambda$runParallelTaskSchedules$1(TornadoTaskSchedule.java:919)
at java.lang.Thread.run(Thread.java:748)
Bailout from LAUNCH Bytecode:
Reason: uk.ac.manchester.tornado.api.exceptions.TornadoRuntimeException: [ERROR] Generated Kernel is NULL.
Please report this issue to https://github.com/beehive-lab/TornadoVM
uk.ac.manchester.tornado.runtime.TornadoVM.execute(TornadoVM.java:517)
uk.ac.manchester.tornado.runtime.TornadoVM.execute(TornadoVM.java:197)
uk.ac.manchester.tornado.runtime.tasks.TornadoTaskSchedule.scheduleInner(TornadoTaskSchedule.java:436)
uk.ac.manchester.tornado.runtime.tasks.TornadoTaskSchedule.schedule(TornadoTaskSchedule.java:666)
uk.ac.manchester.tornado.api.TaskSchedule.execute(TaskSchedule.java:230)
uk.ac.manchester.tornado.runtime.tasks.TornadoTaskSchedule.lambda$runParallelTaskSchedules$1(TornadoTaskSchedule.java:919)
java.lang.Thread.run(Thread.java:748)
task info: XXX0.taskSV13
platform : NVIDIA CUDA
device : GeForce RTX 2060 CL_DEVICE_TYPE_GPU (available)
dims : 1
global work offset: [0]
global work size : [250]
local work size : [250, 1, 1]
uk.ac.manchester.tornado.drivers.opencl> notify error:
uk.ac.manchester.tornado.drivers.opencl> CL_INVALID_KERNEL_ARGS error executing CL_COMMAND_NDRANGE_KERNEL on GeForce RTX 2060 (Device 0).

I do hope you could shed some light on the culprit.

Kind regards,
Victor Schouten

OpenCL script support ?

Hi , is there any high level library/example code that imports existing OpenCL scripts ?
Also for shaders ?
I want to import my old OpenCL Libraries to TornadoVM.
So i can migrate my old C++ / OpenCL apps to TornadoVM easy...

Fix TornadoVM crashes when no OpenCL devices are available

When there are no OpenCL devices available, throw an exception, and inform the user that an OpenCL device is required.

Execute the following command to obtain a full list of OpenCL devices:

$ tornado --devices

Settings from SDK-DIR/etc/tornado.properties are ignored for default platform/device

Settings specified in SDK-DIR/etc/tornado.properties file are ignored.
In particular, setting tornado.platform and tornado.device has no effect.

There is a method OpenCL.defaultDevice that uses these properties, but the method itself is not used anywhere.

TestArrays#testInitParallel on AMD - failed with '[REASON] [ERROR] reset() was called after warmup()'

When running TestArrays#testInitParallel I get '[REASON] [ERROR] reset() was called after warmup()' exception thrown from TornadoVM.compileTaskFromBytecodeToBinary.

In fact, TornadoTestBase.before indeed reset ALL devices, but TornadoVM.warmup validates only one (VM-default ???) device. And if this device is different from SCHEDULE.TASK-device than it leads to an error (task device is changed via s0.t0.device=0:1 command arg).

Port all python scripts from version 2 to 3

Since Python 2.7 is no longer supported, rewrite all python script to use Python 3.X instead of 2.X.

ACCESS_VIOLATION in TestArrays.testAdd and testWarmUp with several OpenCL devices

In my configuration I have 3 OpenCL devices (listed in the same order as in Tornado)

GeForce CTX 1080Ti (GPU)
AMD FirePro W5000 (GPU)
Intel i7 4930K (CPU)

I'm trying to run test suite on AMD device. I see no option to re-order devices or specify a default device, so I set tornado args as following:

TORNADO_FLAGS="-Djava.library.path=${TORNADO_SDK}/lib -Ds0.t0.device=0:1 -Dtornado.opencl.compiler.options=\"-w\""

When running tests mentioned is subject, I've got ACCESS_VIOLATION, because Tornado switched between devices when scheduling additional tasks:

$ tornado -Dtornado.unittests.verbose=True -Xmx6g -Dtornado.recover.bailout=False  --debug uk.ac.manchester.tornado.unittests.tools.TornadoTestRunner  uk.ac.manchester.tornado.unittests.arrays.TestArrays
task info: s0.t0
        platform          : AMD Accelerated Parallel Processing
        device            : Pitcairn CL_DEVICE_TYPE_GPU (available)
        dims              : 1
        global work offset: [0]
        global work size  : [4096]
        local  work size  : [1024, 1, 1]

task info: s0.t0
        platform          : AMD Accelerated Parallel Processing
        device            : Pitcairn CL_DEVICE_TYPE_GPU (available)
        dims              : 1
        global work offset: [0]
        global work size  : [128]
        local  work size  : [128, 1, 1]

[TornadoVM-OCL-JNI] ERROR : clEnqueueWriteBuffer -> Returned: -58
task info: s0.t1
        platform          : NVIDIA CUDA
        device            : GeForce GTX 1080 Ti CL_DEVICE_TYPE_GPU (available)
        dims              : 1
        global work offset: [0]
        global work size  : [128]
        local  work size  : [128, 1, 1]

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  EXCEPTION_ACCESS_VIOLATION (0xc0000005) at pc=0x00007ffb4d6a00ef, pid=23784, tid=0x0000000000005420
#
# JRE version: OpenJDK Runtime Environment (8.0_262-b10) (build 1.8.0_262-b10)
# Java VM: OpenJDK 64-Bit Server VM GraalVM CE 20.2.0 (25.262-b10-jvmci-20.2-b03 mixed mode windows-amd64 )
# Problematic frame:
# C  [nvopencl64.dll+0x200ef]
#
# Failed to write core dump. Minidumps are not enabled by default on client versions of Windows
#
# An error report file with more information is saved as:
# D:\Projects\TornadoVM\hs_err_pid23784.log
#
# If you would like to submit a bug report, please visit:
#   https://github.com/oracle/graal/issues
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

See how AMD switched to NVIDIA for s0:t1 after s0:t0. And error -58 means just this -- wrong event, coming from another device.

Probably there is an undocumented option to specify default device, but I guess it's wrong to change device in this case: data is streamed-in on another device, so why ever another device should be selected?

[Support] Parallelize loops with constant values as loop-bounds

Currently, TornadoVM only parallelizes loops in which loop bounds are taken as array-length.
Add support for constant-values.

for (@Parallel i = 0; i < CONSTANT; i++) {
   ... 
}

To support this, the compiler phases TornadoAutoParalleliser and TornadoParallelScheduler need to be updated.
This change also influences the batch processing, in which loop-bounds need to be updated according to the batch size.
This change will also provide a set of unittests to check this feature.

Usage of Unsafe class in TornadoVMConfig

After recent sync sources I see the following warnings about usage of sun.misc.Unsafe.

Did you try your code with Graal JDK 11? As far as I know, Unsafe was removed in Java 11.

I'm not sure how to replace offset, but scale seems to be just Byte.BYTES, Double.BYTES, Long.BYTES etc...

[WARNING] /D:/Projects/TornadoVM/runtime/src/main/java/uk/ac/manchester/tornado/runtime/TornadoVMConfig.java:[31,16] sun.misc.Unsafe is internal proprietary API and may be removed in a future release
[WARNING] /D:/Projects/TornadoVM/runtime/src/main/java/uk/ac/manchester/tornado/runtime/TornadoVMConfig.java:[31,16] sun.misc.Unsafe is internal proprietary API and may be removed in a future release
[WARNING] /D:/Projects/TornadoVM/runtime/src/main/java/uk/ac/manchester/tornado/runtime/TornadoVMConfig.java:[31,16] sun.misc.Unsafe is internal proprietary API and may be removed in a future release
[WARNING] /D:/Projects/TornadoVM/runtime/src/main/java/uk/ac/manchester/tornado/runtime/TornadoVMConfig.java:[53,24] sun.misc.Unsafe is internal proprietary API and may be removed in a future release
[WARNING] /D:/Projects/TornadoVM/runtime/src/main/java/uk/ac/manchester/tornado/runtime/TornadoVMConfig.java:[55,24] sun.misc.Unsafe is internal proprietary API and may be removed in a future release
[WARNING] /D:/Projects/TornadoVM/runtime/src/main/java/uk/ac/manchester/tornado/runtime/TornadoVMConfig.java:[57,24] sun.misc.Unsafe is internal proprietary API and may be removed in a future release
[WARNING] /D:/Projects/TornadoVM/runtime/src/main/java/uk/ac/manchester/tornado/runtime/TornadoVMConfig.java:[59,24] sun.misc.Unsafe is internal proprietary API and may be removed in a future release
[WARNING] /D:/Projects/TornadoVM/runtime/src/main/java/uk/ac/manchester/tornado/runtime/TornadoVMConfig.java:[61,24] sun.misc.Unsafe is internal proprietary API and may be removed in a future release
[WARNING] /D:/Projects/TornadoVM/runtime/src/main/java/uk/ac/manchester/tornado/runtime/TornadoVMConfig.java:[63,24] sun.misc.Unsafe is internal proprietary API and may be removed in a future release
[WARNING] /D:/Projects/TornadoVM/runtime/src/main/java/uk/ac/manchester/tornado/runtime/TornadoVMConfig.java:[65,24] sun.misc.Unsafe is internal proprietary API and may be removed in a future release
[WARNING] /D:/Projects/TornadoVM/runtime/src/main/java/uk/ac/manchester/tornado/runtime/TornadoVMConfig.java:[67,24] sun.misc.Unsafe is internal proprietary API and may be removed in a future release
[WARNING] /D:/Projects/TornadoVM/runtime/src/main/java/uk/ac/manchester/tornado/runtime/TornadoVMConfig.java:[69,24] sun.misc.Unsafe is internal proprietary API and may be removed in a future release
[WARNING] /D:/Projects/TornadoVM/runtime/src/main/java/uk/ac/manchester/tornado/runtime/TornadoVMConfig.java:[80,24] sun.misc.Unsafe is internal proprietary API and may be removed in a future release
[WARNING] /D:/Projects/TornadoVM/runtime/src/main/java/uk/ac/manchester/tornado/runtime/TornadoVMConfig.java:[82,24] sun.misc.Unsafe is internal proprietary API and may be removed in a future release
[WARNING] /D:/Projects/TornadoVM/runtime/src/main/java/uk/ac/manchester/tornado/runtime/TornadoVMConfig.java:[84,24] sun.misc.Unsafe is internal proprietary API and may be removed in a future release
[WARNING] /D:/Projects/TornadoVM/runtime/src/main/java/uk/ac/manchester/tornado/runtime/TornadoVMConfig.java:[86,24] sun.misc.Unsafe is internal proprietary API and may be removed in a future release
[WARNING] /D:/Projects/TornadoVM/runtime/src/main/java/uk/ac/manchester/tornado/runtime/TornadoVMConfig.java:[88,24] sun.misc.Unsafe is internal proprietary API and may be removed in a future release
[WARNING] /D:/Projects/TornadoVM/runtime/src/main/java/uk/ac/manchester/tornado/runtime/TornadoVMConfig.java:[90,24] sun.misc.Unsafe is internal proprietary API and may be removed in a future release
[WARNING] /D:/Projects/TornadoVM/runtime/src/main/java/uk/ac/manchester/tornado/runtime/TornadoVMConfig.java:[92,24] sun.misc.Unsafe is internal proprietary API and may be removed in a future release
[WARNING] /D:/Projects/TornadoVM/runtime/src/main/java/uk/ac/manchester/tornado/runtime/TornadoVMConfig.java:[94,24] sun.misc.Unsafe is internal proprietary API and may be removed in a future release
[WARNING] /D:/Projects/TornadoVM/runtime/src/main/java/uk/ac/manchester/tornado/runtime/TornadoVMConfig.java:[96,24] sun.misc.Unsafe is internal proprietary API and may be removed in a future release

Throw an exception when having tasks within the same task-schedule with the same name

Task-names are unique strings within the Task-Schedules. This is used for scheduling across different devices.

Currently, there is no check to guarantee that tasks should have different names.

TaskSchedule ts = new TaskSchedule("foo") 
                              .task("bar", ...)
                              .task("bar", ...);       <<- Throw an error when having this type of code

Support a generic kernel (without specialization) for generic block/grid size.

Breaking unit-tests on AMD GPU

Describe the bug
There are 15 unit-tests in the develop branch failing when AMD is used as the default device.

How To Reproduce
In my setup the AMD is device 0:2. I use the thunder machine which has an AMD gfx900 GPU.

git checkout develop
make
tornado-test.py --ea --verbose -J"-Dtornado.unittests.device=0:2"

List of failing unit-tests

tornado -ea -Dtornado.unittests.verbose=True -Xmx6g -Dtornado.recover.bailout=False -Dtornado.unittests.device=0:2 uk.ac.manchester.tornado.unittests.tools.TornadoTestRunner  uk.ac.manchester.tornado.unittests.functional.TestLambdas

tornado -ea -Dtornado.unittests.verbose=True -Xmx6g -Dtornado.recover.bailout=False -Dtornado.unittests.device=0:2 uk.ac.manchester.tornado.unittests.tools.TornadoTestRunner  uk.ac.manchester.tornado.unittests.vectortypes.TestFloats

tornado -ea -Dtornado.unittests.verbose=True --debug -Xmx6g -Dtornado.recover.bailout=False -Dtornado.unittests.device=0:2 uk.ac.manchester.tornado.unittests.tools.TornadoTestRunner  uk.ac.manchester.tornado.unittests.reductions.TestReductionsFloats#testComputePi

tornado -ea -Dtornado.unittests.verbose=True -Xmx6g -Dtornado.recover.bailout=False -Dtornado.unittests.device=0:2 uk.ac.manchester.tornado.unittests.tools.TornadoTestRunner  uk.ac.manchester.tornado.unittests.profiler.TestProfiler

tornado -ea -Dtornado.unittests.verbose=True -Xmx6g -Dtornado.recover.bailout=False -Dtornado.unittests.device=0:2 uk.ac.manchester.tornado.unittests.tools.TornadoTestRunner  uk.ac.manchester.tornado.unittests.grid.TestGrid

tornado -ea -Dtornado.unittests.verbose=True -Xmx6g -Dtornado.recover.bailout=False -Dtornado.unittests.device=0:2 -Dtornado.device.desc=/home/thanos/repositories/tornadoVM/bin/sdk/examples/virtual-device-GPU.json -Dtornado.print.kernel=True -Dtornado.virtual.device=True -Dtornado.print.kernel.dir=/home/thanos/repositories/tornadoVM/bin/sdk/virtualKernelOut.out uk.ac.manchester.tornado.unittests.tools.TornadoTestRunner  uk.ac.manchester.tornado.unittests.virtual.TestVirtualDeviceKernel#testVirtualDeviceKernelGPU

tornado -ea -Dtornado.unittests.verbose=True -Xmx6g -Dtornado.recover.bailout=False -Dtornado.unittests.device=0:2 -Dtornado.device.desc=/home/thanos/repositories/tornadoVM/bin/sdk/examples/virtual-device-CPU.json -Dtornado.print.kernel=True -Dtornado.virtual.device=True -Dtornado.print.kernel.dir=/home/thanos/repositories/tornadoVM/bin/sdk/virtualKernelOut.out uk.ac.manchester.tornado.unittests.tools.TornadoTestRunner  uk.ac.manchester.tornado.unittests.virtual.TestVirtualDeviceKernel#testVirtualDeviceKernelCPU

tornado -ea -Dtornado.unittests.verbose=True -Xmx6g -Dtornado.recover.bailout=False -Dtornado.unittests.device=0:2 -Dtornado.device.desc=/home/thanos/repositories/tornadoVM/bin/sdk/examples/virtual-device-GPU.json -Dtornado.virtual.device=True -Dtornado.feature.extraction=True -Dtornado.features.dump.dir=/home/thanos/repositories/tornadoVM/bin/sdk/virtualFeaturesOut.out uk.ac.manchester.tornado.unittests.tools.TornadoTestRunner  uk.ac.manchester.tornado.unittests.virtual.TestVirtualDeviceFeatureExtraction#testVirtualDeviceFeaturesGPU

tornado -ea -Dtornado.unittests.verbose=True -Xmx6g -Dtornado.recover.bailout=False -Dtornado.unittests.device=0:2 -Dtornado.device.desc=/home/thanos/repositories/tornadoVM/bin/sdk/examples/virtual-device-CPU.json -Dtornado.virtual.device=True -Dtornado.feature.extraction=True -Dtornado.features.dump.dir=/home/thanos/repositories/tornadoVM/bin/sdk/virtualFeaturesOut.out uk.ac.manchester.tornado.unittests.tools.TornadoTestRunner  uk.ac.manchester.tornado.unittests.virtual.TestVirtualDeviceFeatureExtraction#testVirtualDeviceFeaturesCPU

Resulting in:

==================================================
              Unit tests report 
==================================================

{'[FAILED]': 15, '[PASS]': 321}
Coverage: 95.54%

==================================================

Enhance Reduction Operators for TornadoMath functions

I tried in TestReductionsDoubles to replace Math.max() with TornadoMath.max().

private static void maxReductionAnnotation(double[] input, @Reduce double[] result) {
        for (@Parallel int i = 0; i < input.length; i++) {
            result[0] = Math.max(result[0], input[i]);
        }
}

private static void maxReductionAnnotation(double[] input, @Reduce double[] result) {
        for (@Parallel int i = 0; i < input.length; i++) {
            result[0] = TornadoMath.max(result[0], input[i]);
        }
}

This replacement resulted in the following error.

Test: class uk.ac.manchester.tornado.unittests.reductions.TestReductionsDoubles#testMaxReduction
	Running test: testMaxReduction           ................  [FAILED] 
		\_[REASON] [ERROR] Automatic reduce operation not supported yet: 17|OCLFPBinaryIntrinsic
	uk.ac.manchester.tornado.api.exceptions.TornadoRuntimeException: [ERROR] Automatic reduce operation not supported yet: 17|OCLFPBinaryIntrinsic
	at uk.ac.manchester.tornado.runtime.analyzer.ReduceCodeAnalysis.getReduceOperation(ReduceCodeAnalysis.java:117)

The same issue is shown for other operations (e.g. min, pow...)

So, the feature is to update the ReduceCodeAnalysis to support the operators in TornadoMath package.

How to test:

Change the Math.max() with TornadoMath.max() and run:

$ make
$ tornado -ea -Dtornado.unittests.verbose=True -Xmx6g -Dtornado.recover.bailout=False uk.ac.manchester.tornado.unittests.tools.TornadoTestRunner  uk.ac.manchester.tornado.unittests.reductions.TestReductionsDoubles#testMaxReduction

[DOC] Update and complete Javadoc for the TornadoVM-API

Update the existing Javadoc and complete with the latest features in the TornadoVM API.

This involves:

TornadoVM API types
TornadoVM Task-Schedule API
TornadoVM Annotations
TornadoVM Runtime API

TornadoVM gets stuck during sketch compilation in static context

Describe the bug

Creating a TaskSchedule and Task in a static block (during clinit) can cause TornadoVM to get stuck. This will happen if the method passed to the task will have a non-inlined callee (invokestatic bytecode).

How To Reproduce

Run the code provided with the compiler flag -XX:CompileCommand="dontinline Main *"

public class Main {

    public static void iNeedAnInvokeStaticBytecode() {
    }

    public static void testMethod(int[] in) {
        iNeedAnInvokeStaticBytecode();
    }

    static int[] inTor = new int[] { 0 };
    static TaskSchedule ts = new TaskSchedule("s0");
    static {
        ts.task("t0", Main::testMethod, inTor);
    }

    public static void main(String[] args) {
        ts.execute();
    }
}

I tested the code and it seems to happen with both JDK 8 & 11 -- have not tested on other versions.

The sketch compilation for the task method (testMethod) gets stuck during bytecode parsing, inside JVMCI.

TornadoVM does not fully respect the SSA form during LIR building

In case of if conditions we respect the SSA:

z_295  =  i_159 == i_4;
  if(z_295)
  {
    ....

In case of for loops, I am not sure we do:

i_4  =  i_3;
  for(;i_4 < 8;)  {
    ....
    i_15  =  i_14 + i_4;
    i_4  =  i_15;
  }

I am not sure this is an actual problem and open this issue for discussion.

TornadoVM::dumpProfiles has incorrect printf format specifier

$ tornado --version
version=0.6
branch=master
commit=c0b3304

To reproduce: runBenchmarks.sh

profile.getExecutionTime() returns a long but the format specifier is for a float

--- ./runtime/src/main/java/uk/ac/manchester/tornado/runtime/TornadoVM.java	2020-03-13 12:30:21.000000000 -0400
***************
*** 635,641 ****
                      TornadoAcceleratorDevice device = (TornadoAcceleratorDevice) eventset.getDevice();
                      final Event profile = device.resolveEvent(i);
                      if (profile.getStatus() == COMPLETE) {
!                         System.out.printf("task: %s %s %.9f %9d %9d %9d\n", device.getDeviceName(), meta.getId(), profile.getExecutionTime(), profile.getSubmitTime(), profile.getStartTime(),
                                  profile.getEndTime());
                      }
                  }
--- 635,641 ----
                      TornadoAcceleratorDevice device = (TornadoAcceleratorDevice) eventset.getDevice();
                      final Event profile = device.resolveEvent(i);
                      if (profile.getStatus() == COMPLETE) {
!                         System.out.printf("task: %s %s %9d %9d %9d %9d\n", device.getDeviceName(), meta.getId(), profile.getExecutionTime(), profile.getSubmitTime(), profile.getStartTime(),
                                  profile.getEndTime());
                      }
                  }

Support Kotlin

I would like Tornado to support Kotlin. The reason behind that is:
. Kotlin is expressive and concise, hence a good fit for data intensive / machine learning tasks, better fit than Java
. Kotlin is JVM language, might be relatively easy to support given Java is already supported

Make OCLKind and PTXKind consistent for LocalArrayNodes

Describe the feature
There is an inconsistency between the OpenCL and the PTX backends.
The inconsistency is observed in the generate method of the LocalArrayNode. Prior to the TornadoVMContext features, this node was used only from the snippets. So, we could use the constructor that uses the ResolvedJavaType field. Now, that we append this node from the CompilerGraphPlugins, we use a second constructor that uses the JavaKind instead of the ResolvedJavaType.

This change in the OpenCL backend was ok, as the generate method used the following line to obtain the LIRKind.

// line in the generate method
LIRKind lirKind = LIRKind.value(gen.getLIRGeneratorTool().target().arch.getWordKind());

However, in the PTX backend the LIRKind was obtained by the kind that was initialized in the constructor from the ResolvedJavaType.

// line in the constructor
this.kind = PTXKind.fromResolvedJavaType(elementType)

// line in the generate method
LIRKind lirKind = LIRKind.value(kind);

In order to make it work for the TornadoVMContext, I made the PTX generate method to use the gen.getLIRGeneratorTool().target().arch.getWordKind() similar to the OpenCL method. This seems to work for all unit-tests.

Feature description:
After discussion with @jjfumero and @gigiblender, we decided to merge the new API, and open this issue for this feature:
Make the LIRKind to be obtained by the kind (OCLKind or PTXKind).

To do this, we can create a new method in the OCLKind, PTXKind classes that resolves the backend kind from the JavaKind. For example:

public static OCLKind fromResolvedJavaKind(JavaKind javaKind);

public static PTXKind fromResolvedJavaKind(JavaKind javaKind);

Different (incorrect) results when running with GPU compared to sequential

Describe the bug
When testing an implementation of an FFT with TornadoVM I found that I am getting different (and incorrect) results when running the exact same method on the GPU as compared to calling it directly. I narrowed the code down to a simple test case which sets up an array of numbers, retrieves them in sets of 8 values, multiplies them by 10, and puts them back into shuffled locations relative to where they are read from. In the output below it appears that line numbered 2 (and 6) in GPU mode have 10x the number that has been written into the location it reads from rather than 10x the number that was originally there. So it is 200 (10 x 20) rather than 10 (10 x 1). Note that the output is in pairs as original FFT code is to support real/imaginary pairs in alternate locations.

Output in sequential mode:

0   0.000   0.000
1  20.000 160.000
2  10.000  80.000
3  30.000 240.000
4  40.000 320.000
5  60.000 480.000
6  50.000 400.000
7  70.000 560.000

Output in gpu mode

 0   0.000   0.000
 1  20.000 160.000
 2 200.000 1600.000
 3  30.000 240.000
 4  40.000 320.000
 5  60.000 480.000
 6 600.000 4800.000
 7  70.000 560.000

How To Reproduce
Compile the attached ShuffleTest.java file and run:
tornado -cp target/example-1.0-SNAPSHOT.jar org.nmrfx.tornado.ShuffleTest seq > seq.txt
tornado -cp target/example-1.0-SNAPSHOT.jar org.nmrfx.tornado.ShuffleTest gpu > gpu.txt
Compare the two output files

ShuffleTest.txt

Expected behavior
The output in gpu.txt and seq.txt should be the same.

Computing system setup (please complete the following information):

OS: Ubuntu 20
OpenCL Version : Using PTX
TornadoVM commit id: 5022bea

Additional context

Number of Tornado drivers: 1
Total number of PTX devices : 3
Tornado device=0:0
PTX -- GeForce RTX 2080 SUPER
Global Memory Size: 7.8 GB
Local Memory Size: 48.0 KB
Workgroup Dimensions: 3
Max WorkGroup Configuration: [1024, 1024, 64]
Device OpenCL C version: N/A

Wrong default compiler options: OpenCL 2.0 enforced

From documentation:

-Dtornado.opencl.compiler.options=LIST_OF_OPTIONS:
It allows to pass the compile options specified by the OpenCL CLBuildProgram specification to TornadoVM at runtime. By default it doesn't enable any.

From code (AbstractMetaData constructor):

openclCompilerOptions = (getProperty("tornado.opencl.compiler.options") == null) ? "-w -cl-std=CL2.0" : getProperty("tornado.opencl.compiler.options");

This prevents AMD OpenCL devices 1.1 / 1.2 from running (NVIDIA ignores this regardless of OpenCL 1.2 ver).

Refactor the TimeProfiler

Describe the bug
Currently, this method (addValueToMetric) expects a taskName to be passed.
The addValueToMetric method is called for the TASK_COPY_IN_SIZE_BYES and TASK_COPY_OUT_SIZE_BYTES profile type, which is not correct. Objects passed as parameters to tasks belong to the whole task schedule context (i.e multiple tasks that belong to the same task schedule and use the same object parameter will result in a single COPY_IN/STREAM_IN).

Therefore, I think the TASK_COPY_IN/OUT_SIZE_BYTES should be profiled per task schedule, and not individual tasks.

How To Reproduce
To reproduce, run the test below with the -Dtornado.profiler=True flag

    public static void add(int[] a, int[] b) {
        for (@Parallel int i = 0; i < a.length; i++) {
            b[i] = a[i] + a[i];
        }
    }

    public static void mult(int[] a, int[] b) {
        for (@Parallel int i = 0; i < b.length; i++) {
            b[i] = b[i] + a[i] * 3;
        }
    }

    public static void main(String[] args) {
        int n = 32;
        int[] a = new int[n];
        int[] b = new int[n];

        TaskSchedule ts = new TaskSchedule("s0")
                .task("t0", Main::add, a, b)
                .task("t1", Main::mult, a, b)
                .streamOut(b);

        ts.execute();
    }

The output produced by the profiler is:

{
    "s0": {
        "TOTAL_DISPATCH_DATA_TRANSFERS_TIME": "51936",
        "TOTAL_TASK_SCHEDULE_TIME": "298600160",
        "TOTAL_DRIVER_COMPILE_TIME": "169628101",
        "TOTAL_GRAAL_COMPILE_TIME": "51983174",
        "TOTAL_KERNEL_TIME": "18176",
        "TOTAL_DISPATCH_KERNEL_TIME": "15200",
        "TOTAL_BYTE_CODE_GENERATION": "5949780",
        "COPY_IN_TIME": "4512",
        "COPY_OUT_TIME": "1888",
        "s0.t0": {
            "METHOD": "Main.add",
            "DEVICE_ID": "0:0",
            "DEVICE": "GeForce GTX 1650",
            "TASK_COPY_OUT_SIZE_BYTES": "152",
            "TASK_COPY_IN_SIZE_BYTES": "344",
            "TASK_COMPILE_GRAAL_TIME": "35695419",
            "TASK_KERNEL_TIME": "9984",
            "TASK_COMPILE_DRIVER_TIME": "88543309"
        }, 
        "s0.t1": {
            "METHOD": "Main.mult",
            "DEVICE_ID": "0:0",
            "DEVICE": "GeForce GTX 1650",
            "TASK_COPY_IN_SIZE_BYTES": "40",
            "TASK_COMPILE_GRAAL_TIME": "16287755",
            "TASK_KERNEL_TIME": "8192",
            "TASK_COMPILE_DRIVER_TIME": "81084792"
        }
    }
}

Even though objects a and b are used by both t0 and t1, the TASK_COPY_IN_SIZE_BYTES and TASK_COPY_OUT_SIZE_BYTES are only reported for t0.

Expected behavior
The expected output for the test above would be:


{
    "s0": {
        "TOTAL_DISPATCH_DATA_TRANSFERS_TIME": "51936",
        "TOTAL_TASK_SCHEDULE_TIME": "298600160",
        "TOTAL_DRIVER_COMPILE_TIME": "169628101",
        "TOTAL_GRAAL_COMPILE_TIME": "51983174",
        "TOTAL_KERNEL_TIME": "18176",
        "TOTAL_DISPATCH_KERNEL_TIME": "15200",
        "TOTAL_BYTE_CODE_GENERATION": "5949780",
        "COPY_IN_TIME": "4512",
        "COPY_OUT_TIME": "1888",
        "COPY_OUT_SIZE_BYTES": "XXXX",
        "COPY_IN_SIZE_BYTES": "XXXX",
        "s0.t0": {
            "METHOD": "Main.add",
            "DEVICE_ID": "0:0",
            "DEVICE": "GeForce GTX 1650",
            "TASK_COMPILE_GRAAL_TIME": "35695419",
            "TASK_KERNEL_TIME": "9984",
            "TASK_COMPILE_DRIVER_TIME": "88543309"
        }, 
        "s0.t1": {
            "METHOD": "Main.mult",
            "DEVICE_ID": "0:0",
            "DEVICE": "GeForce GTX 1650",
            "TASK_COMPILE_GRAAL_TIME": "16287755",
            "TASK_KERNEL_TIME": "8192",
            "TASK_COMPILE_DRIVER_TIME": "81084792"
        }
    }
}

Additional context
Also, Javadoc for the datastructures of the profiler should be added.

Adding asynchronous execution to TaskSchedule

Currently TaskSchedule API contains only blocking versions of execute methods:

void execute() ;
void execute(GridTask gridTask);
void executeWithProfiler(Policy policy);
void executeWithProfilerSequential(Policy policy);
void executeWithProfilerSequentialGlobal(Policy policy);

All these methods block currently executing Java thread till all computations are done.
However, computations are typically off-loaded to GPU and CPU at this moment just keeps waiting.

I think it should be both beneficial and possible to add asynchronous versions of the same methods with the following signatures:

CompletableFuture executeAsync() ;
CompletableFuture executeAsync(GridTask gridTask);
CompletableFuture executeWithProfilerAsync(Policy policy);
CompletableFuture executeWithProfilerSequentialAsyn(Policy policy); // Not sure about "sequential async"
CompletableFuture executeWithProfilerSequentialGlobalAsync(Policy policy); // Not sure about "sequential async"

Thoughts about implementation

Per my understanding, TaskSchedule delegates back to TornadoTaskSchedule

And this objects waits on Event event object (driver-specific). There are specific classes in each driver- CLEvent and PTXEvent

OpenCL provides clSetEventCallback, CUDA has cudaLaunchHostFunc -- so it's possible to get async notifications from both OpenCL and PTX drivers.

So it should be possible to extend CLEvent and PTXEvent + PTXStream to add some form of listeners, where concrete listener inside TornadoTaskSchedule can settle CompletableFuture returned from the proposed TaskSchedule.executeAsync().

Thought?

Extend the API with missing Vector Types

Currently the TornadoVM API provides limited support for vector types.
In the current state, the following vector types are supported:

VectorFloat3
VectorFloat4

Extending the API to provide support for a number of vector types can allow us the following. 1) to be able to generate code and IR using vector types, so we can access the performance benefits of using auto-verctorization with tornado., 2) for architectures like multi-core CPUs, intel HD graphics or co-processors like Xeon Phi that the provide wider vector units (up to 512bits witdth) to be able to tune application from the API to merit from vectorization.

A few indicative numbers for using opencl vector types are presented her:
Evaluating vector data type usage in OpenCL kernels

[Proposal] Pass all method parameters through the call stack and improve code cache strategy

Currently, we have two different strategies for caching compilation results, one for each backend (PTX, OpenCL).

For the PTX backend, we rely on the identity of the function parameters passed to the task. The issue with this is that a recompilation will be triggered every time a parameter is changed.

For the OpenCL backend, the key to the code cache is scheduleNo.taskNo-methodName. This can cause conflicts when multiple task schedules with the same name are created in different scopes.

For example with the code below:

    static class Data {
        int[] inTor;
        int[] outTor;
        int[] inSeq;
        int[] outSeq;

        public Data(int inTorSize, int outTorSize) {
            Random random = new Random();

            inTor = new int[inTorSize];
            outTor = new int[outTorSize];
            for (int i = 0; i < inTorSize; i++) {
                inTor[i] = random.nextInt();
            }
            for (int i = 0; i < outTorSize; i++) {
                outTor[i] = random.nextInt();
            }

            inSeq = inTor.clone();
            outSeq = outTor.clone();
        }
    }

    public static void testMethod(int[] in, int[] out) {
        for (@Parallel int i = 0; i < in.length; i++) {
            out[i] = in[i];
        }

    }

    public static void testMethod2(int[] in, int[] out) {
        for (@Parallel int i = 0; i < in.length; i++) {
            out[i] = in[i];
        }

    }

    public static void main(String[] args) {
        int N1 = 1024;

        // // FIRST SCOPE
        {
            Data data = new Data(N1, N1 * N1);
            TaskSchedule ts = new TaskSchedule("s0")
                    .task("t0", Main::testMethod, data.inTor, data.outTor)
                    .task("t1", Main::testMethod2, data.inTor, data.outTor)
                    .streamOut(data.inTor, data.outTor);

            ts.execute();
        }

        // SECOND SCOPE
        {
            N1 = N1 / 2;                                          // <---------- Use different input objects and size
            Data data = new Data(N1, N1 * N1);
            TaskSchedule ts = new TaskSchedule("s0")
                    .task("t0", Main::testMethod, data.inTor, data.outTor)
                    .task("t1", Main::testMethod2, data.inTor, data.outTor)
                    .streamOut(data.inTor, data.outTor);

            ts.execute();
        }
    }

The OpenCL backend will not recompile for the first task t0 in the second scope and therefore use the wrong inlined array length in.length value in the kernel. The reason the second task t1 is recompiled is a side effect from here
The PTX backend will trigger 4 compilations in total.

I think the way to solve this is to pass all the task parameters (primitives and object references) through the call stack.
We also might need to stop inlining array lengths in the @Parallel annotated loops (for(;i_3 < 1024;)).

Can the whole repo be shared with ASL2.0/MIT

Is your feature request related to a problem? Please describe.
Can part which is GPL be relicensed under ASL2.0/MIT

Describe the solution you'd like
Whole repo under ASL2.0/MIT

[Task] Prepare TornadoVM 0.9 release

Update documentation and CHANGELOGs
Update version to 0.9
Checks with JDK 8, 11,
Build and update maven links
Build and check new docker images

Time estimation: ~4-6h

Release: ~15th April 2021

Several benchmarks fail when running on MacOS due to ambiguous functions

Describe the bug
Running tornado-benchmarks.py gives failures with calls to ambiguous functions:

nbody : call to '__cl_sqrt' is ambiguous
dgemm: call to '__cl_fma' is ambiguous
dft: call to '__fast_relax_sin' is ambiguous
call to '__fast_relax_cos' is ambiguous

There are a couple other failures (see output) , but I thought I would limit this bug report to the ambiguous function failures

How To Reproduce
Run tornado-benchmarks.py

A clear and concise description of what you expected to happen.
All benchmarks should run
benchmark.txt

Computing system setup (please complete the following information):

OS: MacOS Catalina 10.15.7 MacBook Pro 6-Core Intel I7, 16Gb memory, Radeo Pro 560X 4 GB
OpenCL Version : 1.2
TornadoVM commit id : e8c3161

Additional context

Android and IOS ( RoboVM ) support ?

Hi , it seems this library works on all Desktops.
But i think Android is also Linux fork.So it wont be hard to migrate this library.
Also for IOS/Metal (RoboVM) support will be also good.

Invalid binary in AWS EC2 F1 instance, while AFI is not ready

Is your feature request related to a problem? Please describe.
Due to the AWS FPGA toolchain, once the Xilinx bitstream is generated it cannot be immediately executed. It requires first to create an Amazon FPGA Image (AFI).

In TornadoVM, the user invokes the aws_post_processing.sh script which creates an Amazon FPGA Image (AFI), updates the link of the FPGA binary and uploads a description to an AWS S3 bucket. The creation of the AFI takes around 30-40 minutes.

As a result, TornadoVM will crash the first time it aims to execute the AFI with an error in clBuildProgramWithBinary, which is expected, as AFI is not ready. What do you think about this case? I guess it is not important, as users can query the status of the AFI and then execute on the FPGA. But, I would like to listen any opinions.

Describe the solution you'd like
I was thinking of a solution that has the following:

The aws_post_processing.sh script to be automatically triggered by the TornadoVM runtime.
During the time that the AFI is not ready, the VM to be shutdown or somehow make it hold till the AFI status is ready.

[FEAT] Disable deoptimizations when running unittests

Add an option to disable automatic deoptimizations and use the new option when running the unittests.

Add Javadoc and document the examples in TornadoVM

Add Javadoc to document the examples in TornadoVM.

This affects the packages under the examples module:

https://github.com/beehive-lab/TornadoVM/tree/master/tornado-examples/src/main/java/uk/ac/manchester/tornado/examples

The documentation is at the class-level and it will contain a description of how the TornadoVM API is used for each example. Additionally, it contains how to run the example from the command line.

This class contains an example of documentation of what we are looking for:

https://github.com/beehive-lab/TornadoVM/blob/master/tornado-examples/src/main/java/uk/ac/manchester/tornado/examples/polyglot/HelloPython.java#L26-L38

Summary

Javadoc at the class-level
High-level description of the problem/example
Description of the TornadoVM API calls used and how are they composed
How to run from the command line.

Bitset benchmarks failing - MBP 16 MacOS 10.15.3, TornadoVM SDK 0.6, GraalVM Java 8 19.3.0

tornado --devices                                                

Number of Tornado drivers: 1
Total number of devices  : 3
Tornado device=0:0
	Apple -- Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
		Global Memory Size: 32.0 GB
		Local Memory Size: 32.0 KB
		Workgroup Dimensions: 3
		Max WorkGroup Configuration: [1024, 1, 1]
		Device OpenCL C version: OpenCL C 1.2

Tornado device=0:1
	Apple -- Intel(R) UHD Graphics 630
		Global Memory Size: 1.5 GB
		Local Memory Size: 64.0 KB
		Workgroup Dimensions: 3
		Max WorkGroup Configuration: [256, 256, 256]
		Device OpenCL C version: OpenCL C 1.2

Tornado device=0:2
	Apple -- AMD Radeon Pro 5500M Compute Engine
		Global Memory Size: 8.0 GB
		Local Memory Size: 64.0 KB
		Workgroup Dimensions: 3
		Max WorkGroup Configuration: [256, 256, 256]
		Device OpenCL C version: OpenCL C 1.2

Environment vars:

TORNADO_SDK=/usr/local/tornado-sdk/tornado-sdk-0.6-c0b3304
JAVA_HOME=/usr/local/graalvm-ce-java8/19.3.0/Contents/Home
DEVICES=0:0,0:1,0:2

tornado-benchmarks.py

Stack trace:

bm=bitset-131-8192, id=java-reference      , average=1.871792e+03, median=1.492000e+03, firstIteration=5.065290e+05, best=1.128000e+03
uk.ac.manchester.tornado.drivers.opencl> notify error:
uk.ac.manchester.tornado.drivers.opencl> [CL_DEVICE_NOT_AVAILABLE] : OpenCL Error : Error: build program driver returned (-2)
uk.ac.manchester.tornado.drivers.opencl> notify error:
uk.ac.manchester.tornado.drivers.opencl> OpenCL Warning : clBuildProgram failed: could not build program for 0xffffffff (Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz) (err:-2)
uk.ac.manchester.tornado.drivers.opencl> notify error:
uk.ac.manchester.tornado.drivers.opencl> [CL_BUILD_ERROR] : OpenCL Build Error : Compiler build log:
<program source>:33:3: error: expected expression
  else
  ^


Exception in thread "main" uk.ac.manchester.tornado.api.exceptions.TornadoRuntimeException: [ERROR] Generated Kernel is NULL. 
Please report this issue to https://github.com/beehive-lab/TornadoVM
	at uk.ac.manchester.tornado.drivers.opencl.graal.OCLInstalledCode.checkKernelNotNull(OCLInstalledCode.java:353)
	at uk.ac.manchester.tornado.drivers.opencl.graal.OCLInstalledCode.submitWithoutEvents(OCLInstalledCode.java:359)
	at uk.ac.manchester.tornado.drivers.opencl.graal.OCLInstalledCode.launchWithoutDeps(OCLInstalledCode.java:389)
	at uk.ac.manchester.tornado.runtime.TornadoVM.execute(TornadoVM.java:498)
	at uk.ac.manchester.tornado.runtime.TornadoVM.execute(TornadoVM.java:194)
	at uk.ac.manchester.tornado.runtime.tasks.TornadoTaskSchedule.scheduleInner(TornadoTaskSchedule.java:421)
	at uk.ac.manchester.tornado.runtime.tasks.TornadoTaskSchedule.schedule(TornadoTaskSchedule.java:642)
	at uk.ac.manchester.tornado.api.TaskSchedule.execute(TaskSchedule.java:230)
	at uk.ac.manchester.tornado.benchmarks.bitset.BitsetTornado.code(BitsetTornado.java:71)
	at uk.ac.manchester.tornado.benchmarks.BenchmarkDriver.benchmark(BenchmarkDriver.java:96)
	at uk.ac.manchester.tornado.benchmarks.BenchmarkRunner.benchmarkAll(BenchmarkRunner.java:113)
	at uk.ac.manchester.tornado.benchmarks.BenchmarkRunner.run(BenchmarkRunner.java:82)
	at uk.ac.manchester.tornado.benchmarks.BenchmarkRunner.main(BenchmarkRunner.java:154)

Endianness of OpenCL driver

This is not actually a bug but rather a question - as long as I can't get necessary hardware/software to test.
Per my understanding of the code, the OpenCL driver directly "projects" java arrays as OpenCL buffers (with elements larger than a byte like long or double).

As long as Java always stores primitives in little endian format, does it mean that TornadoVM only support environments (combo of hardware & software) with little endian encoding?

Did you test on any platform with Big Endian? Like ones mentioned here - https://en.wikipedia.org/wiki/Endianness (OpenRISC, IBM z/Architecture, SPARC + corresponding OS like AIX / Solaris)? Or all such platforms are considered obsolete?

Or do you use some OpenCL-specific annotations in generated kernels to enforce little-endianness?

Deprecate passing local and global dimensions for tasks as a runtime flag

Currently, we can specify the local/global work for a task through a runtime flag: <ts>.<taskName>.global/local.dims=x,x,x, bypassing the scheduler.
Since we now have the Grid API, I think we can deprecate these options and remove the extra logic in the TaskMetaData.
Please note that slambench uses the runtime flags and needs refactoring as well.

At the suggestion of @stratika, we can also rename the GridTask to GridScheduler.

MonteCarloDynamic kernel failing on the Xilinx FPGA

Describe the bug
I noticed a problem with the Montecarlo kernel in the dynamic package for all the sizes, when executing on the Xilinx KCU1500 FPGA. There is no error in the compilation, but the kernel does not finish and it causes failures at the driver level regarding the dma. The problem seems like this:

[  815.440478] xocl:engine_status_dump: SG engine 0-H2C1-MM status: 0x00000000:
[  815.440480] xocl:engine_status_dump: SG engine 0-H2C0-MM status: 0x00000001: BUSY
[  815.440483] xocl:transfer_abort: abort transfer 0x000000009584ae00, desc 11, engine desc queued 0.
[  815.440487] xocl:transfer_abort: abort transfer 0x00000000d2360335, desc 1, engine desc queued 0.
[  815.440505] xocl_mm_xdma mm_dma.v5.u.256: xdma_migrate_bo: DMA failed, Dumping SG Page Table
[  815.440508] xocl_mm_xdma mm_dma.v5.u.256: xdma_migrate_bo: DMA failed, Dumping SG Page Table
[  815.440516] xocl_mm_xdma mm_dma.v5.u.256: xdma_migrate_bo: 0, 0xf3ce7c000
[  815.440521] xocl_mm_xdma mm_dma.v5.u.256: xdma_migrate_bo: 1, 0xf3d800000
[  815.440526] xocl_mm_xdma mm_dma.v5.u.256: xdma_migrate_bo: 2, 0xf3d400000
[  815.440531] xocl_mm_xdma mm_dma.v5.u.256: xdma_migrate_bo: 3, 0xf3f000000
[  815.440536] xocl_mm_xdma mm_dma.v5.u.256: xdma_migrate_bo: 4, 0xf7d000000
[  815.440540] xocl_mm_xdma mm_dma.v5.u.256: xdma_migrate_bo: 5, 0xf4f800000
[  815.440545] xocl_mm_xdma mm_dma.v5.u.256: xdma_migrate_bo: 6, 0xf54800000
[  815.440550] xocl_mm_xdma mm_dma.v5.u.256: xdma_migrate_bo: 7, 0xf60400000
[  815.440554] xocl_mm_xdma mm_dma.v5.u.256: xdma_migrate_bo: 8, 0xf61c00000
[  815.440559] xocl_mm_xdma mm_dma.v5.u.256: xdma_migrate_bo: 9, 0xf3b800000
[  815.440568] xocl_mm_xdma mm_dma.v5.u.256: xdma_migrate_bo: 0, 0xf3821f000

This problem occurs only on the Xilinx KCU1500 FPGA. The Intel Nallatech Arria 10 FPGA is working both in emulation mode and the other two modes (Full Jit and AoT).

So, I did some work around and compared the previous kernel that was working (about 2 months old) and the current one. I took the body of the old kernel and applied two changes that we introduced in the latest version:
a) altered the number regarding the frame number from 6 to 0.
b) removed the private region parameter.

The modified kernel seems to be working. So, the main difference between the two kernels is shown in the figure (Left kernel is the old one that is working, Right kernel is the new one that causes the problem):

How To Reproduce
tornado -Ds0.t0.device=0:1 -Xmx20g -Xms20g --printKernel --debug uk.ac.manchester.tornado.examples.dynamic.MontecarloDynamic 65536 default 1

Note that device 0:1 is the xilinx_kcu1500_dynamic_5_0 CL_DEVICE_TYPE_ACCELERATOR

Computing system setup (please complete the following information):

OS: Ubuntu 18.04.02 LTS
OpenCL Version: 1.0
TornadoVM commit id: ed243aa

Any ideas? I am not familiar with this change about the fma.

[BUG] Multiple conditions evaluation results in wrong code gen

Describe the bug

The generated code omits a condition when having a complex if condition:

The result code for OpenCL is as follows:

#pragma OPENCL EXTENSION cl_khr_int64_base_atomics : enable  
__kernel void testIfInt6(__global uchar *_heap_base, ulong _frame_base, __constant uchar *_constant_region, __local uchar *_local_region, __global int *_atomics)
{
  int i_9, i_8, i_7, i_11, i_1, i_2; 
  bool z_10; 
  ulong ul_6, ul_0; 
  long l_3, l_5, l_4; 

  __global ulong *_frame = (__global ulong *) &_heap_base[_frame_base];


  // BLOCK 0
  ul_0  =  (ulong) _frame[3];
  i_1  =  get_global_id(0);
  // BLOCK 1 MERGES [0 5 ]
  i_2  =  i_1;
  for(;i_2 < 256;)  {
    // BLOCK 2
    l_3  =  (long) i_2;
    l_4  =  l_3 << 2;
    l_5  =  l_4 + 24L;
    ul_6  =  ul_0 + l_5;
    i_7  =  *((__global int *) ul_6);
    i_8  =  get_global_size(0);
    i_9  =  i_8 + i_2;
    z_10  =  i_7 < 2;
    if(z_10)
    {
      // BLOCK 3
      *((__global int *) ul_6)  =  100;
      i_7  =  100;
    }  // B3
    else
    {
      // BLOCK 4
      *((__global int *) ul_6)  =  200;
      i_7  =  200;
    }  // B4
    // BLOCK 5 MERGES [3 4 ]
    i_11  =  i_9;
    i_2  =  i_11;
  }  // B5
  // BLOCK 6
  return;
}  //  kernel

How To Reproduce

This test case reproduces the error.

    public static void testIfInt6(int[] a) {
        for (@Parallel int i = 0; i < a.length; i++) {
            if (a[i] >= 0 && a[i] <= 1) {
                a[i] = 100;
            } else {
                a[i] = 200;
            }
        }
    }


 @Test
    public void test06() {
        final int numElements = 256;
        int[] a = new int[numElements];
        int[] expectedResult = new int[numElements];

        Arrays.fill(a, -1);
        Arrays.fill(expectedResult, 200);

        new TaskSchedule("s0") //
                .task("t0", TestKernels::testIfInt6, a) //
                .streamOut(a) //
                .execute(); //

        assertArrayEquals(expectedResult, a);
    }

The problem is that during one of the low-tier compiler phases the multiple conditions is removed. I suspect this is during one of the canonicalization phases.

Printing VectorFloat3 of size <4 throws IllegalFormatConversionException

Describe the bug
Printing VectorFloat3 of size <4 throws IllegalFormatConversionException

How To Reproduce
Attempt to print a VectorFloat3 of size less than four, for example :

VectorFloat3 circleCenters = new VectorFloat3(2);

circleCenters.set(0, new Float3(0,0,10));
circleCenters.set(1, new Float3(1000,1000,2000));

System.out.println(circleCenters);

Expected behavior

Output:
<[0, 0, 10], [1000, 1000, 2000]>

Computing system setup (please complete the following information):

OS: Arch Linux
OpenCL Version 1.2
TornadoVM commit id 29b4554

Additional context

This is caused by the following code in uk.ac.manchester.tornado.api.collections.types.VectorFloat3:

public String toString() {
        return this.numElements > 3 ? String.format("VectorFloat3 <%d>", this.numElements) : this.toString("{%.3f,%.3f,%.3f}");
    }

Which fails as the elements are of type Float3 and not float. Instead, it should be:

public String toString() {
        if (this.numElements > 3)
            return String.format("VectorFloat3 <%d>", this.numElements);
        String tempString = "";
        for (int i = 0; i<numElements; i++){
            tempString += (" " + this.get(i).toString)
        }
        return tempString
    }

Join a Foundation like ASF

Many research projects do not have continuity.

Joining a foundation like ASF may help the long term stability and viability of TornadoVM

Develop use-cases for TornadoVM

The current version of the TornadoVM repository contains a module with several examples. The purpose of this issue is to expand the variety of use cases with more application domains.

The potential new examples and cases to include with TornadoVM are (not limited to): GUI NBODY, computational photogrtaphy, block-chain. crypto, etc.

Note, before working on this and being able to merge it into the TornadoVM repo, please contact the TornadoVM team.

[bug] Kfusion throws CL_OUT_OF_RESOURCES error on Nvidia GPUs

Describe the bug
There is an issue when I test kfusion on Nvidia GTX 1050 GPU of my laptop. The error is the following:

: Reading configuration file: /home/thanos/.kfusion_tornado/living_room_traj2_loop.raw
frame	acquisition	preprocessing	tracking	integration	raycasting	rendering	computation	total    	X          	Y          	Z         	tracked   	integrated
0	0.008163	0.137233	0.321119	0.059884	0.000001	0.068803	0.518238	0.595204	0.000000	0.000000	0.000000	0	1
1	0.000622	0.003541	0.041345	0.010988	0.000000	0.000000	0.055875	0.056497	0.000000	0.000000	0.000000	0	1
2	0.000421	0.004126	0.041017	0.010680	0.000000	0.000000	0.055822	0.056243	0.000000	0.000000	0.000000	0	1
[TornadoVM-OCL-JNI] ERROR : clEnqueueNDRangeKernel -> Returned: -5
[JNI] uk.ac.manchester.tornado.drivers.opencl> notify error:
[JNI] uk.ac.manchester.tornado.drivers.opencl> CL_OUT_OF_RESOURCES error executing CL_COMMAND_NDRANGE_KERNEL on GeForce GTX 1050 Ti with Max-Q Design (Device 0).

I tried to corner the source of the problem, as v0.9 is working and found that the commit point in which this error is thrown.

commit id: 7c5b36fdf and in particular the FixReadsPhase as shown here.

The interesting part is that although this phase breaks kfusion on the Nvidia GPU of my laptop, it is working for the Intel CPU and the Intel Integrated GPU.

As a next step, after discussion with @gigiblender and @jjfumero, I dumped the kernels of the following configurations:

Using the FixReadsPhase:
slambench_kernel_not_working_nvidia.txt
Not using the FixReadsPhase:
slambench_kernel_working_nvidia.txt

How To Reproduce

Build TornadoVM:

$ git checkout 7c5b36fdf
$ source etc/sources.env
$ make
$ tornadoLocalInstallMaven

Configure and run kfusion:

$ source sources.env
$ mvn clean install -DskipTests
$ kfusion kfusion.tornado.Benchmark conf/bm-traj2.settings

Expected behavior
The expected output should not throw the CL_OUT_OF_RESOURCES error. The following output is obtained by running v0.9:

: Reading configuration file: /home/thanos/.kfusion_tornado/living_room_traj2_loop.raw
frame	acquisition	preprocessing	tracking	integration	raycasting	rendering	computation	total    	X          	Y          	Z         	tracked   	integrated
0	0.007167	0.130851	0.328048	0.061200	0.000001	0.068857	0.520100	0.596123	0.000000	0.000000	0.000000	0	1
1	0.000535	0.003574	0.041041	0.010718	0.000000	0.000000	0.055334	0.055869	0.000000	0.000000	0.000000	0	1
2	0.000440	0.004301	0.041231	0.010020	0.000000	0.000000	0.055552	0.055993	0.000000	0.000000	0.000000	0	1

Computing system setup (please complete the following information):

OS: [Ubuntu 18.04]
OpenCL Version: [1.2]
OpenCL Nvidia Driver: [450.119.03]
TornadoVM commit id: [7c5b36fdf]

Can't compile internal While loop or For loop with break

Hi,

I used Aparapi before for GPU calculations Java on GPU: Pricing options with Monte Carlo simulation.
I was excited to hear about Torando VM and wanted to repeat my calculations on FPGA.

I noticed an issue when moving from Aparapi to Tornado VM.
I have a Monte Carlo simulation (n elements with m calculations/simulations per each).
Seems that Tornado can't compile simulations with breaks (while loop or for loop with break).

Maybe Tornado thinks that I modify incoming arrays, but this isn't the case.

Do you have any suggestion about how to fix or debug this issue, please?

public static void mapCalcMc(float[] arrS, float[] arrB, int iterPerElement, int logicIterNum, float[] results) {
        int arrLength = arrS.length;
        int currentIter = 0;

        for (@Parallel int i = 0; i < arrLength; i++) {
            for (@Parallel int j = 0; j < iterPerElement; j++) {
                int k = 0;
                float curA = arrS[i];
                float b = arrB[i];
                while (k < logicIterNum
                        && curA > b//Condition that can't be compiled
                ) {
                    k++;
                }

                results[currentIter] = 0.0f;
                currentIter++;
            }
        }
    }

UPD:

I run code on AMD 4Gb GPU locally. Tornado works for me both on CPU and GPU with other tasks.
Code above is "Minimum Breaking Example". I tried to remove all non-essential logic.
Here is the code to trigger task:

TaskSchedule task = new TaskSchedule("s0")
    .task("map", MonteCarloTornadoGPU::mapCalcMc, new float[10], new float[10],
      10,
      100, // It looks crazy but if I replace 100 wit 10 code works. 
           // This param is used for number of iterations in no-op loop 
       new float[100])
     .streamOut(gpuInterimMapResults);
task.execute();

I found a strange behavior, when I increase number of iterations in no-op while loop, app crashes (see comment to task code).

Attaching kernel out.
kernel.txt

beehive-lab / tornadovm Goto Github PK

tornadovm's Introduction

TornadoVM

1. Installation

2. Usage Instructions

3. Programming Model

a) Loop Parallel API

b) Kernel API

4. Dynamic Reconfiguration

5. How to Use TornadoVM in your Projects?

6. Additional Resources

7. Academic Publications

8. Acknowledgments

9. Contributions and Collaborations

Write your questions and proposals:

Collaborations:

10. TornadoVM Team

11. Licenses

tornadovm's People

Contributors

Stargazers

Watchers

Forkers

tornadovm's Issues

Thoughts about implementation

Summary

Recommend Projects

Recommend Topics

Recommend Org