Coder Social home page Coder Social logo

gpuocelot's Introduction

  • 👋 Hi, I’m @jigardhaval8
  • 🌱 I’m currently learning python, raspberry pi, sensor interfacing and ...
  • 💞️ I’m looking to collaborate on any electronics related problems
  • 📫 How to reach me - jigardhaval8.wordpress.com

gpuocelot's People

Watchers

 avatar

gpuocelot's Issues

Review LLVM IR

Purpose of code changes on this branch:
Make sure that the proposed IR is suitable for being a translated target
for the PTX IR.

When reviewing my code changes, please focus on:

1) Make sure that the IR is able of expressing all LLVM instructions.
2) Suggest ways to modify the IR to make it easier to use/understand.
3) Check to make sure that the code generation functions and error checking
functions are complete and correct.




Original issue reported on code.google.com by [email protected] on 29 Jul 2009 at 9:59

Add an AMD IL backend

Describe the New Feature:

At a high level this would be an IR, translator, and executive kernel for
AMD's IL language.

Which milestone does the feature belong to?
2.0.0


Which branch does the new feature go in?
Trunk

Original issue reported on code.google.com by [email protected] on 8 Oct 2009 at 5:47

Add an OpenCL front-end

Describe the New Feature:
1. Add an OpenCL front-end to ocelot.
2. Completely re-implement the OpenCL 1.0 specification using the open
source Nvidia Open64 compiler to generate PTX code.
3. Use as much of the existing Executive class as possible to implement
OpenCL functionality.

Which milestone does the feature belong to?
1.0.0


Which branch does the new feature go in?
Branch

Original issue reported on code.google.com by [email protected] on 5 Aug 2009 at 4:04

error ---- ir::PTXOperand::AddressMode' is not a class or namespace

What steps will reproduce the problem?
1../build.py --install
2.
3.

What is the expected output? What do you see instead?


What version of the product are you using? On what operating system?
2.0

Please provide any additional information below.
in ocelot/analysis/implementation/DivergenceAnalysis.cpp

ir::PTXOperand::AddressMode::Special
should be
ir::PTXOperand::Special

also similar error in  ocelot/analysis/implementation/SyncEliminationPass.cpp

Original issue reported on code.google.com by [email protected] on 31 Mar 2011 at 4:28

Virtualize the CUDA runtime API

Describe the New Feature:
 We want support for easily swapping between multiple implementations of
the CUDA runtime.  OcelotRuntimeAPI should be replaced by a pure virtual
class with one pure virtual method per CUDA API call.  

Which milestone does the feature belong to?
0.8.0


Which branch does the new feature go in?
Trunk

Original issue reported on code.google.com by [email protected] on 5 Sep 2009 at 7:11

Triage LLVM Failures

What steps will reproduce the problem?
1. Compile ocelot with llvm support enabled.
2. Select the LLVM JIT device.
3. Run the SDK samples.

What is the expected output? What do you see instead?

Very few of the SDK samples will execute and produce correct results. We
need to identify exactly which programs fail and which complete.  Then we
need to begin stepping through individual examples and modifying the
LLVMEmulatedKernel, translator, and PTX optimization passes to fix any
problems.

Please use labels and text to provide additional information.

Original issue reported on code.google.com by [email protected] on 22 Sep 2009 at 11:01

Define LLVM IR

Describe the New Feature:
1. Create an LLVMInstruction class that inherits from Instruction and is
able to represent any valid LLVM instruction.
2. Create a toString function that converts an instruction to a parsable
assembly language representation.

Which milestone does the feature belong to?
Milestone-Release0.6


Which branch does the new feature go in?
Trunk

Original issue reported on code.google.com by [email protected] on 22 Jun 2009 at 8:24

Add a driver level API implementation

Describe the New Feature:
Having a driver level API implementation would allow applications that use
it exclusively to use Ocelot...

Which milestone does the feature belong to?
1.0.0


Which branch does the new feature go in?
Trunk

Original issue reported on code.google.com by [email protected] on 14 Jul 2009 at 12:28

Implement Dead Code Elimination

Describe the New Feature:
1. After the data flow graph has been used to compute live sets, remove
instructions that produce registers with no consumers.  
2. Recursively do this until no more instructions can be removed.

Which milestone does the feature belong to?
0.5.0


Which branch does the new feature go in?
Trunk

Original issue reported on code.google.com by [email protected] on 30 Jun 2009 at 7:30

Problems Linking to Ocelot While Using CUBLAS and CUFFT

What steps will reproduce the problem?
1. Create a program that uses CUBLAS
2. Replace -lcudart with -locelot
3. Run program

What is the expected output? What do you see instead?
I expect the emulator to be invoked.  Instead cudart gets linked in.

What version of the product are you using? On what operating system?
1.1.56  well actually a subversion copy from around Oct. 18 2010

Please provide any additional information below.
I deliberately introduced a memory read error,  but it wasn't caught by ocelot. 
 Looks like my linker still found cudart instead of ocelot.

I found these instructions from a third party site:
"
To use Ocelot with any pre-compiled CUDA libraries (such as CUFFT or CUBLAS), 
the libraries must be compiled as shared objects and must be linked in the 
correct order. The order is PRECOMPILED_LIBRARIES OCELOT_LIBRARIES YOUR_PROGRAM 
. This is REQUIRED to ensure that global constructors are called in the correct 
order. It is an artifact of how CUDA is designed and impossible for us to 
change. "


I tried that build order in the g++ command,  but it seemed to make no 
difference.  An example of linking in CUFFT or CUBLAS, whilst using the ocelot 
emulator would be very helpful.  I'm trying to use the memorey checker.

Original issue reported on code.google.com by [email protected] on 20 Oct 2010 at 10:53

Multithread the emulator!

Describe the New Feature:
1) Devise a work queue approach where the executive class spawns one thread
per CPU core and assigns CTAs to threads as they complete.
2) For atomic ops, rather than locking, asynchronously push data into a
local queue, when it overflows then lock and do a bulk update.  Also do a
bulk update when the CTA completes to eliminate stragglers and when a fence
instruction is called.

Which milestone does the feature belong to?
1.0.0

Which branch does the new feature go in?
Branch

Original issue reported on code.google.com by [email protected] on 30 Jul 2009 at 3:27

Install Compile Failue on ArchLinux 64 bit Boost 1.43

What steps will reproduce the problem?
1. install boost
2. install glew
3. ocelot make fails

What is the expected output? What do you see instead?
compiler error

What version of the product are you using? On what operating system?
Archlinux circa Oct. 2010
boost library 1.43.0-1

Please provide any additional information below.

During make the following error from a boost include file is observed:
libtool: compile:  g++ -DHAVE_CONFIG_H -I. -I ./ocelot/cuda/include -Wall -ansi 
-Werror -std=c++0x -g -O2 -MT libocelot_la-KernelEntry.lo -MD -MP -MF 
.deps/libocelot_la-KernelEntry.Tpo -c 
ocelot/trace/implementation/KernelEntry.cpp  -fPIC -DPIC -o 
.libs/libocelot_la-KernelEntry.o
In file included from /usr/include/boost/interprocess/sync/file_lock.hpp:24:0,
                 from ocelot/trace/implementation/KernelEntry.cpp:22:
/usr/include/boost/interprocess/detail/move.hpp: In function ‘typename 
boost::remove_reference<T>::type&& boost::interprocess::move(T&&) [with T = 
boost::interprocess::file_lock&, typename boost::remove_reference<T>::type = 
boost::interprocess::file_lock]’:
/usr/include/boost/interprocess/sync/file_lock.hpp:68:52:   instantiated from 
here
/usr/include/boost/interprocess/detail/move.hpp:342:11: error: invalid 
initialization of reference of type 
‘boost::remove_reference<boost::interprocess::file_lock&>::type&&’ from 
expression of type ‘boost::interprocess::file_lock’
make[1]: *** [libocelot_la-KernelEntry.lo] Error 1
make[1]: Leaving directory `/opt/ocelot-1.1.560'
make: *** [all] Error 2


Original issue reported on code.google.com by [email protected] on 14 Oct 2010 at 5:51

LLVM Runtime

Describe the New Feature:
Add an LLVM device for which kernels are launched using the LLVM JIT.


Which milestone does the feature belong to?
0.8.0


Which branch does the new feature go in?
Trunk

Original issue reported on code.google.com by [email protected] on 7 Sep 2009 at 8:03

Add SSA Analysis Module

Describe the New Feature:

1. Add an SSA control flow graph to the code analysis modules.
 a. The SSA graph should be composed of blocks of instructions.
 b. Registers should be represented by integers.
 c. Instructions should be represented by either pointers to instruction
objects or indices into a vector of all instructions.
 d. There should be three types of instructions: generic, branch, and phi
2. Write a unit test that builds the SSA CFG for all of the PTX test files.

Which milestone does the feature belong to?
0.5.0


Which branch does the new feature go in?
Trunk


Original issue reported on code.google.com by [email protected] on 22 Jun 2009 at 7:36

Can't check-out svn repos using the command line provided in wiki

What steps will reproduce the problem?
1. svn checkout http://gpuocelot.googlecode.com/svn/trunk/ gpuocelot 

It failed with below error information
A    gpuocelot/tests/ptx/sdk/MonteCarlo_SM10.ptx
A    gpuocelot/tests/ptx/sdk/bandwidthTest.ptx
svn: In directory 'gpuocelot/tests/ptx/sdk'
svn: Can't open file
'gpuocelot/tests/ptx/sdk/.svn/tmp/text-base/particleSystem.ptx.svn-base':
No such file or directory

I tried svn 1.6.5 on Mac OS X 10.6.2 and svn 1.6.6 on Windows XP. 

Seems like something wrong with this file on server. Any idea? Thanks.

Original issue reported on code.google.com by [email protected] on 22 Dec 2009 at 9:24

Lazy Evaluation of PTX Kernels in the CUDA Runtime

Describe the New Feature:
The current implementation loads and parses all PTX kernels declared within
a program upon kernel registration.  Make this lazily evaluated instead.

Ideally, registering a kernel should add an entry with a flag saying that
it has not yet been parsed.  Upon the first execution, it should be
translated and then executed and the flag should be updated.



Which milestone does the feature belong to?
1.0.0


Which branch does the new feature go in?
Trunk

Original issue reported on code.google.com by [email protected] on 13 Aug 2009 at 8:15

Add Shared Memory Race Detection

Describe the New Feature:

From SPWorley:

Just had a little brainstorm... not a feature request, but something
that may inspire some more memory debugging/error checking for
Ocelot's emulator.

It could be straightforward for Ocelot to detect all shared memory
thread race conditions.  These happen when one thread writes to shared
memory and a different thread reads, but with unspecified thread
ordering, that read may have occurred either before or after the write
so your results become uncertain. These bugs are usually hard to track
down since they often work and break only later when some other
innocuous change is made.

Ocelot could detect these shared memory ordering races pretty easily
with a little overhead in memory and time.

Allocate two 16K int (or short) buffers.. these are basically two
tracking variables per shared memory location to track which thread
has read or written to a particular byte of shared memory.  At the
start of a block and every threadsync(), initialize the two buffers to
0xFFFF, meaning "no access yet."
If a thread reads from a shared memory location, mark the "read"
buffer with the thread ID.  If a thread writes to a shared memory
location, mark the "write" buffer with that thread ID.

Typical race errors will be detected by checking that the read and
write arrays never have two different thread IDs in them.  It's OK if
one thread reads and writes. It's OK if lots of threads read and
nobody writes. It's even OK if lots of threads write and nobody
reads.   But once you detect that you have a reader and writer with
different thead IDs, you fire off the "warning, potential race
condition detected."

If multiple threads read, then you could mark the array with a
"multiple readers" flag instead of the threadID and then *ANY* write
before or after would be a race. Same for multiple writers.

This idea may not be too practical or important but I thought I'd
share it while it was still fresh in my mind.


Which milestone does the feature belong to?
2.0.x


Which branch does the new feature go in?
Trunk

Original issue reported on code.google.com by [email protected] on 29 Oct 2009 at 3:03

Add External Trace Generator API

Describe the New Feature:
There should be a high level API call available to CUDA programs that
specifies a trace generator to be attached to the next launched kernel.

For example:

{{{
BranchTraceGenerator generator;
ocelotAddTraceGenerator( generator );
somekernel<<< ctas, threads, memory >>>(parameter);
}}}

Which milestone does the feature belong to?
0.5.0


Which branch does the new feature go in?
Trunk

Original issue reported on code.google.com by [email protected] on 15 Jul 2009 at 4:01

Function Calls are not handled correctly in the emulator or parser

What steps will reproduce the problem?
1. Compile any CUDA device function with the directive __noinline__

What is the expected output? What do you see instead?

Before we were under the impression that the compiler never generated any
function calls.  It turns out that it does.  So we should provide support
for this in the parser and the emulator.

We need to add unit tests for the lexer/parser that contain device function
calls. We also need to add functionality to the code generator for the
emulator to add code of all referenced function calls to a kernel binary. 
Finally, we need to add support for recursive function calls since they are
likely to come out in future releases...

Original issue reported on code.google.com by [email protected] on 7 Jul 2009 at 2:08

Add a code generator for PTX and a GPU target

Describe the New Feature:

We can already create a kernel of PTX instructions.  However, we do not
currently have a way of launching it on a GPU.  We need the following new
features:

1) A new executable kernel class that launches a PTX kernel on a GPU device
using the CUDA driver level api.
2) Additions to the Executive class to support detection of NVIDIA GPUs,
translation to executable GPU kernels, and memory allocation and copies
into the GPU address space.
3) Unit tests to make sure that this functionality works.

Which milestone does the feature belong to?
0.9.0


Which branch does the new feature go in?
Trunk

Original issue reported on code.google.com by [email protected] on 16 Sep 2009 at 9:48

Compilation fails on 86_64

Hi,
compilation of r150 fails on
cc1plus: warnings being treated as errors
ocelot/ir/implementation/LLVMInstruction.cpp: In member function 
'std::string ir::LLVMInstruction::Operand::toString() const':
ocelot/ir/implementation/LLVMInstruction.cpp:147: error: dereferencing 
type-punned pointer will break strict-aliasing rules
ocelot/ir/implementation/LLVMInstruction.cpp:206: error: dereferencing 
type-punned pointer will break strict-aliasing rules

Sems that 'classical' solution to this problem is to introduce an union of 
uint32/64 and a float/double. 
Seems that there exists something called std::hexfloat (like std::hex) but 
unfortunately compiler can't find it.

Original issue reported on code.google.com by [email protected] on 1 Oct 2009 at 6:51

1.1.560 doesn't build

I downloaded the 1.1.560 release, but:

/bin/sh ./libtool  --tag=CXX   --mode=compile g++ -DHAVE_CONFIG_H -I.    
-I/usr/include  -DNDEBUG -D_GNU_SOURCE -D__STDC_LIMIT_MACROS 
-D__STDC_CONSTANT_MACROS -I ./ocelot/cuda/include -Wall -ansi -Werror 
-std=c++0x -g -O2 -MT libocelot_la-ControlFlowGraph.lo -MD -MP -MF 
.deps/libocelot_la-ControlFlowGraph.Tpo -c -o libocelot_la-ControlFlowGraph.lo 
`test -f 'ocelot/ir/implementation/ControlFlowGraph.cpp' || echo 
'./'`ocelot/ir/implementation/ControlFlowGraph.cpp
libtool: compile:  g++ -DHAVE_CONFIG_H -I. -I/usr/include -DNDEBUG 
-D_GNU_SOURCE -D__STDC_LIMIT_MACROS -D__STDC_CONSTANT_MACROS -I 
./ocelot/cuda/include -Wall -ansi -Werror -std=c++0x -g -O2 -MT 
libocelot_la-ControlFlowGraph.lo -MD -MP -MF 
.deps/libocelot_la-ControlFlowGraph.Tpo -c 
ocelot/ir/implementation/ControlFlowGraph.cpp  -fPIC -DPIC -o 
.libs/libocelot_la-ControlFlowGraph.o
cc1plus: warnings being treated as errors
ocelot/ir/implementation/ControlFlowGraph.cpp: In member function 
'std::list<ir::ControlFlowGraph::BasicBlock::Edge>::const_iterator 
ir::ControlFlowGraph::BasicBlock::get_edge(std::list<ir::ControlFlowGraph::Basic
Block>::const_iterator) const':
ocelot/ir/implementation/ControlFlowGraph.cpp:94:1: error: control reaches end 
of non-void function
ocelot/ir/implementation/ControlFlowGraph.cpp: In member function 
'std::list<ir::ControlFlowGraph::BasicBlock::Edge>::iterator 
ir::ControlFlowGraph::BasicBlock::get_edge(std::list<ir::ControlFlowGraph::Basic
Block>::iterator)':
ocelot/ir/implementation/ControlFlowGraph.cpp:85:1: error: control reaches end 
of non-void function
ocelot/ir/implementation/ControlFlowGraph.cpp: In member function 
'std::list<ir::ControlFlowGraph::BasicBlock::Edge>::const_iterator 
ir::ControlFlowGraph::BasicBlock::get_branch_edge() const':
ocelot/ir/implementation/ControlFlowGraph.cpp:76:1: error: control reaches end 
of non-void function
ocelot/ir/implementation/ControlFlowGraph.cpp: In member function 
'std::list<ir::ControlFlowGraph::BasicBlock::Edge>::iterator 
ir::ControlFlowGraph::BasicBlock::get_branch_edge()':
ocelot/ir/implementation/ControlFlowGraph.cpp:67:1: error: control reaches end 
of non-void function
ocelot/ir/implementation/ControlFlowGraph.cpp: In member function 
'std::list<ir::ControlFlowGraph::BasicBlock::Edge>::const_iterator 
ir::ControlFlowGraph::BasicBlock::get_fallthrough_edge() const':
ocelot/ir/implementation/ControlFlowGraph.cpp:58:1: error: control reaches end 
of non-void function
ocelot/ir/implementation/ControlFlowGraph.cpp: In member function 
'std::list<ir::ControlFlowGraph::BasicBlock::Edge>::iterator 
ir::ControlFlowGraph::BasicBlock::get_fallthrough_edge()':
ocelot/ir/implementation/ControlFlowGraph.cpp:49:1: error: control reaches end 
of non-void function
make[1]: *** [libocelot_la-ControlFlowGraph.lo] Error 1
make[1]: Leaving directory `/home/realnc/tmp/ocelot-1.1.560'
make: *** [all] Error 2

Original issue reported on code.google.com by [email protected] on 15 Jan 2011 at 7:11

memoryChecker reports bad global memory access

What steps will reproduce the problem?
1. allocate device memory
2. attempt to access the allocated memory within a kernel


What is the expected output? What do you see instead?

I expect the program to succeed with no output. Instead, the memory checker 
claims one of my kernels is accessing memory that is not allocated or mapped. 
However, it clearly is allocated, as its own list of device allocations shows 
it is within the fifth allocation:

terminate called after throwing an instance of 'hydrazine::Exception'
  what():  [PC 26] [thread 0] [cta 0] ld.global.u8 %r24, [%r23 + 0] - Global memory access 0xb79d0f is not within any allocated or mapped range.

Nearby Device Allocations
[0xa66fc0] - [0xa67020] (96 bytes)
[0xa670a0] - [0xa674d8] (1080 bytes)
[0xa67620] - [0xa67680] (96 bytes)
[0xa67700] - [0xa67b38] (1080 bytes)
[0xb79d00] - [0xb7bd00] (8192 bytes)
[0xb7be20] - [0xb7de20] (8192 bytes)
[0xb7df40] - [0xb7e740] (2048 bytes)
[0xb7e860] - [0xb7f060] (2048 bytes)
[0xb7f180] - [0xb7f980] (2048 bytes)
[0xb7faa0] - [0xb802a0] (2048 bytes)
[0xb803c0] - [0xb80bc0] (2048 bytes)
[0xb80ce0] - [0xb814e0] (2048 bytes)
[0xb83520] - [0xb85520] (8192 bytes)
[0xb87660] - [0xb89660] (8192 bytes)
[0xb89780] - [0xb8e780] (20480 bytes)
[0xb8e8a0] - [0xb938a0] (20480 bytes)
[0xb939c0] - [0xb989c0] (20480 bytes)
[0xb98ae0] - [0xb9dae0] (20480 bytes)
[0xb9dc00] - [0xba2c00] (20480 bytes)
[0xba2d20] - [0xba7d20] (20480 bytes)
[0xba7e40] - [0xbace40] (20480 bytes)
[0xbacf60] - [0xbb1f60] (20480 bytes)
[0xbb2080] - [0xbb2880] (2048 bytes)
[0xbb29a0] - [0xbb31a0] (2048 bytes)
[0xbb32c0] - [0xbb3ac0] (2048 bytes)
[0xbb3be0] - [0xbb43e0] (2048 bytes)


What version of the product are you using? On what operating system?

-Ocelot version: SVN r634
-compiled with gcc 4.5 with the macro to disable c++0x feature in 
KernelEntry.cpp as described in 
http://groups.google.com/group/gpuocelot/msg/55339e218bc5bdaa?pli=1
-modifications to type checker described in 
http://groups.google.com/group/gpuocelot/browse_thread/thread/186dcb0bee10ed8b

Cuda version: 2.3


Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 4 Aug 2010 at 4:21

error--- no matching function for call to 'std::basic_ofstream

What steps will reproduce the problem?
1.build.py --install
2.
3.

What is the expected output? What do you see instead?


What version of the product are you using? On what operating system?
2.0

Please provide any additional information below.

in ocelot/executive/implementation/PassThroughDevice.cpp   line 604
std::ofstream file(stream.str());
should be 
 std::ofstream file((stream.str()).c_str());

the same in ocelot/graphs/implementation/DivergenceDrawer.cpp:187

Original issue reported on code.google.com by [email protected] on 31 Mar 2011 at 5:35

Linear Texture Interpolation on 32-bit installs

What steps will reproduce the problem?
1. Install Ocelot on Ubuntu 8.10-32-bit
2. Run the Dxt8x8 regression test. 

What is the expected output? What do you see instead?
Note that the first example does not match the reference.

Please use labels and text to provide additional information.

Original issue reported on code.google.com by [email protected] on 12 Jul 2009 at 4:40

Add a CUDA debugger

Describe the New Feature:

See this thread:
http://groups.google.com/group/gpuocelot/browse_thread/thread/e4964a46d419623d

Which milestone does the feature belong to?
3.0


Which branch does the new feature go in?
Trunk

Original issue reported on code.google.com by [email protected] on 8 Jun 2010 at 6:38

Implement Register Allocation

Describe the New Feature:
1. The current register allocation scheme simply maps PTX register
variables to unique identifiers.
2. Implement a graph coloring register allocator.
3. Implement a linear scan register allocator.

Which milestone does the feature belong to?
0.5.0


Which branch does the new feature go in?
Trunk

Original issue reported on code.google.com by [email protected] on 29 Jun 2009 at 12:17

Add Windows and MACOS support..

Now that you have LLVM multicore support working (I don't know if you are
aware of Nvidia similar efforts for CUDA for CPUs..
llvm.org/devmtg/2009-10/Grover_PLANG.pdf
and also they have a video..
 I hope they have ready for CUDA 3.0.. which seems to be getting a beta by
SC09 and hope they release PTX spec v1.5 ) 
Assuming CUDA multicore doesn't get relased in this beta I'm interested in
porting your project to Windows and MacOS assuming it can be done with no
major changes.. I mean getting perhaps rid of configure/automake stuff..
and perhaps some fixing system specific code..
Would be good if with some of your thougt you can point me to some big
issues you can expect me to find and already not planned by me?

So perhaps the plan is:
I will try first to test in MacOs which would allow to catch very specific
Linux issues..
Then from that try to use Cmake as make system..
Assuming all goes well on Linux and MacOs attempt Windows port..
This will expose first two kind of portability issues:
*GCC vs Visual Studio issues: 
are you using some C99 or other specific GCC code don't supported.. 
*OS specific API stuff..
I expect I'm having to learn to build:
Boost..
Pthreads..
LLVM..

Original issue reported on code.google.com by [email protected] on 31 Oct 2009 at 10:13

  • Merged into: #51

Add Dataflow Graph Analysis Module

Describe the New Feature:
 1. A dataflow graph is an augmented control flow graph where the live
registers going into and out of each block are annotated in the cfg.
 2. Create a separate DFG class where each block is augmented with a list
of live registers in and a list of live registers out.

Which milestone does the feature belong to?
0.5.0


Which branch does the new feature go in?
Trunk


Original issue reported on code.google.com by [email protected] on 22 Jun 2009 at 7:52

config.ocelot has mistakes

Hi,
when reading config.ocelot, these two lines are executed:
fi.descend("ocelot");
fi.descend("OcelotRuntime");

but the example config.ocelot in svn hasn't got OcelotRuntime tags,
adding them causes Ocelot runtime to load it's settings.

Also there is an unneeded 0 inbetween tags.

Attaching corrected file.

Original issue reported on code.google.com by [email protected] on 16 Aug 2009 at 7:11

Attachments:

ocelot segaults

Hi, i got newest ocelot and running memoryErrors from wiki crashes :

[Thread debugging using libthread_db enabled]
[New Thread 0x7f3822d42760 (LWP 21763)]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7f3822d42760 (LWP 21763)]
0x00007f382269066b in executive::Executive::getSelectedISA (this=0xdec9d8) 
at ocelot/executive/implementation/Executive.cpp:358
358             }
(gdb) bt
#0  0x00007f382269066b in executive::Executive::getSelectedISA 
(this=0xdec9d8) at ocelot/executive/implementation/Executive.cpp:358
#1  0x00007f38226921c2 in executive::Executive::loadModule (this=0xdec9d8, 
path=<value optimized out>, translateToSelected=true, 
    stream=0x7fff2ad6d7d0) at ocelot/executive/implementation/
Executive.cpp:209
#2  0x00007f38217010f8 in cuda::CudaRuntime::registerFatBinary 
(this=0xdec4f8, binary=@0x6151a0)
    at ocelot/cuda/implementation/CudaRuntime.cpp:604
#3  0x00007f382170f5ab in cuda::CudaRuntimeBase::cudaRegisterFatBinary 
(this=<value optimized out>, fatCubin=0x6151a0)
    at ocelot/cuda/implementation/CudaRuntimeBase.cpp:1482
#4  0x000000000040a485 in 
__sti____cudaRegisterAll_47_tmpxft_0000112c_00000000_4_memoryErrors_cpp1_ii_41d2
9f55 
()
    at /tmp/tmpxft_0000112c_00000000-1_memoryErrors.cudafe1.stub.c:29
#5  0x000000000040f8c6 in ?? ()
#6  0x00007f3822d6e000 in ?? ()
#7  0x000000000040f7f0 in ?? ()
#8  0x0000000000000000 in ?? ()



Additionaly compiling LLVMExecutableKernel.cpp fails with

 g++ -DHAVE_CONFIG_H -I. -Wall -ansi -pedantic -Werror -std=c++0x -g -O2 -
MT libOcelotExecutive_la-LLVMExecutableKernel.lo -MD -MP -MF .deps/
libOcelotExecutive_la-LLVMExecutableKernel.Tpo -c ocelot/executive/
implementation/LLVMExecutableKernel.cpp  -fPIC -DPIC -o .libs/
libOcelotExecutive_la-LLVMExecutableKernel.o
cc1plus: warnings being treated as errors
ocelot/executive/implementation/LLVMExecutableKernel.cpp: In destructor 
'virtual executive::LLVMExecutableKernel::~LLVMExecutableKernel()':
ocelot/executive/implementation/LLVMExecutableKernel.cpp:47: error: 
possible problem detected in invocation of delete operator:
ocelot/executive/implementation/LLVMExecutableKernel.cpp:47: error: 
invalid use of incomplete type 'struct llvm::Module'
./ocelot/executive/interface/LLVMExecutableKernel.h:16: error: forward 
declaration of 'struct llvm::Module'
ocelot/executive/implementation/LLVMExecutableKernel.cpp:47: note: neither 
the destructor nor the class-specific operator delete will be called, even 
if they are declared when the class is defined.
make[1]: *** [libOcelotExecutive_la-LLVMExecutableKernel.lo] Error 1


Commenting out this line allowed me to compile, although obvious this is 
not a fix :).

Original issue reported on code.google.com by [email protected] on 8 Sep 2009 at 9:45

Atomic CAS has incorrect semantics when interleaved with non-atomic stores

What steps will reproduce the problem?
Consider the example:

a = 25

(thread 1) st a, 5    (thread 2) atomic cas a, 25, 0


possible outcomes:

case 0:
(thread 1)
(thread 2)
(thread 2)
a = 0

case 1:
(thread 2)
(thread 2)
(thread 1)
a = 5

case 2:
(thread 2)
(thread 1)
(thread 2)
a = 25

What is the expected output? What do you see instead?

Case 2 is a possible ordering in our implementation, but would not be
possible if the operation was actually performed atomically rather than
using locks. It is very unlikely that a program would ever rely on this
behavior, but it is one more reason to abandon the current locking
implementation and move to an entirely atomic implementation. 

Please use labels and text to provide additional information.

We should consider replacing the current implementation with the upcoming
cstdatomics library.

Original issue reported on code.google.com by [email protected] on 10 Dec 2009 at 5:44

Incorrect results in unrolled loops

What steps will reproduce the problem?
1. Run the mri-fhd parboil benchmark with and without loop unrolling in the
inner loop.

What is the expected output? What do you see instead?
1. With unrolling, the kernel produces incorrect outputs starting with the
first ld.const compared to the same kernel without unrolling.


What version of the product are you using? On what operating system?
Ubuntu 9.04, r107

Please provide any additional information below.

This bug is also causing incorrect results in the mandelbrot 2.2 sdk
example without manual rolling of loop bodies.  We need a simple test case
to reproduce this before we can start diagnosing the problem in detail.  It
is not obvious from examining the dataflow traces from either example.

Original issue reported on code.google.com by [email protected] on 15 Aug 2009 at 6:57

Add Support for CUDA 2.3

Describe the New Feature:
 1. Download the new toolkit and SDK.
 2. Create a new parser if there is a new version of PTX.
 3. Dump the .ptx files from each sdk sample into the test directory.
 4. Create a test suite for the 2.3 sdk examples.  Make sure that it passes.

Which milestone does the feature belong to?
0.4.0

Which branch does the new feature go in?
Trunk

Original issue reported on code.google.com by [email protected] on 22 Jun 2009 at 8:21

Mistake in lexer

From the newest release avaible here, ptx.lpp has this definition for 
octal values :
OCT_CONSTANT (0[0123456]*)
but 07 is a proper octal value.

Original issue reported on code.google.com by [email protected] on 10 Aug 2009 at 9:49

CUDA API Trace Generator

Describe the New Feature:
Add an implementation of the CUDA runtime that records a trace of every
call made.

Which milestone does the feature belong to?
0.9.0


Which branch does the new feature go in?
Trunk

Original issue reported on code.google.com by [email protected] on 5 Sep 2009 at 7:13

Detect Global Memory Access Violations

Describe the New Feature:
Instrument all LD/ST/TEX instructions to verify that there is a valid
device memory region allocated before doing the access.

Which milestone does the feature belong to?
0.5.0


Which branch does the new feature go in?
Trunk

Original issue reported on code.google.com by [email protected] on 9 Jul 2009 at 8:06

LLVM Translator

Describe the New Feature:
 1. Implement a high level translator interface for moving between
different Instruction classes.  

 2. Implement a specific translator that examines a vector of PTX
instructions and produces an equivalent vector of LLVM instructions.

Note: This first version will use naive translation where each PTX
instruction maps to one or more LLVM instructions.  It should not pay
attention to automatic vectorization at all.

Which milestone does the feature belong to?
0.7.0


Which branch does the new feature go in?
Trunk

Original issue reported on code.google.com by [email protected] on 29 Jul 2009 at 10:14

failure to build against current hydralize

What steps will reproduce the problem?
1. check out a fresh hydrazine using:

 svn checkout http://hydrazine.googlecode.com/svn/trunk/ hydrazine-read-only

as instructed in the Installation Guide

2. check out a fresh copy of gpuocelot
3. build and install hydrazine. only lbhydrazine.a is installed using the
standard "make install" target
4. attempt to build gpuocelot. the build fails looking for
hydrazine/implementation/debug.h:

make  all-am
make[1]: Entering directory `/home/dank/local/gpuocelot-read-only/ocelot'
/bin/bash ./libtool  --tag=CXX   --mode=compile g++ -DHAVE_CONFIG_H -I.   
 -I ./ocelot/cuda/include -Wall -ansi -Werror -std=c++0x -g -O2 -MT
libocelot_la-DataflowGraph.lo -MD -MP -MF
.deps/libocelot_la-DataflowGraph.Tpo -c -o libocelot_la-DataflowGraph.lo
`test -f 'ocelot/analysis/implementation/DataflowGraph.cpp' || echo
'./'`ocelot/analysis/implementation/DataflowGraph.cpp
libtool: compile:  g++ -DHAVE_CONFIG_H -I. -I ./ocelot/cuda/include -Wall
-ansi -Werror -std=c++0x -g -O2 -MT libocelot_la-DataflowGraph.lo -MD -MP
-MF .deps/libocelot_la-DataflowGraph.Tpo -c
ocelot/analysis/implementation/DataflowGraph.cpp  -fPIC -DPIC -o
.libs/libocelot_la-DataflowGraph.o
In file included from ocelot/analysis/implementation/DataflowGraph.cpp:10:
./ocelot/analysis/interface/DataflowGraph.h:15:44: error:
hydrazine/implementation/debug.h: No such file or directory
ocelot/analysis/implementation/DataflowGraph.cpp: In constructor
‘analysis::DataflowGraph::NoProducerException::NoProducerException(unsigned
int)’:
ocelot/analysis/implementation/DataflowGraph.cpp:56: error: aggregate
‘std::stringstream message’ has incomplete type and cannot be defined
ocelot/analysis/implementation/DataflowGraph.cpp: In member function
‘analysis::DataflowGraph::Instruction
analysis::DataflowGraph::convert(ir::PTXInstruction&)’:


What is the expected output? What do you see instead?

A successful build.

What version of the product are you using? On what operating system?

SVN as of 2010-04-13 on Debian Linux Unstable, using the llvm-snapshot LLVM
packages.

Please provide any additional information below.

Hit me on aim or gtalk (nickblackandmild, [email protected]), if you'd
like, and we ought be able to track in on it pretty quickly. Thanks.

Original issue reported on code.google.com by dankamongmen on 13 Apr 2010 at 9:44

barrier deadlock on __syncthreads?

What steps will reproduce the problem?

The error below occurs when encountering the first __syncthreads(); in my 
CUDA kernel code.

What is the expected output? What do you see instead?

==Ocelot== Emulator failed to run kernel "_Z18chiSquaredDistancePfS_S_i" 
with exception:
==Ocelot== [PC 91] [thread 0] [cta 0] bar.sync 0 - barrier deadlock at: 
precomputeMatrix_chikernel.cu:53:0
terminate called after throwing an instance of 
'executive::RuntimeException'
Aborted

What version of the product are you using? On what operating system?

OpenSuse 11.1/GCC4.3.2/CUDA Toolkit 2.3/Ocelot SVN r271 (2009-12-22)

Please provide any additional information below.


__global__ void
chiSquaredDistance(float* C, float* A, float* B, int slabSizeA)
{
    // Thread index
    int tx = threadIdx.x;

    // Block index
    int bx = blockIdx.x;
    int by = blockIdx.y;

    float temp = 0.0f;

    for(int vectorChunk = 0; vectorChunk < VECTORCHUNK_COUNT; vectorChunk+
+)
    {
        temp += 1.0f;  // compute some value for temp, not removed
    }

    __shared__ float sum[VECTORCHUNK_SIZE];
    sum[tx] = temp;
    __syncthreads();
    for(int bit = VECTORCHUNK_SIZE / 2; bit > 0; bit /= 2)
    {
        float t = sum[tx] + sum[tx ^ bit];
        __syncthreads();
        sum[tx] = t;
        __syncthreads();
    }

    // write to global memory
    if(tx == 0)
        C[by * slabSizeA + bx] = sum[tx] / 2;
}


Original issue reported on code.google.com by [email protected] on 15 Jan 2010 at 2:04

PTX-Emulator: Warp Scheduler

Describe the New Feature:

Add the concept of a warp to the PTX emulator.  
 * The warp size should be configurable on a per-cta basis.  
 * The CooperativeThreadArray class should be extended with a set of Warps,
each containing a stack of CTAContexts. 
 * The CooperativeThreadArray class should include a callback interface to
a warp scheduler function object that picks the next warp to execute out of
a pool of ready warps.
 * The branch divergence mechanism should be refined to operate on a
per-warp basis rather than a per-cta basis.
 * Each eval_* function should be modified to only execute instructions for
the currently selected warp.

Which milestone does the feature belong to?
2.0.0


Which branch does the new feature go in?
Trunk

Original issue reported on code.google.com by [email protected] on 18 May 2010 at 2:36

2-Element Vectors of floats are broken in the llvm backend on 32-bit platforms

What steps will reproduce the problem?
See this bug report from llvm: http://hlvm.llvm.org/bugs/show_bug.cgi?id=3287

What is the expected output? What do you see instead?
Loads to 2-element vectors of floats randomly produce nan values.

What version of the product are you using? On what operating system?
32-bit platforms using LLVM.

Original issue reported on code.google.com by [email protected] on 20 Feb 2010 at 8:15

Software texture sampling off by one pixel

To reproduce problem:

1. Execute 'SimpleTexture' and 'BicubicTexture' applications compiled using
the native CUDA toolchain executing on the GPU.
2. Copy their outputs into the 'data/' directory to be used as reference
inputs for GPU Ocelot.
3. Execute SimpleTexture and BicubicTexture with emulated and LLVM devices. 

The expected output should match the reference inputs. Instead, the output
consists of images shifted by approximately one pixel.

The current reference inputs were produced by GPU Ocelot and not provided
by the CUDA toolchain. We should see this as a defect in the texture
sampling procedures used by GPU Ocelot's emulated and translated devices.


Original issue reported on code.google.com by [email protected] on 5 Nov 2009 at 7:06

Memory checker is not running.

Hi,
i downloaded ocelot from svn today, ran
libtoolize
aclocal
autoconf
automake
./configure --prefix=SOMEPATH
make
make install

and then used your example from wiki, for memory checker
Compiled it with 
g++ -o mem mem.cu.cpp -L /usr/local/cuda/lib64/ -L SOME_PATH/lib -lcudart -
lOcelotIr -lOcelotParser -lOcelotExecutive -lOcelotTrace -lOcelotAnalysis -
lhydrazine
and running it by
./mem 1
doesn't print anything.


For a simple test whether ocelot is running at all, i did:
export CUDA_PROFILE=1
and ran
./mem 3
which, did not procude cuda_profile.log, which might suggest that ocelot 
took over normal execution.

Also, running ./mem 1 doesn't produce any files.

CHECK_GLOBAL_ACCESSES is defined to 1.

I am using CUDA 2.3, but i as far as i understand that shouldn't be an 
issue, as this kernel shouldn't produce unknown ptx instructions.


Original issue reported on code.google.com by [email protected] on 16 Aug 2009 at 12:58

Error during make install

What steps will reproduce the problem?
1. Download the ocelot-1.3.967 package from 
http://code.google.com/p/gpuocelot/downloads/detail?name=ocelot-1.3.967.tar.bz2&
can=2&q=
2. Run ./configure; make; sudo make install

What is the expected output? What do you see instead?

Below is the error I get. Looks like TestLLVMKernels.h file is included in the 
list.

 /usr/bin/install -c -m 644  ocelot/executive/test/TestGPUKernel.h ocelot/executive/test/TestLLVMKernels.h ocelot/executive/test/TestEmulator.h ocelot/executive/test/TestLLVMKernels.h ocelot/executive/test/sequence.ptx ocelot/executive/test/kernels.ptx '/usr/local/include/ocelot/executive/test'                                                                 
/usr/bin/install: will not overwrite just-created 
`/usr/local/include/ocelot/executive/test/TestLLVMKernels.h' with 
`ocelot/executive/test/TestLLVMKernels.h'                       
make[2]: *** [install-nobase_includeHEADERS] Error 1                            


make[2]: Leaving directory `/home/animus/Work/simulators/ocelot-1.3.967'        


make[1]: *** [install-am] Error 2                                               


make[1]: Leaving directory `/home/animus/Work/simulators/ocelot-1.3.967'        


make: *** [install] Error 2 


What version of the product are you using? On what operating system?
I am using ocelot-1.3.967 on Ubuntu 10.04. Make version is 3.81. Install 
version is 8.5.

Please provide any additional information below.
This doesn't seem to be fatal error for NVIDIA GPU emulation. The libraries are 
installed and they work. I just am not sure if this error has any effect on the 
working of the simulator.

Original issue reported on code.google.com by [email protected] on 8 Mar 2011 at 6:11

on 32-bit platforms gcc and nvcc disagree about the size of tuples of pairs

What steps will reproduce the problem?
This test program:

#include <iostream>
#include <tr1/tuple>
#include <utility>

int main()
{
  std::cerr << "sizeof(long long): " << sizeof(long long) << std::endl;

  using namespace std;
  using namespace std::tr1;

  typedef pair<long long, long long> p;
  typedef tuple<p, unsigned int> t;

  std::cerr << "sizeof(tuple<pair<long long, long long> >, unsigned int>):
" << sizeof(t) << std::endl;

  return 0;
}

What is the expected output? What do you see instead?
The size is 20 on gcc4.4.1 32-bit, 24 on gcc4.4.1 64-bit, and 24 on nvcc3.0b


Original issue reported on code.google.com by [email protected] on 6 Feb 2010 at 8:31

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.