stanfordaha / lassen Goto Github PK
View Code? Open in Web Editor NEWThe PE for the second generation CGRA (garnet).
The PE for the second generation CGRA (garnet).
umin, umax, smin, smax
Lines 77 to 84 in 008a0da
Someone needs to write more exhaustive tests on every op with multiple inputs.
Here we have a PE configured as mult (1, reg)
, where data0
is configured as a const register, and data1
is configured as DEALY
.
0: Single from data0
connection box. Since data0
is in CONST
mode, as indicated in signal 2, this is expected
1: Signal from data1
connection box. This is the input the the PE core.
2. Register mode for data0
. 0 is CONST
so this is expected.
3. Register const value for data0
, this is expected since it's multiply by 1.
4. Register mode for data1
. 11 is 3, which is DELAY
mode. This is also correct.
5. This is the signal going to the ALU unit. Notice that there is no delay!
6. This is the output from the ALU unit. No delay either.
Mode flags are here:
Lines 9 to 17 in 5ed30d6
When I was running Jeff's end-to-end tests, based on the waveform I've seen asr
is not implemented correctly in RTL. See the waveform below.
Then I went ahead and poked around inside lassen. Even if I added asr
to RTL, the tests is still passing. That's odd. I then ask pytest to print out everything, and here are the things I found:
fault
is not reporting error even though verilator fails.Here is the link to the print out: https://travis-ci.org/StanfordAHA/lassen/builds/532629001#L1217
You can reproduce the individual test using test_asr
branch with the following command:
pytest tests/test_rtl.py -k "test_rtl[False-mode0-asr]" -s
You should see the verilator assertion failed but pytest passed.
I'd call for a thorough manually examination on every single op we have a manually test them to prevent these kinds of mistake from happening again.
in lut.py: 16 dynamically constructing a new bitvector out of an array of Bits is not supported.
@cdonovick, is this something you want to support, or is there a better way to write this code?
Probably the same issue with asr
.
I haven't test smin
yet, but I'm suspecting it's the same error.
See smax
branch.
See: https://travis-ci.com/StanfordAHA/lassen/builds/111969902
Looks like some mapper tests and rtl tests are failing.
Mapper tests fixed by #24
I will work on a solution that will block merge if the PR fails GarnetFlow.
I have a preliminary branch implementing IRQ in lassen called 'irq'
The features currently are that for both the alu output and the single bit output, you can enable irq along with a comparison value. If at any time during running the comparison is true, the PE will output a 1 on the irq output along with latching that 1 in a register in the PE. The reason for this latch is so that the SOC can easily determine which PE triggered the interrupt.
Open questions for lassen features:
-Should the IRQ be output the same cycle when the interrupt occurs?
-Should we continue output the IRQ till it is cleared by software?
-Should we support multiple comparison operations? (==,!=,<,>,etc)
Let me know your thoughts or any other assumptions I missed.
The goal of this is to use CoSA to formally prove that that any changes to lassen does not change the generated RTL for the PE tile.
#135 contains the gold coreir json file which we should be comparing against.
The bfloat
implementation in Halide seems to be consistent with the IP with rnd=1
. It would much easier if we can use that mode instead.
Currently there is no way to know what ops the PE bitstream corresponds to. In other words it's impossible to debug the PE. Can someone add ways to comment the bitstream so that a human programmer can understand what's going one?
One major issue we have seen is that we are missing a lot of RTL unit tests in lassen. Due to the nature of Peak we should be able to parameterize the functional tests (test_pe.py) to also do RTL tests.
Right now the carry in for the normal add op is always set to 0 (see https://github.com/StanfordAHA/lassen/blob/master/lassen/sim.py#L60). IIRC jade
set the cin
using a one-bit input (see https://github.com/StanfordAHA/CGRAGenerator/blob/master/hardware/generator_z/pe_new/pe/rtl/test_pe_comp.svp#L507) which enables the construction of a carry chain adder (e.g. for a 32 bit add).
Did we intend to drop this feature?
Continuing on testing float pointwise. Create a new issue since the old one is too long: #111.
Build log: https://buildkite.com/stanford-aha/lassen/builds/194#65c1f0d0-c89a-47d4-97b9-5064d65ceaf5/80-1018
I would like to use name_outputs as a decorator on the call function. Currently this causes rtl generation to break, so I have to do a workaround.
See here: https://travis-ci.com/StanfordAHA/GarnetFlow/builds/110987402
and here: https://travis-ci.com/StanfordAHA/lassen/builds/110816257
In the future people should run the garnet flow to make sure that nothing is broken...
See https://buildkite.com/stanford-aha/lassen/builds/1#f7512cd6-f12f-455e-980a-3f25eafa2a84
Just a side note:
From now on every push and PR in lassen will trigger buildkite build, which tests the floating point ops with CW
files. If you want to debug, please use kiwi
to test locally.
EDIT:
I will improve the rtl_tester
to intelligently choose different simulator based on the environment it's in. GarnetFlow
is already doing that.
I propose there be a dedicated configuration bit that will can flip the sign bit of floating point values. This would prevent burning an additional PE when doing FPSUB. (You can flip the sign bit by first doing an XOR in a separate PE).
@alexcarsello, thoughts?
Halide requires the use of the following complex floating point operations:
div, rem, log, exp, pow, sqrt, sin, cos, tan, asin, acos, atan2, tanh
We currently have a bunch of microps that are required to do the floating point divide algorithm. They are as follows:
FGetMant
FAddIExp
FSubExp
FCnvExp2F
FGetFInt
FGetFFrac
Are these microps sufficient to be able to do similar algorithms for the rest of these complex operations?
We need Round, Floor and Ceil. If we have one of these ops, the other two are very easy to create by just adding or subtracting 0.5 appropriately.
@nikhilbhagdikar, could you add these to our complex ops (in lassen/stdlib/) and add some tests?
For the tests, please use hwtypes.FPVector in order to construct and manipulate python floating point values
We need to the flags to be correct in order to do all the floating point comparison operations. (<,<=,==,!=,etc...)
Ideally this can be done just using the already existent flags and that the cond.py does not actually depend on the instruction.
It seems like irq
signal gets propagates into canal
as a valid output. As a result, the irq
signal goes to all the switchbox muxes, potentially increasing the area. What is this signal doing? Based on the code here:
Lines 257 to 261 in 5909514
irq
a way to output constant 0 to the application network? If not, can someone explain to me which applications we have so far need it? Is it a form of premature optimization?Floating point numbers should have their fractional bits as their lsbs, and their sign bit as the msb.
@nikhilbhagdikarI I suspect you have done the opposite in your lassen implementation for most of the fp ops.
I have a branch called 'float-test' that contains your floating point changes along with some more parameterized tests.
on this branch, you can run:
>pytest -k get_mant
and you will see an assertion error.
@rdaly525
Peak core needs to output the following ports:
config_addr
-> this is 8 bits.config_data
-> this is 32-bit wideread_config_data
-> this can be any sizeconfig_en
-> this is the write signalreset
-> 1-bit signal that sets everything to zeroNotice that there is no read signal (sorry I lied). The core should always returns the values to read_config_data
given the addr.
@nikhilbhagdikar, Can you do a PR off of 'float-test' and transfer your complex op algorithms into an inherited peak class? There is an explicit example of how to do this in the test test_complex.py with the FMA.
Resolving/Missing Features
Missing randomized Tests
"micro" FP ops, Nikhil, Ross (branch 'float-test')
Complex Ops, Nikhil, Ross (Branch 'complex-float')
Register Modes (for each of the 5 inputs), Pat:
Carry tests, Ross ('add32' -> master)
IRQ
Reading and writing Data and bit registers (Ross/Keyi)
Stall
Asynchronous reset on Data/Bit Registers (Lenny)
RTL
test_pe
so we're fairly confident in there functionality:
I have a branch called 'mem' where I began specifying the memory using Peak.
I need the following fleshed out in the Peak representation of this memory tile.
Currently I have hand-wavy specified the RAM/ROM mode in lassen/mem/*.py
Let me know if you have any qustions
FP_Mult Error mode issue
-Problem: RTL mismatched FPVector and Halide due to wrapper
-Solution: Use DW FP IP instead
-Verification: Lassen RTL tests
-Can be resolved today
SLT/SGT Bug
-problem: SMT cannot infer SLT and SGT. But the random tests are passing
-Solution: Still debugging (Caleb)
-Verification: Automapper finding SLT and SGT
SubExp Bug
-problem: A small portion of random tests (~5%) are failing the SubExp micro op RTL
-solution: Still debugging (Lenny/anyone else wants to help?)
-verification: Lassen RTL tests
During configuration when the chip is clock-gated, using VALID
mode will prevent the register taking values. This mode was proven to be very critical when I was testing the Jade chip.
So the question is: is VALID
got renamed to DELAY
mode, or we can't never clock gate the register in the future? If that's the case, there should be extra graph analysis to make sure that register used in counter should never be used inside PE since you can't clock gate it.
Being able to do a FP round will give us round, ceil, and floor which are required from the Halide applications.
pytest --workes=auto test_micro.py
and pytest --workes=auto test_pe.py
fail.
Given how long test_pe.py
takes it would be nice to run it in parallel.
@nikhilbhagdikar I have isolated your complex ops tests into a separate branch called 'complex-float'. Can you change them to inherent from peak in that branch?
Again, in general please do small pull requests that do not have multiple orthogonal changes.
Two problems:
The rounding mode, which is round to nearest even, in the functional model and RTL does not match. See: https://buildkite.com/stanford-aha/lassen/builds/109#712c5179-b555-4ddb-8fd3-6da54ebba32b
The functional model is using mpfr
, which I believe should also be correct.
When using -v
with pytest
, fault doesn't catch the error yet running individually does. There is something wrong with either the test bench setup or fault. See the successful build: https://buildkite.com/stanford-aha/lassen/builds/108#ef4d8b94-f985-4e9e-9873-cdf64ba87be3
@leonardt can you take a closer look?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.