Comments (38)
For 1., the output shows:
Failed on action=5 checking port O0. Expected 4117, got 4116
If i'm not mistaken, I believe that means that the functional model expects 4117 and the RTL got 4116, if we're rounding to the nearest even, then 4117 doesn't seem right?
from lassen.
For 2. I suspect this is an issue related to the TB rather than fault because it's correctly identifying the error in the single run case, but perhaps it's an issue with how fault is generating the TBs for multiple tests. Perhaps it's the fact that all the tests are being run in the same test_dir
, maybe ncsim is not recompiling something when it should be? (Perhaps it's using an old test bench rather than the newly generated one for the next op). I'm going to see if running each test in it's own tempdir
catches it, which would suggest the above issue.
from lassen.
My educated guess for 1 is that the rounding depends on how many extra precision bits you have. In the test, we have
3 * 3.140625 = 0x4116c000
If we have more than 1 bits used for rounding, then clearly it should round up, hence 0x4117. If we only have 1 bit for rounding, then we shouldn't round since 0x4117 is odd.
from lassen.
Ah, I forgot we were considering FP numbers, I was interpreting as an integer
from lassen.
Curiously on kiwi, running pytest tests/test_pe.py -xv
does result in the failure on fp_mult
, I'm going to confirm the error isn't present when doing pytest tests/ -xv
(I believe this is what I saw when I did my initial local checkout, but I'll double check. This might suggest that another test file is causing some sort of conflict.
from lassen.
Do the tests depend at all on Python floating-point implementation, and thus could Python/C++ version/lib differences on kiwi vs. other machines explain the difference?
from lassen.
I believe they use the libmpfr
implementation, but the issue is that the output from our functional model (based on libmpfr
) is not consistent with the RTL simulation (which uses the CW primitives with rnd('h0)
).
from lassen.
Could the issue be related to how the multiply is promoting the inputs (concating 0s to the end) and truncating the output (see use of result_x
)? Also the fact that frac_bits
has 3 added to it? These are the various configuration parameters I see that might be related to the issue. Also ieee_compliance
?
module mul #(parameter exp_bits=1, parameter frac_bits=1) (
input [exp_bits+frac_bits:0] in0,
input [exp_bits+frac_bits:0] in1,
output [exp_bits+frac_bits:0] out
);
wire [2:0] result_x;
wire [7:0] status;
CW_fp_mult #(.sig_width(frac_bits+3), .exp_width(exp_bits), .ieee_compliance(0)) mul1 (.a({in0,3'h0}),.b({in1,3'h0}),.rnd('h0),.z({out,result_x}),.status(status));
endmodule // mul
from lassen.
@leonardt Yes we believe this is the issue. We're still investigating the issue now. There are more problems with the IP stub.
from lassen.
Update:
We have located the bug, which is due to the extra 3 bits in the initialization. However, there is another bug in the CW_fp_mult.v
that prevents us using 7-bit.
from lassen.
Okay, update on issue 2, weirdly enough, running pytest tests -v
the failure is missed, but running pytest tests -xv
the failure is caught, this is odd.
from lassen.
I forgot about the extra precision bits. I think the appropriate solution is to modify the coreir stub to make it use the exact correct bits. If we want higher precision we can specify that at the peak/magma level
from lassen.
The valina CW_fp_mult.v
doesn't allow you to use correct precision bits. Due to the copyright I can't post it here, but if you look at line 60-62 you'll see the bug.
from lassen.
For me locally, it seems to reliably catch the failure when running pytest -xv tests
and reliable miss the failure when running pytest -v tests
. Perhaps we can use -x
as a temporary unblocker (not great in some cases because you'll only see the first failure, whereas it can be useful to see all failures, but in other cases it's useful to kill the run early on the first failure). It's not ideal, but it's a strange issue that I suspect is related to some weird interaction with pytest
and our code.
from lassen.
@Kuree, are you saying there is also a bug actually in the IP itself? Currently coreir instances the verilog stub with an extra 3 bits of fractional precision, so I know that is likely part of the issue.
I will work on a coreir fix for this
from lassen.
@rdaly525 Yes I believe so. You can check it out and read the lines I pointed out above.
In the short run we can write a sed
script to fix the bug.
from lassen.
I have a coreir branch 'float-fix' that updates the CoreIR instantiating the multiply. @Kuree, does this fix the issue?
from lassen.
Also if we do actually want the multiply to have 3 extra bits of precision, we need to update the lassen description to concat 3 bits to the inputs, then put it into the appropriate BFloat[8,7+3] multiply operator, then update the tests to match these semantics.
from lassen.
I don't think we need extra 3 bits of precision. You have to round somewhere anyway and it's difficult to match of other tools.
from lassen.
okay, even stranger, I just had a run with pytest tests -v
catch the error. @Kuree is there any possibility of this test failing non-deterministically? Or is the bug deterministic?
from lassen.
This bug is deterministic since it's caused by improper instantiation of the IP.
from lassen.
weird, I seem to be getting consistent failures (2 in a row at least) with pytest tests -v
now, I'm not exactly sure where to continue for debugging the issue, perhaps this is related to the buildkite environment? I'll look at the buildkite script to see if there's anything potentially interesting there. maybe it has to do with the docker flow, so I can try recreating that.
It's strange that I did see the pass locally earlier, but now I can't seem to reproduce it.
from lassen.
I can reproduce the same problem inside the buildkite docker environment last night, so I doubt it has something to do with the buildkite.
I think for the best interest of saving time, we can put -x
flag there and monitor if the problem is going to resolve in the future.
from lassen.
Oh I found something interesting in the buildkite logs:
This run https://buildkite.com/stanford-aha/lassen/builds/104#eeab0d8b-a96e-45ed-aa84-ef2e392fb4de/6 from ea7f165 enables the -s
flag to see the stdout.
If we search the log for fp_mul
, we see
tests/test_pe.py::test_fp_binary_op[args15-op1] Running command: irun -top WrappedPE_tb -timescale 1ns/1ns -access +rwc -notimingchecks -input WrappedPE_cmd.tcl WrappedPE_tb.sv WrappedPE.v CW_fp_mult.v CW_fp_add.v
--
|
| irun(64): 15.20-s022: (c) Copyright 1995-2017 Cadence Design Systems, Inc.
| Recompiling... reason: file './WrappedPE_tb.sv' is newer than expected.
| expected: Tue Jun 4 22:42:23 2019
| actual: Tue Jun 4 22:42:24 2019
| file: WrappedPE_tb.sv
| module worklib.WrappedPE_tb:sv
| errors: 0, warnings: 0
| file: CW_fp_mult.v
| ncvlog: *W,WARIPR: warning within protected source code.
| Elaborating the design hierarchy:
| Caching library 'worklib' ....... Done
| Top level design units:
| WrappedPE_tb
| ncelab: *W,DSEMEL: This SystemVerilog design will be simulated as per IEEE 1800-2009 SystemVerilog simulation semantics. Use -disable_sem2009 option for turning off SV 2009 simulation semantics.
| CW_fp_add #(.sig_width(frac_bits), .exp_width(exp_bits), .ieee_compliance(0)) add1 (.a(in0),.b(in1),.rnd('h0),.z(out),.status(status));
| \|
| ncelab: *W,CUVMPW (./WrappedPE.v,18\|107): port sizes differ in port connection (32/3).
| CW_fp_mult #(.sig_width(frac_bits+3), .exp_width(exp_bits), .ieee_compliance(0)) mul1 (.a({in0,3'h0}),.b({in1,3'h0}),.rnd('h0),.z({out,result_x}),.status(status));
| \|
| ncelab: *W,CUVMPW (./WrappedPE.v,8\|124): port sizes differ in port connection (32/3).
| Building instance overlay tables: .................... Done
| Generating native compiled code:
| worklib.WrappedPE_tb:sv <0x40ef9b4e>
| streams: 8, words: 4345
| Building instance specific data structures.
| Loading native compiled code: .................... Done
| Design hierarchy summary:
| Instances Unique
| Modules: 1041 71
| Primitives: 1 1
| Registers: 63 114
| Scalar wires: 2668 -
| Expanded wires: 51 5
| Vectored wires: 1669 -
| Always blocks: 18 16
| Initial blocks: 8 4
| Cont. assignments: 1716 2028
| Pseudo assignments: 11 11
| Simulation timescale: 1ns
| Writing initial simulation snapshot: worklib.WrappedPE_tb:sv
| Loading snapshot worklib.WrappedPE_tb:sv .................... Done
| ncsim: *W,DSEM2009: This SystemVerilog design is simulated as per IEEE 1800-2009 SystemVerilog simulation semantics. Use -disable_sem2009 option for turning off SV 2009 simulation semantics.
| ncsim> source /cad/cadence/INCISIVE15.20.022/tools/inca/files/ncsimrc
| ncsim>
| ncsim> database -open -vcd vcddb -into verilog.vcd -default -timescale ps
| Created default VCD database vcddb
| ncsim> probe -create -all -vcd -depth all
| Created probe 1
| ncsim> run 10000ns
| Simulation complete via $finish(1) at time 40 NS + 0
| ./WrappedPE_tb.sv:50 #20 $finish;
| ncsim> quit
| PASSED
| tests/test_pe.py::test_fp_mul Running command: irun -top WrappedPE_tb -timescale 1ns/1ns -access +rwc -notimingchecks -input WrappedPE_cmd.tcl WrappedPE_tb.sv WrappedPE.v CW_fp_mult.v CW_fp_add.v
|
| irun(64): 15.20-s022: (c) Copyright 1995-2017 Cadence Design Systems, Inc.
| Loading snapshot worklib.WrappedPE_tb:sv .................... Done
| ncsim: *W,DSEM2009: This SystemVerilog design is simulated as per IEEE 1800-2009 SystemVerilog simulation semantics. Use -disable_sem2009 option for turning off SV 2009 simulation semantics.
| ncsim> source /cad/cadence/INCISIVE15.20.022/tools/inca/files/ncsimrc
| ncsim>
| ncsim> database -open -vcd vcddb -into verilog.vcd -default -timescale ps
| Created default VCD database vcddb
| ncsim> probe -create -all -vcd -depth all
| Created probe 1
| ncsim> run 10000ns
| Simulation complete via $finish(1) at time 40 NS + 0
| ./WrappedPE_tb.sv:50 #20 $finish;
| ncsim> quit
| PASSED
Notice that in the previous test (tests/test_pe.py::test_fp_binary_op[args15-op1]
), ncsim says "Recompiling... reason: file './WrappedPE_tb.sv' is newer than expected." which makes sense because with the new test, there should be a new test bench, forcing a recompile.
Now the output of the fp_mul
test doesn't show this recompile message, which would imply that it's actually running the test bench from the previous tests, any ideas why this might be happening? I'll look to see if there's any place where fault might skip generating the test bench, but maybe there's some strange condition here causing ncsim for miss the recompile?
from lassen.
Hmm, can you force fault
to touch
the test bench file to make sure the modified time has changed?
from lassen.
This is from Nikhil's email:
I think the bug is in the way the multiplier is instantiated. Since CW_mult does not support a 7 bit mantissa (minimum is 10 bits), we had to instantiate with a 10 bit mantissa. So CW rounds to nearest even for that precision, and not for the 7 bit precision we need. I have pasted verilog code below that should round it to nearest even for our precision (note that CW outputs to int_out, not out). Can you please try with this added to the RTL? I have set the CW rounding mode to truncate now.
module mul #(parameter exp_bits=1, parameter frac_bits=1) (
input [exp_bits+frac_bits:0] in0,
input [exp_bits+frac_bits:0] in1,
output [exp_bits+frac_bits:0] out
);
wire [exp_bits+frac_bits:0] int_out;
reg sign;
reg [exp_bits-1:0] exp;
reg [frac_bits:0] frac;
CW_fp_mult #(.sig_width(frac_bits+3), .exp_width(exp_bits), .ieee_compliance(0)) mul1 (.a({in0,3'h0}),.b({in1,3'h0}),.rnd('h1),.z({int_out,result_x}),.status());
always @(*) begin
sign = int_out[exp_bits+frac_bits];
exp = int_out[exp_bits+frac_bits-1:frac_bits];
frac = {1'b0,int_out[frac_bits-1:0]};
if ((results_x[2]&(results_x[1] | results_x[0])) | (int_out[0] & results_x[2])) begin
frac = frac + 1'd1;
if (~&exp) begin
exp = exp + frac[frac_bits];
end
end
end
assign out = {sign, exp, frac[frac_bits-1:0]};
endmodule
He said that using 7-bit won't be synthesized properly, so we have to use 10-bit here.
If we are going to go with the above code, lets explicitly write this in lassen so we can easily verify the high risk rounding code using our functional testbench. @Kuree or @nikhilbhagdikar, can you do a lassen PR using the multiply operator of FPVector[8,7+3,RNE,False] in sim.py, then use normal BFloat16 for the testbench?
from lassen.
There's a couple ways to address the issue, but what I did for now is to have the lassen
test remove the generated tb
file if it exists to hopefully force ncsim to recompile, see 1705e47 also changed the test script back to run them all with dc58b42 so we can see if we get the expected failure.
We could add this to fault, but this was a faster work around because otherwise we need to get the change merged into fault and released. The thing is, fault creates the test bench file through python's file IO interface, see https://github.com/leonardt/fault/blob/master/fault/system_verilog_target.py#L272. We could add logic after this to touch the created file, but it seems to me that this should result in a new timestamp every test. Maybe there's something weird going on with the way ncsim
tracks file modifications? I'm hesitant to integrate this special case into fault, but if it's a general issue with how ncsim
handles recompilation, then it does make sense to ahve fault automatically avoid the issue by doing some weird timestamp magic (we could use python's os.utime
https://docs.python.org/3.7/library/os.html#os.utime to try to force a recompilation)
from lassen.
And it doesn't... https://buildkite.com/stanford-aha/lassen/builds/114#18608758-0e21-40a8-bdad-f92fe698ff65
from lassen.
I'll run with -s
to see if it's a similar message from ncsim (skipping the recompile step)
from lassen.
@rdaly525 What's the plan for the RTL? Are you going to hard-code the verilog fix in (I will verify the fix later today).
from lassen.
In my opinion, it would be easier to verify the fix directly in peak by running functional tests and comparing the semantics of the code that nikhil wrote vs BFloat16.
from lassen.
Weird, I added a print
to make sure that the test bench file was being removed before fault regenerated it, then the output looks like
tests/test_pe.py::test_fp_mul removing tb file
Running command: irun -top WrappedPE_tb -timescale 1ns/1ns -access +rwc -notimingchecks -input WrappedPE_cmd.tcl WrappedPE_tb.sv WrappedPE.v CW_fp_mult.v CW_fp_add.v
irun(64): 15.20-s022: (c) Copyright 1995-2017 Cadence Design Systems, Inc.
Loading snapshot worklib.WrappedPE_tb:sv .................... Done
ncsim: *W,DSEM2009: This SystemVerilog design is simulated as per IEEE 1800-2009 SystemVerilog simulation semantics. Use -disable_sem2009 option for turning off SV 2009 simulation semantics.
ncsim> source /cad/cadence/INCISIVE15.20.022/tools/inca/files/ncsimrc
ncsim>
ncsim> database -open -vcd vcddb -into verilog.vcd -default -timescale ps
Created default VCD database vcddb
ncsim> probe -create -all -vcd -depth all
Created probe 1
ncsim> run 10000ns
Simulation complete via $finish(1) at time 40 NS + 0
./WrappedPE_tb.sv:50 #20 $finish;
ncsim> quit
�[32mPASSED�[0m
so it seems that even if I force remove that file before generating a new _tb
, ncsim still uses the cached worklib
. I could try deleting the intermediate worklib files
from lassen.
@leonardt. If you remove worlib
, then it's the same thing as use a temp folder since ncsim
won't be able to use cached version anymore. This will make the runs much slower, but should avoid the bug.
from lassen.
I'll try that first to see if that avoids the error, so we know the issue, then we can investigate how we can get ncsim to force recompile our _tb
from lassen.
Ah, I saw this in the output:
Recompiling... reason: file './WrappedPE_tb.sv' is newer than expected.
--
| expected: Wed Jun 5 15:15:35 2019
| actual: Wed Jun 5 15:15:36 2019
so maybe it has to do with whether a full test run takes under 1 second, so i'm going to try adding a time.sleep(1)
from lassen.
Adding time.sleep(1)
also fixes the issue, see https://buildkite.com/stanford-aha/lassen/builds/121#a3c84cda-fb2c-4133-adda-7ce4e3a31e33
but this makes the tests run quite a bit slower (because most of them take under a second), not sure what the best workaround is here. we could benchmark nuking INCA_libs
(forcing full recompile) versus time.sleep(1)
from lassen.
Issue seems to be resolved using fault's timestamp-edit branch (leonardt/fault#116), see https://buildkite.com/stanford-aha/lassen/builds/131#b4859b6e-ad64-47da-8104-98a5b23ef0ad
from lassen.
Moved the discussion to #120, since this issue also covers another bug.
from lassen.
Related Issues (20)
- FP mult not working HOT 1
- Lassen Bug Tracker HOT 8
- Better pytest generator for NANs in test_micro.py
- Tests should be able to run in parallel HOT 3
- Add test that will verify circuit equivalence for generated RTL vs RTL-freeze RTL
- More Complex Ops HOT 2
- Convert the complex op tests to use hwtypes.FPVector HOT 1
- Move reading/writing logic outside of lassen description
- Add tests for the following complex ops
- Mapper/Packer tasks
- Warnings need to be fixed HOT 5
- Convert Lassen to new rebind methodology HOT 2
- Operand registers not properly clock gated HOT 2
- Switch to CW floats HOT 17
- CW fp_mult not working properly HOT 5
- Errors pop out when run the pytest
- Multiplier currently compiles to 32 bit mul when only a 16 bit mul suffices
- Add lassen rewrite rules + generation script to master HOT 1
- Update lassen to optionally include floating point support HOT 1
- Enable CoreIR/verilog optimizations during code generation
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from lassen.