Coder Social home page Coder Social logo

Comments (14)

BrunoLevy avatar BrunoLevy commented on June 17, 2024

Hi @jeras, thank you very much for the feedback, I'll double check and test.
Best,
-- Bruno

from learn-fpga.

jeras avatar jeras commented on June 17, 2024

Internal state register aluShamt is missing a reset, causing simulation to fail on WAIT_ALU_OR_MEM state.
I did not analyze what a random value in aluShamt would do.

On FPGA this would not affect power-on reset, since all registers are initialized to 0, but it could cause issues if the CPU is reset in the middle of a shift.
ASIC would be affected.

from learn-fpga.

jeras avatar jeras commented on June 17, 2024

x0 is not initialized to zero, which again would be fine for FPGA, but not for an ASIC.
Some FPGA tools allow you to provide the initial value of a register, which would work for both synthesis and simulation.
An ASIC would need either a register reset, or another mechanism.

from learn-fpga.

jeras avatar jeras commented on June 17, 2024

The instr register does not have a reset, which probably should not be an issue.
A possible issue is mixing registers with and without reset in the same always statement.
This mixing means, the reset signal will be used to implement the instr logic.
I am not sure what the effect would be with a synchronous reset, on an ASIC with an asynchronous reset, this would add a multiplexer at the input into instr with reset as the select signal and a loop from instr output to one of the inputs into the mux. This would assure instr would keep its value while reset is active.

So by splitting the always statement into one with reset for state and PC and one for the other registers might reduce logic consumption a bit.

A related issue is descibed in this document section "3.1 Synchronous reset flip-flops with non reset follower flip-flops":
http://www.sunburst-design.com/papers/CummingsSNUG2003Boston_Resets.pdf

from learn-fpga.

Mecrisp avatar Mecrisp commented on June 17, 2024

Thank you for the observations and hints! Are you going to produce an ASIC with the Quark in it? The Quark is designed around using BRAMs as its register set, which would be implemented with actual registers and muxes that gives hardwired zero for x0. Also absolutely go for a barrel shifter in ASIC! In terms of size, the Quark-with-barrel-shifter still fits in HX1K FPGA, and it is a real gain. The official Quark does not have it because we needed every possible LUT for adding peripherals, but beyond the constraints of the tiny HX1K FPGA, much better performance/area ratio can be achieved.

from learn-fpga.

Mecrisp avatar Mecrisp commented on June 17, 2024

By the way, if I were about to design an ASIC, I would shoot for some very cool analog chip with special optical features :-) Digital design can be done with a vanilla FPGA next to it.

from learn-fpga.

jeras avatar jeras commented on June 17, 2024

First I must say I noticed some initialization code at the end of the RTL under a BENCH ifdef I did not notice before, this should handle many simulation issues I mentioned above.
I still think the reset value of state is problematic.
aluShamt seems to relay on the reset value of state being WAIT_ALU_OR_MEM and multiple clock cycles of active reset to clear to zero. This would at least have to be documented, but would be much easier to just add a reset.

I do not plan to make an ASIC with quark, although I have seen some synthesis attempts:
https://bitlog.it/20220118_asic_roundup_of_open_source_riscv_cpu_cores.html
I needed a very small RISC-V implementation to compare against mine, since I have some issues making my instruction decoder small due to an experimental coding style.

Before you continue reading please note, I did not go through the episodic design notes yet. I will try to get through before asking too many further questions. An open a separate issue if I will have further questions or suggestions.

@Mecrisp, you mentioned quark was designed to use BRAM on FPGA (if I understood the sentence correctly). But the RTL is written like a FlipFlop implementation with a write port and 2 asynchronous read ports (data multiplexers). A short name for this port configuration is 2R1W. Yesterday I compiled it with Vivado for the Arty board and apparently a BRAM block was used for the register file.

I am perplexed how BRAM was inferred from the given RTL. The Artix BRAM has true dual port support, 2 independent read/write ports in short 2RW. So one port would have to be shared between a read and a write, which could not be done simultaneously. Since this sharing is not explicit in the RTL, I am not sure how the synthesis tool was able to infer a BRAM. In my experience inference is a very delicate process often resulting in suboptimal implementations.

I only tried to synthesize for Lattice HX1K using your makefiles (with yosys), and there it is even less clear to me if and how a BRAM was used. The HX1K ports are 1W1R and the largest data width configuration is 16, so at least 4 blocks would be needed to implement a 1W2R 32-bit register file.
On the other hand if the register file was synthesized into Flip-Flops, this would be 1024 FF or about 80% of the available flops on the smallest device.

If (when) I would start a CPU design for HX1K, I would just store the register file in the main memory (probably the end), this would result in the smallest logic consumption. I think a similar approach might be used in 8-bit AVR, since load/store instructions can access the GPRs.
I was also looking at GOWIN FPGA devices, which have distributed RAM with asynchronous read support, which is a better fit for my current CPU design, for now I am synthesizing for Artix and Cyclone V.

from learn-fpga.

Mecrisp avatar Mecrisp commented on June 17, 2024

When aluShamt has a non-zero value after coming fresh from a short reset pulse, shifting is active, but as the processor starts in WAIT_ALU_OR_MEM state, it will perform up to 31 dummy shift cycles and then resume operation. I know, this is not the most elegant solution, but we were aggressively optimising for minimum LUT usage, and therefore we added reset initialisations only when absolutely necesssary. BENCH section will cover your needs.

Most of the paths we explored during design are described in #1

https://bitlog.it/20220118_asic_roundup_of_open_source_riscv_cpu_cores.html

Cool!

There is no asynchronous read of the register file, reads are actually clocked as well. See this snipplet of the relevant lines copied from Quark source:

   reg [31:0] rs1;
   reg [31:0] rs2;
   reg [31:0] registerFile [31:0];

   always @(posedge clk) begin
     if (writeBack)
       if (rdId != 0)
         registerFile[rdId] <= writeBackData;
...
        state[WAIT_INSTR_bit]: begin
           if(!mem_rbusy) begin // may be high when executing from SPI flash
              rs1 <= registerFile[mem_rdata[19:15]];
              rs2 <= registerFile[mem_rdata[24:20]];

@Mecrisp, you mentioned quark was designed to use BRAM on FPGA (if I understood the sentence correctly).

Correct.

We are specifying a 2R1W implementation on a FPGA that only offers 1R1W block RAMs in hardware. In order to achieve 32 bit wide access, actually 4 blocks of memory are used on HX1K for register file.

Like you, we were totally surprised by the marvellous RAM interference capabilities Yosys has to offer.

Keeping the register file in memory is possible; the SERV core is working that way, but it decreases performance when compared to a separate register file. The Quark as-is needs 3 cycles for regular and 5 cycles for load/store instructions (up to 34 cycles for shifts); Quark-Bicycle achieves 2 cycles for regular instructions and 4 cycles load/store with a barrel shifter and additional address bus multiplexers in execute state. Cycles given when running from RAM, no memory busy.

from learn-fpga.

jeras avatar jeras commented on June 17, 2024

Thanks for all the precious details.

I still need to read through the SERV documentation, I would like to understand how much is serialized, since probably it is not everything (instruction decoder?).

There is probably still some middle ground between SERV and quark which is worth exploring. I think it should be possible to get a CPI around 3~5 while storing instruction/data and registers in the same memory, which is close to the current quark. But memory utilization would be much better, so some memory could be available to peripherals, or there would just be more memory for CPU instruction/data.

I had a look at the Quark-Bicycle source code. I will definitely borrow the idea of sharing a single shifter for left/right.

I apologize in advance for the following rant, my social skills are not good enough for me to be educational but not condescending.

Code where every bit is coded individually rubs me the wrong way (it took me 15 min just to find an appropriate phrase):
https://github.com/BrunoLevy/learn-fpga/blob/master/FemtoRV/RTL/PROCESSOR/femtorv32_quark_bicycle.v#L131-L137
While SystemVerilog provides "11.4.14 Streaming operators (pack/unpack)" which can be used for bit reversal, I have never encountered them used in RTL.
A good Verilog-2005 alternative is to use a function with a for loop, something like the Verilog equivalent of this SystemVerilog code:

function logic [XLEN-1:0] reverse (logic [XLEN-1:0] value);
    for (int i=0; i<32; i++)  reverse[i] = value[XLEN-i-1];
endfunction: reverse

This function can be then used for both reversal operations.
The same approach is great for Gray encoder/decoder.

I have a special disdain for generated CRC code. When the same can be achieved with totally generic XOR and a shift on a vector inside a loop.
https://bues.ch/cms/hacking/crcgen
I wanted to give you and example, I have written many, but never for an open source project, so I have nothing public yet. The following is an example I found online, but it has an extra loop compared to my code which applies XOR directly to vectors (data and polynomial):
https://github.com/hellgate202/crc_calc/blob/master/src/crc_calc.sv

from learn-fpga.

Mecrisp avatar Mecrisp commented on June 17, 2024

How will you handle memory busy signals when you aim for similiar CPI but with registers in main memory?

from learn-fpga.

jeras avatar jeras commented on June 17, 2024

Yesterday I wrote a draft spec for a single memory CPU:
https://github.com/jeras/rp32/tree/master/hdl/rtl/r5p_1mem
At the end is a short comment on handling memory busy with a bus based on tightly integrated memory interfaces (also in early draft state):
https://github.com/jeras/rp32/blob/master/doc/Sysbus.md
There are some waveforms available, but they are still in wavedrom json format.
Read (1 cycle fixed delay for read data):
Read
Write:
Write

I find all standard low resource buses inadequate.
Actually the only option is APB and it is really bad, it requires 2 clock cycles for each read/write without any wait states.
The basic idea of my proposal is to handle busy in the first clock period of a transfer for both read and write. Instead of providing busy for a read in the same cycle as read data. Read data is than available after a fixed delay, for simple systems (like the proposed processor) this delay is 1 clock cycle. APB peripherals would have to be rewritten, for now I have only written a GPIO.

from learn-fpga.

Mecrisp avatar Mecrisp commented on June 17, 2024

You could try a combinatorial piece of logic for short-circuiting the wait for busy/ready signaling depending on the address range. Bruno inserted a short-circuit for sw/sh/sb working this way; if you have asynchronous read of the register file, you might do similiar for fetch. Quark is currently limited to a minimum of 2 CPI per instruction because the register file needs one cycle after the opcode arrives.

from learn-fpga.

jeras avatar jeras commented on June 17, 2024

I already took advantage off all possible combinational bypasses in the draft design.
The longest phase sequence is 4 (fetch, rs1 read, rs2 read, write-back for most instructions), and all are memory accesses. So at least for 1RW memories this sequence is optimal.
For dual port memories it would be possible to combine at least some memory accesses and have a shorter sequence.

I made this design with ASIC in mind. On an ASIC you can usually choose what kind of memories you would like to use, and single port memories are about 30% smaller than dual port. So on ASIC a processor without a GPR register file would not make much sense with dual port memories.

It would probably be possible to create some king of pipeline for quark with at least some instructions executing in a single clock cycle, but I am not sure it would be worth the effort.

Do all processors in the FemtoRV32 family use a synchronous GPR register file?

My original R5P processor design has a CPI of 1 for all instructions. This requires an asynchronous GPR register file. So my focus was on FPGA devices with 1RW1R distributed memories where the 1R port can be asynchronous. Some of this FPGA are Artix, EPS5, Cyclone 5, GOWIN, ... I only started thinking about a CPU without a register file after talking to you and thinking, it could be very small in an ASIC.

As I understand cost was the main reason for focusing on Lattice HX1K, did you consider low cost solutions from GOWIN and other low cost vendors?
https://www.dialog-semiconductor.com/products/greenpak/low-power-low-cost-forgefpga
https://www.quicklogic.com/products/fpga/fpgas-sram/

I personally prefer the major vendors due to mature tools, but with improvements in open source tools low cost FPGA might become more attractive.

from learn-fpga.

Mecrisp avatar Mecrisp commented on June 17, 2024

Great! I'll keep an eye on your ongoing development efforts.

Do all processors in the FemtoRV32 family use a synchronous GPR register file?

At this point in time, yes.

As I understand cost was the main reason for focusing on Lattice HX1K, did you consider low cost solutions from GOWIN and other low cost vendors?

We are strong FOSS advocates here, and Project Apicula for Gowin synthesis wasn't available back then. It all started with Lattice HX1K.

I personally prefer the major vendors due to mature tools, but with improvements in open source tools low cost FPGA might become more attractive.

I do FPGA firmware development for aerospace as my main job, using commercial tools, and I still strongly prefer Yosys.

from learn-fpga.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.