<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clip

prof-braino,propforth

Comments (23)

GoogleCodeExporter commented on September 26, 2024

Also, you keep mentioning MultiChannel High Speed Synchronous Serial 
communication as a feature. Nothing wrong with that but it would help when 
you'd put a number to it. From the above code it looks like 20 cycles/bit (raw 
speed).

Thanks,
Marko

Original comment by [email protected] on 18 Jun 2011 at 12:19

from propforth.

GoogleCodeExporter commented on September 26, 2024

The way this code should function is:

\ _treg5 tx reg , _treg3 rx reg
__7trreg
        mov _treg6 , # 20
__Flp
            shl _treg5  , # 1 wc
            muxc    outa , v_pinout
            test    v_pinin , ina wc
            rcl _treg3  , # 1  
        djnz    _treg6  , # __Flp wz
__8trregret
        ret

But remember, we are transmitting and receiving at the same time, and both sides
running in sync

This puts the data valid time, to the data read time about 4 cycles apart, and
any capacitance on the line could slow down the edge and make this a little 
tight
as could and slight clock differences between props

By sacrificing 2 longs, and 8 cycles / 32 bits we add another 4 cycles for the 
data
to stabilize before it is read by the other prop chip, so in theory we have 8 
cycles
or about 100 ns on an 80Mhz clock

This should be good enough for most applications.

The raw bit speed is about 20 cycles / bit. However, the protocol stack must be 
factored.

The goal of this code is not to send 32 bits, but to provide 8 synchronous full 
duplex
channels between props.

Original comment by [email protected] on 19 Jun 2011 at 1:58

from propforth.

GoogleCodeExporter commented on September 26, 2024


will dig up the test code over the next week, clean it up and post.

Original comment by [email protected] on 19 Jun 2011 at 2:25

from propforth.

GoogleCodeExporter commented on September 26, 2024

This is just from cursory inspection (correct me if I'm wrong).

1. The slave waits for the master to drop the comm line and then does the same 
in opposite direction to unblock the master.

a_mastermcs
__Flp
    jmpret  __Aloadregsret , # __9loadregs
    mov _treg1 , # 1 wz
    muxz    outa , v_pinout              ' release slave
    waitpne v_pinin , v_pinin            ' wait for slave
    jmpret  __Ctxrxret , # __Btxrx       ' main transfer loop
    jmp # __Flp

a_slavemcs
__Flp
    jmpret  __Aloadregsret , # __9loadregs
    mov _treg1 , # 1 wz
    waitpne v_pinin , v_pinin            ' wait for master
    muxz    outa , v_pinout              ' unblock master
    jmpret  __Ctxrxret , # __Btxrx       ' main transfer loop
    jmp # __Flp

This puts the master 3 cycles behind the slave (ideal timing). From that point 
both entities run the same code.

2. The transfer loop is structured like this: test-idle-idle-send-loop (T--S+). 
Which means master and slave run like this:

  M:  -T--S+T--S+T--S+
  S:  T--S+T--S+T--S+

This clearly shows that the first test instruction is simply wasted (which was 
why I raised the issue). Maybe that's fine with you, it just feels wrong in a 
way :)

Original comment by [email protected] on 20 Jun 2011 at 1:21

from propforth.

GoogleCodeExporter commented on September 26, 2024

> The raw bit speed is about 20 cycles / bit. However, the protocol stack must 
be
> factored.
>
> The goal of this code is not to send 32 bits, but to provide 8 synchronous 
full
> duplex channels between props.

I'm aware of that but "MultiChannel High Speed Synchronous Serial 
communication" as stated now doesn't really mean anything. A few numbers don't 
hurt.

Original comment by [email protected] on 20 Jun 2011 at 1:26

from propforth.

GoogleCodeExporter commented on September 26, 2024

My apologies, I have been traveling and did not have systems with me.
Looking at my original notes the "optimal" code structure was not reliable.
Different configurations were not yielding the same results.


A more detailed analysis, and the timing is actually tighter, and this because 
of how the pipeline works. The writing of the result of one instruction is 
followed immediately by reading of the source of the next instruction. 

The Master write, to slave read has lots of time, but the slave write to master 
read is much closer. Cutting 4 cycles here to get to the "optimal" code 
structure gave master crc errors in some configurations.



This is a LogicAnalyzer capture, not a real scope trace, and the sample time is 
one clock cylce. MCS loopback (all running on one prop). 


After some looking at the tests

Some exact timings


0 - Execute N-1 Fetch N                    muxc outa , v_pinout / djnz 
1 -  Write Result N-1                      muxc outa , v_pinout
2 -   Fetch Source N                       djnz _treg6  , # __Flp wz
3 -    Fetch Dest N                        djnz _treg6  , # __Flp wz             

4 -     Execute N Fetch N+1                djnz _treg6  , # __Flp wz / test
6 -      Write Result N                    djnz _treg6  , # __Flp wz 
7 -       Fetch Source N+1                 test v_pinin , ina wc
8 -        Fetch Dest N+1                  test v_pinin , ina wc
9 -         Execute N+1 Fetch N+2
10-          Write Result N+1
11 -          Fetch Source N+2
12 -           Fetch Dest N+2
13 -            Execute N+2 Fetch N+3
14 -             Write Result N+2

                               11111
                     012345678901234                      
                      |Master Sets Data on pin 21
                      |        
                      |  |Slave Sets Data on pin 20
                      |  |     
                      |  |  |Master Reads Data on pin 20
                      |  |  |  
                      |  |  |  | Slave Reads Data on pin 21
                      |  |  |  |
21____________________--------------------____________________----------
20---____________________--------------------____________________-------
                 |               |               |               |       
         200.0 nS|       400.0 nS|       600.0 nS|       800.0 nS|

So the Raw Data throughput is on the order of 20 clocks / bit (full duplex). 
About 4Mbits/second 

On top of that is the CRC/framing, and then the acknowledge handshake to provide
a byte by byte synchronous transfer, which is the goal.

This is much slower than 4Mbs, and this is the throughput timing tests I will 
clean up and publish in the near future.

Original comment by [email protected] on 23 Jun 2011 at 5:08

from propforth.

GoogleCodeExporter commented on September 26, 2024

First I'd like to clear up some confusion. I listed two code fragments in 
comment #4. The cog executing a_mastermcs will be the one running 3 cycles 
behind a_slavemcs. I assumed that a_mastermcs designates the master. Are you 
saying that the roles are reversed?

As for the timing, I have to disagree (slightly) (Ir---- omitted, 
muxc/djnz/test)
      muxcdjnztest
M: ---SDeRSDeRSDeR--- (cog running a_mastermcs)
S: SDeRSDeRSDeR------
      |  |   |  |
      0  1   2  3

0. R: slave updates pin
1. R: master updates pin
2. e: slave samples pin
3. e: master samples pin

i.e. the timing is a bit more relaxed.

Original comment by [email protected] on 24 Jun 2011 at 12:27

from propforth.

GoogleCodeExporter commented on September 26, 2024

Just thinking aloud here. Delay the slave by 4 cycles (or get both M & S 
completely in sync). This gets us away (down) from the current 3 cycles 
(anything is an improvement).

Then grab a counter (NCO) but leave it in idle mode (frqx = 0).

    movi    ctra, #%0_00100_000

The transfer loop will then look like this:

    mov     phsa, data       ' send MSb
    mov     lcnt, #32        ' loop count
    test    mask, ina wc     ' sample pin
    rcl     phsa, #1         ' store bit and send next one
    djnz    lcnt, #$-2       ' for all 32 bit

Even with the current delay (3) the timing doesn't change with the code above. 
All you get is 12 cycles/bit instead of 20.

       mov mov testrcl djnztestrcl
M:  ---SDeRSDeRSDeRSDeRSDeRSDeRSDeR---
S:  SDeRSDeRSDeRSDeRSDeRSDeRSDeR------
       |  |   |  | |  |   |  | |
S:     Tx |   Rx | Tx |   Rx | Tx
M:        Tx  |  Rx   Tx     Rx
          +-+-+
            |
            critical timing which can only improve when the master sends earlier

Original comment by [email protected] on 24 Jun 2011 at 12:51

from propforth.

GoogleCodeExporter commented on September 26, 2024

Minor ommission, the counter requires pin A to be set. Anyway, that's done only 
once for the lifetime of the cog.

Original comment by [email protected] on 24 Jun 2011 at 12:54

from propforth.

GoogleCodeExporter commented on September 26, 2024

My mistake, I mis-labelled Master and Slave on the timing diagram, so I think 
we agree, the correct diagram:

0 - Execute N-1 Fetch N                    muxc outa , v_pinout / djnz 
1 -  Write Result N-1                      muxc outa , v_pinout
2 -   Fetch Source N                       djnz _treg6  , # __Flp wz
3 -    Fetch Dest N                        djnz _treg6  , # __Flp wz             

4 -     Execute N Fetch N+1                djnz _treg6  , # __Flp wz / test
6 -      Write Result N                    djnz _treg6  , # __Flp wz 
7 -       Fetch Source N+1                 test v_pinin , ina wc
8 -        Fetch Dest N+1                  test v_pinin , ina wc
9 -         Execute N+1 Fetch N+2
10-          Write Result N+1
11 -          Fetch Source N+2
12 -           Fetch Dest N+2
13 -            Execute N+2 Fetch N+3
14 -             Write Result N+2

                               11111
                     012345678901234                      
                      |Slave Sets Data on pin 21
                      |        
                      |  |Master Sets Data on pin 20
                      |  |     
                      |  |  |Slave Reads Data on pin 20
                      |  |  |  
                      |  |  |  | Master Reads Data on pin 21
                      |  |  |  |
21____________________--------------------____________________----------
20---____________________--------------------____________________-------
                 |               |               |               |       
         200.0 nS|       400.0 nS|       600.0 nS|       800.0 nS|


The timing results including protocol:

Will update the mcs.f to include the test code in a few days, once I verify the
test runs on the released code.

Assuming a 80Mhz props, the raw wire transmission send and receives a bit every 
20 cycles, so the full duplex raw bit rate is 4M bits/sec.

Communication is packed into a 96 bit frame, which has 8 bytes, CRC, + flow 
control. So the theoretical maximum number of frames per second is 41,667. But 
since one cog is running both the flow control protocol and the wire protocol, 
the real maximum throughput is about 21,000 frames per second. So about an 
equal amount of time is spent on the protocol as the wire.

If we factor the code to run on 2 cogs, this should double throughput.

Some real world results from test, the throughput per channel goes down from 
10527 bytes/sec/channel with only one channel running to 9381 bytes/sec/channel 
with 8 channels running. This is about half the frame rate, since each byte is 
synchronously acknowledged.

This decline is due to the frame rate going down slightly as more time is spent 
on protocol.

If comparing this to rs232, factor the bit rates up by 25% for comparison, as a 
minimum of one start bit and stop bit is required for each byte for rs232.

"test" loops the channels on the slave cog, so that each byte received is 
transmitted. The display of cog? indicates the channels of the master cog are 
routed to cog 5 channel 1, this is really an artifact of test routing them to a 
dummy location for testing.

The xmt/rec code is written in assembler, to simulate the fastest possible 
source/sink of bytes. 


Prop0 Cog6 ok
test

                a # # - set mcs pins
                b # # - set mcs cogs
                c #   - set xmt/rec cog
                d     - start mcs
                e #   - set number active channels
                f     - stats
                g     - cog?
                q     - quit
d

CON:Prop0 Cog0 RESET - last status: 0 ok

CON:Prop0 Cog1 RESET - last status: 0 ok

CON:Prop0 Cog2 RESET - last status: 0 ok
Cog:0  #io chan:8                              MCS  0(0)->5(1)  0(1)->5(1)  
0(2)->5(1)  0(3)->5(1)  0(4)->5(1)  0(5)->5(1)  0(6)->5(1)  0(7)->5(1)
Cog:1  #io chan:8                              MCS  1(0)->1(0)  1(1)->1(1)  
1(2)->1(2)  1(3)->1(3)  1(4)->1(4)  1(5)->1(5)  1(6)->1(6)  1(7)->1(7)
Cog:2  #io chan:1 PropForth v4.5 2011MAY31 17:30 0
Cog:3  #io chan:1 PropForth v4.5 2011MAY31 17:30 0
Cog:4  #io chan:1 PropForth v4.5 2011MAY31 17:30 0
Cog:5  #io chan:1 PropForth v4.5 2011MAY31 17:30 0
Cog:6  #io chan:1 PropForth v4.5 2011MAY31 17:30 0  6(0)->7(0)
Cog:7  #io chan:1                           SERIAL  7(0)->6(0)
Master Pin:      0
Slave Pin:       1
Master Cog:      0
Slave Cog:       1

Xmt/Rec Cog:     2
Master Errors:   0
Slave Errors:    0

Master frames/s: 21055
Master bps:      2021280

Slave  frames/s: 21054
Slave  bps:      2021184

Num Channels:    1

XMT byte/sec:    10527
XMT bits/sec:    84216

XMT byte/sec/ch: 10527
XMT bits/sec/ch: 84216

REC byte/sec:    10527
REC bits/sec:    84216

REC byte/sec/ch: 10527
REC bits/sec/ch: 84216

e 8
f
Master Pin:      0
Slave Pin:       1
Master Cog:      0
Slave Cog:       1

Xmt/Rec Cog:     2
Master Errors:   0
Slave Errors:    0

Master frames/s: 18763
Master bps:      1801248

Slave  frames/s: 18763
Slave  bps:      1801248

Num Channels:    8

XMT byte/sec:    75049
XMT bits/sec:    600392

XMT byte/sec/ch: 9381
XMT bits/sec/ch: 75048

REC byte/sec:    75048
REC bits/sec:    600384

REC byte/sec/ch: 9381
REC bits/sec/ch: 75048

Original comment by [email protected] on 24 Jun 2011 at 12:58

from propforth.

GoogleCodeExporter commented on September 26, 2024

And somehow you omitted cycle 5 in the diagram ...

Original comment by [email protected] on 24 Jun 2011 at 1:02

from propforth.

GoogleCodeExporter commented on September 26, 2024

Also - based on your diagram - ina is sampled during S (cycle 7). This is *not* 
the case. Live registers (cnt, ina, phsx) are sampled during e (S+2).

Original comment by [email protected] on 24 Jun 2011 at 1:22

from propforth.

GoogleCodeExporter commented on September 26, 2024

Ok updated diagram, which should be correct, numerically and semantically :)

0 - Execute N-1 Fetch N                    muxc outa , v_pinout / djnz 
1 -  Write Result N-1                      muxc outa , v_pinout
2 -   Fetch Source N                       djnz _treg6  , # __Flp wz
3 -    Fetch Dest N                        djnz _treg6  , # __Flp wz
4 -     Execute N Fetch N+1                djnz _treg6  , # __Flp wz / test
5 -      Write Result N                    djnz _treg6  , # __Flp wz 
6 -       Fetch Source N+1                 test v_pinin , ina wc
7 -        Fetch Dest N+1                  test v_pinin , ina wc
8 -         Execute N+1 Fetch N+2 Fetch live
9-           Write Result N+1
10 -          Fetch Source N+2
11 -           Fetch Dest N+2
12 -            Execute N+2 Fetch N+3
13 -             Write Result N+2

                               11111
                     012345678901234                      
                      |Slave Sets Data on pin 21
                      |        
                      |  |Master Sets Data on pin 20
                      |  |     
                      |  |   |Slave Reads Data on pin 20
                      |  |   |  
                      |  |   |  | Master Reads Data on pin 21
                      |  |   |  |
21____________________--------------------____________________----------
20---____________________--------------------____________________-------
                 |               |               |               |       
         200.0 nS|       400.0 nS|       600.0 nS|       800.0 nS|

Looking at the code in comment 8/9:

The critical timing components are unchanged, with a shorter cycle time, cannot 
see why it would not work, so will try in order

change the send/rec loop, re run timing test and LogicAnalyzer traces

try delaying slave by 4 clock cycles
re run the timing tests and LogicAnalyzer

I expect the overall improvement including protocol stack should be around 20%, 
worthwhile.

Should have time Saturday, will post results.

Original comment by [email protected] on 24 Jun 2011 at 3:08

from propforth.

GoogleCodeExporter commented on September 26, 2024

All changes done, went very smoothly. 21% overall throughput improvement, 
sample is almost exactly in the middle of the bit time.

2 files attached:

mcs.f - changes + test program
LogicAnalyzer.f - bug fix sampling at 1 cycle was failing

results: (feel free to find errors)


Under the Hood:

Assuming a 80Mhz props, the raw wire transmission send and receives a bit every 
12 cycles, so the full
duplex raw bit rate is 6.7M bits/sec.

Communication is packed into a 96 bit frame, which has 8 bytes, CRC, + flow 
control. So the
theoretical maximum number of frames per second is 69,444. But since one cog is 
running both the
flow control protocol and the wire protocol, the real maximum throughput is 
about 22,730 frames per second. So about twice the time is spent on the 
protocol as the wire.

If we factor the code to run on 2 cogs, this should increase throughput.

Some real world results from test, the throughput per channel goes down from 
13298 bytes/sec/channel with only one channel running to 11364 
bytes/sec/channel with 8 channels running. This is about half the frame rate, 
since each byte is synchronously acknowledged.


This decline is due to the frame rate going down slightly as more time is spent 
on protocol.

If comparing this to rs232, factor the bit rates up by 25% for comparison, as a 
minimum of one start bit and stop bit is required for each byte for rs232.

"test" loops the channels on the slave cog, so that each byte received is 
transmitted. The display of cog? indicates the channels of the master cog are 
routed to cog 5 channel 1, this is really an artifact of test routing them to a 
dummy location for testing.

The xmt/rec code is written in assembler, to simulate the fastest possible 
source/sink of bytes. 


Prop0 Cog6 ok
test

                a # # - set mcs pins
                b # # - set mcs cogs
                c #   - set xmt/rec cog
                d     - start mcs
                e #   - set number active channels
                f     - stats
                g     - cog?
                q     - quit
d

CON:Prop0 Cog0 RESET - last status: 0 ok

CON:Prop0 Cog1 RESET - last status: 0 ok

CON:Prop0 Cog2 RESET - last status: 0 ok
Cog:0  #io chan:8                              MCS  0(0)->5(1)  0(1)->5(1)  
0(2)->5(1)  0(3)->5(1)  0(4)->5(1)  0(5)->5(1)  0(6)->5(1)  0(7)->5(1)
Cog:1  #io chan:8                              MCS  1(0)->1(0)  1(1)->1(1)  
1(2)->1(2)  1(3)->1(3)  1(4)->1(4)  1(5)->1(5)  1(6)->1(6)  1(7)->1(7)
Cog:2  #io chan:1 PropForth v4.5 2011MAY31 17:30 0
Cog:3  #io chan:1 PropForth v4.5 2011MAY31 17:30 0
Cog:4  #io chan:1 PropForth v4.5 2011MAY31 17:30 0
Cog:5  #io chan:1 PropForth v4.5 2011MAY31 17:30 0
Cog:6  #io chan:1 PropForth v4.5 2011MAY31 17:30 0  6(0)->7(0)
Cog:7  #io chan:1                           SERIAL  7(0)->6(0)
Master Pin:      0
Slave Pin:       1
Master Cog:      0
Slave Cog:       1

Xmt/Rec Cog:     2
Master Errors:   0
Slave Errors:    0

Master frames/s: 26599
Master bps:      2553504

Slave  frames/s: 26597
Slave  bps:      2553312

Num Channels:    1

XMT byte/sec:    13298
XMT bits/sec:    106384

XMT byte/sec/ch: 13298
XMT bits/sec/ch: 106384

REC byte/sec:    13298
REC bits/sec:    106384

REC byte/sec/ch: 13298
REC bits/sec/ch: 106384

e 8
f
Master Pin:      0
Slave Pin:       1
Master Cog:      0
Slave Cog:       1

Xmt/Rec Cog:     2
Master Errors:   0
Slave Errors:    0

Master frames/s: 22730
Master bps:      2182080

Slave  frames/s: 22729
Slave  bps:      2181984

Num Channels:    8

XMT byte/sec:    90912
XMT bits/sec:    727296

XMT byte/sec/ch: 11364
XMT bits/sec/ch: 90912

REC byte/sec:    90912
REC bits/sec:    727296

REC byte/sec/ch: 11364
REC bits/sec/ch: 90912



Low level timing:

pin 21 is the slave pin
pin 20 is the master pin

LogicAnalyzer trace, sample every cycle


            test    v_pinin , ina wc
            rcl phsa , # 1
        djnz    _treg6  , # __Flp wz


0 - Execute N-1 Fetch N                         rcl phsa , # 1  / djnz 
1 -  Write Result N-1                           rcl phsa , # 1  / djnz
2 -   Fetch Source N                            djnz    _treg6  , # __Flp wz
3 -    Fetch Dest N                             djnz    _treg6  , # __Flp wz
4 -     Execute N Fetch N+1                     djnz    _treg6  , # __Flp wz / test
5 -      Write Result N                         djnz    _treg6  , # __Flp wz 
6 -       Fetch Source N+1                      test    v_pinin , ina wc
7 -        Fetch Dest N+1                       test    v_pinin , ina wc
8 -         Execute N+1 Fetch N+2 FetchLiveReg      test    v_pinin , ina wc
9-           Write Result N+1
10 -          Fetch Source N+2
11 -           Fetch Dest N+2
12 -            Execute N+2 Fetch N+3
13 -             Write Result N+2

                               11111
                 012345678901234                      
                  |Master Sets Data on pin 20
                  |        
                  ||Slave Sets Data on pin 21
                  ||     
                  ||     |Master Reads Data on pin 21
                  ||     |  
                  ||     ||Slave Reads Data on pin 20
                  ||     ||
21_________________------------____________------------____________------------2
0________________-------------___________-------------|               |         
      |               |               |               |
         200.0 nS|       400.0 nS|       600.0 nS|       800.0 nS|       001.0

Original comment by [email protected] on 25 Jun 2011 at 3:41

Attachments:

from propforth.

GoogleCodeExporter commented on September 26, 2024

[deleted comment]

from propforth.

GoogleCodeExporter commented on September 26, 2024

                               11111
                 012345678901234                      
                  |Master Sets Data on pin 20
                  |        
                  ||Slave Sets Data on pin 21
                  ||     
                  ||     |Master Reads Data on pin 21
                  ||     |  
                  ||     ||Slave Reads Data on pin 20
                  ||     ||
21_________________------------____________------------____________-----
20________________-------------___________-------------___________------
                 |               |               |               |     
         200.0 nS|       400.0 nS|       600.0 nS|       800.0 nS|

Original comment by [email protected] on 25 Jun 2011 at 3:44

from propforth.

GoogleCodeExporter commented on September 26, 2024

Nice! If you have the space you could unroll the loop (8 cycles/bit). Probably 
not as much a change as going from 20 to 12 cycles though.

Original comment by [email protected] on 26 Jun 2011 at 12:31

from propforth.

GoogleCodeExporter commented on September 26, 2024

Original comment by [email protected] on 26 Jun 2011 at 9:07

Changed state: Started
Added labels: Performance

from propforth.

GoogleCodeExporter commented on September 26, 2024

Original comment by [email protected] on 26 Jun 2011 at 9:41

from propforth.

GoogleCodeExporter commented on September 26, 2024

Unfortunately there is insufficient space for unrolling the loop, maybe next 
kernel release.

Original comment by [email protected] on 27 Jun 2011 at 5:18

from propforth.

GoogleCodeExporter commented on September 26, 2024

Changes have been incorporated into mcs for the release 4.6. About a 20% 
overall performance increase on the protocol stack. 

Suggest we close this issue.

Original comment by [email protected] on 5 Jul 2011 at 2:49

from propforth.

GoogleCodeExporter commented on September 26, 2024

Be my guest. I'm happy with the code now so I consider this issue resolved :)

Original comment by [email protected] on 6 Jul 2011 at 1:07

from propforth.

GoogleCodeExporter commented on September 26, 2024

Original comment by [email protected] on 8 Jul 2011 at 6:53

Changed state: Verified

from propforth.

mcs.f - 32 + 1 bits in loop about propforth HOT 23 CLOSED

Comments (23)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent