Comments (23)
Also, you keep mentioning MultiChannel High Speed Synchronous Serial
communication as a feature. Nothing wrong with that but it would help when
you'd put a number to it. From the above code it looks like 20 cycles/bit (raw
speed).
Thanks,
Marko
Original comment by [email protected]
on 18 Jun 2011 at 12:19
from propforth.
The way this code should function is:
\ _treg5 tx reg , _treg3 rx reg
__7trreg
mov _treg6 , # 20
__Flp
shl _treg5 , # 1 wc
muxc outa , v_pinout
test v_pinin , ina wc
rcl _treg3 , # 1
djnz _treg6 , # __Flp wz
__8trregret
ret
But remember, we are transmitting and receiving at the same time, and both sides
running in sync
This puts the data valid time, to the data read time about 4 cycles apart, and
any capacitance on the line could slow down the edge and make this a little
tight
as could and slight clock differences between props
By sacrificing 2 longs, and 8 cycles / 32 bits we add another 4 cycles for the
data
to stabilize before it is read by the other prop chip, so in theory we have 8
cycles
or about 100 ns on an 80Mhz clock
This should be good enough for most applications.
The raw bit speed is about 20 cycles / bit. However, the protocol stack must be
factored.
The goal of this code is not to send 32 bits, but to provide 8 synchronous full
duplex
channels between props.
Original comment by [email protected]
on 19 Jun 2011 at 1:58
from propforth.
will dig up the test code over the next week, clean it up and post.
Original comment by [email protected]
on 19 Jun 2011 at 2:25
from propforth.
This is just from cursory inspection (correct me if I'm wrong).
1. The slave waits for the master to drop the comm line and then does the same
in opposite direction to unblock the master.
a_mastermcs
__Flp
jmpret __Aloadregsret , # __9loadregs
mov _treg1 , # 1 wz
muxz outa , v_pinout ' release slave
waitpne v_pinin , v_pinin ' wait for slave
jmpret __Ctxrxret , # __Btxrx ' main transfer loop
jmp # __Flp
a_slavemcs
__Flp
jmpret __Aloadregsret , # __9loadregs
mov _treg1 , # 1 wz
waitpne v_pinin , v_pinin ' wait for master
muxz outa , v_pinout ' unblock master
jmpret __Ctxrxret , # __Btxrx ' main transfer loop
jmp # __Flp
This puts the master 3 cycles behind the slave (ideal timing). From that point
both entities run the same code.
2. The transfer loop is structured like this: test-idle-idle-send-loop (T--S+).
Which means master and slave run like this:
M: -T--S+T--S+T--S+
S: T--S+T--S+T--S+
This clearly shows that the first test instruction is simply wasted (which was
why I raised the issue). Maybe that's fine with you, it just feels wrong in a
way :)
Original comment by [email protected]
on 20 Jun 2011 at 1:21
from propforth.
> The raw bit speed is about 20 cycles / bit. However, the protocol stack must
be
> factored.
>
> The goal of this code is not to send 32 bits, but to provide 8 synchronous
full
> duplex channels between props.
I'm aware of that but "MultiChannel High Speed Synchronous Serial
communication" as stated now doesn't really mean anything. A few numbers don't
hurt.
Original comment by [email protected]
on 20 Jun 2011 at 1:26
from propforth.
My apologies, I have been traveling and did not have systems with me.
Looking at my original notes the "optimal" code structure was not reliable.
Different configurations were not yielding the same results.
A more detailed analysis, and the timing is actually tighter, and this because
of how the pipeline works. The writing of the result of one instruction is
followed immediately by reading of the source of the next instruction.
The Master write, to slave read has lots of time, but the slave write to master
read is much closer. Cutting 4 cycles here to get to the "optimal" code
structure gave master crc errors in some configurations.
This is a LogicAnalyzer capture, not a real scope trace, and the sample time is
one clock cylce. MCS loopback (all running on one prop).
After some looking at the tests
Some exact timings
0 - Execute N-1 Fetch N muxc outa , v_pinout / djnz
1 - Write Result N-1 muxc outa , v_pinout
2 - Fetch Source N djnz _treg6 , # __Flp wz
3 - Fetch Dest N djnz _treg6 , # __Flp wz
4 - Execute N Fetch N+1 djnz _treg6 , # __Flp wz / test
6 - Write Result N djnz _treg6 , # __Flp wz
7 - Fetch Source N+1 test v_pinin , ina wc
8 - Fetch Dest N+1 test v_pinin , ina wc
9 - Execute N+1 Fetch N+2
10- Write Result N+1
11 - Fetch Source N+2
12 - Fetch Dest N+2
13 - Execute N+2 Fetch N+3
14 - Write Result N+2
11111
012345678901234
|Master Sets Data on pin 21
|
| |Slave Sets Data on pin 20
| |
| | |Master Reads Data on pin 20
| | |
| | | | Slave Reads Data on pin 21
| | | |
21____________________--------------------____________________----------
20---____________________--------------------____________________-------
| | | |
200.0 nS| 400.0 nS| 600.0 nS| 800.0 nS|
So the Raw Data throughput is on the order of 20 clocks / bit (full duplex).
About 4Mbits/second
On top of that is the CRC/framing, and then the acknowledge handshake to provide
a byte by byte synchronous transfer, which is the goal.
This is much slower than 4Mbs, and this is the throughput timing tests I will
clean up and publish in the near future.
Original comment by [email protected]
on 23 Jun 2011 at 5:08
from propforth.
First I'd like to clear up some confusion. I listed two code fragments in
comment #4. The cog executing a_mastermcs will be the one running 3 cycles
behind a_slavemcs. I assumed that a_mastermcs designates the master. Are you
saying that the roles are reversed?
As for the timing, I have to disagree (slightly) (Ir---- omitted,
muxc/djnz/test)
muxcdjnztest
M: ---SDeRSDeRSDeR--- (cog running a_mastermcs)
S: SDeRSDeRSDeR------
| | | |
0 1 2 3
0. R: slave updates pin
1. R: master updates pin
2. e: slave samples pin
3. e: master samples pin
i.e. the timing is a bit more relaxed.
Original comment by [email protected]
on 24 Jun 2011 at 12:27
from propforth.
Just thinking aloud here. Delay the slave by 4 cycles (or get both M & S
completely in sync). This gets us away (down) from the current 3 cycles
(anything is an improvement).
Then grab a counter (NCO) but leave it in idle mode (frqx = 0).
movi ctra, #%0_00100_000
The transfer loop will then look like this:
mov phsa, data ' send MSb
mov lcnt, #32 ' loop count
test mask, ina wc ' sample pin
rcl phsa, #1 ' store bit and send next one
djnz lcnt, #$-2 ' for all 32 bit
Even with the current delay (3) the timing doesn't change with the code above.
All you get is 12 cycles/bit instead of 20.
mov mov testrcl djnztestrcl
M: ---SDeRSDeRSDeRSDeRSDeRSDeRSDeR---
S: SDeRSDeRSDeRSDeRSDeRSDeRSDeR------
| | | | | | | | |
S: Tx | Rx | Tx | Rx | Tx
M: Tx | Rx Tx Rx
+-+-+
|
critical timing which can only improve when the master sends earlier
Original comment by [email protected]
on 24 Jun 2011 at 12:51
from propforth.
Minor ommission, the counter requires pin A to be set. Anyway, that's done only
once for the lifetime of the cog.
Original comment by [email protected]
on 24 Jun 2011 at 12:54
from propforth.
My mistake, I mis-labelled Master and Slave on the timing diagram, so I think
we agree, the correct diagram:
0 - Execute N-1 Fetch N muxc outa , v_pinout / djnz
1 - Write Result N-1 muxc outa , v_pinout
2 - Fetch Source N djnz _treg6 , # __Flp wz
3 - Fetch Dest N djnz _treg6 , # __Flp wz
4 - Execute N Fetch N+1 djnz _treg6 , # __Flp wz / test
6 - Write Result N djnz _treg6 , # __Flp wz
7 - Fetch Source N+1 test v_pinin , ina wc
8 - Fetch Dest N+1 test v_pinin , ina wc
9 - Execute N+1 Fetch N+2
10- Write Result N+1
11 - Fetch Source N+2
12 - Fetch Dest N+2
13 - Execute N+2 Fetch N+3
14 - Write Result N+2
11111
012345678901234
|Slave Sets Data on pin 21
|
| |Master Sets Data on pin 20
| |
| | |Slave Reads Data on pin 20
| | |
| | | | Master Reads Data on pin 21
| | | |
21____________________--------------------____________________----------
20---____________________--------------------____________________-------
| | | |
200.0 nS| 400.0 nS| 600.0 nS| 800.0 nS|
The timing results including protocol:
Will update the mcs.f to include the test code in a few days, once I verify the
test runs on the released code.
Assuming a 80Mhz props, the raw wire transmission send and receives a bit every
20 cycles, so the full duplex raw bit rate is 4M bits/sec.
Communication is packed into a 96 bit frame, which has 8 bytes, CRC, + flow
control. So the theoretical maximum number of frames per second is 41,667. But
since one cog is running both the flow control protocol and the wire protocol,
the real maximum throughput is about 21,000 frames per second. So about an
equal amount of time is spent on the protocol as the wire.
If we factor the code to run on 2 cogs, this should double throughput.
Some real world results from test, the throughput per channel goes down from
10527 bytes/sec/channel with only one channel running to 9381 bytes/sec/channel
with 8 channels running. This is about half the frame rate, since each byte is
synchronously acknowledged.
This decline is due to the frame rate going down slightly as more time is spent
on protocol.
If comparing this to rs232, factor the bit rates up by 25% for comparison, as a
minimum of one start bit and stop bit is required for each byte for rs232.
"test" loops the channels on the slave cog, so that each byte received is
transmitted. The display of cog? indicates the channels of the master cog are
routed to cog 5 channel 1, this is really an artifact of test routing them to a
dummy location for testing.
The xmt/rec code is written in assembler, to simulate the fastest possible
source/sink of bytes.
Prop0 Cog6 ok
test
a # # - set mcs pins
b # # - set mcs cogs
c # - set xmt/rec cog
d - start mcs
e # - set number active channels
f - stats
g - cog?
q - quit
d
CON:Prop0 Cog0 RESET - last status: 0 ok
CON:Prop0 Cog1 RESET - last status: 0 ok
CON:Prop0 Cog2 RESET - last status: 0 ok
Cog:0 #io chan:8 MCS 0(0)->5(1) 0(1)->5(1)
0(2)->5(1) 0(3)->5(1) 0(4)->5(1) 0(5)->5(1) 0(6)->5(1) 0(7)->5(1)
Cog:1 #io chan:8 MCS 1(0)->1(0) 1(1)->1(1)
1(2)->1(2) 1(3)->1(3) 1(4)->1(4) 1(5)->1(5) 1(6)->1(6) 1(7)->1(7)
Cog:2 #io chan:1 PropForth v4.5 2011MAY31 17:30 0
Cog:3 #io chan:1 PropForth v4.5 2011MAY31 17:30 0
Cog:4 #io chan:1 PropForth v4.5 2011MAY31 17:30 0
Cog:5 #io chan:1 PropForth v4.5 2011MAY31 17:30 0
Cog:6 #io chan:1 PropForth v4.5 2011MAY31 17:30 0 6(0)->7(0)
Cog:7 #io chan:1 SERIAL 7(0)->6(0)
Master Pin: 0
Slave Pin: 1
Master Cog: 0
Slave Cog: 1
Xmt/Rec Cog: 2
Master Errors: 0
Slave Errors: 0
Master frames/s: 21055
Master bps: 2021280
Slave frames/s: 21054
Slave bps: 2021184
Num Channels: 1
XMT byte/sec: 10527
XMT bits/sec: 84216
XMT byte/sec/ch: 10527
XMT bits/sec/ch: 84216
REC byte/sec: 10527
REC bits/sec: 84216
REC byte/sec/ch: 10527
REC bits/sec/ch: 84216
e 8
f
Master Pin: 0
Slave Pin: 1
Master Cog: 0
Slave Cog: 1
Xmt/Rec Cog: 2
Master Errors: 0
Slave Errors: 0
Master frames/s: 18763
Master bps: 1801248
Slave frames/s: 18763
Slave bps: 1801248
Num Channels: 8
XMT byte/sec: 75049
XMT bits/sec: 600392
XMT byte/sec/ch: 9381
XMT bits/sec/ch: 75048
REC byte/sec: 75048
REC bits/sec: 600384
REC byte/sec/ch: 9381
REC bits/sec/ch: 75048
Original comment by [email protected]
on 24 Jun 2011 at 12:58
from propforth.
And somehow you omitted cycle 5 in the diagram ...
Original comment by [email protected]
on 24 Jun 2011 at 1:02
from propforth.
Also - based on your diagram - ina is sampled during S (cycle 7). This is *not*
the case. Live registers (cnt, ina, phsx) are sampled during e (S+2).
Original comment by [email protected]
on 24 Jun 2011 at 1:22
from propforth.
Ok updated diagram, which should be correct, numerically and semantically :)
0 - Execute N-1 Fetch N muxc outa , v_pinout / djnz
1 - Write Result N-1 muxc outa , v_pinout
2 - Fetch Source N djnz _treg6 , # __Flp wz
3 - Fetch Dest N djnz _treg6 , # __Flp wz
4 - Execute N Fetch N+1 djnz _treg6 , # __Flp wz / test
5 - Write Result N djnz _treg6 , # __Flp wz
6 - Fetch Source N+1 test v_pinin , ina wc
7 - Fetch Dest N+1 test v_pinin , ina wc
8 - Execute N+1 Fetch N+2 Fetch live
9- Write Result N+1
10 - Fetch Source N+2
11 - Fetch Dest N+2
12 - Execute N+2 Fetch N+3
13 - Write Result N+2
11111
012345678901234
|Slave Sets Data on pin 21
|
| |Master Sets Data on pin 20
| |
| | |Slave Reads Data on pin 20
| | |
| | | | Master Reads Data on pin 21
| | | |
21____________________--------------------____________________----------
20---____________________--------------------____________________-------
| | | |
200.0 nS| 400.0 nS| 600.0 nS| 800.0 nS|
Looking at the code in comment 8/9:
The critical timing components are unchanged, with a shorter cycle time, cannot
see why it would not work, so will try in order
change the send/rec loop, re run timing test and LogicAnalyzer traces
try delaying slave by 4 clock cycles
re run the timing tests and LogicAnalyzer
I expect the overall improvement including protocol stack should be around 20%,
worthwhile.
Should have time Saturday, will post results.
Original comment by [email protected]
on 24 Jun 2011 at 3:08
from propforth.
All changes done, went very smoothly. 21% overall throughput improvement,
sample is almost exactly in the middle of the bit time.
2 files attached:
mcs.f - changes + test program
LogicAnalyzer.f - bug fix sampling at 1 cycle was failing
results: (feel free to find errors)
Under the Hood:
Assuming a 80Mhz props, the raw wire transmission send and receives a bit every
12 cycles, so the full
duplex raw bit rate is 6.7M bits/sec.
Communication is packed into a 96 bit frame, which has 8 bytes, CRC, + flow
control. So the
theoretical maximum number of frames per second is 69,444. But since one cog is
running both the
flow control protocol and the wire protocol, the real maximum throughput is
about 22,730 frames per second. So about twice the time is spent on the
protocol as the wire.
If we factor the code to run on 2 cogs, this should increase throughput.
Some real world results from test, the throughput per channel goes down from
13298 bytes/sec/channel with only one channel running to 11364
bytes/sec/channel with 8 channels running. This is about half the frame rate,
since each byte is synchronously acknowledged.
This decline is due to the frame rate going down slightly as more time is spent
on protocol.
If comparing this to rs232, factor the bit rates up by 25% for comparison, as a
minimum of one start bit and stop bit is required for each byte for rs232.
"test" loops the channels on the slave cog, so that each byte received is
transmitted. The display of cog? indicates the channels of the master cog are
routed to cog 5 channel 1, this is really an artifact of test routing them to a
dummy location for testing.
The xmt/rec code is written in assembler, to simulate the fastest possible
source/sink of bytes.
Prop0 Cog6 ok
test
a # # - set mcs pins
b # # - set mcs cogs
c # - set xmt/rec cog
d - start mcs
e # - set number active channels
f - stats
g - cog?
q - quit
d
CON:Prop0 Cog0 RESET - last status: 0 ok
CON:Prop0 Cog1 RESET - last status: 0 ok
CON:Prop0 Cog2 RESET - last status: 0 ok
Cog:0 #io chan:8 MCS 0(0)->5(1) 0(1)->5(1)
0(2)->5(1) 0(3)->5(1) 0(4)->5(1) 0(5)->5(1) 0(6)->5(1) 0(7)->5(1)
Cog:1 #io chan:8 MCS 1(0)->1(0) 1(1)->1(1)
1(2)->1(2) 1(3)->1(3) 1(4)->1(4) 1(5)->1(5) 1(6)->1(6) 1(7)->1(7)
Cog:2 #io chan:1 PropForth v4.5 2011MAY31 17:30 0
Cog:3 #io chan:1 PropForth v4.5 2011MAY31 17:30 0
Cog:4 #io chan:1 PropForth v4.5 2011MAY31 17:30 0
Cog:5 #io chan:1 PropForth v4.5 2011MAY31 17:30 0
Cog:6 #io chan:1 PropForth v4.5 2011MAY31 17:30 0 6(0)->7(0)
Cog:7 #io chan:1 SERIAL 7(0)->6(0)
Master Pin: 0
Slave Pin: 1
Master Cog: 0
Slave Cog: 1
Xmt/Rec Cog: 2
Master Errors: 0
Slave Errors: 0
Master frames/s: 26599
Master bps: 2553504
Slave frames/s: 26597
Slave bps: 2553312
Num Channels: 1
XMT byte/sec: 13298
XMT bits/sec: 106384
XMT byte/sec/ch: 13298
XMT bits/sec/ch: 106384
REC byte/sec: 13298
REC bits/sec: 106384
REC byte/sec/ch: 13298
REC bits/sec/ch: 106384
e 8
f
Master Pin: 0
Slave Pin: 1
Master Cog: 0
Slave Cog: 1
Xmt/Rec Cog: 2
Master Errors: 0
Slave Errors: 0
Master frames/s: 22730
Master bps: 2182080
Slave frames/s: 22729
Slave bps: 2181984
Num Channels: 8
XMT byte/sec: 90912
XMT bits/sec: 727296
XMT byte/sec/ch: 11364
XMT bits/sec/ch: 90912
REC byte/sec: 90912
REC bits/sec: 727296
REC byte/sec/ch: 11364
REC bits/sec/ch: 90912
Low level timing:
pin 21 is the slave pin
pin 20 is the master pin
LogicAnalyzer trace, sample every cycle
test v_pinin , ina wc
rcl phsa , # 1
djnz _treg6 , # __Flp wz
0 - Execute N-1 Fetch N rcl phsa , # 1 / djnz
1 - Write Result N-1 rcl phsa , # 1 / djnz
2 - Fetch Source N djnz _treg6 , # __Flp wz
3 - Fetch Dest N djnz _treg6 , # __Flp wz
4 - Execute N Fetch N+1 djnz _treg6 , # __Flp wz / test
5 - Write Result N djnz _treg6 , # __Flp wz
6 - Fetch Source N+1 test v_pinin , ina wc
7 - Fetch Dest N+1 test v_pinin , ina wc
8 - Execute N+1 Fetch N+2 FetchLiveReg test v_pinin , ina wc
9- Write Result N+1
10 - Fetch Source N+2
11 - Fetch Dest N+2
12 - Execute N+2 Fetch N+3
13 - Write Result N+2
11111
012345678901234
|Master Sets Data on pin 20
|
||Slave Sets Data on pin 21
||
|| |Master Reads Data on pin 21
|| |
|| ||Slave Reads Data on pin 20
|| ||
21_________________------------____________------------____________------------2
0________________-------------___________-------------| |
| | | |
200.0 nS| 400.0 nS| 600.0 nS| 800.0 nS| 001.0
Original comment by [email protected]
on 25 Jun 2011 at 3:41
Attachments:
from propforth.
[deleted comment]
from propforth.
11111
012345678901234
|Master Sets Data on pin 20
|
||Slave Sets Data on pin 21
||
|| |Master Reads Data on pin 21
|| |
|| ||Slave Reads Data on pin 20
|| ||
21_________________------------____________------------____________-----
20________________-------------___________-------------___________------
| | | |
200.0 nS| 400.0 nS| 600.0 nS| 800.0 nS|
Original comment by [email protected]
on 25 Jun 2011 at 3:44
from propforth.
Nice! If you have the space you could unroll the loop (8 cycles/bit). Probably
not as much a change as going from 20 to 12 cycles though.
Original comment by [email protected]
on 26 Jun 2011 at 12:31
from propforth.
Original comment by [email protected]
on 26 Jun 2011 at 9:07
- Changed state: Started
- Added labels: Performance
from propforth.
Original comment by [email protected]
on 26 Jun 2011 at 9:41
from propforth.
Unfortunately there is insufficient space for unrolling the loop, maybe next
kernel release.
Original comment by [email protected]
on 27 Jun 2011 at 5:18
from propforth.
Changes have been incorporated into mcs for the release 4.6. About a 20%
overall performance increase on the protocol stack.
Suggest we close this issue.
Original comment by [email protected]
on 5 Jul 2011 at 2:49
from propforth.
Be my guest. I'm happy with the code now so I consider this issue resolved :)
Original comment by [email protected]
on 6 Jul 2011 at 1:07
from propforth.
Original comment by [email protected]
on 8 Jul 2011 at 6:53
- Changed state: Verified
from propforth.
Related Issues (20)
- Loading EEprom 1430 bytes - different on windows and linux HOT 3
- Remove all copyright as they are encountered HOT 3
- add37 with no parameter gives no error HOT 1
- Backspace on Android HOT 2
- self signed version of green? HOT 1
- HC06 bluetooth name query command
- Bluetooth init script questions
- setDriftCorrection confusion HOT 2
- How to determine current setDriftCorretion for logger?
- To make Chineese identifers? HOT 1
- new single cog VGA drivers available
- EEWRITE ACK bit read afterr clock goes low
- Logger 2 Existing logfile HOT 4
- Goterm, Gomux term with SDkernel?
- request for goterm + time HOT 3
- Discuss Time Radio sync
- 20140102 errors HOT 5
- Definition of PropForth word "serial" HOT 2
- 1 step test suite
- Driver for radar gun? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from propforth.