spinnakermanchester / spinnaker_pdp2 Goto Github PK

View Code? Open in Web Editor NEW

2.0 15.0 0.0 6.62 MB

Cognitive systems modelling on SpiNNaker

License: GNU General Public License v3.0

Makefile 1.12% C 66.35% Python 32.53%

spinnaker cognitive-systems artificial-neural-networks back-propagation

spinnaker_pdp2's Issues

Missing ticks reported by output_mon_lens

With the latest changes, the number of missing ticks reported by 'output_mon_lens' has increased significantly.

This appears to be a problem of SDP messages being dropped, not a problem with actual computation on SpiNNaker.

A delay was inserted before sending data to the host to try to reduce the problem [process_t.c/line 308]. This seems to be working but should only be a temporary measure. We may need to stop using output_mon_lens and use a different approach to report data.

Errors when running with Java off

Reported by @lplana

Replicated with: https://github.com/SpiNNakerManchester/IntegrationTests/tree/less_java

With Java on it works as normal

run-time SDRAM requirements must be included in required resources

Currently, only the SDRAM region requirements are included. This may result in a run-time error if cores run out of SDRAM.

Lens-style training set support is incomplete

The following features of Lens-style training sets are currently not supported:

Training sets, known in Lens as example sets, can be applied in six different example orders, described in this section of the Lens manual.
Training examples can have optional specifications such as name, proc and freq. These specifications are part of the optional example set header, described in this section of the Lens manual.
Input and target syntax in example files, described in this section of the Lens manual, is not fully-supported. Valid files can fail to compile.
A training set that does not specify all inputs or all targets, expecting default values to be used, may compile without errors but will produce incorrect results.

Some of these features are required for the example networks provided, such as the simple visual-semantic-phonological network visSemPhon.

Network congestion and deadlock

Network congestion is present in large networks and congestion leads to packet dropping.

dropped packets are not necessarily a correctness issue as they are reinjected and correctly processed.
dropped packets are a performance issue, as dropped packets are delayed, in turn delaying computation.

Occasionally, dropped packets can not be picked up by the reinjector and are permanently lost. leading to deadlock as every packet is required to complete the computation.

permanently lost packets are a correctness issue.

The current implementation does not have a way of dealing with permanently lost packets.

armcc support in Makefile broken

This issue has also been raised in SpiNNakerManchester/SpiNNFrontEndCommon

Compiling the SpiNNaker_PDP2 C code with armcc produces the following error:

armcc -c --c99 --cpu=5te --apcs interwork --min_array_alignment=4 -I /home/plana/scratch/gfe_tests/pdp2/spinnaker_tools/include -Ofast -Wall -Wextra -DPRODUCTION_CODE -Otime -DAPPLICATION_NAME_HASH=0xa43c1b0a -g -o build/input.o input.c
Fatal error: C3900U: Unrecognized option '-all'.
Warning: C3910W: Old syntax, please use '-E'.
Fatal error: C3900U: Unrecognized option '-xtra'.

The error seems to be related to the use of options '-Wall' and '-Wextra'. Additionally, using options '-Ofast' and '-Otime' does not seem right.

The SpiNNaker_PDP2 Makefile does not set any compilation flags or options. It #includes 'Makefile.SpiNNFrontEndCommon'.

Compilation completes correctly for arm-none-eabi-gcc.

unnecessary w cores are created for non-existent links

Currently, a w core is created in the machine graph for every possible (group, group) pair, even if a link between the pair does not exist. This results in an all-zero matrix, which is wasteful. Unfortunately, they cannot simply be removed because they contribute to system synchronisation.

Support for large group partitioning needed

Currently, each group is transformed into a single [w, s, i, t] core pipeline, irrespective of the group size (in terms of units). This will not scale to any arbitrary size.

Reference output files for examples

It is possible to test implementation correctness by comparing the output files generated by the examples in the repository with reference output files.

The reference files are attached here. Please note that the extension has been changed from '.out' to '.txt' due to repository file type restrictions.

example rand10x40:
REF_rand10x40_test_20e.txt
REF_rand10x40_train.txt
REF_rand10x40_train_test_20e.txt

example rogers-basic:
REF_rogers-all-links.txt

example simple_past_tense:
REF_simple_past_tense_train_test.txt

Current timeout implementation is inadequate

Currently, a global timeout is used, which assumes that there is an upper bound on program execution time. This is not adequate.

A better alternative is to time out on lack of progress. Given that all cores need to send and receive packets "continuously", an upper bound can be set for lack of progress. This has to be done carefully as both deadlock and livelock must be catered for.

PacketGatherer warnings

After the update to align PDP2 with SpiNNTools version 6, PacketGatherer-related warnings appear when running the examples. Some of the warnings are included below.

Selected warnigs:

2021-05-05 20:36:35 WARNING: The transmission buffer for SYSTEM:PacketGatherer(0,0) on 0,0,2 was blocked on 402924800 occasions. This is often a sign that the system is experiencing back pressure from the communication fabric. Please either: 1. spread the load over more cores, 2. reduce your peak transmission load, or 3. adjust your mapping algorithm.

2021-05-05 20:36:35 WARNING: The callback queue for SYSTEM:PacketGatherer(0,0) on 0,0,2 overloaded on 2560 occasions. This is often a sign that the system is running too quickly for the number of neurons per core. Please increase the machine time step or time_scale_factor or decrease the number of neurons per core.

2021-05-05 20:36:35 WARNING: The DMA queue for SYSTEM:PacketGatherer(0,0) on 0,0,2 overloaded on 278530 occasions. This is often a sign that the system is running too quickly for the number of neurons per core. Please increase the machine time step or time_scale_factor or decrease the number of neurons per core.

2021-05-05 20:36:35 WARNING: A Timer tick callback in SYSTEM:PacketGatherer(0,0) on 0,0,2 was still executing when the next timer tick callback was fired off 71565312 times. This is a sign of the system being overloaded and therefore the results are likely incorrect. Please increase the machine time step or time_scale_factor or decrease the number of neurons per core

2021-05-05 20:36:35 WARNING: The timer for SYSTEM:PacketGatherer(0,0) on 0,0,2 fell behind by up to 402655296 ticks. This is a sign of the system being overloaded and therefore the results are likely incorrect. Please increase the machine time step or time_scale_factor or decrease the number of neurons per core

Integeration tests errors on simple_past_tense.py

We have had to set the integration test on simple_past_tense.py
to Convert SpinnmanTimeoutException or SpiNNManCoresNotInStateException
to SkipTest as it happens much much too often.

fix side effects of changing outputs to s16.15 fix-point representation

The initOutput value for the bias units is still 0.999969 (instead of 1.0). This has been "corrected" within the C code at the point where initial outputs are loaded into t_outputs. The correct solution is to send the right value in the core configuration.
Targets are now 32 bit values, rather than 16 bit values, but output_mon_lens still expects 16-bit values.
to be loaded into the array my_data as 16 bit values. When cast to a16-bit value, a target of 1 becomes -1 because 16-bit activations use an s0.15 representation. This was not previously a problem because a true 1 was never actually represented, 0.999969 was used instead. Therefore the code at this point now checks whether the target is 1, and if so, loads the value SPINN_SHORT_ACTIV_MAX (0.999969) into the array. output_mon_lens is then able to handle this as before, and outputs a target value of 1.

Support for initial weights generation needed

Currently, the only mechanism supported is to read a Lens weights file.

Routing keys could encode functionality efficiently

Currently, routing keys are assigned by the GFE using the SpiNNTools default key assignment algorithm. This works correctly but a targeted algorithm could result in a more efficient use of the key space.

Currently, each vertex requests a part of the key space in which to indicate the unit being processed and, additionally, encode functionality. The added features normally include: packet type and colour, execution phase and group/subgroup data. These could be encoded efficiently in the routing key, saving key space and also saving decoding effort in the receiving core.

The assignment must be done carefully so that packets are sent only to where they are needed, as is done currently. This requires correct key/mask combinations.

Two possible approaches were suggested by the SpiNNaker software team, each with pros and cons:

use FixedKeyAndMaskConstraint in each machine vertex. This tells the allocator to use the key and mask provided rather than generating one itself.
write a targeted key allocator to replace the default one.

Move binaries directory under binaries

The binaries are currently stored in a directory that not inside (a child directory of) the main spinn_pdp2 directory.

This result in the code only working if installed in developer/ edittable mode.

In developer / editable mode the spinn_pdp2 directory is only referenced not copied into site_packages
Therefor the code
path to binary files
binaries_path = os.path.join(os.path.dirname(file), "..", "binaries")
works.

In a normal install the spinn_pdp2 directory is copied into site_packages.
However the "binaries" directory is not.
It could be but with a generic name like "binaries" this is not recommended.
Also the build would fail it you need sudo access to site_packages.

The PyPA recommends that any data files you wish to be accessible at run time be included inside the package.
ref: https://setuptools.pypa.io/en/latest/userguide/datafiles.html

The fix is to move binaries under spinn_pdp2 and change the code that references it.

fix output file differences between Lens and SpiNNaker

From pull request #29, output_mon_lens is no longer used. The generation of Lens-style output files is handled by function write_Lens_output_file ( ) in mlp_networks.py.

Outputs reported are inconsistent in tick -1. lens reports 0 while SpiNNaker reports 0.5. lens itself is not consistent (output is 0 in rand10x40 and 0.5 in rogers - may have to do with output integrator in rand10x40).
Lens reports the number of weight updates while SpiNNaker doesn't.
Lens reports the actual number of ticks for every example while SpiNNaker reports the maximum.
target values are not reported by Lens during the grace period while SpiNNaker does it in every tick.

Some group types can be optimised

Group types such as BIAS (bias clamp) or INPUT (hard clamp) do not require a [w, s, i, t] core pipeline. They can be optimised.

Support for grace period needed

need to add support for grace period. The code shouldn't even evaluate errors during that period.

Tick synchronisation is incomplete

The arrival of the following packets is not verified before moving to the next processing tick:

forward-phase broadcast STOP packets
backprop-phase broascast SYNC packets
backprop-phase multicast (s-core to w-cores) LDSA packets
any-phase broadcast DLRV packets
forward-phase broadcast NET STOP packets

Arrival of these packets is difficult to verify due to their multicast/broadcast nature.

These packets are transmitted during periods of quiet network traffic and have a low probability of being dropped and missed, i.e., they are unlikely to be dropped and, if indeed dropped, they are very likely to be picked up and successfully reinjected.

Support for multiple training and testing sets

Lens supports the use of different example sets for training and testing. Also, multiple sets can be used in each stage.

Often, testing is done on a different example set or sets (possibly a subset of the original, or sometimes something completely new, if generalisation performance is being assessed).

LENS has an option for loading all the example sets at the beginning and then switching between them, which is generally more efficient .

MAX_CRIT group criterion needs re-implementation

The MAX_CRIT group criterion is described in the Lens Manual.

This group criterion, implemented in function max_stop_crit() [file process_t.c] needs re-writing as the current implementation only works correctly if only one unit has the largest target, which is not usually the case.

Additionally, as this criterion is based on group-wide values rather than on individual unit ones, the function needs to make a correct distributed decision when the output group is split across multiple subgroups.

Result differences with respect to Lens

@joannavioletmoy reports result differences with respect to Lens when training network rand10x40 for 300 epochs, testing after every 10 epochs.

using steepest descent :
The number of examples correct after each test differs wildly between PDP2 and LENS. PDP2 is currently getting about three quarters of the examples correct before LENS even manages a single one.
using Doug's momentum:
LENS starts getting the examples correct before PDP2 does, although not so markedly different in this case.

Although differences are expected due to fixed-point (PDP2) vs double (Lens) numeric representation, further verification is needed because implementation issues could also be the cause.

partial nets and deltas may need larger type

Weight fix-point representation was changed to s16.5. With larger weights, partial nets (s4.27 representation) can get outside the [-16.0, 16.0) range. They may need a longer type and saturation.
This may also be the case for error deltas backprop.

spinnakermanchester / spinnaker_pdp2 Goto Github PK

spinnaker_pdp2's Issues

Recommend Projects

Recommend Topics

Recommend Org