nvlabs / timeloop Goto Github PK

Timeloop performs modeling, mapping and code-generation for tensor algebra workloads on various accelerator architectures.

Home Page: https://timeloop.csail.mit.edu/

License: BSD 3-Clause "New" or "Revised" License

Python 8.81% C++ 88.88% Shell 0.22% Dockerfile 0.10% Jupyter Notebook 1.99%

timeloop's Introduction

Timeloop

About

Timeloop is an infrastructure that aims to provide modeling, mapping and code-generation for dense- and sparse- tensor algebra workloads on a range of accelerator architectures. It is built from two modular components:

A fast analytical model that can emulate a range of architecture designs and provide performance and energy projections
A mapper that that searches for an optimal mapping in the space of mappings of a tensor-algebra problem on a given architecture

Documentation

Timeloop documentation is hosted at https://timeloop.csail.mit.edu/. The guides there cover installation, usage and examples. For a deeper understanding of Timeloop's internals please read our ISPASS 2019 paper.

Timeloop version 2.0 (a.k.a. Sparseloop) provides stochastic modeling of compressed-sparse tensor algebra. This work is described in our MICRO 2022 paper.

Timeloop version 3.0 (a.k.a. Ruby) adds support for imperfectly-factorized mappings (described in our ISPASS 2022 paper), in addition to support for spatial skews and flattened mappings.

Tutorial

New users are strongly encouraged to complete the Timeloop tutorial. Serially walking through the exercises from the tutorial serves as an essential hands-on introduction to the tool.

Dependencies

Timeloop depends on the isl and barvinok libraries. In particular, barvinok version 0.41.6 (along with the pre-packaged isl library) has been tested to build successfully with this version of Timeloop. Instructions for installing barvinok can be found in the this link.

timeloop's People

Contributors

Stargazers

Watchers

Forkers

qiyangjie lfr-0531 jeffjunzhang poant shenjiangqiu rbshi gyb1325 vivekashok1221 xdr1xnl0p3z vmiheer wxbbuaa2011 rachmadvwp sunilsurineni mainepicenter diksha-moolchandani memmeta arash1902 hibagus egiacomin vnaveen0 haimrich chomolungma travisdai googol-lab lpentecost vzyknc byungchul wangfeng012316 ghjeong12 zhuangzhuangwu bk174 tanvisharma blahner andravin nelliewu-nv markhoreni won81 ziyuliu-moonman accelergy-project biswapanda hemimin casper-min hqjenny knowingnothing intelligent-microsystems-lab zhangdy-6483 nellie-wu kentang-mit blockeditors supervan-young po0ria nandeekan-nvidia kelvinyang0320 hqjennynv ajunlonglive gxflying sacak32 chengwenonline sunlex0717 gilbertmike phoenixdigitalfx yeonan fisherxue tt430 charlieisacat lgy35 gulang2019 georgiashay ehw-fit wizyke chenyin0 gongyu-lightmatter musavimariam tkdhmin chamikasudusinghe accelergy-project zhongwujie henrychang213 pengchengai aserenate mfkiwl luciusmos asdlei99 anniezfy chipmunkkeeper sinahaghighat piyumalranawaka intelligent-microsystems-lab ianboyanzhang ainozaki alkeirn zking77 tanner-andrulis

timeloop's Issues

Some questions on the memory energy methodology

Hello,

I am currently using Timeloop & Accelergy to evaluate some CNN architectures.
I am only using Table based energy for the computation, but I am having troubles understanding how Timeloop & Accelergy are using these energy numbers:

1- When using a specific layer (for the problem) description, I can see in the .stats.txt file (I am only showing the important lines here):

Level 2
-------
=== MainMemory ===
....
        Vector access energy : 8.14 pJ
...
    Weights:
...
        Energy (per-scalar-access)               : 1.02 pJ

The vector access energy (8.144 pJ) is the one I used in the .csv table file, so up to there, all is working. And we have the energy per-scalar-access which is 8.144/8=1.02 pJ as I only use 8-bit data width in my design.

However, 1.02 pJ is used for all weights/input activations and outputs, while I have a separate read a write values in my .csv file. When only having the "read" or the "write" line in my csv, Timeloop will use this value, but when having both, Timeloop will only use the "higher" value. Am I missing something here? I am using the Docker container (https://github.com/Accelergy-Project/timeloop-accelergy-tutorial), so maybe it's outdated?

2- When using the exact same design, but only changing the problem parameters (like only having 1 filter 3x3 and 1 image 3x3), the .map.stats file looks like:

Level 2
-------
=== MainMemory ===
....
        Vector access energy : 8.14 pJ
...
    Weights:
...
        Energy (per-scalar-access)               : 1.81 pJ

    Inputs:
...
        Energy (per-scalar-access)               : 1.30 pJ

So this time, the 8.144 pJ is used for the vector access, but the energy per scalar access is different (while the design is exactly the same). Is that expected?

Thanks!

More sample configs?

Hi @angshuman-parashar,

Great simulator! Thanks for your hard work. It seems like that the current sample config specifies a weight stationary dataflow. I am interested in exploring other dataflows as well. Can you provide some more sample configs for other popular dataflows like output stationary and row stationary? Or can you provide more documentation on how certain map constraints specify certain dataflows?

Thanks!

best mapping found by mapper

Hello,

When working on the timeloop and timeloop-accelergy exercise, I have a question for the best mapping found by the mapper.
As far as I know, the mapper uses multi-threading to reduce the run time of timeloop. So, I guess that the timeloop picks the most optimized value among the results of the mapper executed in multiple threads. The followed result is the mapper's result.
Finally, the mapper selected the result of TID 0 (7.405pj) as the best mapping. However, as you can see from the results below, Opt.energy of TID2(6.045 pJ), TID3(6.372 pJ), and TID7(6.228 pJ) are lower than the best mapping found by the mapper.
Could you tell me why the mapper selected the result of TID0 as the best mapping?


================================================================================
                                TIMELOOP MAPPER
================================================================================
TID      Total    Invalid      Valid    Consec.       Last   Opt.util Opt.energy
                                        invalid     update
--------------------------------------------------------------------------------
- 0       7579       6837        742          0        500    100.00%      7.405
- 1       5716       5193        523          0        500    100.00%      6.045
- 2       8820       8199        621          0        500    100.00%      6.372
- 3       8804       8114        690          0        500    100.00%     16.086
- 4       7167       6586        581          0        500    100.00%     17.626
- 5       6600       6069        531          0        500    100.00%     12.765
- 6       9461       8951        510          0        500    100.00%     12.759
- 7       9426       8918        508          0        500    100.00%      6.228

Summary stats for best mapping found by mapper:
  Utilization = 1.00 | pJ/MACC =    7.405

Area Details not present in stats.txt file

Hi @angshuman-parashar
Thanks for the great tool.
I am using timeloop with accelergy which is giving me ART.yaml and ART_summary.yaml file but in the final stats file I am not getting the overall area that my architecture is consuming.
In the stats file it always gives me Area=0 mm^2
Is it normal? Or do I need to to do something to make it work?

Plan to support other architecture configuration

Hi!
First of all, thank you for your great works and I think that Timeloop is a useful simulator.

In the ISPASS2019 paper, you compared performance and energy about some architectures like DianNao, NVDLA, and Eyeriss. However, there is only Eyeriss architecture example in ‘config/timeloop’ folder.

Is there any plan to support DianNao, NVLDA configuration file? Also, can Timeloop support systolic array architecture like TPU?

Thank you!

Scons build error whith gcc 9.3/7.5

Hello,

When try to build accelergy and/or timeloop with gcc 9.3 (whether with scons --accelergy or scons -j4, I get an error:

scons: Reading SConscript files ... Using dynamic linking. scons: done reading SConscript files. scons: Building targets ... g++ -o build/mapspaces/mapspace-base.o -c -g -O3 -Werror -Wall -Wextra -fmax-errors=1 -std=c++14 -pthread -DBUILD_BASE_DIR=\"/home/egiacomi/timeloop-dev/timeloop\" -DUSE_ACCELERGY -Ibuild/src/include -Isrc/src/include -Ibuild -Isrc src/mapspaces/mapspace-base.cpp In file included from src/mapspaces/mapspace-base.cpp:28: src/mapspaces/mapspace-base.hpp:30:10: fatal error: boost/multiprecision/cpp_int.hpp: No such file or directory 30 | #include <boost/multiprecision/cpp_int.hpp> | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ compilation terminated. scons: *** [build/mapspaces/mapspace-base.o] Error 1 scons: building terminated because of errors.

When switching to gcc 7.5 (I did not tried other versions), scons --accelergy runs sucessfully.
However, I am still having an error when running scons -j4:

g++ -o build/applications/model/main.o -c -g -O3 -Werror -Wall -Wextra -fmax-errors=1 -std=c++14 -pthread -DBUILD_BASE_DIR=\"/home/egiacomi/timeloop-dev/timeloop\" -Ibuild/src/include -Isrc/src/include -Ibuild -Isrc src/applications/model/main.cpp In file included from src/applications/metrics/main.cpp:32:0: src/applications/metrics/metrics.hpp: In constructor ‘Application::Application(config::CompoundConfig*)’: src/applications/metrics/metrics.hpp:63:32: error: ‘arch.config::CompoundConfigNode::cConfig’ may be used uninitialized in this function [-Werror=maybe-uninitialized] config::CompoundConfigNode arch; ^~~~ cc1plus: all warnings being treated as errors scons: *** [build/applications/metrics/main.o] Error 1 scons: building terminated because of errors.

The build runs sucessfully if I remove the "-Werror" flag in the SConscript file, but I just wanted to let you know. Maybe it requires on older gcc version (5?) to compile the code? If yes, it would be worth mentioning in the Readme.

Design Space Example Configuration

Hi,

I am just trying to use design space application part of timeloop.
It seems like it just needs multiple architecture and problem files. What about the mapping and constraints file in that case?
Is there any configuration example available for design space exploration that can help.

Assertion throws with empty constraint

Hi there,

When I'm running `timeloop-mapper eyeriss-256.cfg' and get rid of all constraints, it throws the assertion below. Is there a basic set of constraints that have to be specified? If so, what are they? In more general, how should we go about get rid of all constraints (or any set of constraints)?

Thanks!

WARNING: found neither a problem shape description nor a string corresponding to a to a pre-existing shape description. Assuming shape: cnn-layer.
MESSAGE: attempting to read problem shape from file: /scratch/cluster/zshi17/timeloop/problem-shapes/cnn-layer.yaml
Problem configuration complete.
Architecture configuration complete.
Using all available hardware threads = 24
Mapper configuration complete.
Initializing Index Factorization subspace.
  Factorization options along problem dimension R = 8
  Factorization options along problem dimension S = 8
  Factorization options along problem dimension P = 960
  Factorization options along problem dimension Q = 960
  Factorization options along problem dimension C = 6435
  Factorization options along problem dimension K = 6435
  Factorization options along problem dimension N = 1
Mapspace Dimension [IndexFactorization] Size: 2442415472640000
Mapspace Dimension [LoopPermutation] Size: 3252016064102400000
Mapspace Dimension [Spatial] Size: 64
Mapspace Dimension [DatatypeBypass] Size: 32768
timeloop-mapper: src/mapspaces/uber.hpp:142: void mapspace::Uber::Init(config::CompoundConfigNode): Assertion `de_cumulative_prod == 1' failed.
zsh: abort (core dumped)  timeloop-mapper eyeriss-256.cfg

How to get mapper output in yaml format as expected by timeloop-model executable?

I tried dumping .cfg file but even after changing all files (arch/problem) in config file format, the cfg parsing fails.

How Timeloop takes latency of different types of memories into account while measuring performance?

In cycle-accurate simulators such as Gem5, we can replace one memory (say SRAM) with another (say STTRAM). The difference in their latency and capacity affects the overall performance (measured as cycle-count). Can we do something like that in timeloop and will it have an effect on the performance?
Consider an analogy: how would the cycle-count change if I were to replace an RF with a larger RF. This would surely change the performance because accesses to DRAM will be reduced, but each RF access now takes more latency.
Thank you in advance!

Reproducing the result from search

Hi there,

Is there a way that I can reproduce the results of the optimal mapping from timeloop-mapper with timeloop-model? For example, among the output files (timeloop-mapper.map+stats.xml, timeloop-mapper.map.txt, timeloop-mapper.stats.txt) that describes the optimal mapping out of the search model, is there any one of them that I can use to generate the input file for timeloop-model?

Thanks!

Issue while building Timeloop

I ran into a problem while building timeloop. How can I resolve this? I already installed all prerequisites. I am using Ubuntu 16.04.

/usr/include/c++/5/bits/hashtable_policy.h:85:34: error: no match for call to ‘(const std::hash<boost::multiprecision::number<boost::multiprecision::backends::cpp_int_backend<128u, 128u, (boost::multiprecision::cpp_integer_type)0u, (boost::multiprecision::cpp_int_check_type)0u, void> > >) (const boost::multiprecision::number<boost::multiprecision::backends::cpp_int_backend<128u, 128u, (boost::multiprecision::cpp_integer_type)0u, (boost::multiprecision::cpp_int_check_type)0u, void> >&)’
noexcept(declval<const _Hash&>()(declval<const _Key&>()))>

Cannot compile the codes

I got the following error while compiling the source using "scons -g4".
g++: error: unrecognized command line option ‘-std=c++14’
I use g++ 4.5.2.

How can I compile?

Thanks,
-Mustafa

Different NoC topologies simulation with timeloop?

Dear @angshuman-parashar,
Hello
As far as I know different NoC topologies could affect the overall performance, power and area of the accelerators. As an example the Microswitch (a novel custom topology) is causing much less latency and power compared to the Mesh topology [https://doi.org/10.1145/3130218.3130230].
So I'm really wondering to know if it is possible to simulate different NoC topologies (other than 2D-Mesh) with timeloop? If it is not, then is it worth it to contribute such feature to the project?
Thanks a lot for this nice project.

Kind Regards,
Manili

Timeloop crashes when "diagnostics: True" in the mapper.yaml file

Hello,

I am currently using the latest Timeloop commit and when setting" diagnostics: True" in the mapper file, Timeloop is crashing:

`================================================================================
TIMELOOP MAPPER

TID Total Invalid Valid Consec. Last Opt.util Opt.energy
invalid update

0 1 1 0 1 0
1 4 4 0 4 0terminate called after throwing an instance of 'std::out_of_range' 4 0
3 4 4 0 terminate called recursively
4 1 1 0 1 0 Aborted (core dumped)4 4 0 4 0`

When setting the value to "False", Timeloop runs fine.
This was not the case a few commits ahead (I think I was running the commit e3dbf09, or the one just after, and I did not have this behavior).
I attached a testcase in case you want to reproduce the issue on your end. Let me know if you have the same problem as well, or if I am missing something.

testcase_files.zip

how can I get some help to study these config attribute meaning?

Hi,
I'm fresh to study the timeloop tools. And I cann't find some user guide to understanding the examples. Specailly about the config attribute, how many attribute do they have and what are they. Could you give some user guide?

number of PEs in eyeriss-256.yaml

The number of MACs in the config file of eyeriss is taken 256 while in the original eyeriss paper, it is 168.
Any specific reason for the same?

Handling unspecified block-size and cluster-size

If a user specifies the word_size and the row_width for a buffer, Timeloop automatically assigns the (row_width / word_size) vector to cluster_size, with block_size being 1. Instead, the vector width should be assigned to block_size (which is more intuitive), with the default cluster_size being 1. A warning will be emitted when Timeloop is making this choice.

Deconvolution layer

Hi,

Can the deconvolution / transposed convolution layer be modeled / handled in timeloop?
Is there an example to specify this problem in yaml?

Deconvolution is essentially convolution with upsampling. The input feature maps are upsampled by inserting zeros in between original values and then the feature map is convolved with a filter.

So, my plan is to use the upsampled feature map as input and consider it as a normal convolution. Then I can either use the construct_workloads.py script from timeloop-accelergy-exercises or manually generate the yaml file.

Is this the right approach?

Dataflows of eyeriss, simba and chen-asplos2014

Hi there,

I do see the architectures and a set of constraints of eyeriss, simba and chen-asplos2014, but how could I get the the default dataflows without a mapper? Basically, in addition to dataflow found by Timeloop's search model, I'm trying to compare to the original dataflows on those accelerators as baselines, given different problems / layers. I assume that the search model is definitely not reproducing their results right?

Another related question is, without a mapper, how are these accelerators picking up a valid and efficient dataflow for each problem / layer? I assume that the dataflow will change with problems, at least the factors.

Thanks!!!

mapped instances exceeds available hardware instances

Hi, I've come across a similar issue to "Exceeds buffer size", which is shown below. I'm also attaching my config file. Could you help explain how the mapped instances 588 is computed?

ERROR: couldn't map level PsumRegFile: mapped instances 588 exceeds available hardware instances 256.

arch : 
{
  arithmetic : 
  {
    name = "MACs";
    instances = 256;
    word-bits = 16;
    meshX = 16;
  };
  storage = ( 
    {
      name = "PsumRegFile";
      entries = 16;
      instances = 256;
      meshX = 16;
      word-bits = 16;
      read_bandwidth = 2;
      write_bandwidth = 2;
    }, 
    {
      name = "WeightRegFile";
      entries = 192;
      instances = 256;
      meshX = 16;
      word-bits = 16;
      read_bandwidth = 2;
      write_bandwidth = 2;
    }, 
    {
      name = "InputRegFile";
      entries = 12;
      instances = 256;
      meshX = 16;
      word-bits = 16;
      read_bandwidth = 2;
      write_bandwidth = 2;
    }, 
    {
      name = "DummyBuffer";
      entries = 0;
      instances = 16;
      meshX = 16;
      word-bits = 16;
    }, 
    {
      name = "GlobalBuffer";
      sizeKB = 128;
      instances = 1;
      meshX = 1;
      word-bits = 16;
      block-size = 4;
      read_bandwidth = 16;
      write_bandwidth = 16;
    }, 
    {
      name = "DRAM";
      technology = "DRAM";
      instances = 1;
      word-bits = 16;
    } );
};

problem : 
{
  R = 7;
  S = 7;
  P = 112;
  Q = 112;
  C = 3;
  K = 64;
  N = 1;
  Wstride = 2;
  Hstride = 2;
};

mapping = (
    {
      target = 0;
      type = "datatype";
      keep = [ "Outputs" ];
      bypass = [ "Weights", "Inputs" ];
    }, 
    {
      target = 1;
      type = "datatype";
      keep = [ "Weights" ];
      bypass = [ "Inputs", "Outputs" ];
    }, 
    {
      target = 2;
      type = "datatype";
      keep = [ "Inputs" ];
      bypass = [ "Weights", "Outputs" ];
    }, 
    {
      target = 3;
      type = "datatype";
      keep = [ ];
      bypass = [ "Weights", "Inputs", "Outputs" ];
    }, 
    {
      target = 4;
      type = "datatype";
      keep = [ "Inputs", "Outputs" ];
      bypass = [ "Weights" ];
    }, 
    {
      target = 5;
      type = "datatype";
      keep = [ "Weights", "Inputs", "Outputs" ];
      bypass = [ ];
    }, 
    {
      target = 0;
      type = "temporal";
      factors = "R1 S1 P1 Q1 C1 K16 N1";
      permutation = "KRSPQCN";
    }, 
    {
      target = 1;
      type = "temporal";
      factors = "R7 S1 P1 Q1 C1 K1 N1";
      permutation = "RSPQCKN";
    }, 
    {
      target = 2;
      type = "temporal";
      factors = "R1 S1 P1 Q1 C1 K1 N1";
      permutation = "RSPQCKN";
    }, 
    {
      target = 3;
      type = "spatial";
      factors = "R1 S7 P1 Q1 C3 K1 N1";
      permutation = "SKRPQCN";
      split = 0;
    }, 
    {
      target = 3;
      type = "temporal";
      factors = "R1 S1 P1 Q1 C1 K1 N1";
      permutation = "RSPQCKN";
    }, 
    {
      target = 4;
      type = "spatial";
      factors = "R1 S1 P1 Q28 C1 K1 N1";
      permutation = "QKRSPCN";
      split = 2;
    }, 
    {
      target = 4;
      type = "temporal";
      factors = "R1 S1 P112 Q1 C1 K4 N1";
      permutation = "PRSQCKN";
    }, 
    {
      target = 5;
      type = "temporal";
      factors = "R1 S1 P1 Q4 C1 K1 N1";
      permutation = "CQKRSPN";
    }
);

Using accelergy with timeloop

Hi,

I installed accelergy and "which accelergy" works fine. Now when I install timeloop, I provided the pat symbolic link as given in the readme and then built with scons --accelergy and scons -j4.

I also have timeloop running on another machine where I followed the same steps except scons --accelergy.

In both of them I am getting the same energy per macc output. I am unsure if the first version is using accelergy or it is using the default pat model.

Can you please guide me here? Is the default pat symbolic link needed in case of running with accelergy?

couldn't map level DummyBuffer

Hi there,

I got this error ERROR: couldn't map level DummyBuffer: mapped instances 28 exceeds available hardware instances 16 with the following config file. Could you help explain what the dummy buffer does and where this constraint violation comes from?

arch : 
{
  arithmetic : 
  {
    name = "MACs";
    instances = 256;
    word-bits = 16;
    meshX = 16;
  };
  storage = ( 
    {
      name = "PsumRegFile";
      entries = 16;
      instances = 256;
      meshX = 16;
      word-bits = 16;
      read_bandwidth = 2;
      write_bandwidth = 2;
    }, 
    {
      name = "WeightRegFile";
      entries = 192;
      instances = 256;
      meshX = 16;
      word-bits = 16;
      read_bandwidth = 2;
      write_bandwidth = 2;
    }, 
    {
      name = "InputRegFile";
      entries = 12;
      instances = 256;
      meshX = 16;
      word-bits = 16;
      read_bandwidth = 2;
      write_bandwidth = 2;
    }, 
    {
      name = "DummyBuffer";
      entries = 0;
      instances = 16;
      meshX = 16;
      word-bits = 16;
    }, 
    {
      name = "GlobalBuffer";
      sizeKB = 128;
      instances = 1;
      meshX = 1;
      word-bits = 16;
      block-size = 4;
      read_bandwidth = 16;
      write_bandwidth = 16;
    }, 
    {
      name = "DRAM";
      technology = "DRAM";
      instances = 1;
      word-bits = 16;
    } );
};

problem : 
{
  R = 7;
  S = 7;
  P = 112;
  Q = 112;
  C = 3;
  K = 64;
  N = 1;
  Wstride = 2;
  Hstride = 2;
};

mapping = (
    {
      # PsumRegFile
      target = 0;
      type = "datatype";
      keep = [ "Outputs" ];
      bypass = [ "Weights", "Inputs" ];
    },
    {
      # WeightRegFile
      target = 1;
      type = "datatype";
      keep = [ "Weights" ];
      bypass = [ "Inputs", "Outputs" ];
    }, 
    {
      # InputRegFile
      target = 2;
      type = "datatype";
      keep = [ "Inputs" ];
      bypass = [ "Weights", "Outputs" ];
    }, 
    {
      # DummyBuffer
      target = 3;
      type = "datatype";
      keep = [ ];
      bypass = [ "Weights", "Inputs", "Outputs" ];
    }, 
    {
      # GlobalBuffer
      target = 4;
      type = "datatype";
      keep = [ "Inputs", "Outputs" ];
      bypass = [ "Weights" ];
    }, 
    {
      # DRAM
      target = 5;
      type = "datatype";
      keep = [ "Weights", "Inputs", "Outputs" ];
      bypass = [ ];
    }, 
    {
      # PsumRegFile
      target = 0;
      type = "temporal";
      factors = "R1 S1 P1 Q1 C1 K16 N1";
      permutation = "KRSPQCN";
    }, 
    {
      # WeightRegFile
      target = 1;
      type = "temporal";
      factors = "R7 S1 P1 Q1 C1 K1 N1";
      permutation = "CRSPQKN";
    }, 
    {
      # InputRegFile
      target = 2;
      type = "temporal";
      factors = "R1 S1 P1 Q1 C1 K1 N1";
      permutation = "RSPQCKN";
    }, 
    {
      # DummyBuffer
      target = 3;
      type = "spatial";
      factors = "R1 S7 P1 Q1 C1 K1 N1";
      permutation = "CSKRPQN";
      split = 0;
    }, 
    {
      # DummyBuffer
      target = 3;
      type = "temporal";
      factors = "R1 S1 P1 Q1 C1 K1 N1";
      permutation = "RSPQCKN";
    }, 
    {
      # GlobalBuffer
      target = 4;
      type = "spatial";
      factors = "R1 S1 P1 Q28 C1 K1 N1";
      permutation = "QKRSPCN";
      split = 2;
    }, 
    {
      # GlobalBuffer
      target = 4;
      type = "temporal";
      factors = "R1 S1 P16 Q1 C1 K2 N1";
      permutation = "KPRSQCN";
    }, 
    {
      # DRAM
      target = 5;
      type = "temporal";
      factors = "R1 S1 P7 Q4 C3 K2 N1";
      permutation = "QKCPRSN";
    }
);

Config file for real-world CNNs

Hi there,

I'm trying to evaluate the mapping for real-world CNN benchmarks. Do you have any plan to provide layer specifications in terms of problem configs of those real-world CNNs, such as AlexNet, VGG and ResNet (17 or 50)? If they already exist somewhere, can you point me there?

Zhan

Support for 3D CNN

Hi,

Do you have some preliminary support for 3D CNNs in timeloop?

thanks

Different solutions depending on the number of threads

Hello,

I am currently using Timeloop (latest commit) on a 2-cores machine. I was experimenting a bit the option " num-threads: 16" in the mapper.yaml file and found some behavior I am doubtful about:

1- When using 2,8 or 24 threads, the mapper does not find any solution to my particular problem.
2- When using 16 (or even 18) threads, a solution is found. That does not make a lot of sense as I only have 2 cores (4 threads) available, but a solution is indeed found for this particular case.

Even with 8 threads, I tried increasing the timeout condition by 10x (so the mapper will tries more combinations before stopping), but it did not find any solution still. What does not make sense to me is why for 24 threads, no solution is found why 16 threads can find one.

Thanks.
testcase_threads.zip

Is it possible to describe an array of "arrays"?

Hello,

I am trying to describe an array of arrays in Timeloop. As an example, what if I want to study the ISAAC architecture considering several IMAs instead of a single one (as specified here: https://github.com/Accelergy-Project/timeloop-accelergy-exercises/blob/master/baseline_designs/example_designs/simple_pim/arch/system_PIM.yaml)?

Right now, for a single IMA (128*128 array), the arch. description is:

      subtree:
        - name: PE[0..16383]
          local:                 # "local" contains physical instantiations
            - name: scratchpad
              class: memcell_storage  # definitions of the compound classes can be found under "components" folder
              attributes:
                width: 16       # width in bits
                depth: 1
                meshX: 128        # number of components in the X dimension (PE rows)
                meshY: 128        # number of components in the Y dimension (PE cols)
            - name: mac
              class: memcell_compute
              attributes:
                datawidth: 16   # datawidth in bits
                meshX: 128        # number of components in the X dimension (PE rows)
                meshY: 128        # number of components in the Y dimension (PE cols)

I'd like to consider a 4*4 IMA array, so I am wondering how to use the meshX/meshY parameters as we have "2 levels of mesh" (1 for the number of IMAs and 1 for the number of elements within 1 IMA).
I was thinking of something like this, but I am running into errors (as the meshX/meshY have to correspond to the PE_Array dimensions, but I'd also like to specify that each IMA has some meshX/meshY dimensions):

      - name: PE_Array[0..3]
        subtree:
        - name: PE[0..16383]
          local:                 # "local" contains physical instantiations
            - name: scratchpad
              class: memcell_storage  # definitions of the compound classes can be found under "components" folder
              attributes:
                width: 16       # width in bits
                depth: 1
                meshX: 128        # number of components in the X dimension (PE rows)
                meshY: 128        # number of components in the Y dimension (PE cols)
            - name: mac
              class: memcell_compute
              attributes:
                datawidth: 16   # datawidth in bits
                meshX: 128        # number of components in the X dimension (PE rows)
                meshY: 128        # number of components in the Y dimension (PE cols)

Thanks!

scons build error

Hi, Thank you for the awesome tool!
When building timeloop using scons -j4, following error message is displayed.
I installed dependencies and linked pat model (pubilic one)

My env as follows:

$ python 
Python 2.7.16 |Anaconda, Inc.| (default, Mar 14 2019, 21:00:58) 
[GCC 7.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys 
>>> sys.version, sys.platform
('2.7.16 |Anaconda, Inc.| (default, Mar 14 2019, 21:00:58) \n[GCC 7.3.0]', 'linux2')

OS: ubuntu 16.04

Please help me...
Thanks in advance!

error message

shkim at cnn in /home/shkim/timeloop-dev/timeloop on master $
→scons -j4
scons: Reading SConscript files ...
scons: done reading SConscript files.
scons: Building targets ...
g++ -o build/applications/mapper/main.o -c -g -O3 -Werror -Wall -Wextra -fmax-errors=1 -std=c++14 -pthread -DBUILD_BASE_DIR=\"/home/shkim/timeloop-dev/timeloop\" -Ibuild/src/include -Isrc/src/include -Ibuild -Isrc src/applications/mapper/main.cpp
g++ -o build/data/cnn/cnn-layers.o -c -g -O3 -Werror -Wall -Wextra -fmax-errors=1 -std=c++14 -pthread -DBUILD_BASE_DIR=\"/home/shkim/timeloop-dev/timeloop\" -Ibuild/src/include -Isrc/src/include -Ibuild -Isrc src/data/cnn/cnn-layers.cpp
g++ -o build/loop-analysis/tiling.o -c -g -O3 -Werror -Wall -Wextra -fmax-errors=1 -std=c++14 -pthread -DBUILD_BASE_DIR=\"/home/shkim/timeloop-dev/timeloop\" -Ibuild/src/include -Isrc/src/include -Ibuild -Isrc src/loop-analysis/tiling.cpp
g++ -o build/loop-analysis/nest-analysis.o -c -g -O3 -Werror -Wall -Wextra -fmax-errors=1 -std=c++14 -pthread -DBUILD_BASE_DIR=\"/home/shkim/timeloop-dev/timeloop\" -Ibuild/src/include -Isrc/src/include -Ibuild -Isrc src/loop-analysis/nest-analysis.cpp
In file included from /usr/include/c++/5/bits/hashtable.h:35:0,
                 from /usr/include/c++/5/unordered_set:47,
                 from src/search/random.hpp:31,
                 from src/search/search-factory.hpp:32,
                 from src/applications/mapper/mapper.hpp:44,
                 from src/applications/mapper/main.cpp:32:
/usr/include/c++/5/bits/hashtable_policy.h: In instantiation of ‘struct std::__detail::__is_noexcept_hash<boost::multiprecision::number<boost::multiprecision::backends::cpp_int_backend<128u, 128u, (boost::multiprecision::cpp_integer_type)0u, (boost::multiprecision::cpp_int_check_type)0u, void> >, std::hash<boost::multiprecision::number<boost::multiprecision::backends::cpp_int_backend<128u, 128u, (boost::multiprecision::cpp_integer_type)0u, (boost::multiprecision::cpp_int_check_type)0u, void> > > >’:
/usr/include/c++/5/type_traits:137:12:   required from ‘struct std::__and_<std::__is_fast_hash<std::hash<boost::multiprecision::number<boost::multiprecision::backends::cpp_int_backend<128u, 128u, (boost::multiprecision::cpp_integer_type)0u, (boost::multiprecision::cpp_int_check_type)0u, void>, (boost::multiprecision::expression_template_option)0u> > >, std::__detail::__is_noexcept_hash<boost::multiprecision::number<boost::multiprecision::backends::cpp_int_backend<128u, 128u, (boost::multiprecision::cpp_integer_type)0u, (boost::multiprecision::cpp_int_check_type)0u, void>, (boost::multiprecision::expression_template_option)0u>, std::hash<boost::multiprecision::number<boost::multiprecision::backends::cpp_int_backend<128u, 128u, (boost::multiprecision::cpp_integer_type)0u, (boost::multiprecision::cpp_int_check_type)0u, void>, (boost::multiprecision::expression_template_option)0u> > > >’
/usr/include/c++/5/type_traits:148:38:   required from ‘struct std::__not_<std::__and_<std::__is_fast_hash<std::hash<boost::multiprecision::number<boost::multiprecision::backends::cpp_int_backend<128u, 128u, (boost::multiprecision::cpp_integer_type)0u, (boost::multiprecision::cpp_int_check_type)0u, void>, (boost::multiprecision::expression_template_option)0u> > >, std::__detail::__is_noexcept_hash<boost::multiprecision::number<boost::multiprecision::backends::cpp_int_backend<128u, 128u, (boost::multiprecision::cpp_integer_type)0u, (boost::multiprecision::cpp_int_check_type)0u, void>, (boost::multiprecision::expression_template_option)0u>, std::hash<boost::multiprecision::number<boost::multiprecision::backends::cpp_int_backend<128u, 128u, (boost::multiprecision::cpp_integer_type)0u, (boost::multiprecision::cpp_int_check_type)0u, void>, (boost::multiprecision::expression_template_option)0u> > > > >’
/usr/include/c++/5/bits/unordered_set.h:95:63:   required from ‘class std::unordered_set<boost::multiprecision::number<boost::multiprecision::backends::cpp_int_backend<128u, 128u, (boost::multiprecision::cpp_integer_type)0u, (boost::multiprecision::cpp_int_check_type)0u, void> > >’
src/search/random.hpp:56:33:   required from here
/usr/include/c++/5/bits/hashtable_policy.h:85:34: error: no match for call to ‘(const std::hash<boost::multiprecision::number<boost::multiprecision::backends::cpp_int_backend<128u, 128u, (boost::multiprecision::cpp_integer_type)0u, (boost::multiprecision::cpp_int_check_type)0u, void> > >) (const boost::multiprecision::number<boost::multiprecision::backends::cpp_int_backend<128u, 128u, (boost::multiprecision::cpp_integer_type)0u, (boost::multiprecision::cpp_int_check_type)0u, void> >&)’
  noexcept(declval<const _Hash&>()(declval<const _Key&>()))>
                                  ^
compilation terminated due to -fmax-errors=1.
scons: *** [build/applications/mapper/main.o] Error 1
scons: building terminated because of errors.

Question about problem-shapes/cnn-layer.yaml

Hi, Thank you for the awesome tool!
I have a question about problem-shapes / cnn-layer.yaml.
How to change Projection to run Depthwise Separable Convolution of MobileNet, not Standard Convolution?

like pytorch code

class DWConv3x3(nn.Module):
    def __init__(self, nin, nout):
        super(DWConv3x3, self).__init__()
        self.depthwise = nn.Conv2d(nin, nin, kernel_size=3, padding=1, groups=nin)
    def forward(self, x):
        out = self.depthwise(x)
        return out

Thank you in advance!

entries and mesh-X and block-size

Hi there, Could you explain a bit about the physical meaning of these three parameters entires (vs. instances), mesh-X and block-size?

As I understand, the instances in all layers refer to the number of PEs, and for each PE, there are 16, 192 and 12 entries/registers for outputs, weights and inputs respectively, is that correct? For mesh-X, I believe that it is the number of rows in the 2D PE array. I don't have much idea about block-size in the global buffer, is it a concept similar to banks in memory?

Thanks!

arch =
{
    arithmetic =
    {
        name                    =   "MACs";
        instances               =   256;
        word-bits               =   16;
        meshX                   =   16;
    };
    storage =
    (
        {
            name                =   "PsumRegFile";
            entries             =   16;
            instances           =   256;
            meshX               =   16;
            word-bits           =   16;
            read_bandwidth      =   2; # bytes/cycle
            write_bandwidth     =   2; # bytes/cycle
        },
        {
            name                =   "WeightRegFile";
            entries             =   192;
            instances           =   256;
            meshX               =   16;
            word-bits           =   16;
            read_bandwidth      =   2; # bytes/cycle
            write_bandwidth     =   2; # bytes/cycle
        },
        {
            name                =   "InputRegFile";
            entries             =   12;
            instances           =   256;
            meshX               =   16;
            word-bits           =   16;
            read_bandwidth      =   2; # bytes/cycle
            write_bandwidth     =   2; # bytes/cycle
        },
        {
            name                =   "DummyBuffer";
            entries             =   0;
            instances           =   16;
            meshX               =   16;
            word-bits           =   16;
        },
        {
            name                =   "GlobalBuffer";
            sizeKB              =   128;
            instances           =   1;
            meshX               =   1;
            word-bits           =   16;
            block-size          =   4;
            read_bandwidth      =   16; # bytes/cycle (8 for inputs and 8 for psums)
            write_bandwidth     =   16; # bytes/cycle (8 for inputs and 8 for psums)
        },
        {
            name                =   "DRAM";
            technology          =   "DRAM";
            instances           =   1;
            word-bits           =   16;
        }
    );
};

confusion with MeshX and clarity on split commands

MeshX:

From slides of tutorial I understand meshX is used to divide the total number of components into XY-dimension.
For example, from slide number 41 in tutorial (exercise 6), meshX is used to arrange RegFile[0..11] into 3X4 in XY-dimension. But in the provided eyeriss architecture .yaml file (attached image below), meshX is used inside the local InputRegFile/WeightRegFile/PsumRegFile 's attributes but not on PEs attribute. Is that means meshX=14 is applied to individual InputRegFile/WeightRegFile/PsumRegFile if yes then how the 256 PEs are splitted in XY dimension?

Split:

what does split do in constrain file, the explanation is missing in the video.

target: DummyBuffer
type: spatial
split: 4
permutation: NPQR SCM
factors: N=1 P=1 Q=1 R=1 S=0
only allow fanout of M, Q out from glb
target: shared_glb
type: spatial
split: 7
permutation: NCPRSQM
factors: N=1 C=1 P=1 R=1 S=1
one ofmap position but of different output channels

Thank you.

Running gemm simulation on architecture

Hi there,

I see there's problem descriptor for gemm (attached below). Is there any template that i can use to specify my shape (M, N and K) and run on any of the architecture

shape =
{
  name = "gemm";
  dimensions = [ "M", "N", "K"];
  data-spaces =
  (
    {
      name = "A";
      projection = [ "M", "K" ];
    },
    {
      name = "B";
      projection = [ "N", "K"];
    },
    {
      name = "C";
      projection = [ "M", "N" ];
    },
    {
      name = "D";
      projection = [ "M", "N"];
      read-write = True;
    }
  );
};

Mapper Diagnostic Mode results in Segfault

When using the mapper configuration flag "diagnostics: True", Mapping.PrettyPrint() throws an exception resulting from segfault.

"terminate called after throwing an instance of 'std::out_of_range'
what(): vector::_M_range_check: __n (which is 8) >= this->size() (which is 0)
Aborted (core dumped)"

Is it possible to evaluate TPU-like systolic array?

Hi, Thank you very much for offering this wonderful tool. I want to know is it possible to evaluate TPU-like systolic array, such as the output stationary, weight stationary and input stationary dataflows in the paper 'ISPASS 2020 --- A Systematic Methodology for Characterizing Scalability of DNN Accelerators using SCALE-Sim'?
Please note the output stationary and weight stationary dataflows here are different with these defined in the originial Eyeriss paper. These dataflows are special to the GEMM computation.
I want to get the energy and performance for a particular dataflow on a special architecture. How should I define the contraints for the mapspace to get these dataflows?
Thank you very much!

config format

Hi,

Timeloop supports both libconfig and yam formats for input configurations, will it be unified?

Thanks,
Cai Yu

Row stationary constraints for shared RF architecture

Hi
Can you please provide the row stationary constraints for shared RF architecture (architecture having common PE scratch pad for ifmap, weights and psum).

Thanks in advance :)

Does timeloop offers a support for FPGA

How to define bypassing in map file?

Hi,

Thank you for providing the wonderful tool! I have a question about the map file fed to the model.

If I want to only use the model (without the mapper) to predict the energy cost of a fix HW+layer+mapping combination, is there a way to define bypassing in the map file, like the one provided in EX04 (conv1d+oc+ic-3levelspatial-cp-ws.map.yaml):

mapping (without bypassing):

target: MainMemory
type: temporal
factors: R=1 P=1 K=1 C=1
permutation: PRKC
target: GlobalBuffer
type: temporal
factors: R=3 P=1 K=32 C=2
permutation: PRKC
target: GlobalBuffer
type: spatial
factors: R=1 P=1 K=1 C=16
permutation: PRKC
target: RegisterFile
type: temporal
factors: R=1 P=16 K=1 C=1
permutation: RPKC

How do I change it to add bypassing information inside? Could you provide an example?

Thank you very much!

hardware extension

Hi, I can see that the /arch/ directory within timeloop contains the description of the hardware architecture. however, I do not see any details described about the latency and energy, how can we get the cycles number ? Besides, if I want to do the extension of the hardware, how can I modify the files?

Thanks.

Any example for an optimized architecture search for an entire CNN Model?

Thanks for providing such a tool for us. That is very helpful.

I am wondering could you provide an example for searching an optimized acc architecture for an entire model. It seems that current examples only for a specific CNN layer.

Thank you so much~

Possible to eyeball the type of dataflow from the config file?

Hi there,

It's really nice that in Timeloop we can implement different types of dataflow, such as weight / output / row stationary. I was wondering if there's a clear sign that can tell me which dataflow a config file is corresponding to? For example, the one below from the tutorial 4 is cp-ws. It's pretty clear that it partitioned spatially along the C dimension, but which key factors determine that it keeps the weight around in the register file? Put it another way, how should I go about changing it to output / row stationary? Thanks!

  - target: MainMemory
    type: temporal
    factors: R=1 P=1 K=1 C=1
    permutation: PRKC
    
  - target: GlobalBuffer
    type: temporal
    factors: R=3 P=1 K=32 C=2
    permutation: PRKC

  - target: GlobalBuffer
    type: spatial
    factors: R=1 P=1 K=1 C=16
    permutation: PRKC
    
  - target: RegisterFile
    type: temporal
    factors: R=1 P=16 K=1 C=1
    permutation: RPKC

Small typo in README

Hello,

Just wanted to report a small typo in the readme:

cd configs/timeloop ../../build/timeloop-mapper ./sample.yaml > sample.out

Should be:

cd configs/mapper ../../build/timeloop-mapper ./sample.yaml > sample.out

Timeloop-mapper + Plug-In Table

Hi,

Thank you for offering this tool. I am writing to you because I get a problem when I use Plug-In tables and Timeloop-mapper. I can define my memory read/write values, and accelergy correctly generates the ERT table, however, then when I use timeloop-mapper, it only takes the write energy to calculate the energy (per-scalar-access). This is weird because it is the energy that it uses for Scalar Reads, Scalar Updates, and Scalar fills, which implies reading and writing. I can see this in the file: timeloop-mapper.stats in which for the Global Buffer (the memory that I am modifying) only considers the write energy in the "Vector access energy : 1.20 pJ".

I could just use the counting and my energy values to get the total energy, but I will not be sure that the dataflow is optimal for my memory parameters, given that only one energy value was used for the optimization.

Please find attached a zipped file with the complete setup of the experiment. It can be run by running the code:
timeloop-mapper arch/system.yaml arch/components/*.yaml mapper/mapper.yaml constraints/*.yaml example_layer.yaml

The Plug-in table is in the folder EM_tables, and in the generated ERT file (timeloop-mapper.ERT_summary), you will see that the read and write energy are correctly defined.

Please consider that to run the example you will have to add the EM_tables to your config file (accelergy_config) that is located at:
\Docker\workspace\.config\accelergy

I would want to know if there is a way to include the read and write energy in the optimization process of Timeloop-mapper.

Thanks,

Jorge

Test_Read_Write.zip

Unnecessary writes on weights buffer

Hello,
Thanks for all this work!
I have a question about some unnecessary memory writes that the system seems to be doing. I defined a memory buffer only to store the weights. The size of this buffer is much larger than the size of all the weights. I also defined in the constraints to keep only the weights on this memory and I use a second buffer for the inputs and outputs. However, after Timeloops runs, this memory has been written much more times than the size of the total weights (that shouldn’t be necessary).
In the example that I am attaching the buffer weights size is 512000B, the number of weights is 147456, and the number of times that the memory is filled is 4718592 (exactly 32 times the total size of the weights, probably I am missing something :) ).
Attach is an example that can be run using the command:
timeloop-mapper arch/system.yaml arch/components/.yaml mapper/mapper.yaml constraints/.yaml Out_Conv_layer_VGG.yaml
I am also attaching my config file a folder with a primitive that is needed.
Thanks!
Tests.zip

Error in scons --accelergy

Hi,

I am getting the following error while running scons --accelergy. The libconfig++-dev version is 1.4.9. Ubuntu version is 14.04.

src/applications/design-space/../mapper/mapper.hpp:580:39: error: ‘class libconfig::Setting’ has no member named ‘lookup’; did you mean ‘lookupValue’?
libconfig::Setting& mapper = root.lookup("mapper");
^~~~~~
lookupValue
scons: *** [build/applications/design-space/main.o] Error 1

Can you guide me regarding this error?

Problem shape for group convolution

Hi Sir,
Kindly suggest a way to modify the yaml file of problem for group convolution.

Also, can you please let me know the difference in the latency obtained in flattened architecture and the cycle count come in mapper.stats.

Thanks in advance :)

Exceeds buffer capacity

Hi there,

As I'm playing with different configurations, I've run into ERROR: couldn't map level GlobalBuffer: mapped tile size 428201 exceeds buffer capacity 65536. I've been trying to look into the codebase to figure out what it happens, but it seems a bit hard to figure this out through the code.

Could you briefly explain (hopefully in math) how the mapped tile size, buffer capacity are computed from the problem shape (RSPQCKN) and arch specs (sizeKB, entries, word-bits, instances, etc.).

Here are my
problem shape = (R = 7; S = 7; P = 112; Q = 112; C = 3; K = 64; N = 1; Wstride = 2; Hstride = 2;)
factors = ("R1 S1 P112 Q1 C1 K1 N1")
and arch spec (sizeKB = 128; instances = 1; meshX = 1; word-bits = 16; block-size = 4; read_bandwidth = 16; write_bandwidth = 16;)

Thanks in advance!

Adding a sample.yaml to the timeloop-model.

Need an example yaml file for the timeloop-model configs.

Recommendations on the type of search algorithm

Hi,

I needed some insights for which search algorithm to use when. I observed that linear-pruned is way faster than random-pruned for some architectures. And hence I am more inclined to using linear-pruned.

Can you please let me know the advantages and disadvantages of these?

Memory Temporal reduction

Hello,

I have a quick question about the memory Temporal reduction:
When looking at a basic example where we have 1MAC + 1 SRAM array, and a 1D convolution where R=3 and P=16 (https://github.com/Accelergy-Project/timeloop-accelergy-exercises/tree/master/exercises/timeloop/00-model-conv1d-1level/ref-output), how is the "Temporal reductions (per-instance)" calculated?

For instance, in (https://github.com/Accelergy-Project/timeloop-accelergy-exercises/blob/master/exercises/timeloop/00-model-conv1d-1level/ref-output/timeloop-model.stats.txt), line 100, there are 32 temporal reduction, so these memory accesses are not taken into account into the final memory accesses number.
Since P=16, we will have 163 = 48 output memory writes (32 partial sums and 16 ofmaps) to the SRAM and 162=32 memory reads (for the partial sum) from the SRAM. Should not the final memory accesses number be 48+32 here? Or is there some kind of optimization I am missing?

Thanks!

nvlabs / timeloop Goto Github PK

timeloop's Introduction

Timeloop

About

Documentation

Tutorial

Dependencies

timeloop's People

Contributors

Stargazers

Watchers

Forkers

timeloop's Issues

`================================================================================ TIMELOOP MAPPER

TID Total Invalid Valid Consec. Last Opt.util Opt.energy invalid update

error message

MeshX:

Split:

Recommend Projects

Recommend Topics

Recommend Org

`================================================================================
TIMELOOP MAPPER

TID Total Invalid Valid Consec. Last Opt.util Opt.energy
invalid update