stanfordaha / aha Goto Github PK

46.0 46.0 8.0 3 MB

Home Page: https://stanfordaha.github.io/aha-wiki-page/

Dockerfile 7.40% Python 64.12% Shell 28.48%

aha's Issues

aha test Error

With the guidance of the website, I install the aha docker and run the "garnet", "halide", "map" successfully, but when I try to run "aha test apps/pointwise", it occurs some errors:

Running command: verilator -Wall -Wno-INCABSPATH -Wno-DECLFILENAME -Wno-fatal --cc Interconnect.v --exe Interconnect_driver.cpp --top-module Interconnect

%Warning-PINMISSING: Interconnect.v:2112: Cell has missing pin: ren_out
%Warning-PINMISSING: Use "/* verilator lint_off PINMISSING */" and lint_on around source to disable this message.
%Warning-PINMISSING: Interconnect.v:2112: Cell has missing pin: wen_out
%Warning-PINMISSING: Interconnect.v:10384: Cell has missing pin: parallel_out
%Error: Specified --top-module 'Interconnect' isn't at the top level, it's under another cell 'Garnet'
%Error: Exiting due to 1 error(s)
%Error: Command Failed /usr/bin/verilator_bin -Wall -Wno-INCABSPATH -Wno-DECLFILENAME -Wno-fatal --cc Interconnect.v --exe Interconnect_driver.cpp --top-module Interconnect

Found 1 error(s):

Got return code 10. `

I don't have the Cadence Incisive, so I install the verilator.
All of the commands are:

aha garnet --width 16 --height 4 --verilog --interconnect-only --no-pd
aha halide apps/pointwise
aha map apps/pointwise --width 16 --height 4 --interconnect-only --no-pd
aha test apps/pointwise

Thanks a lot! Happy new year.

aha binary recompilation

Hello. Using the currently provided docker, I configured cgra, compiled the example halide code, and performed RTL simulation.

Where can I find the source code for the provided binary aha? The code provided at 'cgra_pnr' directory and the pnr results of the binary tend to be different, and I am wondering how I can compile the binary anew.

Versioning problems for "packaging" python package requirement

Our CI 'aha-flow' has stopped working, because of a python-package version requirement in the docker build. In particular, aha/Dockerfile specifies a specific version 21.3 for the "packaging" package:

  aha/Dockerfile line 95:
    pip install packaging==21.3

This seemed to work fine for a long time, but now it seems to be causing an error later in the build https://buildkite.com/stanford-aha/aha-flow/builds/10718

    #56 12.73 ERROR: requirements-parser 0.11.0 has requirement
     packaging>=23.2, but you'll have packaging 21.3 which is
     incompatible.

I will be submitting a change removing the version requirement for packaging==21.3. My question for the group:

does anyone know why we had this version-21.3 requirement in the first place?
is something going to break that you know about and I don't?

tbg.py Cannot identify the output pixel size

I was using garnet to test demosaic_complex app which have three output port(RGB) in parallel. It could not identify the pixel size from the gold.raw file. Keyi suggest we take the .pgm file from Halide and it have the pixel width information.

PR builds from forks don't resolve correctly

For an example of a failing build: https://buildkite.com/stanford-aha/aha-flow/builds/940

Your .dependabot/config.yml contained invalid details

Dependabot encountered the following error when parsing your .dependabot/config.yml:

Automerging is not enabled for this account. You can enable it from the [account settings](https://app.dependabot.com/accounts/hofstee/settings) screen in your Dependabot dashboard.

Please update the config file to conform with Dependabot's specification using our docs and online validator.

PR status reported incorrectly

Currently, it seems like even if the builds succeed, the PR is reported as a failure. e.g. cdonovick/peak#137

Error when trying to run pointwise

I've been trying to go through the basic flow in the readthedocs page just to get started. Everything works well until I run:

aha test apps/pointwise

At which point I get a long error printout from the simulator which ends with:

file: flop_unq1.sv
	module worklib.flop_unq1:sv
		errors: 0, warnings: 0
	Elaborating the design hierarchy:
		Caching library 'worklib' ....... Done
        .config_config_addr(config_config_addr),
                          |
ncelab: *E,CUVPOM (./Interconnect_tb.sv,88|26): Port name 'config_config_addr' is invalid or has multiple connections.
        .config_config_data(config_config_data),
                          |
ncelab: *E,CUVPOM (./Interconnect_tb.sv,89|26): Port name 'config_config_data' is invalid or has multiple connections.
        .config_read(config_read),
                   |
ncelab: *E,CUVPOM (./Interconnect_tb.sv,90|19): Port name 'config_read' is invalid or has multiple connections.
        .config_write(config_write),
                    |
ncelab: *E,CUVPOM (./Interconnect_tb.sv,91|20): Port name 'config_write' is invalid or has multiple connections.
        .stall(stall),
             |
ncelab: *E,CUVPOM (./Interconnect_tb.sv,94|13): Port name 'stall' is invalid or has multiple connections.
irun: *E,ELBERR: Error during elaboration (status 1), exiting.
</STDOUT>
Found 1 error(s):
1) Got return code 1.
Traceback (most recent call last):
  File "tbg.py", line 390, in <module>
    test.test()
  File "tbg.py", line 321, in test
    directory=tempdir)
  File "/aha/fault/fault/tester/staged_tester.py", line 276, in compile_and_run
    self._compile_and_run(target=target, **kwargs)
  File "/aha/fault/fault/tester/staged_tester.py", line 261, in _compile_and_run
    self.run(target)
  File "/aha/fault/fault/tester/staged_tester.py", line 251, in run
    target_obj.run(self.actions)
  File "/aha/fault/fault/system_verilog_target.py", line 715, in run
    err_str=sim_err_str, disp_type=self.disp_type)
  File "/aha/fault/fault/subprocess_run.py", line 163, in subprocess_run
    raise AssertionError
AssertionError
Traceback (most recent call last):
  File "/aha/bin/aha", line 11, in <module>
    load_entry_point('aha', 'console_scripts', 'aha')()
  File "/aha/aha/aha.py", line 51, in main
    args.dispatch(args, extra_args)
  File "/aha/aha/util/test.py", line 27, in dispatch
    cwd=args.aha_dir / "garnet",
  File "/usr/lib/python3.7/subprocess.py", line 363, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/aha/bin/python', 'tbg.py', 'garnet.v', 'garnet_stub.v', PosixPath('/aha/Halide-to-Hardware/apps/hardware_benchmarks/apps/pointwise/bin/pointwise.bs.json')]' returned non-zero exit status 1.

Am I doing something wrong here? Do either @hofstee or @Kuree have any idea what might be going wrong?

Retry strategy masks all compiler bugs

Hey Steve, the retry strategy implemented in garnet.py and regress.py:

aha/aha/util/regress.py

Line 49 in 251bd0e

if 'SIGSEGV' in str(e):

will mask any compiler failures. Because we call the compiler stages using subprocess, any error will be caught as a subprocess.CalledProcessError, and since there is no else statement to the if 'SIGSEGV' in str(e):, any error won't be caught, and the regression will just continue on.

We probably want an else: raise Exception("Exception in mapping/garnet") or something like that.

Cannot run the quickstart on kiwi?

Hey @hofstee I'm trying to run the quick start described in the docs on kiwi, but I get the following error:

dhuff@kiwi:~/aha$ pip install -e .
Defaulting to user installation because normal site-packages is not writeable
Obtaining file:///home/dhuff/aha
ERROR: Package 'aha' requires a different Python: 3.6.9 not in '>=3.7'

Any idea what I'm doing wrong?

`garnet/tbg.py` does not seem to work with Verilator

It seems like it gets to the point where a testbench is compiled in Verilator and then begins running, but after a while the process running Verilator is taking up 0% CPU and seems like it hangs. Not sure if this is because a testbench failed to terminate or something else.

Running garnet->halide->map->test in Docker container

Previously, using the aha Docker image, on a generic linux server, with Verilator installed, I was able to run the garnet generation, halide compilation, map/pnr, and the test bench sequentially on a subset of the applications the Halide-to-Hardware repo.

The problem I now have is that in the last few months I think lot of the feature work related to clockwork has moved forward and I am no longer able to coax the commands (garnet->halide->map->test) into running in the docker container. I suspect that the accumulated changes have fully deprecated the ability to run anything outside of the specific dev env the team uses that includes xcelium and Cadence IP.

Here's a series of commands that used to run for harris/resnet_layer_gen/others, that no longer does:

# within aha docker container

apt update && apt install verilator

...

python garnet.py --width 32 --height 16 --verilog --interconnect-only --no-pd

aha halide apps/harris

aha map apps/harris --width 32 --height 16

aha test apps/harris

Previously I had applied single line edits in garnet.py and tbg.py inspired by

#1058
StanfordAHA/garnet#452

Those no longer seem to help make it past errors.

Is the glb/ global buffer related to/required for running these types of tests? #1277

Resnet dense layers FAIL unless preceded by e.g. gaussian test

Executive summary: I'm not sure we are properly testing dense resnet layers.

Details:

I'm not sure what to do with this information (yet). And I have not done the work to thoroughly, under controlled conditions, verify what's really happening (yet). But here's what it looks like:

I tried using the "aha regress" command to run 'conv1' on its own, as a dense test, and it failed.

But! If I include a non-layer test e.g. gaussian in the test suite, e.g. "gaussian" followed by "conv1", then both tests pass.

I run the tests with the garnet daemon turned ON, which means that the "conv1" test (re)uses the verilog that was built for the "gaussian" test, in the case where both tests pass. I think this is relevant. But I'm not sure what it means, since both are dense tests (I think?) and both should be using the same verilog anyway (right?)

The next thing I would probably try: What happens if we run "gaussian" + "conv1" with daemon OFF? I'm guessing that conv1 fails.

Until I have better information, I guess I will file this as both a "garnet" issue and an "aha" issue...

I will include @kalhankoul96 as an assignee, because I think he'll be interested, and because he can remove himself and/or add more assignees if there's anyone else that might obviously want to know about this...

Garnet issue is here: StanfordAHA/garnet#1070

Dependabot couldn't find the branch coreir-dev

Dependabot was set up to create pull requests against the branch coreir-dev, but couldn't find it.

If the branch has been permanently deleted you can update Dependabot's target branch in the .dependabot/config.yml file in this repo.

View the update logs.

AHA PNR Result

Hello, I have a question regarding Amber CGRA mapping.

Using the provided Gaussian application example, we obtained the final PNR result using the aha map and pnr commands. After checking, the placement of PE tiles used was relatively widespread. Can you tell me why?

I understand that in order to solve the placement problem in the thunder library used in the AHA PNR process, HPWL (sum of squared distances from the center of the net to each block (cluster)), spreading potential, legalization energy, etc. are calculated. As a result, isn't the placement of the PE tiles used as a PNR result close to each other?

The picture below shows the results of gaussian application pnr in the order described in the AHA wiki (https://github.com/StanfordAHA/aha/wiki/CGRA-Design-Flow). The layout seems to be different from what I expected, so I would appreciate it if you could let me know if I misunderstood anything.

And the provided application does not perform global placement because "DISABLE_GP" is all set to 1 in the aha/util/application_parameters.json file. Is there a reason? I arbitrarily tried to perform PNR with the DISABLE_GP setting removed, but an error occurred during the PNR process and it stopped midway. Is “DISABLE_GP” all set to 1 because of this phenomenon?

Thank you for reading the question.

Dependabot couldn't fetch the branch/reference for lake

Your dependency file specified a branch or reference for lake, but Dependabot couldn't find it at the project's source. Has it been removed?

View the update logs.

Merging `master` into other branches is hard

Whenever scripts and CI definitions update on master, the changes need to be propagated to the other branches for the Buildkite CI to work. Additionally, we usually want to have the branches running the most updated version of CI they can.

One way to fix this might be to create a scripts branch where updating the scripts is done and then merging that into both master and the other branches.

Issue: Out-of-order regression info in log

Problem: AHA regressions show log information out of order. I.e. in regress.py we see the following code

    print(f"--- Running regression: {args.config}")
    ...
    generate_sparse_bitstreams(sparse_tests, width, height)

However, in the log output, sparse_bitstream info comes out before we see the Running regression tag:

    % docker exec $container /bin/bash -c "aha regress pr" >& log
    % egrep '(BASE|Running)' regress.log
    TEST BASE NAME: matmul_ijk
    TEST BASE NAME: mat_mattransmul
    ...
    --- Running regression: pr

An example can of this behavior can be seen in the Aha regressions step of the buildkite log at https://buildkite.com/stanford-aha/garnet/builds/4660

Why: sparse_bitstream log info comes out on stderr, whereas the print("Running regression") goes to stdout. Until the stdout buffer reaches minimum capacity (e.g. 1024 characters), it does not get written out to the terminal screen. Meanwhile, stderr continues to spew.

Solution: Periodically flush stdout so that output appears in a more timely manner.

I have a fix that I will submit once the tapeout panic is over.

AHA docker image too big (deleting .git directories)

The official docker image garnet:latest currently takes up about 7.5G of disk space.

(kiwi)% docker images
REPOSITORY           TAG            IMAGE ID       CREATED         SIZE
stanfordaha/garnet   latest         7579cf0747ef   4 days ago      7.41GB

Is there some easy way to reduce the image size? I have an idea that might help.

The .git directories associated with some of the aha submod repos are enormous, due mainly to poor git-repo hygiene on the part of the repo maintainers (yes that includes me). Here (below) we see that the .git metadata, for four submodules, takes up almost a gigabyte of disk space:

(aha) root@daa174656670:/aha# du -shx /aha/.git/modules/* | sort -hr
418M    ../.git/modules/clockwork
340M    ../.git/modules/Halide-to-Hardware
104M    ../.git/modules/sam
63M     ../.git/modules/garnet

The metadata is large generally because someone checked in a big binary file to the repo in question, e.g. the clockwork repo metadata includes a 90MB file libntl.a. The file does not exist in the repo itself, at least not in the master branch, but because it was pushed and subsequently deleted, the file lives on as master-branch metadata. In addition, clockwork has over a hundred "active" branches, many of them very old and unused, and many of which still have the undeleted file. Extracting this, and many other similar, objects from the repos would be extremely time consuming.

But what if there was an easier way?

I have written a script that can recover a missing .git directory, such that one might maybe

use Dockerfile to build the docker image as usual, except
delete the huge .git file before writing the final image, and
have .bashrc use the script to re-install .git whenever a user fires up a new container.

I am preparing a new aha branch to implement and test out this idea, initially targeting the largest offender (clockwork), watch for an upcoming related pull request on this topic.

Meanwhile, your comments and suggestions are more than welcome!

Your .dependabot/config.yml contained invalid details

Dependabot encountered the following error when parsing your .dependabot/config.yml:

The property '#/' contains additional properties ["automerged_updates"] outside of the schema when none are allowed

Please update the config file to conform with Dependabot's specification using our docs and online validator.

stanfordaha / aha Goto Github PK

aha's Issues

Recommend Projects

Recommend Topics

Recommend Org