microsoft / nnfusion Goto Github PK

A flexible and efficient deep neural network (DNN) compiler that generates high-performance executable from a DNN model description.

License: MIT License

CMake 0.59% Shell 0.46% C++ 85.40% Perl 0.46% Cuda 0.23% C 0.07% PureBasic 0.01% Python 11.38% Dockerfile 0.15% Makefile 0.01% Jupyter Notebook 1.09% Batchfile 0.02% HLSL 0.14%

nnfusion's Issues

[BUG]Cannot install example model from github.

Users cannot install frozen_inception2_batch_1.pb from the provided link

[ENHANCEMENT] Brief introduction to external user/developer

🚀 Feature

Motivation

Pitch

Alternatives

Additional context

[Enhancement]Documental Improvement

Solve inefficient link.

Draft 0.1 release plan

The issue is to list out what we would like to mark as 0.1.

Unexpected Constant folder was generated by NNFusion CLI

[ENHANCEMENT] OP support status when parse model by front-end

🚀 Feature

Motivation

Pitch

Alternatives

Additional context

[BUG]git user&git passwd required when build with docker

When users build with docker, users are required to specify user & passwd

[Story] [Oct] Code Refactor Plan Proposal

The purpose of the code refactor is to improve our code quality and usability, the approach we take should be considered from two sides: from the user's aspect and from our developers' aspect.

From user's aspect, the main goal is to make user can use NNF as a real tool: compile the model and understand the procedure easily:

Building stages
Currently we make our building scripts to support install dependencies, run making in native env or inside container, but currently we haven't consider much about what users may actually meet in real scenario:
Take #48 for example, the user didn't read our doc thus he don't know we use ubuntu 16/18 not 20. So this system version should be checked in the building scripts.
Work items:
Testing stages
Our testing utilities are tricky and hard to use: user should have NVIDIA or CUDA hardware and configure it through a specific config file. The unit tests have no check for hardware which will result in failing test.
We should make testing more easily and make the test report easily for user to understand.
Besides, we should check coverage for each PR and give report for whether the code change is covered by test, this will help us to improve code quality.
Validation stages
The NNF is currently more like in tech validation stage: we might configure the env by hand and user can do validation. So what we need to do is to make script or NNFusion CLI more easily for user to compile model;
One more problem: Because NNF need frozen model but freeze a model is not a standard procedure for user so NNF may take fault frozen model as input, should we provide a standard script to freeze model?
Work item:

User interface for Inference and Trainning

From our developers' aspect:

License Problem:
Apache-2 license is kind strict and hard to modify code, so we move the code we didn't rewrite into thirdparty folder, but we need to rewrite those code and bring those back to our source code tree. If not, code reader might get confused about where the code is actually. Those code are mainly related to operator set, some core data type, and the importer frontend for TF and ONNX. We might discuss those at later sections;
Operator Set:
The operator set we use originates from Ngraph and is amended with some op with "OperatorV2" type. The main goal will be make all the operator set migrated to "OperatorV2" or a new class which is not hard-coded anymore and could be added/removed/changed easily.
The operator should also support serialized.
Kernels:
Currently we have hard-coded kernels, antares kernels( antares-ir), kernelDB kernels. We have features but we didn't provide a good mechanism to pick kernels from those. And all the kernels' interface are not same.
So in this part, we might need to:
Firstly, design a general interface for all the kernel provider, which make us support more provider like TVM.
Secondly, design kernel selection policies for kernel providers;
The new interface will give our optimization pass more flexibility to pick/change kernels.
Code generator:
This is might the most hard part of the refactor plan: since the code generator is complex and integrate many many features and those sub features are interacted with each other.
The main goal is to make the code gen much much more simple and could be easily use to support "new" device with much much less code change.
Profiler
Our profiler have some flaws. For example, it does not guarantee the input data is valid, which may cause error when profiling some kernels (eg. OneHot). Also, the profiler and codegen are independent in current design, but they share many functions. We may use codegen to do profiling.
Training
We have added basic training feature like autodiff, backward ops etc. But endusers cannot easily use them and integrate with their own project. This problem is not only for training, but training is an important factor to consider. For a better training experience, we have two items: The first one is a clear Python interface hiding NNFusion trivias and implementation. Then, based on the interface, we need to figure out the scope and add missed training features.

[BUG]Cannot run test when build with docker.

Users cannot run test when build with docker.

[BUG]cmake failure

git clone from master
use docker to build from source code
freeze model in docker
compile model
cmake in cuda_codegen
Error occurs when executing step5 in both tensorflow and pytorch models:

A typo of op name in the op definition

In /src/nnfusion/core/operators/generic_op/generic_op_define/Scatter.cpp we defined ScatterMim op.

This seems a typo that ScatterMin might be the correct name.

[BUG] thirdparty/ngraph not in linter path

🐛 Bug

Our code_style checker ignored thirdparty/ folder, but the majority of our code is currently in thirdparty/ngraph, we should add it to whitelist

To Reproduce
Steps to reproduce the behavior:
1.
2.
3.

Expected behavior

Additional context

NNFusion v0.1 Endgame Plan

release manager @wenxcs
cut branch date: 10/22
test period : 10.22-10.26
target release date: @wenxcs

New Feature

Feature	Feature Owner	Test Owner(s)	Test case	Status

Documents

Input Model Specification | feature owner: @xiayuqing0622 | test owner: @AlisaChen98 @scarlett2018
Dependencies Claim | @wenxcs | @AlisaChen98 @scarlett2018
Script-usage | @wenxcs | @AlisaChen98 @scarlett2018
Docker Image Tutorial | @wenxcs | @AlisaChen98 @scarlett2018
Performance Figure | @wenxcs | @AlisaChen98 @scarlett2018
Quick Start | @AlisaChen98 | @scarlett2018
Corner Case/FAQ | @wenxcs | @AlisaChen98
Supported Models & Operators | @wenxcs | @AlisaChen98
Jupyter Notebook | @wenxcs | @AlisaChen98

Installation(native & docker)

build from docker image | @wenxcs
build from source code | @wenxcs

Workload 10+ tf models & 2 ONNX models

Quick Start | @AlisaChen98 | @qfyin
Output Modification | @jlxue

Framework

Tensorflow freeze_graph | @xiayuqing0622 @mzmssg
PyTorch ONNX | @xiayuqing0622 @mzmssg

Hardware & Runtime

CUDA | @wenxcs
ROCM | @wenxcs
Refact main_test/NNFusion_test to feed data | @wenxcs
Generate DLL | @wenxcs

Artifact

Artifact | @xysmlx
Provide A good performance model for advanced user | @jixue

Other Work Items

Unsupported Operators
Release Note

[ENHANCEMENT] optimizations for some complex operations

🚀 Feature

NNFusion has kernel implementations for some complex operations (e.g., GELU, LayerNorm). However, some frontends implement these operations as a series of simple operators so that current NNFusion cannot recognize these patterns and call relevant advanced kernel implementations:

GELU
Non-fused BatchNorm
LayerNorm

Motivation

Pitch

These policies can be implemented in the pattern substitution pass.

Alternatives

Additional context

[BUG] Compile Error in source code

🐛 Bug

Compile Error occurs in nnfusion/src/nnfusion/core/operators/generic_op/generic_op_define/Elementwise.cpp:6:27: error: non-local lambda expression cannot have a capture-default
To Reproduce
Steps to reproduce the behavior

mkdir build && cd build && cmake .. && make -j6

[ 52%] Building CXX object src/nnfusion/core/operators/CMakeFiles/nnfusion_operators.dir/generic_op/generic_op_define/Elementwise.cpp.o
/home/liang/Documents/nnfusion/src/nnfusion/core/operators/generic_op/generic_op_define/Elementwise.cpp:6:27: error: non-local lambda expression cannot have a capture-default
6 | auto trans_elementwise = [&](std::shared_ptrgraph::GNode& curr, const std::string& topi) {
| ^
make[2]: *** [src/nnfusion/core/operators/CMakeFiles/nnfusion_operators.dir/build.make:433: src/nnfusion/core/operators/CMakeFiles/nnfusion_operators.dir/generic_op/generic_op_define/Elementwise.cpp.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[1]: *** [CMakeFiles/Makefile2:2058: src/nnfusion/core/operators/CMakeFiles/nnfusion_operators.dir/all] Error 2
make: *** [Makefile:149: all] Error 2

Expected behavior

Additional context

Protoc version 3.6.1
Cmake version 3.18.3
g++ version 9.3.0
Ubuntu 20.04 x86_64

CMake has passed

(base) liang@:~/Documents/nnfusion/build$ cmake ..

-- MSRAsia NNFusion Team(@nnfusion)
-- https://github.com/microsoft/nnfusion
-- 
-- Installation directory: /usr/local
-- thirdparty enabled
-- tools enabled
-- nnfusion enabled
-- unit tests enabled
-- Configuring done
-- Generating done
-- -- Build files have been written to: /home/liang/Documents/nnfusion

[BUG] Unstable ROCM container

🐛 Bug

To Reproduce
Steps to reproduce the behavior:
1.
2.
3.

Expected behavior

Additional context

[ENHANCEMENT] NNF Python interface design

🚀 Feature

Provide a Python runner to improve NNF usability.

Motivation

Currently, NNFusion is still not easy to use for third parties, the reasons include:

Need explicitly freeze model
Complex codegen steps and flags
Model in cpp source code, no standard format for integration

A possible solution is providing a Python interface hiding these details.

Goal

Provide a python wrapper for NNF, improve NNF usability for PyTorch users. Use PyTorch model/tensor as standard interface, then users only need to replace forward execution by NNF.

Non-Goal

Support Full PyTorch feature like sparse Tensor(Should raise unsupported exception)
Models cannot be converted to ONNX
Other frameworks like TF

Class Definition

Executor
Executor is a simple Python binding on nnf_rt, it accepts a folder containing the nnf_rt, provided a call() func to execute nnf_rt and write output with raw_pointer.
Session
Session is the pipeline generating an Executor, it accepts a PyTorch model and input_desc(a list of input shape/type/device), then does codegen, compiles and loads compiled files to an Executor. It also wraps the internal Executor by PyTorch tensors interface.
Runner
Runner is the outmost interface, it maintains a cache map from input_desc to Sessions. Every time PyTorch tensors fed, Runner checks input tensor shape/type/device, then forward the tensors to corresponding Session(cache hit) or construct a new Session(cache missed).

Workflow

Init Runner with specific model, nnf_flags and workdir(store nnf_rt)
Feed input PyTorch tensors to Runner
Runner checks input shape/type/device, if already cached a corresponding Session, go to step 5
Runner generates a new Session, store it under related input_desc key
1. Session codegens and compiles model
2. Session loads compiled files into an Executor
3. Session binds weights and output tensors
Runner picks Session from cache
Runner forwards input tensors to the Session
1. Session unwraps tensors to raw_pointer, forward to Executor
2. Session returns result tensors

Usage

PyTorch example

...
model = MLP()
model.load_state_dict(torch.load('/path/to/checkpoint'))
data_loader = get_data_loader(batch_size=batch_size)

for batch in data_loader:
      out = model(batch)
      ...

torch.save(model.state_dict(), '/path/to/checkpoint')

Its NNF version

## load model and data loader by PyTorch
model = MLP()
model.load_state_dict(torch.load('/path/to/checkpoint'))
data_loader = get_data_loader(batch_size=batch_size)

## init NNF Runner
nnf_flags = {
    codegen_debug: True, 
    kernel_fusion_level: 2
}
runner = NNFRunner(model, **nnf_flags)

## replace execution by NNF
for batch in data_loader:
      out = runner(batch)
      ...

## save model by PyTorch
torch.save(model.state_dict(), '/path/to/checkpoint')

Work Item
We need more discussion to break down items, roughly includes:

Supports build nnf_rt in dynamic lib
~~Codegen Python binding?~~[low priority]
Share constant across different Session
Runner implementation

Remove "using namespace" in headers

NNFusion have some using namespace in headers for convenience like

nnfusion/src/nnfusion/common/common.hpp

Line 212 in 77ba8ca

using namespace nnfusion;

, which unlimitedly pollutes the global namespace and is strongly discouraged. A bad case is that the log level(INFO/ERROR) are globalized, in conflict with other 3rd libraries.
We should fix it by several steps:

Refine code guideline, disallow using namespace in headers
Inspect automatically by tools like clang-format
Remove such snippet in common headers
Remove such snippet in dedicated headers

[ENHANCEMENT] CUDA-Graph integration

🚀 Feature

CUDA-Graph is introduced in CUDA-10.1 to reduce kernel launch overhead. CUDA-Graph matches current NNFusion's design, so it could be easily integrated to cuda_codegen to improve performance.

Motivation

Pitch

Add stream in kernel_entry and capture the kernel_entry function to initialize the cuda-graph.

Note that it cannot capture default stream and there should not be host-blocking API calls (e.g., cudaDeviceSynchronize) during stream capturing.

Alternatives

Additional context

https://developer.nvidia.com/blog/cuda-graphs/
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-graphs

Improve exception safety with smart pointers

Would you like to wrap any pointers with the class template “std::unique_ptr”?

[ENHANCEMENT] Specify SM code in generated CMakelists

🚀 Feature

Motivation

Pitch

Alternatives

Additional context

[BUG] Check failed: 'm_ref_count > 0' and bad free when set -fblockfusion_level=2

🐛 Bug

Check failed: 'm_ref_count > 0' and bad free when set -fblockfusion_level=2

lstm-tf-slope.const_folded.pb

[ERROR] 2020-10-13T11:25:27z src/nnfusion/util/errors.hpp 169   Check failed: 'm_ref_count > 0' at /home/lingm/projects/rammer_artifact/thirdparty/ngraph/src/nnfusion/common/descriptor/tensor.hpp:87:
(no explanation given)
terminate called after throwing an instance of 'nnfusion::errors::CheckError'
  what():  Check failed: 'm_ref_count > 0' at /home/lingm/projects/rammer_artifact/thirdparty/ngraph/src/nnfusion/common/descriptor/tensor.hpp:87:
(no explanation given)
Aborted (core dumped)

frozen_lstm_l2_s2_h256.const_folded.pb

[ERROR] 2020-10-14T04:42:56z src/nnfusion/util/errors.hpp 169   Check failed: 'found' at /home/lingm/projects/rammer_artifact/src/nnfusion/engine/memory_allocator.cpp:241:
bad free
terminate called after throwing an instance of 'nnfusion::errors::CheckError'
  what():  Check failed: 'found' at /home/lingm/projects/rammer_artifact/src/nnfusion/engine/memory_allocator.cpp:241:
bad free
Aborted (core dumped)

To Reproduce
Steps to reproduce the behavior:

enable -fblockfusion_level=2 during model compilation

Expected behavior

Additional context

test

test nnfbot

NNFusion Backlog

This is the backlog of NNFusion, which is to track issues that are not yet planned but consider as future candidate items.

Our current or upcoming release is tracking in #194 .

Our release procedures are listed in #195.

NNFusion users are highly encouraged to comment and suggest on the priority, preference and needs for the work items. Please feel free to share your ideas with us or contribute to NNFusion.

Backlog (no rank)

Typo: Module | Module Owner

Mechanism

Custom op support
support training | @mzmssg
- Support learning rate scheduling
- Support external optimizer
- Freeze some layers for fine tuning
- Support gradient stop
support low-precision & mixed-precision | @Niupple
- Fp16 specific kernels
- wait until Antares is integrated into NNFusion, by which nothing specific should be done
- manually generate & inject FP16 specific kernels with Antares into Kernel DB.
- Modify the data type in generated IR
auto kernel tuner integration | @jlxue
- Kernel DB
- Add kernelEmitters (CPU/CUDA/ROCm) to parse and emit Antares kernels
- Modify kernel selection pass accordingly
offline inference(PAI) | @wenxcs
- Reduce padding(bytedance/effective_transformer)
- Docker image for BERT offline inference
- Batch backet inference
- Offline inference wrapper
parallel training support (via SuperScaler) | @lynex
- v0.2 new datatype support

Refactor/Improvement

detect unsupported model & ops | @mzmssg
- Others(unsupported op attr etc.)
support block-fusion as default | @xysmlx
- Define and implement the interfaces between BlockFusion and kernel tuner
- End-to-end test with kernel tuner enabled
- automatic active block check #50
- Tune-efficient policy in kernel tuner
- Refactor BlockCudaEmitter
- Advanced scheduling policy with kernel tuner
support reduce-fusion | @xiayuqing0622
- add reduce fusion pass
- optimize schedular
- test performance
- transport the code to github
sub-graph substitution | @wenxcs
- Graph match feature, by FSM:
- Replacing current Pattern Match;
- Graph Re-writer Tool;
- Antares Fusion
code refactor | @wenxcs
- Move Operator define to opdefine_v2
- Robust validation pipeline

Frontend & Backend support

support CPU | @guoshzhao
- Update azure mirror download urls for thirdparty package
support HLSL | Pending
model support(training models & more inference models) | Pending

Common Tools

python interface | @mzmssg
- Share const across multi nnf_rt
- Install by pip

Documentation

Docs for dev | @AlisaChen98
Doc website | @AlisaChen98

[BUG] Need Git Credentials when run ./build_containers.sh

🐛 Bug

To Reproduce
Steps to reproduce the behavior:
1.
2.
3.

Expected behavior

Additional context

[BUG]Numpy uninstalled after running install_dependency.sh

After running install_dependency.sh, numpy is still required to install.

[ENHANCEMENT] Seperate Large Model file from Azure Storage Storage Download Script

🚀 Feature

Use url to download frozen model;
Pipeline models stay same;

Motivation

Pitch

Alternatives

Additional context

Possible identifier conflict in kernel DB

Current identifier of kernel DB does not have delimiter for parameters, which may introduce possibility of identifier conflicts for different kernel configurations.

For example: [1, 256, 16, 16] and [12, 56, 16, 16] could have the same identifier 12561616.

[BUG] static "/usr/local/cuda" path string in cuda_lib cmake codegen

🐛 Bug

Although the cmake file in nnfusion_rt folder can search CUDA path, the linking procedure still uses "/usr/local/cuda" path to link cuda libraries. It will result in linking error when users set custom CUDA paths.

To Reproduce
Steps to reproduce the behavior:

set custom CUDA path (e.g., /usr/local/cuda-10.2)
remove "/usr/local/cuda" or redirect "/usr/local/cuda" to another CUDA path (e.g., /usr/local/cuda -> /usr/local/cuda-9.0)
compile model and build nnfusion_rt

Expected behavior

cmake .
-- The C compiler identification is GNU 5.4.0
-- The CXX compiler identification is GNU 5.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE
-- Found CUDA: /usr/local/cuda-10.2 (found version "10.2")
-- Configuring done
-- Generating done

make -j
Scanning dependencies of target nnfusion_naive_rt
[ 96%] Linking CXX static library libnnfusion_naive_rt.a
[ 96%] Built target nnfusion_naive_rt
Scanning dependencies of target main_test
[ 98%] Building CXX object CMakeFiles/main_test.dir/main_test.cpp.o
[100%] Linking CXX executable main_test
/usr/bin/ld: cannot find -lcudnn
collect2: error: ld returned 1 exit status
CMakeFiles/main_test.dir/build.make:110: recipe for target 'main_test' failed
make[2]: *** [main_test] Error 1
CMakeFiles/Makefile2:96: recipe for target 'CMakeFiles/main_test.dir/all' failed
make[1]: *** [CMakeFiles/main_test.dir/all] Error 2
Makefile:102: recipe for target 'all' failed
make: *** [all] Error 2

Additional context

After setting soft link "/usr/local/cuda -> /usr/local/cuda-10.2", it works well.

[Story] Optimization on Bert

Feasibility:
The NNFusion project need some flag models to prove the usability, we choose Bert as one of the models.

Target:

Improve NNFusion's inference effectiveness on Transformer/Bert;
Provide a friendly interface on LM tasks;
Provide a docker solution/container to do inference/deployed easily;

Work items:

Investigation on current bert acceleration projects;

Validation:

[BUG] Unexpected exit when codegen

🐛 Bug

To Reproduce
Steps to reproduce the behavior:

Inside the container provided;

Expected behavior

Exit when codegen

Additional context

Special container to reproduce the bug is provided.

Compile error when set -fkernels_as_files=true in model codegen and build model code in non-root folder

If we compile model with flag -fkernels_as_files=true and build model code in non-root folder (e.g., nnfusion_rt/cuda_codegen/build), it will report the following error.

fatal error: shared.h: No such file or directory

However, the compilation will be successful in root folder (i.e., nnfusion_rt/cuda_codegen)

[BUG] Missing config file tvm-config.cmake for osdi20 artifact

🐛 Bug
When building nnfusion osdi2020 artifact, tvm-config.cmake is required here:
https://github.com/microsoft/nnfusion/blob/osdi20_artifact/artifacts/scripts/build_and_install_deps.sh#L29

Please share this config file. Thanks!

[ENHANCEMENT] Active block check in -fblockfusion_level=2

🚀 Feature
Check GridDim in -fblockfusion_level=2 to satisfy the active block limitation in CUDA.

Motivation
BlockFusion with -fblockfusion_level=2 uses inter-block synchronization primitives. Improper number of BEs (vEUs) may lead to deadlock due to the active block limitation in CUDA.

Pitch
We can use nvcc to check the GridDim after blockfusion codegen and adaptively change the number of BEs (vEUs) to satisfy the active block limitation in CUDA.

Alternatives
Fallback to -fblockfusion_level=1 when the GridDim exceeds the active block limitation. The overhead of inter-block synchronization is becoming larger with the increasing of blocks.

Additional context

Update artifacts to tutorials

[ENHANCEMENT] Dockerhub Support

🚀 Feature

Put our own docker image into dockerhub;

Motivation

Pitch

Alternatives

Additional context

[BUG] Build crash when no constant folder generated

🐛 Bug

To Reproduce
Steps to reproduce the behavior:

Using a model with no constant.

Expected behavior

Additional context

Performance issue compared to internal commit

This issue was reproduced by compiling a bert_large model.

test nnfbot

test nnfbot id

[BUG] Link error in debug mode

🐛 Bug

when building with debug mode, the link stage will fail.

To Reproduce
Steps to reproduce the behavior:

cmake .. -DDEBUG_ENABLE=TRUE
make -j

Error log:
../../nnfusion/engine/pass/graph/blockfusion/libnnfusion_engine_pass_graph_blockfusion.a(blockfusion_optimizer.cpp.o): In functionBlockFusionWavefrontOptimizer::SplitGroup(std::shared_ptr<std::vector<std::shared_ptrBlockFusionWavefrontOptimizer::FusionGroup, std::allocator<std::shared_ptrBlockFusionWavefrontOptimizer::FusionGroup > > >)':
/home/jxue/repo/nnfusion-jlxue/src/nnfusion/engine/pass/graph/blockfusion/blockfusion_optimizer.cpp:296: undefined reference to BlockFusionWavefrontOptimizer::MAX_GROUP' ../../nnfusion/engine/pass/graph/blockfusion/libnnfusion_engine_pass_graph_blockfusion.a(blockfusion_optimizer.cpp.o): In function BlockFusionWavefrontOptimizer::GroupProfiler(std::shared_ptrBlockFusionWavefrontOptimizer::FusionGroup)':
/home/jxue/repo/nnfusion-jlxue/src/nnfusion/engine/pass/graph/blockfusion/blockfusion_optimizer.cpp:362: undefined reference to BlockFusionWavefrontOptimizer::DEFAULT_BE' ../../nnfusion/engine/pass/graph/blockfusion/libnnfusion_engine_pass_graph_blockfusion.a(blockfusion_optimizer.cpp.o): In function BlockFusionWavefrontOptimizer::FuseGroupOnGraph(std::shared_ptrBlockFusionWavefrontOptimizer::FusionGroup)':
/home/jxue/repo/nnfusion-jlxue/src/nnfusion/engine/pass/graph/blockfusion/blockfusion_optimizer.cpp:432: undefined reference to BlockFusionWavefrontOptimizer::DEFAULT_BE' collect2: error: ld returned 1 exit status src/tools/nnfusion/CMakeFiles/nnfusion.dir/build.make:124: recipe for target 'src/tools/nnfusion/nnfusion' failed make[2]: *** [src/tools/nnfusion/nnfusion] Error 1 CMakeFiles/Makefile2:2193: recipe for target 'src/tools/nnfusion/CMakeFiles/nnfusion.dir/all' failed make[1]: *** [src/tools/nnfusion/CMakeFiles/nnfusion.dir/all] Error 2 make[1]: *** Waiting for unfinished jobs.... ../src/nnfusion/engine/pass/graph/blockfusion/libnnfusion_engine_pass_graph_blockfusion.a(blockfusion_optimizer.cpp.o): In function BlockFusionWavefrontOptimizer::SplitGroup(std::shared_ptr<std::vector<std::shared_ptrBlockFusionWavefrontOptimizer::FusionGroup, std::allocator<std::shared_ptrBlockFusionWavefrontOptimizer::FusionGroup > > >)':
/home/jxue/repo/nnfusion-jlxue/src/nnfusion/engine/pass/graph/blockfusion/blockfusion_optimizer.cpp:296: undefined reference to BlockFusionWavefrontOptimizer::MAX_GROUP' ../src/nnfusion/engine/pass/graph/blockfusion/libnnfusion_engine_pass_graph_blockfusion.a(blockfusion_optimizer.cpp.o): In function BlockFusionWavefrontOptimizer::GroupProfiler(std::shared_ptrBlockFusionWavefrontOptimizer::FusionGroup)':
/home/jxue/repo/nnfusion-jlxue/src/nnfusion/engine/pass/graph/blockfusion/blockfusion_optimizer.cpp:362: undefined reference to BlockFusionWavefrontOptimizer::DEFAULT_BE' ../src/nnfusion/engine/pass/graph/blockfusion/libnnfusion_engine_pass_graph_blockfusion.a(blockfusion_optimizer.cpp.o): In function BlockFusionWavefrontOptimizer::FuseGroupOnGraph(std::shared_ptrBlockFusionWavefrontOptimizer::FusionGroup)':
/home/jxue/repo/nnfusion-jlxue/src/nnfusion/engine/pass/graph/blockfusion/blockfusion_optimizer.cpp:432: undefined reference to BlockFusionWavefrontOptimizer::DEFAULT_BE' collect2: error: ld returned 1 exit status test/CMakeFiles/unit-test.dir/build.make:1014: recipe for target 'test/unit-test' failed make[2]: *** [test/unit-test] Error 1 CMakeFiles/Makefile2:2240: recipe for target 'test/CMakeFiles/unit-test.dir/all' failed make[1]: *** [test/CMakeFiles/unit-test.dir/all] Error 2 Makefile:148: recipe for target 'all' failed make: *** [all] Error 2

Expected behavior

build success.

Additional context

no.

[ENHANCEMENT]Cmake fails when lib versions of native system differ from anaconda

When build with native system, cmake fails when existing unexpected lib version in PATH.

Move large model files out of current repo

Create a dedicated model zoo repo
Store models in Azure storage

[ENHANCEMENT] Useful Interface for main_test;

🚀 Feature

Currently we only use Ones tensor as input for main_test, we need to provide a easy interface for user.

Motivation

Pitch

Alternatives

Additional context

[BUG] missing link

🐛 Bug

Expected behavior

nnfusion home page, need a link to the how-to get start
Additional context

Error handling for GPU kernel launch/execution

🚀 Feature
Add error handling to better discover errors of kernels caused by kernel launching and kernel execution.

Motivation
I am checking the correctness of one models with CPU backend. And plan to compare the results with CUDA backend. But finally found that the results of CUDA backend is wrong because that one kernel with invalid configuration failed to launch. And the CUDA program just executed normally and didn't report any information.

Pitch

Want to be notified whether there have kernels failed in advance.
If possible, want to know what kernels fail. If this will introduce more overhead, only the previous one is acceptable.

Alternatives

Additional context

[BUG] kernel fusion pass NullPointer bug

🐛 Bug

Run kernel fusion pass on some models might report Check failed: '((cuda_kernel) != nullptr)' error at /src/nnfusion/core/kernels/cuda_gpu/kernels/elementwise_fused.cpp:166.

To Reproduce
Steps to reproduce the behavior:

the bug need specific model that has a broadcast node with multiple output edges, each of them connects to an element-wise node
run nnfusion xx.pb --format tensorflow -fdefault_device CUDA -fblockfusion_level=0 -fkernel_fusion_level=3

Error logs:
[ERROR] 2020-10-12T03:16:32z src/nnfusion/util/errors.hpp 169 Check failed: '((cuda_kernel) != nullptr)' at /home/jxue/repo/nnfusion-jlxue/src/nnfusion/core/kernels/cuda_gpu/kernels/elementwise_fused.cpp:166: kernel type: terminate called after throwing an instance of 'nnfusion::errors::NullPointer' what(): Check failed: '((cuda_kernel) != nullptr)' at /home/jxue/repo/nnfusion-jlxue/src/nnfusion/core/kernels/cuda_gpu/kernels/elementwise_fused.cpp:166: kernel type: Aborted (core dumped)

Expected behavior

compile success.

Additional context

no.

[ENHANCEMENT] Operator Fusion by MIOpen and Cudnn

🚀 Feature

Latest Cudnn and MIOpen provide basic operator fusion interface， could us move some operator fusion policies to support native MIOpen & Cudnn op fusion.

Motivation

Pitch

Alternatives

Additional context

Reference:
https://docs.nvidia.com/deeplearning/cudnn/developer-guide/index.html#op-fusion
https://rocmsoftwareplatform.github.io/MIOpen/doc/html/fusion.html#

[BUG]open_docker.sh does't support cuda

The open_docker.sh doesn't take cuda support into account from below 2 aspects:

nvidia-docker
2.cuda-container(name,etc)

microsoft / nnfusion Goto Github PK

nnfusion's Issues

New Feature

Other Work Items

Backlog (no rank)

Mechanism

Refactor/Improvement

Frontend & Backend support

Common Tools

Documentation

Recommend Projects

Recommend Topics

Recommend Org