microsoft / nnfusion Goto Github PK
View Code? Open in Web Editor NEWA flexible and efficient deep neural network (DNN) compiler that generates high-performance executable from a DNN model description.
License: MIT License
A flexible and efficient deep neural network (DNN) compiler that generates high-performance executable from a DNN model description.
License: MIT License
Users cannot install frozen_inception2_batch_1.pb from the provided link
π Feature
Motivation
Pitch
Alternatives
Additional context
Solve inefficient link.
test react to issue bot
The issue is to list out what we would like to mark as 0.1.
π Feature
Motivation
Pitch
Alternatives
Additional context
When users build with docker, users are required to specify user & passwd
The purpose of the code refactor is to improve our code quality and usability, the approach we take should be considered from two sides: from the user's aspect and from our developers' aspect.
From user's aspect, the main goal is to make user can use NNF as a real tool: compile the model and understand the procedure easily:
Building stages
Currently we make our building scripts to support install dependencies, run making in native env or inside container, but currently we haven't consider much about what users may actually meet in real scenario:
Take #48 for example, the user didn't read our doc thus he don't know we use ubuntu 16/18 not 20. So this system version should be checked in the building scripts.
Work items:
Testing stages
Our testing utilities are tricky and hard to use: user should have NVIDIA or CUDA hardware and configure it through a specific config file. The unit tests have no check for hardware which will result in failing test.
We should make testing more easily and make the test report easily for user to understand.
Besides, we should check coverage for each PR and give report for whether the code change is covered by test, this will help us to improve code quality.
Validation stages
The NNF is currently more like in tech validation stage: we might configure the env by hand and user can do validation. So what we need to do is to make script or NNFusion CLI more easily for user to compile model;
One more problem: Because NNF need frozen model but freeze a model is not a standard procedure for user so NNF may take fault frozen model as input, should we provide a standard script to freeze model?
Work item:
From our developers' aspect:
License Problem:
Apache-2 license is kind strict and hard to modify code, so we move the code we didn't rewrite into thirdparty folder, but we need to rewrite those code and bring those back to our source code tree. If not, code reader might get confused about where the code is actually. Those code are mainly related to operator set, some core data type, and the importer frontend for TF and ONNX. We might discuss those at later sections;
Operator Set:
The operator set we use originates from Ngraph and is amended with some op with "OperatorV2" type. The main goal will be make all the operator set migrated to "OperatorV2" or a new class which is not hard-coded anymore and could be added/removed/changed easily.
The operator should also support serialized.
Kernels:
Currently we have hard-coded kernels, antares kernels( antares-ir), kernelDB kernels. We have features but we didn't provide a good mechanism to pick kernels from those. And all the kernels' interface are not same.
So in this part, we might need to:
Firstly, design a general interface for all the kernel provider, which make us support more provider like TVM.
Secondly, design kernel selection policies for kernel providers;
The new interface will give our optimization pass more flexibility to pick/change kernels.
Code generator:
This is might the most hard part of the refactor plan: since the code generator is complex and integrate many many features and those sub features are interacted with each other.
The main goal is to make the code gen much much more simple and could be easily use to support "new" device with much much less code change.
Profiler
Our profiler have some flaws. For example, it does not guarantee the input data is valid, which may cause error when profiling some kernels (eg. OneHot). Also, the profiler and codegen are independent in current design, but they share many functions. We may use codegen to do profiling.
Training
We have added basic training feature like autodiff, backward ops etc. But endusers cannot easily use them and integrate with their own project. This problem is not only for training, but training is an important factor to consider. For a better training experience, we have two items: The first one is a clear Python interface hiding NNFusion trivias and implementation. Then, based on the interface, we need to figure out the scope and add missed training features.
Users cannot run test when build with docker.
In /src/nnfusion/core/operators/generic_op/generic_op_define/Scatter.cpp we defined ScatterMim
op.
This seems a typo that ScatterMin
might be the correct name.
π Bug
Our code_style checker ignored thirdparty/ folder, but the majority of our code is currently in thirdparty/ngraph, we should add it to whitelist
To Reproduce
Steps to reproduce the behavior:
1.
2.
3.
Expected behavior
Additional context
release manager @wenxcs
cut branch date: 10/22
test period : 10.22-10.26
target release date: @wenxcs
Feature | Feature Owner | Test Owner(s) | Test case | Status |
---|
Documents
Installation(native & docker)
Workload 10+ tf models & 2 ONNX models
Framework
Hardware & Runtime
Artifact
π Feature
NNFusion has kernel implementations for some complex operations (e.g., GELU, LayerNorm). However, some frontends implement these operations as a series of simple operators so that current NNFusion cannot recognize these patterns and call relevant advanced kernel implementations:
Motivation
Pitch
These policies can be implemented in the pattern substitution pass.
Alternatives
Additional context
π Bug
Compile Error occurs in nnfusion/src/nnfusion/core/operators/generic_op/generic_op_define/Elementwise.cpp:6:27: error: non-local lambda expression cannot have a capture-default
To Reproduce
Steps to reproduce the behavior
[ 52%] Building CXX object src/nnfusion/core/operators/CMakeFiles/nnfusion_operators.dir/generic_op/generic_op_define/Elementwise.cpp.o
/home/liang/Documents/nnfusion/src/nnfusion/core/operators/generic_op/generic_op_define/Elementwise.cpp:6:27: error: non-local lambda expression cannot have a capture-default
6 | auto trans_elementwise = [&](std::shared_ptrgraph::GNode& curr, const std::string& topi) {
| ^
make[2]: *** [src/nnfusion/core/operators/CMakeFiles/nnfusion_operators.dir/build.make:433: src/nnfusion/core/operators/CMakeFiles/nnfusion_operators.dir/generic_op/generic_op_define/Elementwise.cpp.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[1]: *** [CMakeFiles/Makefile2:2058: src/nnfusion/core/operators/CMakeFiles/nnfusion_operators.dir/all] Error 2
make: *** [Makefile:149: all] Error 2
Expected behavior
Additional context
Protoc version 3.6.1
Cmake version 3.18.3
g++ version 9.3.0
Ubuntu 20.04 x86_64
CMake has passed
(base) liang@:~/Documents/nnfusion/build$ cmake ..
-- MSRAsia NNFusion Team(@nnfusion)
-- https://github.com/microsoft/nnfusion
--
-- Installation directory: /usr/local
-- thirdparty enabled
-- tools enabled
-- nnfusion enabled
-- unit tests enabled
-- Configuring done
-- Generating done
-- -- Build files have been written to: /home/liang/Documents/nnfusion
π Bug
To Reproduce
Steps to reproduce the behavior:
1.
2.
3.
Expected behavior
Additional context
π Feature
Provide a Python runner to improve NNF usability.
Motivation
Currently, NNFusion is still not easy to use for third parties, the reasons include:
A possible solution is providing a Python interface hiding these details.
Goal
Provide a python wrapper for NNF, improve NNF usability for PyTorch users. Use PyTorch model/tensor as standard interface, then users only need to replace forward execution by NNF.
Non-Goal
Class Definition
Workflow
Usage
...
model = MLP()
model.load_state_dict(torch.load('/path/to/checkpoint'))
data_loader = get_data_loader(batch_size=batch_size)
for batch in data_loader:
out = model(batch)
...
torch.save(model.state_dict(), '/path/to/checkpoint')
## load model and data loader by PyTorch
model = MLP()
model.load_state_dict(torch.load('/path/to/checkpoint'))
data_loader = get_data_loader(batch_size=batch_size)
## init NNF Runner
nnf_flags = {
codegen_debug: True,
kernel_fusion_level: 2
}
runner = NNFRunner(model, **nnf_flags)
## replace execution by NNF
for batch in data_loader:
out = runner(batch)
...
## save model by PyTorch
torch.save(model.state_dict(), '/path/to/checkpoint')
Work Item
We need more discussion to break down items, roughly includes:
NNFusion have some using namespace
in headers for convenience like
nnfusion/src/nnfusion/common/common.hpp
Line 212 in 77ba8ca
using namespace
in headersπ Feature
CUDA-Graph is introduced in CUDA-10.1 to reduce kernel launch overhead. CUDA-Graph matches current NNFusion's design, so it could be easily integrated to cuda_codegen to improve performance.
Motivation
Pitch
Add stream in kernel_entry and capture the kernel_entry function to initialize the cuda-graph.
Note that it cannot capture default stream and there should not be host-blocking API calls (e.g., cudaDeviceSynchronize) during stream capturing.
Alternatives
Additional context
https://developer.nvidia.com/blog/cuda-graphs/
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-graphs
Would you like to wrap any pointers with the class template βstd::unique_ptrβ?
π Feature
Motivation
Pitch
Alternatives
Additional context
π Bug
Check failed: 'm_ref_count > 0' and bad free when set -fblockfusion_level=2
lstm-tf-slope.const_folded.pb
[ERROR] 2020-10-13T11:25:27z src/nnfusion/util/errors.hpp 169 Check failed: 'm_ref_count > 0' at /home/lingm/projects/rammer_artifact/thirdparty/ngraph/src/nnfusion/common/descriptor/tensor.hpp:87:
(no explanation given)
terminate called after throwing an instance of 'nnfusion::errors::CheckError'
what(): Check failed: 'm_ref_count > 0' at /home/lingm/projects/rammer_artifact/thirdparty/ngraph/src/nnfusion/common/descriptor/tensor.hpp:87:
(no explanation given)
Aborted (core dumped)
frozen_lstm_l2_s2_h256.const_folded.pb
[ERROR] 2020-10-14T04:42:56z src/nnfusion/util/errors.hpp 169 Check failed: 'found' at /home/lingm/projects/rammer_artifact/src/nnfusion/engine/memory_allocator.cpp:241:
bad free
terminate called after throwing an instance of 'nnfusion::errors::CheckError'
what(): Check failed: 'found' at /home/lingm/projects/rammer_artifact/src/nnfusion/engine/memory_allocator.cpp:241:
bad free
Aborted (core dumped)
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Additional context
test nnfbot
This is the backlog of NNFusion, which is to track issues that are not yet planned but consider as future candidate items.
Our current or upcoming release is tracking in #194 .
Our release procedures are listed in #195.
NNFusion users are highly encouraged to comment and suggest on the priority, preference and needs for the work items. Please feel free to share your ideas with us or contribute to NNFusion.
Typo: Module | Module Owner
π Bug
To Reproduce
Steps to reproduce the behavior:
1.
2.
3.
Expected behavior
Additional context
After running install_dependency.sh, numpy is still required to install.
π Feature
Motivation
Pitch
Alternatives
Additional context
Current identifier of kernel DB does not have delimiter for parameters, which may introduce possibility of identifier conflicts for different kernel configurations.
For example: [1, 256, 16, 16] and [12, 56, 16, 16] could have the same identifier 12561616.
π Bug
Although the cmake file in nnfusion_rt folder can search CUDA path, the linking procedure still uses "/usr/local/cuda" path to link cuda libraries. It will result in linking error when users set custom CUDA paths.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
cmake .
-- The C compiler identification is GNU 5.4.0
-- The CXX compiler identification is GNU 5.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE
-- Found CUDA: /usr/local/cuda-10.2 (found version "10.2")
-- Configuring done
-- Generating done
make -j
Scanning dependencies of target nnfusion_naive_rt
[ 96%] Linking CXX static library libnnfusion_naive_rt.a
[ 96%] Built target nnfusion_naive_rt
Scanning dependencies of target main_test
[ 98%] Building CXX object CMakeFiles/main_test.dir/main_test.cpp.o
[100%] Linking CXX executable main_test
/usr/bin/ld: cannot find -lcudnn
collect2: error: ld returned 1 exit status
CMakeFiles/main_test.dir/build.make:110: recipe for target 'main_test' failed
make[2]: *** [main_test] Error 1
CMakeFiles/Makefile2:96: recipe for target 'CMakeFiles/main_test.dir/all' failed
make[1]: *** [CMakeFiles/main_test.dir/all] Error 2
Makefile:102: recipe for target 'all' failed
make: *** [all] Error 2
Additional context
After setting soft link "/usr/local/cuda -> /usr/local/cuda-10.2", it works well.
Feasibility:
The NNFusion project need some flag models to prove the usability, we choose Bert as one of the models.
Target:
Work items:
Validation:
π Bug
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Exit when codegen
Additional context
If we compile model with flag -fkernels_as_files=true and build model code in non-root folder (e.g., nnfusion_rt/cuda_codegen/build), it will report the following error.
fatal error: shared.h: No such file or directory
However, the compilation will be successful in root folder (i.e., nnfusion_rt/cuda_codegen)
π Bug
When building nnfusion osdi2020 artifact, tvm-config.cmake
is required here:
https://github.com/microsoft/nnfusion/blob/osdi20_artifact/artifacts/scripts/build_and_install_deps.sh#L29
Please share this config file. Thanks!
π Feature
Check GridDim in -fblockfusion_level=2 to satisfy the active block limitation in CUDA.
Motivation
BlockFusion with -fblockfusion_level=2 uses inter-block synchronization primitives. Improper number of BEs (vEUs) may lead to deadlock due to the active block limitation in CUDA.
Pitch
We can use nvcc to check the GridDim after blockfusion codegen and adaptively change the number of BEs (vEUs) to satisfy the active block limitation in CUDA.
Alternatives
Fallback to -fblockfusion_level=1 when the GridDim exceeds the active block limitation. The overhead of inter-block synchronization is becoming larger with the increasing of blocks.
Additional context
π Feature
Put our own docker image into dockerhub;
Motivation
Pitch
Alternatives
Additional context
π Bug
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Additional context
This issue was reproduced by compiling a bert_large model.
test nnfbot
test nnfbot id
π Bug
when building with debug mode, the link stage will fail.
To Reproduce
Steps to reproduce the behavior:
Error log:
../../nnfusion/engine/pass/graph/blockfusion/libnnfusion_engine_pass_graph_blockfusion.a(blockfusion_optimizer.cpp.o): In function
BlockFusionWavefrontOptimizer::SplitGroup(std::shared_ptr<std::vector<std::shared_ptrBlockFusionWavefrontOptimizer::FusionGroup, std::allocator<std::shared_ptrBlockFusionWavefrontOptimizer::FusionGroup > > >)':
/home/jxue/repo/nnfusion-jlxue/src/nnfusion/engine/pass/graph/blockfusion/blockfusion_optimizer.cpp:296: undefined reference to BlockFusionWavefrontOptimizer::MAX_GROUP' ../../nnfusion/engine/pass/graph/blockfusion/libnnfusion_engine_pass_graph_blockfusion.a(blockfusion_optimizer.cpp.o): In function
BlockFusionWavefrontOptimizer::GroupProfiler(std::shared_ptrBlockFusionWavefrontOptimizer::FusionGroup)':
/home/jxue/repo/nnfusion-jlxue/src/nnfusion/engine/pass/graph/blockfusion/blockfusion_optimizer.cpp:362: undefined reference to BlockFusionWavefrontOptimizer::DEFAULT_BE' ../../nnfusion/engine/pass/graph/blockfusion/libnnfusion_engine_pass_graph_blockfusion.a(blockfusion_optimizer.cpp.o): In function
BlockFusionWavefrontOptimizer::FuseGroupOnGraph(std::shared_ptrBlockFusionWavefrontOptimizer::FusionGroup)':
/home/jxue/repo/nnfusion-jlxue/src/nnfusion/engine/pass/graph/blockfusion/blockfusion_optimizer.cpp:432: undefined reference to BlockFusionWavefrontOptimizer::DEFAULT_BE' collect2: error: ld returned 1 exit status src/tools/nnfusion/CMakeFiles/nnfusion.dir/build.make:124: recipe for target 'src/tools/nnfusion/nnfusion' failed make[2]: *** [src/tools/nnfusion/nnfusion] Error 1 CMakeFiles/Makefile2:2193: recipe for target 'src/tools/nnfusion/CMakeFiles/nnfusion.dir/all' failed make[1]: *** [src/tools/nnfusion/CMakeFiles/nnfusion.dir/all] Error 2 make[1]: *** Waiting for unfinished jobs.... ../src/nnfusion/engine/pass/graph/blockfusion/libnnfusion_engine_pass_graph_blockfusion.a(blockfusion_optimizer.cpp.o): In function
BlockFusionWavefrontOptimizer::SplitGroup(std::shared_ptr<std::vector<std::shared_ptrBlockFusionWavefrontOptimizer::FusionGroup, std::allocator<std::shared_ptrBlockFusionWavefrontOptimizer::FusionGroup > > >)':
/home/jxue/repo/nnfusion-jlxue/src/nnfusion/engine/pass/graph/blockfusion/blockfusion_optimizer.cpp:296: undefined reference to BlockFusionWavefrontOptimizer::MAX_GROUP' ../src/nnfusion/engine/pass/graph/blockfusion/libnnfusion_engine_pass_graph_blockfusion.a(blockfusion_optimizer.cpp.o): In function
BlockFusionWavefrontOptimizer::GroupProfiler(std::shared_ptrBlockFusionWavefrontOptimizer::FusionGroup)':
/home/jxue/repo/nnfusion-jlxue/src/nnfusion/engine/pass/graph/blockfusion/blockfusion_optimizer.cpp:362: undefined reference to BlockFusionWavefrontOptimizer::DEFAULT_BE' ../src/nnfusion/engine/pass/graph/blockfusion/libnnfusion_engine_pass_graph_blockfusion.a(blockfusion_optimizer.cpp.o): In function
BlockFusionWavefrontOptimizer::FuseGroupOnGraph(std::shared_ptrBlockFusionWavefrontOptimizer::FusionGroup)':
/home/jxue/repo/nnfusion-jlxue/src/nnfusion/engine/pass/graph/blockfusion/blockfusion_optimizer.cpp:432: undefined reference to BlockFusionWavefrontOptimizer::DEFAULT_BE' collect2: error: ld returned 1 exit status test/CMakeFiles/unit-test.dir/build.make:1014: recipe for target 'test/unit-test' failed make[2]: *** [test/unit-test] Error 1 CMakeFiles/Makefile2:2240: recipe for target 'test/CMakeFiles/unit-test.dir/all' failed make[1]: *** [test/CMakeFiles/unit-test.dir/all] Error 2 Makefile:148: recipe for target 'all' failed make: *** [all] Error 2
Expected behavior
build success.
Additional context
no.
When build with native system, cmake fails when existing unexpected lib version in PATH.
π Feature
Currently we only use Ones tensor as input for main_test, we need to provide a easy interface for user.
Motivation
Pitch
Alternatives
Additional context
π Feature
Add error handling to better discover errors of kernels caused by kernel launching and kernel execution.
Motivation
I am checking the correctness of one models with CPU backend. And plan to compare the results with CUDA backend. But finally found that the results of CUDA backend is wrong because that one kernel with invalid configuration failed to launch. And the CUDA program just executed normally and didn't report any information.
Pitch
Alternatives
Additional context
π Bug
Run kernel fusion pass on some models might report Check failed: '((cuda_kernel) != nullptr)'
error at /src/nnfusion/core/kernels/cuda_gpu/kernels/elementwise_fused.cpp:166
.
To Reproduce
Steps to reproduce the behavior:
nnfusion xx.pb --format tensorflow -fdefault_device CUDA -fblockfusion_level=0 -fkernel_fusion_level=3
Error logs:
[ERROR] 2020-10-12T03:16:32z src/nnfusion/util/errors.hpp 169 Check failed: '((cuda_kernel) != nullptr)' at /home/jxue/repo/nnfusion-jlxue/src/nnfusion/core/kernels/cuda_gpu/kernels/elementwise_fused.cpp:166: kernel type: terminate called after throwing an instance of 'nnfusion::errors::NullPointer' what(): Check failed: '((cuda_kernel) != nullptr)' at /home/jxue/repo/nnfusion-jlxue/src/nnfusion/core/kernels/cuda_gpu/kernels/elementwise_fused.cpp:166: kernel type: Aborted (core dumped)
Expected behavior
compile success.
Additional context
no.
π Feature
Latest Cudnn and MIOpen provide basic operator fusion interfaceοΌ could us move some operator fusion policies to support native MIOpen & Cudnn op fusion.
Motivation
Pitch
Alternatives
Additional context
Reference:
https://docs.nvidia.com/deeplearning/cudnn/developer-guide/index.html#op-fusion
https://rocmsoftwareplatform.github.io/MIOpen/doc/html/fusion.html#
The open_docker.sh doesn't take cuda support into account from below 2 aspects:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. πππ
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google β€οΈ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.