Coder Social home page Coder Social logo

megengine / megengine Goto Github PK

View Code? Open in Web Editor NEW
4.7K 136.0 529.0 31.15 MB

MegEngine 是一个快速、可拓展、易于使用且支持自动求导的深度学习框架

Home Page: https://megengine.org.cn/

License: Apache License 2.0

CMake 0.45% Dockerfile 0.01% Shell 0.36% C++ 80.18% C 1.75% Makefile 0.01% Python 7.48% Cuda 9.72% Starlark 0.02% MLIR 0.01% HTML 0.01% JavaScript 0.01% GDB 0.01%
deep-learning machine-learning megengine tensor autograd python gpu numpy

megengine's Introduction

MegEngine

MegEngine is a fast, scalable, and user friendly deep learning framework with 3 key features.

  • Unified framework for both training and inference
    • Quantization, dynamic shape/image pre-processing, and even derivation with a single model.
    • After training, put everything into your model to inference on any platform with speed and precision. Check here for a quick guide.
  • The lowest hardware requirements
    • The memory usage of the GPU can be reduced to one-third of the original memory usage when DTR algorithm is enabled.
    • Inference models with the lowest memory usage by leveraging our Pushdown memory planner.
  • Inference efficiently on all platforms
    • Inference with speed and high-precision on x86, Arm, CUDA, and RoCM.
    • Supports Linux, Windows, iOS, Android, TEE, etc.
    • Optimize performance and memory usage by leveraging our advanced features.

Installation

NOTE: MegEngine now supports Python installation on Linux-64bit/Windows-64bit/MacOS(CPU-Only)-10.14+/Android 7+(CPU-Only) platforms with Python from 3.6 to 3.9. On Windows 10 you can either install the Linux distribution through Windows Subsystem for Linux (WSL) or install the Windows distribution directly. Many other platforms are supported for inference.

Binaries

To install the pre-built binaries via pip wheels:

python3 -m pip install --upgrade pip
python3 -m pip install megengine -f https://megengine.org.cn/whl/mge.html

Building from Source

How to Contribute

We strive to build an open and friendly community. We aim to power humanity with AI.

How to Contact Us

Resources

License

MegEngine is licensed under the Apache License, Version 2.0

Citation

If you use MegEngine in your publication,please cite it by using the following BibTeX entry.

@Misc{MegEngine,
  institution = {megvii},
  title =  {MegEngine:A fast, scalable and easy-to-use deep learning framework},
  howpublished = {\url{https://github.com/MegEngine/MegEngine}},
  year = {2020}
}

Copyright (c) 2014-2021 Megvii Inc. All rights reserved.

megengine's People

Contributors

asthestarsfalll avatar chaibyte avatar chenjiahui0131 avatar haolongzhangm avatar huahua404 avatar hxdmegvii avatar is-shidian avatar jia-kai avatar jieli-matrix avatar kagome1007 avatar kxz18 avatar liangmuxin avatar llehtahw avatar luzzyzhang avatar megvii-mge avatar qsingle avatar seeker98 avatar shrimplau avatar stonemo avatar thunderstudying avatar tpoisonooo avatar wangxiang9603 avatar wanwan1996 avatar weixiao-huang avatar wkcn avatar xindah avatar xxr3376 avatar yeasoon avatar yzchen avatar zjd1988 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

megengine's Issues

BUG Issue 训练时加入 dropout gpu内存不断增加

环境

1.系统环境:ubuntu 18.0.4
2.MegEngine版本:0.3.1
3.python版本:3.7.6

使用gpu训练mnist数据集 Lenet时,占用gpu内存不断增加,直至gpu内存溢出(4g内存,batch 128)

请提供关键的代码片段便于追查问题

示例训练代码 gpu训练mnist数据集 Lenet时 加入两层dropout

去除drop 不会出问题

class LeNet(M.Module):
    def __init__(self):
        super(LeNet, self).__init__()
        # 单信道图片, 两层  5x5 卷积 + ReLU + 池化
        self.conv1 = M.Conv2d(1, 6, 5)
        self.relu1 = M.ReLU()
        self.pool1 = M.MaxPool2d(2, 2)
        self.conv2 = M.Conv2d(6, 16, 5)
        self.relu2 = M.ReLU()
        self.pool2 = M.MaxPool2d(2, 2)
        # 两层全连接 + ReLU
        self.fc1 = M.Linear(16 * 5 * 5, 120)
        self.relu3 = M.ReLU()
        self.drop1 = M.Dropout(0.5)
        self.fc2 = M.Linear(120, 84)
        self.relu4 = M.ReLU()
        self.drop2 = M.Dropout(0.5)
        # 分类器
        self.classifer = M.Linear(84, 10)

    def forward(self, x):
        x = self.pool1(self.relu1(self.conv1(x)))
        x = self.pool2(self.relu2(self.conv2(x)))
        # F.flatten 将原本形状为 (N, C, H, W) 的张量x从第一个维度(即C)开始拉平成一个维度,
        # 得到的新张量形状为 (N, C*H*W) 。 等价于 reshape 操作: x = x.reshape(x.shape[0], -1)
        x = F.flatten(x, 1)
        x = self.relu3(self.fc1(x))
        x = self.drop1(x)
        x = self.relu4(self.fc2(x))
        x = self.drop2(x)
        x = self.classifer(x)
        return x

请提供完整的日志及报错信息

egBrainError: MegBrain core throws exception: mgb::MegDNNError
curand call failed: status=102(CURAND_STATUS_ALLOCATION_FAILED) call=curandGenerateUniform(m_curand_handle.gen(), dst.ptr<dt_float32>(), dst.layout.total_nr_elems())

使用pip install 安装, 在docker中运行报错

环境

1.系统环境:
本地os:Linux WXRG0238 4.13.0-36-generic #40~16.04.1-Ubuntu SMP Fri Feb 16 23:25:58 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
nvidia driver 是 418.67

我在docker镜像中跑
ubuntu 16 和18 都试过 (nvidia/cuda:10.1-devel-ubuntu18.04)
cuda 10.1
cudnn 7.6.5.32
anaconda python 3.6

2.MegEngine版本:
使用官方pip install megengine -f https://megengine.org.cn/whl/mge.html 安装。
3.python版本:3.6。

复现步骤

1.用python跑 (用几个GPU都会报错): https://github.com/MegEngine/Models/blob/master/official/vision/classification/resnet/train.py
2.
3.

请提供关键的代码片段便于追查问题

用python跑 (用几个GPU都会报错): https://github.com/MegEngine/Models/blob/master/official/vision/classification/resnet/train.py

请提供完整的日志及报错信息

02 11:07:18 init distributed process group 1 / 2
02 11:07:18 init distributed process group 0 / 2
02 11:07:25 preparing dataset..
02 11:07:25 preparing dataset..
02 11:07:25 WRN devkit directory /mnt/CEPH_GALACTICA/dataset/imagenet_new/ILSVRC2012_devkit_t12 does not exists
02 11:07:25 WRN devkit directory /mnt/CEPH_GALACTICA/dataset/imagenet_new/ILSVRC2012_devkit_t12 does not exists
02 11:07:37 WRN devkit directory /mnt/CEPH_GALACTICA/dataset/imagenet_new/ILSVRC2012_devkit_t12 does not exists
02 11:07:37 WRN devkit directory /mnt/CEPH_GALACTICA/dataset/imagenet_new/ILSVRC2012_devkit_t12 does not exists
02 11:07:37 Epoch 0 LR 2.500e-02
02 11:07:37 Epoch 0 LR 2.500e-02
[1585825669.280300] [WXRG0238:3850 :0]          debug.c:1285 UCX  WARN  ucs_debug_disable_signal: signal 8 was not set in ucs
[1585825669.280334] [WXRG0238:3850 :1]          debug.c:1285 UCX  WARN  ucs_debug_disable_signal: signal 8 was not set in ucs
[WXRG0238:3850 :2:4026] Caught signal 7 (Bus error: nonexistent physical address)
[1585825669.280354] [WXRG0238:3850 :2]          debug.c:1271 UCX  WARN  ucs_spinlock_destroy() failed (-15)
[WXRG0238:3850 :1:4029] Caught signal 7 (Bus error: nonexistent physical address)
==== backtrace ====
[1585825669.280357] [WXRG0238:3850 :3]          debug.c:1285 UCX  WARN  ucs_debug_disable_signal: signal 8 was not set in ucs
[1585825669.280360] [WXRG0238:3850 :1]          debug.c:1271 UCX  WARN  ucs_spinlock_destroy() failed (-15)
==== backtrace ====
[WXRG0238:3850 :3:4054] Caught signal 7 (Bus error: nonexistent physical address)
[1585825669.280370] [WXRG0238:3850 :0]          debug.c:1285 UCX  WARN  ucs_debug_disable_signal: signal 11 was not set in ucs
[1585825669.280400] [WXRG0238:3850 :0]          debug.c:1271 UCX  WARN  ucs_spinlock_destroy() failed (-15)
[1585825669.280405] [WXRG0238:3850 :4]          debug.c:1285 UCX  WARN  ucs_debug_disable_signal: signal 8 was not set in ucs
[WXRG0238:3850 :4:4030] Caught signal 7 (Bus error: nonexistent physical address)
[1585825669.280424] [WXRG0238:3850 :5]          debug.c:1285 UCX  WARN  ucs_debug_disable_signal: signal 8 was not set in ucs
[WXRG0238:3850 :0:4028] Caught signal 7 (Bus error: nonexistent physical address)
[1585825669.280439] [WXRG0238:3850 :0]          debug.c:1271 UCX  WARN  ucs_spinlock_destroy() failed (-15)
==== backtrace ====
[1585825669.280391] [WXRG0238:3850 :3]          debug.c:1271 UCX  WARN  ucs_spinlock_destroy() failed (-15)
==== backtrace ====
[1585825669.280427] [WXRG0238:3850 :4]          debug.c:1271 UCX  WARN  ucs_spinlock_destroy() failed (-15)
==== backtrace ====
[WXRG0238:3850 :5:4027] Caught signal 7 (Bus error: nonexistent physical address)
==== backtrace ====
    0  /root/anaconda3/envs/original/lib/python3.6/site-packages/megengine/_internal/lib/libucs-e8168204.so.0.0.0(+0x2338c) [0x7fd1b119938c]
    1  /root/anaconda3/envs/original/lib/python3.6/site-packages/megengine/_internal/lib/libucs-e8168204.so.0.0.0(+0x23588) [0x7fd1b1199588]
    2  /lib/x86_64-linux-gnu/libpthread.so.0(+0x12890) [0x7fd2937d6890]
    3  /lib/x86_64-linux-gnu/libc.so.6(+0x18eb1f) [0x7fd293561b1f]
    4  /root/anaconda3/envs/original/lib/python3.6/site-packages/pyarrow/libarrow.so.16(+0x63ccfc) [0x7fd1aab8fcfc]
    5  /root/anaconda3/envs/original/lib/python3.6/site-packages/pyarrow/libarrow.so.16(+0x63ce6b) [0x7fd1aab8fe6b]
    6  /lib/x86_64-linux-gnu/libpthread.so.0(+0xf827) [0x7fd2937d3827]
    7  /root/anaconda3/envs/original/lib/python3.6/site-packages/pyarrow/libarrow.so.16(_ZNSt17_Function_handlerIFvvEN5arrow8internal6detail21packaged_task_wrapperIPvJEEEE9_M_invokeERKSt9_Any_data+0x123) [0x7fd1aab90513]
    8  /root/anaconda3/envs/original/lib/python3.6/site-packages/pyarrow/libarrow.so.16(+0x6432e4) [0x7fd1aab962e4]
    9  /root/anaconda3/envs/original/lib/python3.6/site-packages/pyarrow/libarrow.so.16(+0xba158f) [0x7fd1ab0f458f]
   10  /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7fd2937cb6db]
   11  /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7fd2934f488f]
===================
Process Process-1:6:
Traceback (most recent call last):
  File "/root/anaconda3/envs/original/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/root/anaconda3/envs/original/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/root/anaconda3/envs/original/lib/python3.6/site-packages/megengine/data/dataloader.py", line 426, in _data_selecting_loop
    batch_queue.put(batch_data, timeout=1)
  File "/root/anaconda3/envs/original/lib/python3.6/site-packages/megengine/data/_queue.py", line 66, in put
    object_id = self.client.put(data)
  File "pyarrow/_plasma.pyx", line 536, in pyarrow._plasma.PlasmaClient.put
  File "pyarrow/_plasma.pyx", line 364, in pyarrow._plasma.PlasmaClient.create
  File "pyarrow/_plasma.pyx", line 291, in pyarrow._plasma.plasma_check_status
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
OSError: Encountered unexpected EOF
Exception ignored in: <bound method _ParallelDataLoaderIter.__del__ of <megengine.data.dataloader._ParallelDataLoaderIter object at 0x7f2897b3c160>>
Traceback (most recent call last):
  File "/root/anaconda3/envs/original/lib/python3.6/site-packages/megengine/data/dataloader.py", line 544, in __del__
AttributeError: '_ParallelDataLoaderIter' object has no attribute '_ParallelDataLoaderIter__initialized'
Process Process-1:
Traceback (most recent call last):
  File "/root/anaconda3/envs/original/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/root/anaconda3/envs/original/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/root/test/Models/official/vision/classification/resnet/train.py", line 193, in worker
    train_func, train_queue, optimizer, args, epoch=epoch
  File "/root/test/Models/official/vision/classification/resnet/train.py", line 223, in train
    for step, (image, label) in enumerate(data_queue):
  File "/root/anaconda3/envs/original/lib/python3.6/site-packages/megengine/data/dataloader.py", line 152, in __next__
    minibatch = self._get_next_batch()
  File "/root/anaconda3/envs/original/lib/python3.6/site-packages/megengine/data/dataloader.py", line 512, in _get_next_batch
    batch_data = self._try_get_next_batch()
  File "/root/anaconda3/envs/original/lib/python3.6/site-packages/megengine/data/dataloader.py", line 501, in _try_get_next_batch
    self._check_workers()
  File "/root/anaconda3/envs/original/lib/python3.6/site-packages/megengine/data/dataloader.py", line 441, in _check_workers
    raise RuntimeError("data collecting worker died. {}".format(exitcode))
RuntimeError: data collecting worker died. None

我需要用ctl-c 终止程序。 GPU memory 已经有数据了,但是程序挂了。

Thanks and looking forward to your reply.

BUG Issue

环境

1.系统环境:Linux ubuntu_vb 4.15.0-91-generic #92-Ubuntu SMP x86_64
2.MegEngine版本:默认版本
3.python版本:python 3.6

复现步骤

安装过程参考:https://megengine.org.cn/doc/latest/index.html#installation
1.pip3 install megengine -f https://megengine.org.cn/whl/mge.html
2.python3
3.提示

Python 3.6.9 (default, Nov  7 2019, 10:44:02) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import megengine as meg
This is an placeholder only, please install by 'pip3 install megengine -f https://megengine.org.cn/whl/mge.html'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/username/.local/lib/python3.6/site-packages/megengine/__init__.py", line 3, in <module>
    raise ValueError("This is an placeholder only")
ValueError: This is an placeholder only
>>> 

源码编译安装时编译nccl报错

编译时built target nccl时报错

环境

1.系统环境:ubuntu16.04
2.MegEngine版本:
3.python版本:3.7.7
4.cuda:10.1
5.tensorrt 6.0.1.5
6.llvm clang 6.0

image

module 'megengine' has no attribute 'tensor'

环境

1.系统环境:Ubuntu18.04
2.MegEngine版本:megengine-0.3.1
3.python版本:3.7

复现步骤

  1. pip install megengine
    2.运行官方Demo
    3.可以正常导入包
    image

请提供关键的代码片段便于追查问题

import numpy as np
import megengine as mge
a = mge.tensor(np.random.random((2,5)).astype('float32'))
print(a)
b = mge.tensor([1., 2., 3.])
print(b)

请提供完整的日志及报错信息

AttributeError: module 'megengine' has no attribute 'tensor'
image

error when building from source

环境

1.系统环境:ubuntu 16
2.MegEngine版本:current version
3.python版本:3.6

复现步骤

请提供关键的代码片段便于追查问题

请提供完整的日志及报错信息

/home/yjyang/test-MegEngine/MegEngine/third_party/intel-mkl-dnn/src/cpu/nspc_batch_normalization.cpp: In lambda function:
/home/yjyang/test-MegEngine/MegEngine/third_party/intel-mkl-dnn/src/cpu/nspc_batch_normalization.cpp:386:10: error: ‘simdlen’ is not valid for ‘#pragma omp simd’
PRAGMA_OMP_SIMD(SIMD_LEN_16)
^
src/cpu/CMakeFiles/dnnl_cpu.dir/build.make:1574: recipe for target 'src/cpu/CMakeFiles/dnnl_cpu.dir/nspc_batch_normalization.cpp.o' failed
make[5]: *** [src/cpu/CMakeFiles/dnnl_cpu.dir/nspc_batch_normalization.cpp.o] Error 1
CMakeFiles/Makefile2:276: recipe for target 'src/cpu/CMakeFiles/dnnl_cpu.dir/all' failed
make[4]: *** [src/cpu/CMakeFiles/dnnl_cpu.dir/all] Error 2
Makefile:157: recipe for target 'all' failed
make[3]: *** [all] Error 2
CMakeFiles/mkl_dnn.dir/build.make:128: recipe for target 'third_party/intel-mkl-dnn/src/mkl_dnn-stamp/mkl_dnn-build' failed
make[2]: *** [third_party/intel-mkl-dnn/src/mkl_dnn-stamp/mkl_dnn-build] Error 2
CMakeFiles/Makefile2:356: recipe for target 'CMakeFiles/mkl_dnn.dir/all' failed
make[1]: *** [CMakeFiles/mkl_dnn.dir/all] Error 2
Makefile:168: recipe for target 'all' failed
make: *** [all] Error

Any clue what causes this problem??
Thank you!

benchmark_basic_types 测试用例代码问题

请简要描述您的需求

最近打算好好看看Megengine代码,先看的dnn模块,打算从测试用例入手,在看benchmark_basic_types.cpp文件发现有些问题

企业微信截图_15950673418073
TensorShape 的最大维度为7
但是switch只写了6个,还有就是在单步调试时发现
TEST(BENCHMARK_BASIC_TYPES, EQ_SHAPE)函数中
static TensorShape s0, s1[NR_TEST];
s0的ndim总是为0,这样是不是测试benchmark有点问题.

同理,eq_layout1也有一样的问题.

静态图相关待进一步优化LIST

静态图模式下存在以下问题,我们将持续优化:

  • trace 后的函数无法继续求导
  • jit.trace 和 jit.sideeffect 不支持复杂嵌套

希望支持移动端边缘计算设备

背景

边缘计算可以保证数据的隐私和安全性,目前流行的树莓派、jetson nano等轻量化计算设备被广泛应用于执行深度学习任务。

需求描述

使用jetson nano设备紧源码编译megengine,遇到一些问题,最后发现应该是还不支持这种设备,准确来说是不支持jetson nano的GPU类型。

  1. LLVM问题。
    官网教程提示LLVM版本可以>=6.0,我根据https://apt.llvm.org/的方法使用apt安装LLVM9.0版本之后,cmake报错:
build ➤ cmake -DMGE_WITH_LLVM=OFF ..
FATALUnknown machine architecture for MegEngine.
-- Using GNU gold linker.
-- Setting build type to 'RelWithDebInfo' as none was specified.
CMake Error at cmake/Halide.cmake:2 (find_package):
  Could not find a configuration file for package "LLVM" that is compatible
  with requested version "6.0".

  The following configuration files were considered but not accepted:

    /usr/lib/llvm-9/cmake/LLVMConfig.cmake, version: 9.0.0

Call Stack (most recent call first):
  CMakeLists.txt:140 (include)


-- Configuring incomplete, errors occurred!

行,那我就不使能JIT,使用cmake编译选项控制:-DMGE_WITH_JIT=OFF

  1. CUDA arch问题
    失能JIT镜像编译:
cmake -DMGE_WITH_JIT=OFF ..

得到下面错误:

CMake Error at CMakeLists.txt:231 (message):
  Unsupported CUDA host arch.

看错误应该是GPU架构不支持,通过pytorch查看GPU的算力是5.3

  1. flat_buffer问题
    失能GPU后继续cmake,出现错误:
CMake Error at src/CMakeLists.txt:118 (build_flatbuffers):
  Unknown CMake command "build_flatbuffers".

那就下载flatbuffers镜像编译安装:https://github.com/google/flatbuffers/releases

编译安装之后发现并没有build_flatbuffers这个执行文件,所以还是无法cmake构建项目。

能力有限,测试就到此为止了

生成 sphinx C++ 文档

MegEngine 提供了 C++ inference 网络的能力,但是目前未提供 C++ 相关的文档(代码里已经有 doxygen 文档)。

目前在 docs/doxygen 目录中有 doxygen 生成配置,但这份配置难以与 https://megengine.org.cn/doc/latest/index.html 上的 sphinx 文档进行对接。

因此希望生成一份文档,能够与 https://github.com/megengine/docs 中的生成逻辑配合。

目前调研到可以用 breathe+sphinx+doxygen 的组合进行生成,但未能调通,需要进一步调研。

模型并行的支持

请简要描述您的需求

浏览API应该是用的数据并行的方式,之后有支持模型并行的计划吗?

BUG Issue 不支持 int32 类型的矩阵乘

环境

1.系统环境:linux without cuda
2.MegEngine版本:0.3.1
3.python版本:3.8

复现步骤

请提供关键的代码片段便于追查问题

import megengine as mge
a = mge.ones((3, 4))
b = mge.ones((4, 5))
print(a @ b)

请提供完整的日志及报错信息

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/constroy/.local/lib/python3.8/site-packages/megengine/core/tensor.py", line 65, in wrapped
    return Tensor(f(self._attach(comp_graph), other))
  File "/home/constroy/.local/lib/python3.8/site-packages/megengine/_internal/mgb.py", line 1779, in __matmul__
    return matrix_mul(self, rhs)
  File "/home/constroy/.local/lib/python3.8/site-packages/megengine/_internal/opr.py", line 332, in matrix_mul
    outputs = _mgb._create_opr("MatrixMulV2", all_inputs, all_params, config)
megengine._internal.exc.MegBrainError: MegBrain core throws exception: mgb::MegDNNError
assertion `C.valid() && (C == C_candi || C == C_candi2)' failed at /home/code/dnn/src/common/matrix_mul.cpp:42: void megdnn::MatrixMulForward::deduce_dtype(megdnn::DType, megdnn::DType, megdnn::DType&)
extra message: unsupported MatMul(Int32, Int32) -> invalid
+ bt:/home/constroy/.local/lib/python3.8/site-packages/megengine/_internal/_mgb.cpython-38-x86_64-linux-gnu.so{1bf8371,1fa8fc4,1fa8ff2,1ff5a7c}
| Associated operator: id=25 name=matrix_mul(broadcast[4])[25] type=mgb::opr::MatrixMul
|   input variables: 
|     0: {id:5, layout:{3(4),4(1)}, Int32, owner:broadcast(1[0])[4]{Broadcast}, name:broadcast(1[0])[4], slot:0, cpu0:0, d, 2, 2}
|     1: {id:14, layout:{4(5),5(1)}, Int32, owner:broadcast(1[0])[13]{Broadcast}, name:broadcast(1[0])[13], slot:0, cpu0:0, d, 2, 2}
|   output variables: 
|     0: {id:26, shape:{}, invalid, owner:matrix_mul(broadcast[4])[25]{MatrixMul}, name:matrix_mul(broadcast[4])[25], slot:0, cpu0:0, d, 1, 1}
|     1: {id:27, shape:{}, Byte, owner:matrix_mul(broadcast[4])[25]{MatrixMul}, name:matrix_mul(broadcast[4])[25]:workspace, slot:1, cpu0:0, d, 1, 1}

MegHair miss project

文档链接

sdk/load-and-run/README.md

问题描述

L17 提到 MegHair/utils/debug/load_network_and_run.py 似乎是另一个project?这个脚本目前似乎还没有?

感谢!

Clarification on layouts

It would be nice to clarify what the (device-independent) constraints are for tensor layouts:
E.g.:

  • Are negative strides allowed?
  • May the stride for any axis be zero ?
  • Is it possible for the strides to be such that they create 'aliasing' of elements? E.g. num_rows == num_cols == 3, row_stride == col_stride == 1.

I also notice that you don't support tensors with no axes (i.e. you require ndim > 0)... I thought that was quite a nice feature of PyTorch, as there are some operations like summing over an axis which naturally reduce the num-axes by one. Is there a reason you don't support that?

ptrdiff_t stride[MAX_NDIM];

动态模式显存占用和性能尚待进一步优化LIST

动态图模式下存在以下问题,我们将持续优化:

  • 当前实现导致动态创建的 Tensor 显存不能自动回收,目前需要使用 set_value() 方法手动复用 Tensor 来避免显存持续增加
  • 动态模式下的显存占用较高
  • PyTorch 子图,megengine.Funtion 等算子占用显存会持续增加
  • random operator 以及依赖它的 operator 暂时无法去重,导致显存使用持续增加,例如 dropout

发布会的视频还有录播吗?

请简要描述您的需求

作为中小企业的CTO,一直关注旷视的深度学习平台,但我错过了发布会的直播,想问之后提供录播吗?

megengine._internal.exc.MegBrainError: MegBrain core throws exception: mgb::MegDNNError cuda error invalid device function(98) occurred;

环境

1.系统环境:Redhat Linux Enterprise
2.MegEngine版本:0.5.1
3.python版本:3.6.10

复现步骤

请提供关键的代码片段便于追查问题

请提供完整的日志及报错信息

megengine._internal.exc.MegBrainError: MegBrain core throws exception: mgb::MegDNNError
cuda error invalid device function(98) occurred; expr: cudaOccupancyMaxPotentialBlockSizeVariableSMem( &ret.grid_size, &ret.block_size, kern, s)

  • bt:/mnt/xfs1/home/intern6/anaconda3/envs/CRDET/lib/python3.6/site-packages/megengine/_internal/_mgb.cpython-36m-x86_64-linux-gnu.so{1e55711,21124c4,21124f2,2b284d5}
    | Associated operator: id=7 name=asFloat32(1[5])[7] type=mgb::opr::TypeCvt
    | input variables:
    | 0: {id:6, layout:{1(1)}, Int32, owner:1[5]{ImmutableTensor}, name:1[5], slot:0, gpu0:0, s, 2, 2}
    | output variables:
    | 0: {id:8, layout:{1(1)}, Float32, owner:asFloat32(1[5])[7]{TypeCvt}, name:asFloat32(1[5])[7], slot:0, gpu0:0, d, 2, 2}

性能优化LIST

性能上存在以下问题,我们将持续优化:

  • 当前 Adam 优化器的 step 性能不佳
  • 模型参数随机初始化性能不佳

BUG Issue: NameError: name 'r' is not defined

环境

1.系统环境:MegStudio cpu
2.MegEngine版本:0.3.1
3.python版本:3.8

复现步骤

import numpy as np
import megengine as mge
import megengine.functional as F

a = mge.tensor(np.ones((224, 224, 3)).astype('float32'))
print(a.shape)
b = F.transpose(a, (2, 0, 1))

请提供关键的代码片段便于追查问题

import numpy as np
import megengine as mge
import megengine.functional as F

a = mge.tensor(np.ones((224, 224, 3)).astype('float32'))
print(a.shape)
b = F.transpose(a, (2, 0, 1))

请提供完整的日志及报错信息

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-29-2cfe08bd4479> in <module>
      5 a = mge.tensor(np.ones((224, 224, 3)).astype('float32'))
      6 print(a.shape)
----> 7 b = F.transpose(a, (2, 0, 1))

~/miniconda3/envs/xuan/lib/python3.8/site-packages/megengine/functional/tensor.py in transpose(*args, **kwargs)
    448 @functools.wraps(dimshuffle)
    449 def transpose(*args, **kwargs):
--> 450     r
    451     """See :func:`dimshuffle`
    452     """

NameError: name 'r' is not defined

F.transposeF.dimshuffle都会有这个问题
看了下源码

@wrap_io_tensor
def dimshuffle(inp: Tensor, pattern: Iterable[int]) -> Tensor:
    r
    """
    Swap shapes and strides according to given pattern

    :param inp: Input tensor
    :param pattern: a list of integers including 0, 1, ... , ``ndim``-1, and any number of ``'x'`` char in dimensions where this tensor should be broadcasted. For examples:

        * (``'x'``) -> make a 0d (scalar) into a 1d vector
        * (0, 1) -> identity for 2d vectors
        * (1, 0) -> inverts the first and second dimensions
        * (``'x'``, 0) -> make a row out of a 1d vector (N to 1xN)
        * (0, ``'x'``) -> make a column out of a 1d vector (N to Nx1)
        * (2, 0, 1) -> AxBxC to CxAxB
        * (0, ``'x'``, 1) -> AxB to Ax1xB
        * (1, ``'x'``, 0) -> AxB to Bx1xA
        * (1,) -> This remove dimensions 0. It must be a broadcastable dimension (1xA to A)

    :return: The output tensor

    Examples:

    .. testcode::

        import numpy as np
        from megengine import tensor
        import megengine.functional as F
        x = tensor(np.array([[1, 1], [0, 0]], dtype=np.int32))
        out = F.dimshuffle(x, (1, 0))
        print(out.numpy())

    Outputs:

    .. testoutput::

        [[1 0]
         [1 0]]

    """

注释前的那个r位置不对导致的。。。该如何避免掉呢。。。源码一个个改过去?

CTCLoss Request

计划使用天元实现crnn文字识别算法,但是在api里没有找到ctcloss,希望尽快实现这个功能

error: invalid use of ‘auto’ const std::string& pre) -> auto { ^ .... during the process of "make -j4"

环境

1.系统环境:Red hat Linux Enterprise
2.MegEngine版本:
3.python版本:

复现步骤

请提供关键的代码片段便于追查问题

请提供完整的日志及报错信息

(base) [intern6@node011 build]$ make -j4
-- Building MegBrain 8.6.0
-- Using GNU gold linker.
-- Found PythonInterp: /mnt/xfs1/home/intern6/anaconda3/envs/CRDET/bin/python3 (found suitable version "3.6.10", minimum required is "3")
-- Setting build type to 'RelWithDebInfo' as none was specified.
-- Found PythonInterp: /mnt/xfs1/home/intern6/anaconda3/envs/CRDET/bin/python3 (found version "3.6.10")
-- Found CuDNN: /cm/shared/apps/cudnn/7.4.2 (found version: 7.4)
-- Found TensorRT: /mnt/xfs1/home/intern6/software/TensorRT-5.1.5.0 (found version: 5.1.5)
-- Build with MKL in /mnt/xfs1/home/intern6/software/MegEngine/third_party/mkl/x86_64
-- GPU support is disabled
-- Intel(R) MKL: include /mnt/xfs1/home/intern6/software/MegEngine/third_party/mkl/x86_64/include
-- Intel(R) MKL: lib libmkl
-- Could NOT find Doxygen (missing: DOXYGEN_EXECUTABLE)
git found: /mnt/xfs1/home/intern6/git/libexec/git-core/git
-- NumPy ver. 1.18.5 found (include: /mnt/xfs1/home/intern6/anaconda3/envs/CRDET/lib/python3.6/site-packages/numpy/core/include)
-- Configuring done
-- Generating done
-- Build files have been written to: /mnt/xfs1/home/intern6/software/MegEngine/build
[ 2%] Built target halide
[ 2%] Built target zmq
[ 2%] Built target protobuf
[ 3%] Built target flatc
[ 3%] Built target flatbuffers
[ 3%] Built target flathash
[ 3%] Built target cuda-stub
[ 3%] Built target lapack_static_weak_target
[ 3%] Built target _opr_param_defs
[ 4%] Built target gtest
[ 4%] Built target ucx
[ 4%] Built target dnnl_common
[ 4%] Built target nccl
[ 4%] Built target mgb_proto_target
[ 4%] Built target _mgb_opr_param_defs
[ 6%] Built target mgb_serialization_schema_fbs
[ 6%] Built target mgb_opr_py
[ 9%] Built target mgb_swig_compilation
[ 12%] Built target dnnl_cpu
[ 12%] Built target version_ld
[ 12%] Built target dnnl
[ 13%] Built target megray
[ 13%] Building CXX object dnn/src/CMakeFiles/megdnn.dir/common/add_update.cpp.o
[ 13%] Building CXX object dnn/src/CMakeFiles/megdnn.dir/common/argmxx/base_impl.cpp.o
[ 13%] Building CXX object dnn/src/CMakeFiles/megdnn.dir/common/basic_types.cpp.o
[ 13%] Building CXX object dnn/src/CMakeFiles/megdnn.dir/common/argsort.cpp.o
[ 13%] Building CXX object dnn/src/CMakeFiles/megdnn.dir/common/batch_conv_bias.cpp.o
[ 13%] Building CXX object dnn/src/CMakeFiles/megdnn.dir/common/batch_normalization.cpp.o
[ 13%] Building CXX object dnn/src/CMakeFiles/megdnn.dir/common/batched_matrix_mul.cpp.o
[ 13%] Building CXX object dnn/src/CMakeFiles/megdnn.dir/common/checksum.cpp.o
[ 13%] Building CXX object dnn/src/CMakeFiles/megdnn.dir/common/concat_split.cpp.o
[ 13%] Building CXX object dnn/src/CMakeFiles/megdnn.dir/common/cond_take/opr_impl.cpp.o
[ 13%] Building CXX object dnn/src/CMakeFiles/megdnn.dir/common/conv_bias.cpp.o
[ 14%] Building CXX object dnn/src/CMakeFiles/megdnn.dir/common/conv_pooling.cpp.o
/mnt/xfs1/home/intern6/software/MegEngine/dnn/src/common/conv_bias.cpp: In static member function ‘static megdnn::ConvBiasForward::WinogradParam megdnn::ConvBiasForward::parse_winograd_name(const string&)’:
/mnt/xfs1/home/intern6/software/MegEngine/dnn/src/common/conv_bias.cpp:337:49: error: invalid use of ‘auto’
const std::string& pre) -> auto {
^
make[2]: *** [dnn/src/CMakeFiles/megdnn.dir/common/conv_bias.cpp.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[1]: *** [dnn/src/CMakeFiles/megdnn.dir/all] Error 2
make: *** [all] Error 2

Feature Request

Tensor clamp

背景

I am doing a project regarding object detection. So I need to clamp the predicted boxes. Does the tensor in the MegEngine have clamp methods like tensors in the PyTorch framework? Thank you in advance.

需求描述

pip安装无法使用

环境

1.系统环境:Ubuntu 19.04 (GNU/Linux 5.0.0-38-generic x86_64)
2.MegEngine版本:默认版本
3.python版本:3.7
4.pip版本:20.0

复现步骤

1.pip3 install megengine -f https://megengine.org.cn/whl/mge.html
2.import megengine as mge

报错信息:ValueError

This is an placeholder only, please install by 'pip3 install megengine -f https://megengine.org.cn/whl/mge.html'. Please make sure your pip version is higher than 19.0.
Traceback (most recent call last):
File "", line 1, in
File "/home/frb/.local/lib/python3.7/site-packages/megengine/init.py", line 3, in
raise ValueError("This is an placeholder only")
ValueError: This is an placeholder only
##追加##
翻看issues后发现相同的问题,按照提示将pip升为version=20.0 后仍然出现此问题

FileNotFoundError 'plasma_store'; AttributeError in <function _ParallelDataLoaderIter>

运行官方Models repo中的resnet/train.py代码实例时出现以下报错:

环境

1.系统环境:Linux Mint 19.2
2.MegEngine版本:0.3.1
3.python版本:python 3.7.5
4.硬件: laptop, intel i5 + Nvidia 1050ti

复现步骤

  1. 运行 https://github.com/MegEngine/Models/blob/master/official/vision/classification/resnet/train.py

完整的日志及报错信息

26 15:40:52 preparing dataset..
26 15:41:32 Epoch 0 LR 1.250e-02
Traceback (most recent call last):
File "train.py", line 328, in
main()
File "train.py", line 93, in main
worker(0, 1, args)
File "train.py", line 209, in worker
train_func, train_queue, optimizer, args, epoch=epoch
File "train.py", line 239, in train
for step, (image, label) in enumerate(data_queue):
File "/home/rlee/.local/lib/python3.7/site-packages/megengine/data/dataloader.py", line 122, in iter
return _ParallelDataLoaderIter(self)
File "/home/rlee/.local/lib/python3.7/site-packages/megengine/data/dataloader.py", line 192, in init
from ._queue import PlasmaShmQueue
File "/home/rlee/.local/lib/python3.7/site-packages/megengine/data/_queue.py", line 39, in
MGE_PLASMA_STORE_MANAGER = _PlasmaStoreManager()
File "/home/rlee/.local/lib/python3.7/site-packages/megengine/data/_queue.py", line 29, in init
stderr=None if debug_flag else subprocess.DEVNULL,
File "/usr/lib/python3.7/subprocess.py", line 800, in init
restore_signals, start_new_session)
File "/usr/lib/python3.7/subprocess.py", line 1551, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'plasma_store': 'plasma_store'
Exception ignored in: <function _PlasmaStoreManager.del at 0x7ff9d40363b0>
Traceback (most recent call last):
File "/home/rlee/.local/lib/python3.7/site-packages/megengine/data/_queue.py", line 33, in del
if self.plasma_store and self.plasma_store.returncode is None:
AttributeError: '_PlasmaStoreManager' object has no attribute 'plasma_store'
Exception ignored in: <function _ParallelDataLoaderIter.del at 0x7ffa5988b290>
Traceback (most recent call last):
File "/home/rlee/.local/lib/python3.7/site-packages/megengine/data/dataloader.py", line 544, in del
if self.__initialized:
AttributeError: '_ParallelDataLoaderIter' object has no attribute '_ParallelDataLoaderIter__initialized'

初步分析

报错1: FileNotFoundError: [Errno 2] No such file or directory: 'plasma_store': 'plasma_store'

报错2: AttributeError: '_ParallelDataLoaderIter' object has no attribute '_ParallelDataLoaderIter__initialized'

dataloader.py中_ParallelDataLoaderIter类初始化__initialzed变量时并未在__init__()中进行属性赋
值,若

self.__initialized = True

不执行则导致报错2

程序无法正常退出(CPU速度是pytroch9倍)

首先,惊叹MegEngine的速度。从torchvision拷贝个resnet18在cpu测试,MegEngine是pytorch速度的9倍。这个速度比opencv还快,opencv大概是pytroch的4倍。
但是,我运行程序后发现可能无法正常退出。
image

环境
ubuntu
没有GPU
megengine 0.3.1
我的代码:

import megengine as mge
import megengine.module as M
import cv2
import numpy as np
import megengine.functional as F
def conv3x3(in_planes, out_planes, stride=1, groups=1, dilation=1):
    """3x3 convolution with padding"""
    return M.Conv2d(in_planes, out_planes, kernel_size=3, stride=stride,
                     padding=dilation, groups=groups, bias=False, dilation=dilation)


def conv1x1(in_planes, out_planes, stride=1):
    """1x1 convolution"""
    return M.Conv2d(in_planes, out_planes, kernel_size=1, stride=stride, bias=False)


class BasicBlock(M.Module):
    expansion = 1

    def __init__(self, inplanes, planes, stride=1, downsample=None, groups=1,
                 base_width=64, dilation=1, norm_layer=None):
        super(BasicBlock, self).__init__()
        if norm_layer is None:
            norm_layer = M.BatchNorm2d
        if groups != 1 or base_width != 64:
            raise ValueError('BasicBlock only supports groups=1 and base_width=64')
        if dilation > 1:
            raise NotImplementedError("Dilation > 1 not supported in BasicBlock")
        # Both self.conv1 and self.downsample layers downsample the input when stride != 1
        self.conv1 = conv3x3(inplanes, planes, stride)
        self.bn1 = norm_layer(planes)
        self.relu = M.ReLU()
        self.conv2 = conv3x3(planes, planes)
        self.bn2 = norm_layer(planes)
        self.downsample = downsample
        self.stride = stride

    def forward(self, x):
        identity = x

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)

        if self.downsample is not None:
            identity = self.downsample(x)

        out += identity
        out = self.relu(out)

        return out


class ResNet(M.Module):

    def __init__(self, block, layers, num_classes=1000, zero_init_residual=False,
                 groups=1, width_per_group=64, replace_stride_with_dilation=None,
                 norm_layer=None):
        super(ResNet, self).__init__()
        if norm_layer is None:
            norm_layer = M.BatchNorm2d
        self._norm_layer = norm_layer

        self.inplanes = 64
        self.dilation = 1
        if replace_stride_with_dilation is None:
            # each element in the tuple indicates if we should replace
            # the 2x2 stride with a dilated convolution instead
            replace_stride_with_dilation = [False, False, False]
        if len(replace_stride_with_dilation) != 3:
            raise ValueError("replace_stride_with_dilation should be None "
                             "or a 3-element tuple, got {}".format(replace_stride_with_dilation))
        self.groups = groups
        self.base_width = width_per_group
        self.conv1 = M.Conv2d(3, self.inplanes, kernel_size=7, stride=2, padding=3,
                               bias=False)
        self.bn1 = norm_layer(self.inplanes)
        self.relu = M.ReLU()
        self.maxpool = M.MaxPool2d(kernel_size=3, stride=2, padding=1)
        self.layer1 = self._make_layer(block, 64, layers[0])
        self.layer2 = self._make_layer(block, 128, layers[1], stride=2,
                                       dilate=replace_stride_with_dilation[0])
        self.layer3 = self._make_layer(block, 256, layers[2], stride=2,
                                       dilate=replace_stride_with_dilation[1])
        self.layer4 = self._make_layer(block, 512, layers[3], stride=2,
                                       dilate=replace_stride_with_dilation[2])
                                              
        
        self.avgpool = M.AvgPool2d((7,7))#M.AdaptiveAvgPool2d((1, 1))
        self.fc = M.Linear(512 * block.expansion, num_classes)

        #for m in self.modules():
        #    if isinstance(m, M.Conv2d):
        #        M.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
        #    elif isinstance(m,M.BatchNorm2d):
        #        M.init.constant_(m.weight, 1)
        #        M.init.constant_(m.bias, 0)

        # Zero-initialize the last BN in each residual branch,
        # so that the residual branch starts with zeros, and each residual block behaves like an identity.
        # This improves the model by 0.2~0.3% according to https://arxiv.org/abs/1706.02677
        if zero_init_residual:
            for m in self.modules():
                if isinstance(m, Bottleneck):
                    M.init.constant_(m.bn3.weight, 0)
                elif isinstance(m, BasicBlock):
                    M.init.constant_(m.bn2.weight, 0)

    def _make_layer(self, block, planes, blocks, stride=1, dilate=False):
        norm_layer = self._norm_layer
        downsample = None
        previous_dilation = self.dilation
        if dilate:
            self.dilation *= stride
            stride = 1
        if stride != 1 or self.inplanes != planes * block.expansion:
            downsample = M.Sequential(
                conv1x1(self.inplanes, planes * block.expansion, stride),
                norm_layer(planes * block.expansion),
            )

        layers = []
        layers.append(block(self.inplanes, planes, stride, downsample, self.groups,
                            self.base_width, previous_dilation, norm_layer))
        self.inplanes = planes * block.expansion
        for _ in range(1, blocks):
            layers.append(block(self.inplanes, planes, groups=self.groups,
                                base_width=self.base_width, dilation=self.dilation,
                                norm_layer=norm_layer))

        return M.Sequential(*layers)

    def _forward_impl(self, x):
        # See note [TorchScript super()]
        x = self.conv1(x)
        x = self.bn1(x)
        x = self.relu(x)
        x = self.maxpool(x)

        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)

        x = self.avgpool(x)
        x = F.flatten(x, 1)
        x = self.fc(x)

        return x

    def forward(self, x):
        return self._forward_impl(x)


def _resnet(arch, block, layers, pretrained, progress, **kwargs):
    model = ResNet(block, layers, **kwargs)
    if pretrained:
        state_dict = load_state_dict_from_url(model_urls[arch],
                                              progress=progress)
        model.load_state_dict(state_dict)
    return model


def resnet18(pretrained=False, progress=True, **kwargs):
    r"""ResNet-18 model from
    `"Deep Residual Learning for Image Recognition" <https://arxiv.org/pdf/1512.03385.pdf>`_
    Args:
        pretrained (bool): If True, returns a model pre-trained on ImageNet
        progress (bool): If True, displays a progress bar of the download to stderr
    """
    return _resnet('resnet18', BasicBlock, [2, 2, 2, 2], pretrained, progress,
                   **kwargs)

def test_fps():
    model = resnet18()
    model.eval()
    x = mge.tensor(np.random.randn(1, 3, 224, 224).astype(np.float32))
    count = 0
    t0 = cv2.getTickCount()
    while True:
        count+=1
        model(x)
        if count>100:break
    t = (cv2.getTickCount()-t0)/cv2.getTickFrequency()
    fps = (1.0 / t)*count;
    print(fps)
                   
if __name__ == "__main__":
    model = resnet18()
    model.eval()
    x = mge.tensor(np.random.randn(1, 3, 224, 224).astype(np.float32))
    out = model(x)
    print(out.shape)  
    test_fps()

xornet_deploy error

1.LD_LIBRARY_PATH=/data/self_project/MegEngine/build/src/:$LD_LIBRARY_PATH ./xor_deploy xor_net.mge 0.6 0.9
Usage: ./xornet_deploy model_name x_value y_value
[30 20:21:49 [email protected]:134][ERR] megbrain is about to die abruptly; you can set MGB_WAIT_TERMINATE and rerun to wait for gdb attach: std::terminate() called
[30 20:21:49 [email protected]:142][ERR] bt:/data/self_project/MegEngine/build/src/libmegengine.so{187664e}/usr/lib64/libstdc++.so.6.0.20{5dec6,5df11,5e129}

2./opt/rh/devtoolset-7/root/usr/libexec/gcc/x86_64-redhat-linux/7/ld: cannot find -lmegbrain

BUG Issue

环境

1.系统环境:Centos 7.4
2.MegEngine版本:0.3.1
3.python版本:3.6

复现步骤

  1. cd official/vision/classification/resnet
  2. 运行model中的resnet18或resnet50
    python3 -u train.py --data /mnt/lustre/share/images --arch resnet18 --batch-size 32 --learning-rate 0.0125 --ngpus 8 --save .
  3. 出现错误:ERR cudaGetDeviceCount failed: CUDA driver version is insufficient for CUDA runtime version (err 35)

请提供关键的代码片段便于追查问题

image
image

请提供完整的日志及报错信息

cmake 版本问题

请简要描述您的需求

在参考https://github.com/MegEngine/MegEngine/blob/master/README_CN.md 进行源码编译时,
执行以下脚本时,会出错(使用cmake3.13.5版本)
cmake .. -DMGE_WITH_CUDA=OFF -DMGE_WITH_TEST=ON

-- Building MegBrain 8.4.1
-- The C compiler identification is GNU 5.4.0
-- The CXX compiler identification is GNU 5.4.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Performing Test CXX_SUPPORT_WCLASS_MEMACCESS
-- Performing Test CXX_SUPPORT_WCLASS_MEMACCESS - Failed
-- Performing Test CXX_SUPPORT_GOLD
-- Performing Test CXX_SUPPORT_GOLD - Success
-- Using GNU gold linker.
-- Disable JIT support, as CUDA is not enabled.
-- Disable TensorRT support, as CUDA is not enabled.
-- Found PythonInterp: /home/xj-zjd/anaconda3/bin/python3 (found suitable version "3.7.3", minimum required is "3")
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - not found
-- Check if compiler accepts -pthread
-- Check if compiler accepts -pthread - yes
-- Found Threads: TRUE
-- Setting build type to 'RelWithDebInfo' as none was specified.
-- Found PythonInterp: /home/xj-zjd/anaconda3/bin/python3 (found version "3.7.3")
-- Disable distributed support, as CUDA is not enabled.
-- Looking for strtof_l
-- Looking for strtof_l - found
-- Looking for strtoull_l
-- Looking for strtoull_l - found
-- Build with MKL in /home/xj-zjd/work_space/self_work/megengine_code/MegEngine/third_party/mkl/x86_64
-- Found OpenMP_C: -fopenmp (found version "4.0")
-- Found OpenMP_CXX: -fopenmp (found version "4.0")
-- Found OpenMP: TRUE (found version "4.0")
-- GPU support is disabled
-- Intel(R) MKL: include /home/xj-zjd/work_space/self_work/megengine_code/MegEngine/third_party/mkl/x86_64/include
-- Intel(R) MKL: lib libmkl
-- Could NOT find Doxygen (missing: DOXYGEN_EXECUTABLE)
-- Found Git: /usr/bin/git (found version "2.7.4")
CMake Error at python_module/CMakeLists.txt:1 (cmake_policy):
Policy "CMP0086" is not known to this version of CMake.

-- Found PythonLibs: /home/xj-zjd/anaconda3/lib/libpython3.7m.so (found suitable exact version "3.7.3")
git found: /usr/bin/git
-- Found NumPy: /home/xj-zjd/anaconda3/lib/python3.7/site-packages/numpy/core/include (found version "1.16.4")
-- NumPy ver. 1.16.4 found (include: /home/xj-zjd/anaconda3/lib/python3.7/site-packages/numpy/core/include)
-- Found SWIG: /usr/bin/swig3.0 (found version "3.0.8")
-- Configuring incomplete, errors occurred!
See also "/home/xj-zjd/work_space/self_work/megengine_code/MegEngine/build/CMakeFiles/CMakeOutput.log".
See also "/home/xj-zjd/work_space/self_work/megengine_code/MegEngine/build/CMakeFiles/CMakeError.log".

google了一下,发现可能是这个原因
企业微信截图_15946463664540

这个Policy应该是3.14才能支持

模型可视化工具

我们希望能够对模型进行网络结构的可视化,展现出网络的结构信息。

需求分为两部分:

1. Module 结构可视化

希望输入一个 megengine.module.Module 的实例,能够将这个 Module 内部全部的子 Module 都获取到并绘制成网络结构图。

具体展现形式可以参考 Netron.tph 文件的展现方式。

使用方法:希望能集成到 Netron 内部(具体方式可以进一步讨论)

2. C++ 计算图结构可视化

输入一个 .mge 文件(即序列化的 C++ 图),生成这个计算图对应的网络链接结构。

使用方法:(待定)

  • 一个命令行工具,指定输入文件后,输出一个 svg 文件(或待渲染的 graphviz 源文件)
  • 集成到 Netron

其他要求:

  • 希望能将计算图上的信息尽可能多的暴露出来,可以考虑用 svg + 鼠标 hover 的功能来做

Action:

实现 Assistant 库以降低犯错几率

背景介绍:

  • 像 BN.momentum / optimizer.momentum 这类在业界目前有一定歧义的定义,未来应该还会遇到很多,我们应该做一些假设来让帮助用户少犯错,降低排查成本

其他选择:

  • 直接 assert 或者抛 exception 肯定是不合适的,我们并不能预料到全部的用法,万一用户真的就要这么做就不好了
  • 用 logger 打 warning 太烦了,还不能 ignore
  • 用 python 原本的 warnings 库:功能上都可以,但是使用体验非常诡异

希望达到玩游戏的时候的那种辅助提示的效果
比如 momentum < 0.5 的时候,提示:"The momentum of batch normalization layer rarely uses a value less than 0.5, Please check the document for momentum's definition, which is different from PyTorch."
至多提示一次,如果用户觉得我明白风险,可以显式用 API 表明”我了解这个风险”

BUG Issue

环境

1.系统环境:Ubuntu16.04
2.MegEngine版本:0.3.1
3.python版本:3.6

你好,我在用 MegEngine 写代码的时候遇到了一个 Bug。跑下面这段代码时,显存不断增加,感觉木有释放。Bug 代码如下:

import megengine as mge
import megengine.module as M
import numpy as np

class Net(M.Module):

    def __init__(self):
        self.fc0 = M.Linear(2, 100)

    def forward(self, inp):
        return self.fc0(inp)

net = Net()

for i in range(100000):
    data = mge.tensor(
        np.random.rand(64, 2).astype(np.float32) * 2 - 1
    )
    net(data)

这个 Bug 很奇怪,后来又反复看了几遍官网的教程和几个示例,我把下面这段代码放到循环外。

data = mge.tensor(
        np.random.rand(64, 2).astype(np.float32)
    )

然后在循环里使用 set_value 对 tensor 进行赋值才算解决显存问题。

import megengine as mge
import megengine.module as M
import numpy as np

class Net(M.Module):

    def __init__(self):
        self.fc0 = M.Linear(2, 100)

    def forward(self, inp):
        return self.fc0(inp)

net = Net()

data = mge.tensor(
        np.random.rand(64, 2).astype(np.float32)
    )

for i in range(100000):
    data.set_value(np.random.rand(64, 2) * 2 -1)
    # data.set_value(data * 2 - 1)
    net(data)

但还是不理解为什么无法直接在循环里使用 mge.tensor?而是一定要用 set_value?
希望官方能解答一下呀

希望能提供 macOS 的 pip 版本

希望能提供 macOS 版本

背景

希望在 macOS 上进行 megengine 的学习和调试。

在 macOS 上安装 megengine 时会出现:

2020-03-27 11:30:39 zxytim@localhost Downloads 3001
$ pip3 install megengine -f https://megengine.org.cn/whl/mge.html
2020-03-27 11:30:51 zxytim@localhost Downloads 3001
$ pip3 --version
pip 20.0.2 from /usr/local/lib/python3.7/site-packages/pip (python 3.7)
2020-03-27 11:30:56 zxytim@localhost Downloads 3002
$ pip3 install megengine -f https://megengine.org.cn/whl/mge.html --user --no-cache-dir
Looking in links: https://megengine.org.cn/whl/mge.html
Collecting megengine
  Downloading MegEngine-0.0.1.dev2-py3-none-any.whl (2.8 kB)
Installing collected packages: megengine
Successfully installed megengine-0.0.1.dev2
2020-03-27 11:31:09 zxytim@localhost Downloads 3003
$ python3 -c 'import megengine'
This is an placeholder only, please install by 'pip3 install megengine -f https://megengine.org.cn/whl/mge.html'. Please make sure your pip version is higher than 19.0.
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/Users/zxytim/Library/Python/3.7/lib/python/site-packages/megengine/__init__.py", line 3, in <module>
    raise ValueError("This is an placeholder only")
ValueError: This is an placeholder only

注意到,wheel 名字是:

MegEngine-0.0.1.dev2-py3-none-any.whl

作为对比,在一个 Linux 系统下是能正常安装的,且 wheel 名字是:

MegEngine-0.3.0-cp36-cp36m-manylinux2010_x86_64.whl 

所以看起来是没有打包 macOS 的版本

需求描述

希望能提供 macOS 的 pip 版本

[已经解决] 在Arch Linux上编译MegEngine (only CPU)版本失败

背景

我在Arch Linux上尝试编译MegEngine的最新版本(githash: aa147b7, only CPU), 但是失败了。

环境

G++(GCC) 10.1.0
ld 2.34.0
CPU: Intel i7-7500u
Linux MiraiT 5.7.8-arch1-1 #1 SMP PREEMPT Thu, 09 Jul 2020 16:34:01 +0000 x86_64 GNU/Linux

编译步骤

cd megengine
sh third_party/prepare.sh
sh third_party/install-mkl.sh
mkdir build
cd build
cmake .. -DMGE_WITH_CUDA=OFF -DMGE_WITH_TEST=ON
make -j4

错误日志

[ 67%] Built target megdnn
[ 67%] Built target gtest
[ 67%] Linking CXX executable megdnn_test
/usr/bin/ld.gold: error: cannot find -lMKL_CORE_LIBRARY-NOTFOUND
/usr/bin/ld.gold: error: cannot find -lMKL_SEQUENTIAL_LIBRARY-NOTFOUND
../../../third_party/mkl/x86_64/lib/libmkl_intel_ilp64.a(vml_mode_iface.o):mode_iface.c:function vmlSetMode: error: undefined reference to 'mkl_vml_kernel_SetMode'
../../../third_party/mkl/x86_64/lib/libmkl_intel_ilp64.a(vml_mode_iface.o):mode_iface.c:function vmlGetMode: error: undefined reference to 'mkl_vml_kernel_GetMode'
../../../third_party/mkl/x86_64/lib/libmkl_intel_ilp64.a(vml_mode_iface.o):mode_iface.c:function VMLSETMODE_: error: undefined reference to 'mkl_vml_kernel_SetMode'
../../../third_party/mkl/x86_64/lib/libmkl_intel_ilp64.a(vml_mode_iface.o):mode_iface.c:function VMLGETMODE_: error: undefined reference to 'mkl_vml_kernel_GetMode'
../../../third_party/mkl/x86_64/lib/libmkl_intel_ilp64.a(vml_mode_iface.o):mode_iface.c:function vmlsetmode_: error: undefined reference to 'mkl_vml_kernel_SetMode'
...
../../../third_party/mkl/x86_64/lib/libmkl_intel_ilp64.a(_xerbla_u.o):_xerbla_u.c:function XERBLA: error: undefined reference to 'mkl_serv_default_xerbla'
collect2: error: ld returned 1 exit status
make[2]: *** [dnn/test/CMakeFiles/megdnn_test.dir/build.make:2785: dnn/test/megdnn_test] Error 1
make[1]: *** [CMakeFiles/Makefile2:837: dnn/test/CMakeFiles/megdnn_test.dir/all] Error 2
make: *** [Makefile:172: all] Error 2

尝试的方法

怀疑是sh third_party/install-mkl.sh这一步下载的MKL库不适用于Arch Linux

安装问题

请简要描述您的需求

problem

我按照官网的描述在虚拟环境中执行了安装命令,但是导入的时候报错了。
如图片所示,
请问该怎么解决,谢谢~

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.