apache / singa Goto Github PK

View Code? Open in Web Editor NEW

3.3K 135.0 1.2K 47.48 MB

a distributed deep learning platform

License: Apache License 2.0

CMake 0.95% C++ 70.24% C 4.73% Java 0.07% Python 20.97% Cuda 1.07% Shell 0.38% Dockerfile 0.61% SWIG 0.99%

deep-learning

singa's Introduction

Apache SINGA

Distributed deep learning system

http://singa.apache.org

Quick Start

Issues

JIRA tickets

Code Analysis:

Mailing Lists

singa's People

Contributors

Stargazers

Watchers

Forkers

nudles xiezl wangsheng1001 kaiping lzjpaul pengvan luffyhwl helios-xh rtvt123 ug93tad ooibc guowentian frank-wu leoowen19 ronnie-tian aigocham chonho wuyuehit aaronwwf cc13ny ijingo tpnguyen ooibc88 jinyangturbo lorraine-li sesame-nus zmeihui kevinguoqi zhoujingbo jaejaejae wulifu2hao summer-3 cszqwe craftsliu pthaike dfrsg litoupu xindaya ensemblearner zhouxianke sylvia123456 mr-r liuppp tsinghuald puuzll ximing123 newtonker jeffersoncong huguanglong gaopeng-eugene andyzju jonxiao ktjoeng-eti harveyliufly hongliping jackwee dingbupt bladewayne dawn1389ims c-xiaoqiang licongqi cnnzp qingniaotonghua helloworldwade weepy3641 soroushmehr yeefan kod3r genify mengjin sevenzzz capasky java0racle newdivide zjx1195688876 ljgeneral alfredmou behappyf leeluolee alpsblue lasiaz hellogp fenglai0802 rainfore moretwo aashusingh stevenchen3 ibigerbiger bakarinouhou kkagudn lqniunjunlper jobedward cfregly lkhamsurenl-zz nkhuyu mjk276 fanju1984 dralves phvu connlan

singa's Issues

Draft for ASF board report

Issues:

There are no issues requiring board attention

Membership Data:

Apache SINGA was founded 2019-10-16 (a year ago)
There are currently 22 committers and 16 PMC members in this project.
The Committer-to-PMC ratio is roughly 3:2.

Community changes, past quarter:

No new PMC members. Last addition was Chris Yeung on 2020-04-17.
Zhang Zhaoqi was added as committer on 2020-07-01
Rulin Xing was added as committer on 2020-06-23
Shicong Lin was added as committer on 2020-06-26

Project Activity:

The community is working on release v3.1, which include the following
features/changes:

generating wheel package for distributing SINGA on PyPI repo and
simplifying the installation process.
moving from Travis CI to Github workflow for better integration with
Github.
improving the computational graph to support RNN models
adding new operators to support more ONNX models
fixing some bugs.

Community Health:

Overall, the community has slowed down a bit after releasing version 3.0. We
are still active in development for v3.1, with 52 PRs opened and 53 PRs
closed. The community is growing stably with 3 new committers joined since
last report.

Extend the Module class

In #674 , we propose to move the param creation and initialization into the forward propagation stage, i.e., __call__.
However, sometimes, we may want to access the parameters of a model after it is created, e.g.,

m = ModelFoo()
m.get_params()  # returns the params of each layer via get_params

We will get errors since the params are not created yet.
To resolve this issue, we can add a new method to the Module class

def init(self, x):
    # x represents the input tensor(s) whose values could be randomly filled, 
    # but the shape and device are set correctly.
    self.forward(x)  # the forward propagation will initialize all params.

The following code will pass without errors.

m = ModelFoo()
m.init(x)
m.get_params()  # returns the params of each layer via get_params

Comments are welcomed

Unify the error code

It would be better to have a error code list for debugging.
Then we can raise the errors.

Here are some example errors:

device not set or not match for input tensors to an operator
memory not allocated for a tensor
parameters not created in a layer
tensor shape not match
tensor dtype not match

Adding GPU testing to Github workflows

This issue is open to discuss different options for adding GPU build and test to Github workflows.

To enable this feature, SINGA must provide a real or virtual machine with GPU as host machine for running the workflow. Then use the self-hosted runner feature of Github Actions. See also this MLOps video tutorial.

The team need to take some decisions:

Which machine(s) should we use? (e.g. virtual machine(s) on AWS, dedicated server(s) at NUS, ..)
Which operating systems we test on? (Only Linux? or also Mac).
When we run the GPU build and test workflow? (with every pull request? once per day at night? once per week? only on master branch?, ...)
Should we keep the machines always on? or use them only when the scheduled test is running and shut down them when there is no workflow runs? Assuming we run the GPU build and test only at scheduled time.
Should we add workflows to run examples and test the Jupyter notebooks? note that some examples may take hours or days to complete the training. But automating the test of examples can be very useful to speed up the development.

What do you think?

Create DNNL conda package

To use DNNL in SINGA, it would be convenient to have a DNNL conda package.
However, Intel does not provide the conda package for DNNL.
Therefore, we may need to compile DNNL via conda and upload it to anaconda cloud.

Related to some test_onnx_backend.py test cases

Some updates:
@joddiy is fixing some problem in onnx

root@d05828f767ee:~/dcsysh/singa/test/python# python3 test_onnx_backend.py
ss............................ssssssssssssssssssssssssssssssss................ssss..ss..ss......FFFF..ssssssssss..ssssssssssssssss............ssss....................ssssssss........................ssss....ssssssssssssssssssssssssss........ssssssssssssssssssssssssssss..........ssssssssssssss......FsFsssssssssssssssss..................ssss....ssssss..ssss..........ss.s............ssss....ssssssssssssssssssss........ssssssssssss..............ssssssssssssssssssssssss......ss......ssss..ss........FFFF..ssssssssss............ssssssssssssssssssssssssss......ssss....ssssssssssssssssss..................................ss........ssssssssssssssss....ssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss................ssssssssssssssssssssssssssssssss................ssssssssssssssssss....................ssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss..........ssss......ssss........ssss..............ss..........................................ssssssssssss....................ssssssssssssssssssss....ssssssssssssssssssssssssssssss................ssssss................
======================================================================
FAIL: test_averagepool_2d_same_lower_cpu (__main__.OnnxBackendNodeModelTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/onnx/backend/test/runner/__init__.py", line 248, in device_test_func
    return test_func(*args, device=device, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/onnx/backend/test/runner/__init__.py", line 313, in run
    atol=model_test.atol)
  File "/usr/local/lib/python3.6/dist-packages/onnx/backend/test/runner/__init__.py", line 178, in assert_similar_outputs
    atol=atol)
  File "/usr/local/lib/python3.6/dist-packages/numpy/testing/_private/utils.py", line 1533, in assert_allclose
    verbose=verbose, header=header, equal_nan=equal_nan)
  File "/usr/local/lib/python3.6/dist-packages/numpy/testing/_private/utils.py", line 846, in assert_array_compare
    raise AssertionError(msg)
AssertionError:
Not equal to tolerance rtol=0.001, atol=1e-07

Mismatched elements: 189 / 3072 (6.15%)
Max absolute difference: 1.3230393
Max relative difference: 3.
 x: array([[[[ 1.764052e+00,  1.082105e+00,  6.894476e-01, ...,
           1.501069e+00,  8.121531e-01,  2.665550e-01],
         [ 4.381333e-01, -1.760931e-01, -2.374533e-01, ...,...
 y: array([[[[ 4.410131e-01,  5.410524e-01,  3.447238e-01, ...,
           7.505345e-01,  4.060766e-01,  1.332775e-01],
         [ 2.190667e-01, -1.760931e-01, -2.374533e-01, ...,...

======================================================================
FAIL: test_averagepool_2d_same_lower_cuda (__main__.OnnxBackendNodeModelTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/onnx/backend/test/runner/__init__.py", line 248, in device_test_func
    return test_func(*args, device=device, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/onnx/backend/test/runner/__init__.py", line 313, in run
    atol=model_test.atol)
  File "/usr/local/lib/python3.6/dist-packages/onnx/backend/test/runner/__init__.py", line 178, in assert_similar_outputs
    atol=atol)
  File "/usr/local/lib/python3.6/dist-packages/numpy/testing/_private/utils.py", line 1533, in assert_allclose
    verbose=verbose, header=header, equal_nan=equal_nan)
  File "/usr/local/lib/python3.6/dist-packages/numpy/testing/_private/utils.py", line 846, in assert_array_compare
    raise AssertionError(msg)
AssertionError:
Not equal to tolerance rtol=0.001, atol=1e-07

Mismatched elements: 189 / 3072 (6.15%)
Max absolute difference: 1.3230393
Max relative difference: 3.
 x: array([[[[ 1.764052e+00,  1.082105e+00,  6.894476e-01, ...,
           1.501069e+00,  8.121531e-01,  2.665550e-01],
         [ 4.381333e-01, -1.760931e-01, -2.374533e-01, ...,...
 y: array([[[[ 4.410131e-01,  5.410524e-01,  3.447238e-01, ...,
           7.505345e-01,  4.060766e-01,  1.332775e-01],
         [ 2.190667e-01, -1.760931e-01, -2.374533e-01, ...,...

======================================================================
FAIL: test_averagepool_2d_same_upper_cpu (__main__.OnnxBackendNodeModelTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/onnx/backend/test/runner/__init__.py", line 248, in device_test_func
    return test_func(*args, device=device, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/onnx/backend/test/runner/__init__.py", line 313, in run
    atol=model_test.atol)
  File "/usr/local/lib/python3.6/dist-packages/onnx/backend/test/runner/__init__.py", line 178, in assert_similar_outputs
    atol=atol)
  File "/usr/local/lib/python3.6/dist-packages/numpy/testing/_private/utils.py", line 1533, in assert_allclose
    verbose=verbose, header=header, equal_nan=equal_nan)
  File "/usr/local/lib/python3.6/dist-packages/numpy/testing/_private/utils.py", line 846, in assert_array_compare
    raise AssertionError(msg)
AssertionError:
Not equal to tolerance rtol=0.001, atol=1e-07

Mismatched elements: 189 / 3072 (6.15%)
Max absolute difference: 0.9547153
Max relative difference: 3.
 x: array([[[[-0.176093, -0.237453,  0.757017, ...,  0.112902, -0.50158 ,
          -0.67406 ],
         [-0.773234, -1.090172, -0.339745, ...,  0.040076, -0.369122,...
 y: array([[[[-0.176093, -0.237453,  0.757017, ...,  0.112902, -0.50158 ,
          -0.33703 ],
         [-0.773234, -1.090172, -0.339745, ...,  0.040076, -0.369122,...

======================================================================
FAIL: test_averagepool_2d_same_upper_cuda (__main__.OnnxBackendNodeModelTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/onnx/backend/test/runner/__init__.py", line 248, in device_test_func
    return test_func(*args, device=device, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/onnx/backend/test/runner/__init__.py", line 313, in run
    atol=model_test.atol)
  File "/usr/local/lib/python3.6/dist-packages/onnx/backend/test/runner/__init__.py", line 178, in assert_similar_outputs
    atol=atol)
  File "/usr/local/lib/python3.6/dist-packages/numpy/testing/_private/utils.py", line 1533, in assert_allclose
    verbose=verbose, header=header, equal_nan=equal_nan)
  File "/usr/local/lib/python3.6/dist-packages/numpy/testing/_private/utils.py", line 846, in assert_array_compare
    raise AssertionError(msg)
AssertionError:
Not equal to tolerance rtol=0.001, atol=1e-07

Mismatched elements: 189 / 3072 (6.15%)
Max absolute difference: 0.9547153
Max relative difference: 3.
 x: array([[[[-0.176093, -0.237453,  0.757017, ...,  0.112902, -0.50158 ,
          -0.67406 ],
         [-0.773234, -1.090172, -0.339745, ...,  0.040076, -0.369122,...
 y: array([[[[-0.176093, -0.237453,  0.757017, ...,  0.112902, -0.50158 ,
          -0.33703 ],
         [-0.773234, -1.090172, -0.339745, ...,  0.040076, -0.369122,...

======================================================================
FAIL: test_equal_bcast_cpu (__main__.OnnxBackendNodeModelTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/onnx/backend/test/runner/__init__.py", line 248, in device_test_func
    return test_func(*args, device=device, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/onnx/backend/test/runner/__init__.py", line 313, in run
    atol=model_test.atol)
  File "/usr/local/lib/python3.6/dist-packages/onnx/backend/test/runner/__init__.py", line 178, in assert_similar_outputs
    atol=atol)
  File "/usr/local/lib/python3.6/dist-packages/numpy/testing/_private/utils.py", line 1533, in assert_allclose
    verbose=verbose, header=header, equal_nan=equal_nan)
  File "/usr/local/lib/python3.6/dist-packages/numpy/testing/_private/utils.py", line 846, in assert_array_compare
    raise AssertionError(msg)
AssertionError:
Not equal to tolerance rtol=0.001, atol=1e-07

Mismatched elements: 1 / 60 (1.67%)
 x: array([[[False, False, False, False, False],
        [False, False, False, False, False],
        [False, False, False, False, False],...
 y: array([[[False, False, False, False, False],
        [False, False, False, False, False],
        [False, False, False, False, False],...

======================================================================
FAIL: test_equal_cpu (__main__.OnnxBackendNodeModelTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/onnx/backend/test/runner/__init__.py", line 248, in device_test_func
    return test_func(*args, device=device, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/onnx/backend/test/runner/__init__.py", line 313, in run
    atol=model_test.atol)
  File "/usr/local/lib/python3.6/dist-packages/onnx/backend/test/runner/__init__.py", line 178, in assert_similar_outputs
    atol=atol)
  File "/usr/local/lib/python3.6/dist-packages/numpy/testing/_private/utils.py", line 1533, in assert_allclose
    verbose=verbose, header=header, equal_nan=equal_nan)
  File "/usr/local/lib/python3.6/dist-packages/numpy/testing/_private/utils.py", line 846, in assert_array_compare
    raise AssertionError(msg)
AssertionError:
Not equal to tolerance rtol=0.001, atol=1e-07

Mismatched elements: 1 / 60 (1.67%)
 x: array([[[False, False, False, False, False],
        [False, False, False, False, False],
        [False, False, False, False,  True],...
 y: array([[[False, False, False, False, False],
        [False, False, False, False, False],
        [False, False, False, False,  True],...

======================================================================
FAIL: test_maxpool_2d_same_lower_cpu (__main__.OnnxBackendNodeModelTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/onnx/backend/test/runner/__init__.py", line 248, in device_test_func
    return test_func(*args, device=device, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/onnx/backend/test/runner/__init__.py", line 313, in run
    atol=model_test.atol)
  File "/usr/local/lib/python3.6/dist-packages/onnx/backend/test/runner/__init__.py", line 178, in assert_similar_outputs
    atol=atol)
  File "/usr/local/lib/python3.6/dist-packages/numpy/testing/_private/utils.py", line 1533, in assert_allclose
    verbose=verbose, header=header, equal_nan=equal_nan)
  File "/usr/local/lib/python3.6/dist-packages/numpy/testing/_private/utils.py", line 846, in assert_array_compare
    raise AssertionError(msg)
AssertionError:
Not equal to tolerance rtol=0.001, atol=1e-07

Mismatched elements: 35 / 3072 (1.14%)
Max absolute difference: 1.6961312
Max relative difference: 0.
 x: array([[[[ 1.764052e+00,  1.764052e+00,  9.787380e-01, ...,
           1.532779e+00,  1.469359e+00,  3.781625e-01],
         [ 1.764052e+00,  1.764052e+00,  9.787380e-01, ...,...
 y: array([[[[ 1.764052e+00,  1.764052e+00,  9.787380e-01, ...,
           1.532779e+00,  1.469359e+00,  3.781625e-01],
         [ 1.764052e+00,  1.764052e+00,  9.787380e-01, ...,...

======================================================================
FAIL: test_maxpool_2d_same_lower_cuda (__main__.OnnxBackendNodeModelTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/onnx/backend/test/runner/__init__.py", line 248, in device_test_func
    return test_func(*args, device=device, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/onnx/backend/test/runner/__init__.py", line 313, in run
    atol=model_test.atol)
  File "/usr/local/lib/python3.6/dist-packages/onnx/backend/test/runner/__init__.py", line 178, in assert_similar_outputs
    atol=atol)
  File "/usr/local/lib/python3.6/dist-packages/numpy/testing/_private/utils.py", line 1533, in assert_allclose
    verbose=verbose, header=header, equal_nan=equal_nan)
  File "/usr/local/lib/python3.6/dist-packages/numpy/testing/_private/utils.py", line 846, in assert_array_compare
    raise AssertionError(msg)
AssertionError:
Not equal to tolerance rtol=0.001, atol=1e-07

Mismatched elements: 35 / 3072 (1.14%)
Max absolute difference: 1.6961312
Max relative difference: 0.
 x: array([[[[ 1.764052e+00,  1.764052e+00,  9.787380e-01, ...,
           1.532779e+00,  1.469359e+00,  3.781625e-01],
         [ 1.764052e+00,  1.764052e+00,  9.787380e-01, ...,...
 y: array([[[[ 1.764052e+00,  1.764052e+00,  9.787380e-01, ...,
           1.532779e+00,  1.469359e+00,  3.781625e-01],
         [ 1.764052e+00,  1.764052e+00,  9.787380e-01, ...,...

======================================================================
FAIL: test_maxpool_2d_same_upper_cpu (__main__.OnnxBackendNodeModelTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/onnx/backend/test/runner/__init__.py", line 248, in device_test_func
    return test_func(*args, device=device, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/onnx/backend/test/runner/__init__.py", line 313, in run
    atol=model_test.atol)
  File "/usr/local/lib/python3.6/dist-packages/onnx/backend/test/runner/__init__.py", line 178, in assert_similar_outputs
    atol=atol)
  File "/usr/local/lib/python3.6/dist-packages/numpy/testing/_private/utils.py", line 1533, in assert_allclose
    verbose=verbose, header=header, equal_nan=equal_nan)
  File "/usr/local/lib/python3.6/dist-packages/numpy/testing/_private/utils.py", line 846, in assert_array_compare
    raise AssertionError(msg)
AssertionError:
Not equal to tolerance rtol=0.001, atol=1e-07

Mismatched elements: 37 / 3072 (1.2%)
Max absolute difference: 1.2028884
Max relative difference: 0.
 x: array([[[[ 1.764052,  0.978738,  2.240893, ...,  1.469359,  0.378163,
           0.378163],
         [ 0.177426, -0.347912,  0.462782, ...,  0.976639,  0.706573,...
 y: array([[[[ 1.764052,  0.978738,  2.240893, ...,  1.469359,  0.378163,
           0.378163],
         [ 0.177426, -0.347912,  0.462782, ...,  0.976639,  0.706573,...

======================================================================
FAIL: test_maxpool_2d_same_upper_cuda (__main__.OnnxBackendNodeModelTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/onnx/backend/test/runner/__init__.py", line 248, in device_test_func
    return test_func(*args, device=device, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/onnx/backend/test/runner/__init__.py", line 313, in run
    atol=model_test.atol)
  File "/usr/local/lib/python3.6/dist-packages/onnx/backend/test/runner/__init__.py", line 178, in assert_similar_outputs
    atol=atol)
  File "/usr/local/lib/python3.6/dist-packages/numpy/testing/_private/utils.py", line 1533, in assert_allclose
    verbose=verbose, header=header, equal_nan=equal_nan)
  File "/usr/local/lib/python3.6/dist-packages/numpy/testing/_private/utils.py", line 846, in assert_array_compare
    raise AssertionError(msg)
AssertionError:
Not equal to tolerance rtol=0.001, atol=1e-07

Mismatched elements: 37 / 3072 (1.2%)
Max absolute difference: 1.2028884
Max relative difference: 0.
 x: array([[[[ 1.764052,  0.978738,  2.240893, ...,  1.469359,  0.378163,
           0.378163],
         [ 0.177426, -0.347912,  0.462782, ...,  0.976639,  0.706573,...
 y: array([[[[ 1.764052,  0.978738,  2.240893, ...,  1.469359,  0.378163,
           0.378163],
         [ 0.177426, -0.347912,  0.462782, ...,  0.976639,  0.706573,...

----------------------------------------------------------------------
Ran 1114 tests in 2.126s

Creation of tensor in 3.0.0 results in NonImplementedError

Hi,

For the creation of Tensor in singa 3.0.0 cpu for py36 results in an error as described below:

NotImplementedError: Wrong number or type of arguments for overloaded function 'new_Tensor'.
Possible C/C++ prototypes are:
singa::Tensor::Tensor()
singa::Tensor::Tensor(std::vector< size_t,std::allocator< size_t > > const &,singa::DataType)
singa::Tensor::Tensor(std::vector< size_t,std::allocator< size_t > > const &)
singa::Tensor::Tensor(std::vector< size_t,std::allocator< size_t > > const &,std::shared_ptr< singa::Device >,singa::DataType)
singa::Tensor::Tensor(std::vector< size_t,std::allocator< size_t > > const &,std::shared_ptr< singa::Device >)
singa::Tensor::Tensor(singa::Tensor const &)

The tensor were created from numpy with code implementation as below:

import numpy as np
from singa import tensor

tensor.from_numpy( np.asarray([[1, 0, 0], [0, 1, 0]], dtype=np.float32) )

Following are the operating system specification:

OS: MacOS version 10.15.3
python version: 3.6.10
singa: 3.0.0 cpu_py36

The same code seems to work on singa version 3.0.0.rc1 cpu_py36

Extend the opt module

Refer to https://gist.github.com/nudles/d7f8043f251872333ec06f2701696cce

Refactor SONNX

Hi, all, during refactoring SONNX, I found the following issues:

Input

ONNX prefers to use tensors as input instead of attributes, which may incurs some issues when we create SINGA operators(or layers). There are two cases:

SINGA params <- ONNX Initializer
The params of an operator come from the ONNX Initializer(pre-stored weights). This part is ok now.
SINGA params <- ONNX operator, dynamic graph
For some operators of ONNX(OneHot, Tile, Gather, Reshape, Slice, Clip). Some attributes of these operators, they come from other operators' outputs. We cannot handle this case.
For example, in BERT, for this Reshape operator, its shape comes from the previous operator:

Layers

for @dcslin

BatchNorm2d

remove num_features
self.allow_params = ["scale", "bias", "running_mean", "running_var"]

Conv2d

remove in_channels, out_channels

Gemm

In some model, the developer prefers gemm instead of linear, so we need to add gemm to Layer,

Metaclass

I've checked the metaclass carefully, but It seems I cannot use the metaclass to modify the forward function in this case. The case is, I have a graph written by ONNX, I need to write a forward by using SINGA's operator. In this case, I can call the SINGA's operator by the graph, but I cannot write a forward function automatically from the graph.

This more like the exec function.

for example, I have a graph like this:

graph = {
    "op1" : {"inputs":["a1"], outputs:["a2"]},
    "op2" : {"inputs":["a2"], outputs:["a3"]},
}
# what I can do
def forward(x):
    tensors = []
    for op, op_info in graph.items():
        inputs = [tensors[inp] for inp in op_info.inputs]
        outputs = op()
        for (outp, val) in zip(op_info.outputs, outputs):
            tensors[outp] = val
# what I cannot do by metaclass but can with exec
program = parse_graph_to_str(graph)
# 'a2=op1(a1)\na3=op2(a2)'
exec(program)

So, the above forward is my current implementation.

Improve the assertion message

The check and assertion in some functions now have very brief descriptions, which make it not easy to locate and identify the error.
The displayed information should include at least:

the function name
the value of the variable in the check/assertion

Preparation for the next release

Release version, 2.1 vs 3.0 depending on the API change

Please tick the feature if it is done.

onnx (vision and nlp), model-zoo md files @joddiy
distributed training with MPI and NCCL, dist-train.md @chrishkchris
computational graph with memory optimization, device.md @ Rulin
new website
code quality and coverage management @moazreyad
dnnl integration, installation.md, build.md @dcslin
tensor APIs, tensor.md @dcslin
new autograd operators; upgrade existing operators, autograd.md @joddiy
anything else?

Here is the checklist and steps

Select a release manager. The release manager (RM) is the coordinator for the release process. It is the RM's signature (.asc) that is uploaded together with the release. The RM generates KEY (RSA 4096-bit) and uploads it to a public key server. The RM needs to get his key endorsed (signed) by other Apache user, to be connected to the web of trust. He should first ask the mentor to help signing his key. How to generate the key?
Check license. FAQ; SINGA Issue
- the codebase does not include third-party code which is not compatible to APL;
- The dependencies are compatible with APL. GNU-like licenses are NOT compatible;
- All source files written by us MUST include the Apache license header: http://www.apache.org/legal/src-headers.html. There's a script in there which helps propagating the header to all files.
- Update the LICENSE file. If we include any third party code in the release package which is not APL, must state it at the end of the LICENSE file and include the license boilerplate in the original file.
Bump the version. Check code and documentation
- The build process is error-free.
- Unit tests are included (as much as possible)
- Conda packages run without errors.
- The online documentation on the Apache website is up to date.
Prepare the RELEASE_NOTES file. Include the following items, Introduction, Features, Bugs (link to JIRA or Github PR), Changes, Dependency list, Incompatibility issues. Follow this example.
Prepare DISCLAIMER file. Modify from the template
Package the release candidate. The release should be packaged into : apache-singa-VERSION.tar.gz. The release should not include any binary files including git files. Upload the release to for stage. The tar file, signature, KEY and SHA256 checksum file should be included. MD5 is no longer used. Policy is here
- apache-singa-VERSION.tar.gz
- KEY
- XX.acs
- .SHA256

Call for vote by sending an email

To: [email protected]
Subject: [VOTE] Release apache-singa-X.Y.Z (release candidate N)

Hi all,

I have created a build for Apache SINGA X.Y.Z, release candidate N.
The artifacts to be voted on are located here:  xxxx
The hashes of the artifacts are as follows: xxx
Release artifacts are signed with the following key: xxx
Please vote on releasing this package. The vote is open for at least 72 hours and passes if a majority of at least three +1 votes are cast.

[ ] +1 Release this package as Apache SINGA X.Y.Z
[ ] 0 I don't feel strongly about it, but I'm okay with the release
[ ] -1 Do not release this package because...

Here is my vote:
+1

Wait at least 48 hours for test responses. Any PMC, committer or contributor can test features for releasing, and feedback. Everyone should check these before vote +1. If the vote passes, then send the result email. Otherwise, repeat from the beginning.

Subject: [RESULT] [VOTE] Release apache-singa-X.Y.Z (release candidate N)
To: [email protected]

Thanks to everyone who has voted and given their comments. The tally is as follows.

N binding +1s:
<names>

N non-binding +1s:
<names>

No 0s or -1s.

 I am delighted to announce that the proposal to release Apache SINGA X.Y.Zhas passed.

Upload the package for distribution to https://dist.apache.org/repos/dist/release/VERSION/.
Update the Download page of SINGA website. The tar.gz file MUST be downloaded from mirror, using closer.cgi script; other artifacts MUST be downloaded from main Apache site. More details here. Some feedback we got during the previous releases: "Download pages must only link to formal releases, so must not include links to GitHub.", "Links to KEYS, sigs and hashes must not use dist.apache.org; instead use https://www.apache.org/dist/singa/...;", "Also you only need one KEYS link, and there should be a description of how to use KEYS + sig or hash to verify the downloads."
Remove the RC tag and compile the conda packages.

Publish the release information.

To: [email protected], [email protected] 
Subject: [ANNOUNCE] Apache SINGA X.Y.Z released

We are pleased to announce that SINGA X.Y.Z is released.

SINGA is a general distributed deep learning platform for training big deep learning models over large datasets. 
The release is available at: http://singa.apache.org/downloads.html
The main features of this release include XXX
We look forward to hearing your feedback, suggestions, and contributions to the project.

On behalf of the SINGA team, {SINGA Team Member Name}

Implement popular initialization methods

Add each initialization method as a function or class in https://github.com/apache/singa/blob/master/python/singa/initializer.py. e.g.,

class InitializationBase(object):
      def __init__(self, ):
          pass
      def call(self, x):
          pass
      def __call__(self, x):
          self.call(x)

class Uniform(InitializationBase):
       def __init__(self, low=-1, high=1):
               self.low = low
               self.high = high
      def call(self, x):
              x.uniform(self.low, self.high)

def uniform(x, low=-1, high=1):
       x.uniform(low, high)

Some consultations using the conda build in singa/tool/conda/singa

@dcslin Shicong, did you encountered such error using conda build before? I would like your help because seems you are recently on CI/CD

I am trying to use the dev branch to build conda using singa/tool/conda/singa, it gives the errors concerning onnx version

conda_build.exceptions.DependencyNeedsBuildingError: Unsatisfiable dependencies for platform linux-64: {"onnx[version='>=1.3.0']"}

dcsysh@panda:~/singa/tool/conda/singa$ export CUDA=10.0
dcsysh@panda:~/singa/tool/conda/singa$ conda-build .  --python 3.6
No numpy version specified in conda_build_config.yaml.  Falling back to default numpy value of 1.11
WARNING:conda_build.metadata:No numpy version specified in conda_build_config.yaml.  Falling back to default numpy value of 1.11
Copying /home/dcsysh/singa to /home/dcsysh/anaconda3/conda-bld/singa_1583505296868/work/
Adding in variants from internal_defaults
INFO:conda_build.variants:Adding in variants from internal_defaults
Adding in variants from /home/dcsysh/singa/tool/conda/singa/conda_build_config.yaml
INFO:conda_build.variants:Adding in variants from /home/dcsysh/singa/tool/conda/singa/conda_build_config.yaml
Adding in variants from config.variant
INFO:conda_build.variants:Adding in variants from config.variant
/home/dcsysh/anaconda3/lib/python3.6/site-packages/conda_build/environ.py:427: UserWarning: The environment variable 'CUDA' is being passed through with value '10.0'.  If you are splitting build and test phases with --no-test, please ensure that this value is also set similarly at test time.
  UserWarning
Attempting to finalize metadata for singa
INFO:conda_build.metadata:Attempting to finalize metadata for singa
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done
Adding .* to spec 'libprotobuf 3.6.1' to ensure satisfiability.  Please consider putting {{ var_name }}.* or some relational operator (>/</>=/<=) on this spec in meta.yaml, or if req is also a build req, using {{ pin_compatible() }} jinja2 function instead.  See https://conda.io/docs/user-guide/tasks/build-packages/variants.html#pinning-at-the-variant-level
WARNING:conda_build.utils:Adding .* to spec 'libprotobuf 3.6.1' to ensure satisfiability.  Please consider putting {{ var_name }}.* or some relational operator (>/</>=/<=) on this spec in meta.yaml, or if req is also a build req, using {{ pin_compatible() }} jinja2 function instead.  See https://conda.io/docs/user-guide/tasks/build-packages/variants.html#pinning-at-the-variant-level
WARNING conda_build.utils:ensure_valid_spec(1749): Adding .* to spec 'libprotobuf 3.6.1' to ensure satisfiability.  Please consider putting {{ var_name }}.* or some relational operator (>/</>=/<=) on this spec in meta.yaml, or if req is also a build req, using {{ pin_compatible() }} jinja2 function instead.  See https://conda.io/docs/user-guide/tasks/build-packages/variants.html#pinning-at-the-variant-level
Adding .* to spec 'libopenblas 0.3.3' to ensure satisfiability.  Please consider putting {{ var_name }}.* or some relational operator (>/</>=/<=) on this spec in meta.yaml, or if req is also a build req, using {{ pin_compatible() }} jinja2 function instead.  See https://conda.io/docs/user-guide/tasks/build-packages/variants.html#pinning-at-the-variant-level
WARNING:conda_build.utils:Adding .* to spec 'libopenblas 0.3.3' to ensure satisfiability.  Please consider putting {{ var_name }}.* or some relational operator (>/</>=/<=) on this spec in meta.yaml, or if req is also a build req, using {{ pin_compatible() }} jinja2 function instead.  See https://conda.io/docs/user-guide/tasks/build-packages/variants.html#pinning-at-the-variant-level
WARNING conda_build.utils:ensure_valid_spec(1749): Adding .* to spec 'libopenblas 0.3.3' to ensure satisfiability.  Please consider putting {{ var_name }}.* or some relational operator (>/</>=/<=) on this spec in meta.yaml, or if req is also a build req, using {{ pin_compatible() }} jinja2 function instead.  See https://conda.io/docs/user-guide/tasks/build-packages/variants.html#pinning-at-the-variant-level
BUILD START: ['singa-2.1.0.dev-cudnn7.3.1_cuda10.0_py36.tar.bz2']
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: /home/dcsysh/anaconda3/conda-bld/singa_1583505296868/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_place


The following NEW packages will be INSTALLED:

    _libgcc_mutex:    0.1-main
    blas:             1.0-openblas
    ca-certificates:  2020.1.1-0
    certifi:          2019.11.28-py36_0
    cudatoolkit:      10.0.130-0
    cudnn:            7.3.1-cuda10.0_0
    gflags:           2.2.2-he6710b0_0
    glog:             0.3.5-hf484d3e_1
    intel-openmp:     2018.0.3-0
    ld_impl_linux-64: 2.33.1-h53a641e_7
    libedit:          3.1.20181209-hc058e9b_0
    libffi:           3.2.1-hd88cf55_4
    libgcc-ng:        9.1.0-hdf63c60_0
    libgfortran-ng:   7.3.0-hdf63c60_0
    libmklml:         2018.0.3-0
    libopenblas:      0.3.3-h5a2b251_3
    libprotobuf:      3.6.1-hd408876_0
    libstdcxx-ng:     9.1.0-hdf63c60_0
    mkl-dnn:          0.14-h6bb024c_0
    ncurses:          6.2-he6710b0_0
    nomkl:            3.0-0
    numpy:            1.16.0-py36h99e49ec_1
    numpy-base:       1.16.0-py36h2f8d375_1
    openblas:         0.3.3-3
    openblas-devel:   0.3.3-3
    openssl:          1.1.1d-h7b6447c_4
    pcre:             8.43-he6710b0_0
    pip:              20.0.2-py36_1
    protobuf:         3.6.1-py36he6710b0_0
    python:           3.6.10-h0371630_0
    readline:         7.0-h7b6447c_5
    setuptools:       45.2.0-py36_0
    six:              1.14.0-py36_0
    sqlite:           3.31.1-h7b6447c_0
    swig:             3.0.12-h38cdd7d_3
    tk:               8.6.8-hbc83047_0
    wheel:            0.34.2-py36_0
    xz:               5.2.4-h14c3975_4
    zlib:             1.2.11-h7b6447c_3

Preparing transaction: ...working... done
Verifying transaction: ...working... done
Executing transaction: ...working... done
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... failed

Leaving build/test directories:
  Work:
 /home/dcsysh/anaconda3/conda-bld/work
  Test:
 /home/dcsysh/anaconda3/conda-bld/test_tmp
Leaving build/test environments:
  Test:
source activate  /home/dcsysh/anaconda3/conda-bld/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_
  Build:
source activate  /home/dcsysh/anaconda3/conda-bld/_build_env


Traceback (most recent call last):
  File "/home/dcsysh/anaconda3/lib/python3.6/site-packages/conda_build/environ.py", line 757, in get_install_actions
    actions = install_actions(prefix, index, specs, force=True)
  File "/home/dcsysh/anaconda3/lib/python3.6/site-packages/conda/common/io.py", line 88, in decorated
    return f(*args, **kwds)
  File "/home/dcsysh/anaconda3/lib/python3.6/site-packages/conda/plan.py", line 474, in install_actions
    txn = solver.solve_for_transaction(prune=prune, ignore_pinned=not pinned)
  File "/home/dcsysh/anaconda3/lib/python3.6/site-packages/conda/core/solve.py", line 117, in solve_for_transaction
    should_retry_solve)
  File "/home/dcsysh/anaconda3/lib/python3.6/site-packages/conda/core/solve.py", line 158, in solve_for_diff
    force_remove, should_retry_solve)
  File "/home/dcsysh/anaconda3/lib/python3.6/site-packages/conda/core/solve.py", line 275, in solve_final_state
    ssc = self._add_specs(ssc)
  File "/home/dcsysh/anaconda3/lib/python3.6/site-packages/conda/core/solve.py", line 555, in _add_specs
    explicit_pool = ssc.r._get_package_pool(self.specs_to_add)
  File "/home/dcsysh/anaconda3/lib/python3.6/site-packages/conda/resolve.py", line 553, in _get_package_pool
    pool = self.get_reduced_index(specs)
  File "/home/dcsysh/anaconda3/lib/python3.6/site-packages/conda/common/io.py", line 88, in decorated
    return f(*args, **kwds)
  File "/home/dcsysh/anaconda3/lib/python3.6/site-packages/conda/resolve.py", line 574, in get_reduced_index
    explicit_specs, features = self.verify_specs(explicit_specs)
  File "/home/dcsysh/anaconda3/lib/python3.6/site-packages/conda/resolve.py", line 288, in verify_specs
    raise ResolvePackageNotFound(bad_deps)
conda.exceptions.ResolvePackageNotFound:
  - onnx[version='>=1.3.0']

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/dcsysh/anaconda3/bin/conda-build", line 11, in <module>
    sys.exit(main())
  File "/home/dcsysh/anaconda3/lib/python3.6/site-packages/conda_build/cli/main_build.py", line 469, in main
    execute(sys.argv[1:])
  File "/home/dcsysh/anaconda3/lib/python3.6/site-packages/conda_build/cli/main_build.py", line 460, in execute
    verify=args.verify, variants=args.variants)
  File "/home/dcsysh/anaconda3/lib/python3.6/site-packages/conda_build/api.py", line 209, in build
    notest=notest, need_source_download=need_source_download, variants=variants)
  File "/home/dcsysh/anaconda3/lib/python3.6/site-packages/conda_build/build.py", line 2344, in build_tree
    notest=notest,
  File "/home/dcsysh/anaconda3/lib/python3.6/site-packages/conda_build/build.py", line 1408, in build
    create_build_envs(top_level_pkg, notest)
  File "/home/dcsysh/anaconda3/lib/python3.6/site-packages/conda_build/build.py", line 1292, in create_build_envs
    raise e
  File "/home/dcsysh/anaconda3/lib/python3.6/site-packages/conda_build/build.py", line 1282, in create_build_envs
    channel_urls=tuple(m.config.channel_urls))
  File "/home/dcsysh/anaconda3/lib/python3.6/site-packages/conda_build/environ.py", line 759, in get_install_actions
    raise DependencyNeedsBuildingError(exc, subdir=subdir)
conda_build.exceptions.DependencyNeedsBuildingError: Unsatisfiable dependencies for platform linux-64: {"onnx[version='>=1.3.0']"}

CUDA failed with the test case of operations

Hi, @dcslin, as you know, I need to develop and test the new API for onnx, so I have to make sure the new implement of autograd and layers is ok. However, when I test the PR #697 , I have three types of issues of the test_operation.py:

Zero-value tensors

For example, in the test case of sum, the value GPU tensor always is zero. But when I remove the conv2d test case, the sum case will be fine. It seems the conv2d layer results in the zero GPU tensor issue.

The following cases have the same problem:

sum
onehot
tile
gather
split
slice
reduce_mean
reduce_sum
globalaveragepool
gemm
prelu
less_broadcast
xor_broadcast
reciprocal
xor
not
or
selu
prelu
min_3inputs
min
shape
squeeze
max_1inputs
max_3inputs
max
reshape
mul
transpose
unsqueeze

CURAND_STATUS_LAUNCH_FAILURE

an issue when I run the Conv2d with odd_padding. As you see, sometimes, I need to padding zeros from only one direction, so I write this function:

def handle_odd_pad_fwd(x, odd_padding):
    """
    handle odd padding mode forward
    Args:
        x, the input tensor
        odd_padding, the odd_padding
    Returns: 
        tensor, the output
    """
    x_tensor = tensor.from_raw_tensor(x)
    # (axis, left padding if True else right padding)
    flags = [(2, True), (2, False), (3, True), (3, False)]
    for (axis, left), pad in zip(flags, odd_padding):
        if pad == 0:
            continue
        zeros_shape = list(x_tensor.data.shape())
        zeros_shape[axis] = pad
        zero_padding = np.zeros(zeros_shape).astype(np.float32)
        zero_padding = tensor.Tensor(device=x.device(), data=zero_padding)
        if left:
            x_tensor = tensor.concatenate((zero_padding, x_tensor), axis)
        else:
            x_tensor = tensor.concatenate((x_tensor, zero_padding), axis)
    return x_tensor.data

But it seems, when I call this func, it'd be fine if I call only one or two times, however, if I call it more times, it will report a error:

F0526 12:53:40.017063 15641 tensor_math_cuda.h:791] Check failed: status == CURAND_STATUS_SUCCESS (201 vs. 0) CURAND_STATUS_LAUNCH_FAILURE

I guess, the reason maybe it doesn't release the GPU memory in time?

The following cases have the same problem:

conv2d
pooling2d
div_broadcast

CUDNN_STATUS_EXECUTION_FAILED

The third error msg is:

F0526 18:52:07.318809 21112 tensor_math_cuda.h:193] Error on line 193: CUDNN_STATUS_EXECUTION_FAILED

And the following cases have the same problem:

SeparableConv2d
mul_broadcast
sub_broadcast
add_broadcast
greater_broadcast
or_broadcast
and_broadcast
negative
min_1inputs

Autograd Layer constructor

The Layer class in Autograd is to maintain the model parameters.
It passes the parameters into the operation and thus operations are stateless.

Typically the parameter size depends on the input and layer configuration.
Currently, we require the users to provide the input size in the layer constructor.
Then we can create the parameter tensor and initialize it in the constructor, e.g., Linear layer. One potential problem is that the initialization operation may not be buffered. @XJDKC Is it an issue?
For some layers like RNN implemented using cudnn, although we can get the input size, the parameter size is unknown until the cudnn handle is created, which is done until the data is forwarded through the layer.

Another way is to delay the parameter tensor creation until the layer is called for forward propagation. At that time, we have the input tensor (and its device). Then in the layer constructor, we do not need the user to provide the input size. The drawback is that after the layer is created, the get_params() function would still fail to get the parameter tensors as they are not created yet. @dcslin To switch to this approach, we need to change the constructors of existing layer classes and examples. We also need to provide an initializer function/class into the constructor for initializing the parameter tensors after they are created.

Please add your comments.

The autograd.softmax has some problems

The autograd.softmax may have some problem, where I found it when I took part in the review of PR #572.

In examples/autograd/mlp.py (multilayer perception), the result is:

ubuntu@ip-172-31-26-47:~/singa/examples/autograd$ python3 mlp.py
train_data_shape: (400, 2)
train_label_shape: (400, 2)
training loss =  0.6908062
training loss =  0.5960194
training loss =  0.57797414
training loss =  0.55334115
training loss =  0.48568404
training loss =  0.38458923
training loss =  0.30776194
training loss =  0.24188559
training loss =  0.18657134
training loss =  0.15864176
training loss =  0.13929243

However, if I use softmax + cross_entropy instead of softmax_cross_entropy, there is such error:

ubuntu@ip-172-31-26-47:~/singa/examples/autograd$ python3 mlp.py
train_data_shape: (400, 2)
train_label_shape: (400, 2)
training loss =  6.682101
WARNING: Logging before InitGoogleLogging() is written to STDERR
F0113 09:20:05.180658 12032 tensor_math_cpp.h:357] Check failed: a > 0.f (-nan vs. 0)
*** Check failure stack trace: ***
Aborted (core dumped)

In the review of PR #572, I did not suspect SoftMax because I compared the 1-D result with Pytorch. However, now when I run the result with 2-D input, the backpropagation cannot with parameter axis = 1.

My test codes and results for softmax are in:
https://gist.github.com/chrishkchris/1bce55260b5e771ce974940a855292e2

I will need to further see how to debug.

codecov/project fail

Pls check https://github.com/apache/singa/pull/795/checks?check_run_id=1115935742

Clang-formatter results in different formatting?

I used the clang-formatter with VS-code after I alter the tensor.h file, it results in different format with the dev branch.

The tensor.cc should have re-formatted before in PR #581. So, did I use incorrect setting in clang-formatter?

Layer mismatch causes session to to terminate abruptly

Hi,

The issue might be known, however while creating a Neural network layer stack with unmatched layer can cause the current python session to end abruptly, without generating any stack trace. while calculating model.loss(e.g. autograd.mse_loss(y,t) )

for example of a simple feed forward neural network:
class MLP():
def init(self):
self.linear1 = autograd.Linear(3,4)
self.linear2 = autograd.Linear(4,3)
def forward(self,x):
y = self.linear1(x)
return self.linear2(y)

if the output does not have a dimension of 3, the current session will terminate without generating any error.

A stack trace is generated with below warning.

WARNING: Logging before InitGoogleLogging() is written to STDERR
F0520 17:37:19.265754 288538048 tensor.cc:431] Check failed: shape_.at(m - i) == 1 (3 vs. 1) i= 0
*** Check failure stack trace: ***

This causes to rerun the entire program/notebook again. The same issue is not seen in autograd.backward which generates an assertion error.

Thanks and Regards,
Shashank

Add docstring for each Python module

The module's docstring should explain the classes and functions in this module.
Example usages should be given.
How to add a new sub-class should be explained, e.g., how to add a new Layer subclass in layer.py, how to add a new Operator subclass in autograd.py.

The docstrings will be used to build the API documentation pages at https://apache-singa.readthedocs.io/en/latest/

Add editorconfig

editorconfig is a configuration file adopted by many editors to ensure consistent coding style, which can avoid many diff results when we do git merge.

Unable to convert maxpool2d from onnx to singa with ceil_mode set as False

Hi,

For converting a ONNX model to SINGA, sonnx.py is being used. Different modules are converted using sonnx.py. Current implementation does not support maxpool2d with ceil_mode set to True and count_include_pad attributes.

For MaxPool2d implemented using PyTorch, ceil_mode is a boolean operator which is default set to false. Sometimes while converting the pytorch model to onnx the atttribute get transferred to onnx implementation as:

onnx_node.attrs["ceil_mode"] = 0

Which represents a valid context for conversion to singa format. The current code in sonnx however checks only for the presence of attribute ceil_mode in onnx_node.attrs before raising an exception as illustrated below:

def _create_max_avg_pool(cls, onnx_node, inputs, opset_version):
        """
        get the max or avg pool operator from onnx node
        Args:
            onnx_node: a given onnx node
        Args:
            inputs: the input tensor
        Args:
            opset_version: the opset version
        Returns: 
            handle, the handle of singa operator
        Returns: 
            forward, the autograd of singa operator
        """
        kernel = tuple(onnx_node.attrs["kernel_shape"])
        padding = tuple(
            onnx_node.attrs["pads"]) if "pads" in onnx_node.attrs else (0, 0)
        stride = tuple(onnx_node.getattr('strides', (1, 1)))
        # default the odd_padding is 0, once there are same pad mode, we modify it
        # for odd_padding, please refer the autegrade.py
        odd_padding = (0, 0, 0, 0)
        if "auto_pad" in onnx_node.attrs:
            auto_pad = utils.force_unicode(onnx_node.attrs['auto_pad'])
            if auto_pad in ('SAME_UPPER', 'SAME_LOWER'):
                padding, odd_padding = utils.get_padding_shape(
                    auto_pad, inputs[0].shape[2:], kernel, stride)

        **# not support count_include_pad and auto_pad
        if "count_include_pad" in onnx_node.attrs or "ceil_mode" in onnx_node.attrs :
            raise ValueError(
                "Not implemented yet for count_include_pad or ceil_mode")**

        # only support 2d
        if len(kernel) != 2:
            raise ValueError("Not implemented yet")

        is_max = onnx_node.op_type == 'MaxPool'
        x = inputs[0]
        if x.device.id() == -1:
            handle = singa.PoolingHandle(x.data, kernel, stride, padding,
                                         is_max)
        else:
            handle = singa.CudnnPoolingHandle(x.data, kernel, stride, padding,
                                              is_max)

        _, forward = cls._common_onnx_node_to_singa_op(onnx_node, inputs,
                                                       opset_version)
        return _, forward(handle, odd_padding)

The code does not consider if the value of ceil_mode is set as False/0

The following changes can allow considering this edge case

def _create_max_avg_pool(cls, onnx_node, inputs, opset_version):
        """
        get the max or avg pool operator from onnx node
        Args:
            onnx_node: a given onnx node
        Args:
            inputs: the input tensor
        Args:
            opset_version: the opset version
        Returns: 
            handle, the handle of singa operator
        Returns: 
            forward, the autograd of singa operator
        """
        kernel = tuple(onnx_node.attrs["kernel_shape"])
        padding = tuple(
            onnx_node.attrs["pads"]) if "pads" in onnx_node.attrs else (0, 0)
        stride = tuple(onnx_node.getattr('strides', (1, 1)))
        # default the odd_padding is 0, once there are same pad mode, we modify it
        # for odd_padding, please refer the autegrade.py
        odd_padding = (0, 0, 0, 0)
        if "auto_pad" in onnx_node.attrs:
            auto_pad = utils.force_unicode(onnx_node.attrs['auto_pad'])
            if auto_pad in ('SAME_UPPER', 'SAME_LOWER'):
                padding, odd_padding = utils.get_padding_shape(
                    auto_pad, inputs[0].shape[2:], kernel, stride)

        # not support count_include_pad and auto_pad
        if "ceil_mode" in onnx_node.attrs and onnx_node.attrs["ceil_mode"]:
          raise ValueError(
                "Not implemented yet for count_include_pad or ceil_mode")
        if "count_include_pad" in onnx_node.attrs:
            raise ValueError(
                "Not implemented yet for count_include_pad or ceil_mode")

        # only support 2d
        if len(kernel) != 2:
            raise ValueError("Not implemented yet")

        is_max = onnx_node.op_type == 'MaxPool'
        x = inputs[0]
        if x.device.id() == -1:
            handle = singa.PoolingHandle(x.data, kernel, stride, padding,
                                         is_max)
        else:
            handle = singa.CudnnPoolingHandle(x.data, kernel, stride, padding,
                                              is_max)

        _, forward = cls._common_onnx_node_to_singa_op(onnx_node, inputs,
                                                       opset_version)
        return _, forward(handle, odd_padding)

The issue is faced while converting shufflenetv2 from onnx to singa

Request to let us know if this change is possible

Thanks and Regards,
Shashank Nigam

AsType cannot work when after Reshape

Hi, @dcslin , The asType cannot work when it is after the reshape operators.

Please check by using the following code:

    dev = device.create_cuda_gpu()
    X = np.array([[1, 0], [1, 1]]).astype(np.int32)
    x = tensor.from_numpy(X)
    x.to_device(dev)

    x = autograd.cast(x, tensor.int32)
    x = autograd.reshape(x, [1, 2, 2])
    x = autograd.cast(x, tensor.float32)

Build conda packages for the master branch

The update frequency of the dev branch, master branch and version is high, middle and low respectively.
However, the current CI builds the conda package once the master has new commits.
I suggest to

build the package when there is a new commit in master branch or a new release. Do not build conda packages when there is a new commit to the dev branch.
name the package built against the master branch with the tag 'next' in the build string, which is to be consistent with the documentation site like http://singa.apache.org/docs/next/installation/. But the internal version is still the latest release version. singa-gpu and singa-cpu still points to the latest release versions.

Release, versioning and continous integration

We are following the semantic versioning scheme (i.e., X.Y.Z) , which means every hotfix will bump the version number to X.Y.(Z+1).
To support it, we need continuous integration which tracks the master branch to update the version number and generate the conda packages.
For updates to the dev branch, we also need the continuous integration tool to run the test and generate the nightly builds (conda packages).
Currently, we have Travis (for CPU) and Jenkins (CPU). They need to be updated to support 1, 2 and 3.

Update the metric module

It would be better to implement a metric as a function in the metric.py as metric typically has not states. Therefore, no need to make it as a class. e.g,

def accuracy(y_pred, y_true):
   """compute the accuracy.

     Args:
        y_pred(numpy array or tensor): each value is a label index
        y_true(numpy array or tensor): each value is a label index
   """
      check shape match
      convert y_pred and y_true to np array     
      return np.sum(y_pred== y_true) / y_true.shape[0]

Refer to https://keras.io/api/metrics/

cuda stream destory failed after python test passed

Hi, when running python test scripts, even the test are passing, the cuda stream still fails:

python3 test/python/test_operation.py -v TestPythonOperation.test_sum_cpu
test_sum_cpu (__main__.TestPythonOperation) ... ok

----------------------------------------------------------------------
Ran 1 test in 0.002s

OK
WARNING: Logging before InitGoogleLogging() is written to STDERR
F0618 07:06:09.329576 10741 cuda_gpu.cc:48] Check failed: error == cudaSuccess (29 vs. 0)  driver shutting down
*** Check failure stack trace: ***
Aborted (core dumped)

Is there any runtime problem of onnx in Travis CI built SINGA CPU version related to libprotobuf.so.20?

In the log of travis CI CPU version build, it displays the error in test_onnx that cannot import libprotobuf.so.20
https://travis-ci.org/github/apache/singa/jobs/664251025#L3998


======================================================================
3966
ERROR: test_onnx (unittest.loader._FailedTest)
3967
----------------------------------------------------------------------
3968
ImportError: Failed to import test module: test_onnx
3969
Traceback (most recent call last):
3970
  File "/home/travis/conda-bld-1971.5/singa_1584596418932/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_pla/lib/python3.7/unittest/loader.py", line 436, in _find_test_path
3971
    module = self._get_module_from_name(name)
3972
  File "/home/travis/conda-bld-1971.5/singa_1584596418932/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_pla/lib/python3.7/unittest/loader.py", line 377, in _get_module_from_name
3973
    __import__(name)
3974
  File "/home/travis/conda-bld-1971.5/singa_1584596418932/test_tmp/test/python/test_onnx.py", line 24, in <module>
3975
    from singa import sonnx
3976
  File "/home/travis/conda-bld-1971.5/singa_1584596418932/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_pla/lib/python3.7/site-packages/singa/sonnx.py", line 23, in <module>
3977
    import onnx.utils
3978
  File "/home/travis/conda-bld-1971.5/singa_1584596418932/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_pla/lib/python3.7/site-packages/onnx/__init__.py", line 8, in <module>
3979
    from .onnx_cpp2py_export import ONNX_ML
3980
ImportError: libprotobuf.so.20: cannot open shared object file: No such file or directory

PackageNotFoundError in Installation

Followed the installation instruction in http://singa.apache.org/ but encountered a similar error as this issue https://issues.apache.org/jira/browse/SINGA-422

I tried to delete all installed package and redo, I encountered this error.

NoBaseEnvironmentError: This conda installation has no default base environment. Use
'conda create' to create new environments and 'conda activate' to
activate environments.

I created a new virtual environment and I encountered this error when I entered any of the three commands

PackagesNotFoundError: The following packages are not available from current channels:

  - singa-gpu

Has there been any similar issues to this? Thank you for the help!

Implement cudnn_rnn operation

Currently, we implement the rnn operations from scratch, which may not be as fast as the cudnn versions.
To use cudnn rnn operations, we need to implement the cpp operation and call it from the python side.
https://docs.nvidia.com/deeplearning/sdk/cudnn-api/index.html#cudnnRNNMode_t

cannot init scalar tensor

Hi, @dcslin , since some onnx models need scalar-tensor but we can't support.

The error is:

Traceback (most recent call last):
File "../../test/python/test_operation.py", line 4015, in test_tmp
x = tensor.from_numpy(x)
File "/usr/local/lib/python3.5/dist-packages/singa/tensor.py", line 766, in from_numpy
ret.copy_from_numpy(np_array)
File "/usr/local/lib/python3.5/dist-packages/singa/tensor.py", line 307, in copy_from_numpy
assert np_array.size == self.size(), 'tensor shape should be the same'
AssertionError: tensor shape should be the same

please use this test case:

x = np.array(256, ndmin=0).astype(np.int32)
x = tensor.from_numpy(x)

Dev branch cpu training problem (with conv and pool)

After merging the PR #590, today I am trying to solve the problem that the mnist cnn hangs as reported in PR #589.
When I run mnist_cnn after changing to cpu, it hangs. So the training has some problem.

singa-gpu Anaconda package has no Module class

I'm trying to test some examples of SINGA. However, when I ran the examples, the singa-gpu package throwed an error that it could not find Module class.

I reproduce this error by the following commands:

(base) user@xgpe3:~$ conda activate sg37
(sg37) user@xgpe3:~$ conda install -c nusdbsystem -c conda-forge singa-gpu=3.0.0
Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: done


==> WARNING: A newer version of conda exists. <==
  current version: 4.8.3
  latest version: 4.8.4

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /path/to/miniconda3/envs/sg37

  added / updated specs:
    - singa-gpu=3.0.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    importlib_metadata-1.7.0   |                0           3 KB  conda-forge
    zipp-3.1.0                 |             py_0          10 KB  conda-forge
    ------------------------------------------------------------
                                           Total:          13 KB

The following NEW packages will be INSTALLED:

  _openmp_mutex      conda-forge/linux-64::_openmp_mutex-4.5-1_llvm
  attrs              conda-forge/noarch::attrs-20.2.0-pyh9f0ad1d_0
  cudatoolkit        pkgs/main/linux-64::cudatoolkit-10.0.130-0
  cudnn              pkgs/main/linux-64::cudnn-7.6.5-cuda10.0_0
  deprecated         conda-forge/noarch::deprecated-1.2.7-py_0
  dnnl               nusdbsystem/linux-64::dnnl-1.1-build
  freetype           conda-forge/linux-64::freetype-2.10.2-he06d7ca_0
  future             conda-forge/linux-64::future-0.18.2-py37hc8dfbb8_1
  glog               conda-forge/linux-64::glog-0.3.5-hf484d3e_1001
  importlib-metadata conda-forge/linux-64::importlib-metadata-1.7.0-py37hc8dfbb8_0
  importlib_metadata conda-forge/noarch::importlib_metadata-1.7.0-0
  iniconfig          conda-forge/noarch::iniconfig-1.0.1-pyh9f0ad1d_0
  jpeg               conda-forge/linux-64::jpeg-9d-h516909a_0
  lcms2              conda-forge/linux-64::lcms2-2.11-hbd6801e_0
  libblas            conda-forge/linux-64::libblas-3.8.0-16_openblas
  libcblas           conda-forge/linux-64::libcblas-3.8.0-16_openblas
  libgfortran-ng     conda-forge/linux-64::libgfortran-ng-7.5.0-hdf63c60_16
  liblapack          conda-forge/linux-64::liblapack-3.8.0-16_openblas
  libopenblas        conda-forge/linux-64::libopenblas-0.3.9-h5ec1e0e_0
  libpng             conda-forge/linux-64::libpng-1.6.37-hed695b0_2
  libprotobuf        conda-forge/linux-64::libprotobuf-3.9.2-h8b12597_0
  libtiff            conda-forge/linux-64::libtiff-4.1.0-hc7e4089_6
  libwebp-base       conda-forge/linux-64::libwebp-base-1.1.0-h516909a_3
  llvm-openmp        conda-forge/linux-64::llvm-openmp-10.0.1-hc9558a2_0
  lz4-c              conda-forge/linux-64::lz4-c-1.9.2-he1b5a44_3
  more-itertools     conda-forge/noarch::more-itertools-8.5.0-py_0
  numpy              conda-forge/linux-64::numpy-1.16.5-py37h95a1406_0
  olefile            conda-forge/noarch::olefile-0.46-py_0
  onnx               conda-forge/linux-64::onnx-1.6.0-py37he1b5a44_0
  packaging          conda-forge/noarch::packaging-20.4-pyh9f0ad1d_0
  pillow             conda-forge/linux-64::pillow-7.2.0-py37h718be6c_1
  pluggy             conda-forge/linux-64::pluggy-0.13.1-py37hc8dfbb8_2
  protobuf           conda-forge/linux-64::protobuf-3.9.2-py37he1b5a44_1
  py                 conda-forge/noarch::py-1.9.0-pyh9f0ad1d_0
  pyparsing          conda-forge/noarch::pyparsing-2.4.7-pyh9f0ad1d_0
  pytest             conda-forge/linux-64::pytest-6.0.1-py37hc8dfbb8_0
  python_abi         conda-forge/linux-64::python_abi-3.7-1_cp37m
  singa              nusdbsystem/linux-64::singa-3.0.0-cudnn7.6.5_cuda10.0_py37
  singa-gpu          nusdbsystem/linux-64::singa-gpu-3.0.0-py37
  six                conda-forge/noarch::six-1.15.0-pyh9f0ad1d_0
  toml               conda-forge/noarch::toml-0.10.1-pyh9f0ad1d_0
  tqdm               conda-forge/noarch::tqdm-4.48.2-pyh9f0ad1d_0
  wrapt              conda-forge/linux-64::wrapt-1.12.1-py37h8f50634_1
  zipp               conda-forge/noarch::zipp-3.1.0-py_0
  zstd               conda-forge/linux-64::zstd-1.4.5-h6597ccf_2

The following packages will be UPDATED:

  libgcc-ng           pkgs/main::libgcc-ng-9.1.0-hdf63c60_0 --> conda-forge::libgcc-ng-9.3.0-h24d8f2e_16
  openssl              pkgs/main::openssl-1.1.1g-h7b6447c_0 --> conda-forge::openssl-1.1.1g-h516909a_1

The following packages will be SUPERSEDED by a higher-priority channel:

  _libgcc_mutex           pkgs/main::_libgcc_mutex-0.1-main --> conda-forge::_libgcc_mutex-0.1-conda_forge
  ca-certificates    pkgs/main::ca-certificates-2020.7.22-0 --> conda-forge::ca-certificates-2020.6.20-hecda079_0
  certifi               pkgs/main::certifi-2020.6.20-py37_0 --> conda-forge::certifi-2020.6.20-py37hc8dfbb8_0


Proceed ([y]/n)? y


Downloading and Extracting Packages
importlib_metadata-1 | 3 KB      | #################################################################################### | 100%
zipp-3.1.0           | 10 KB     | #################################################################################### | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
(sg37) user@xgpe3:~$ python -c "from singa.module import Module"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'singa.module'
(sg37) user@xgpe3:~$

By contrast, with singa-cpu package, everything seems fine.

(base) user@xgpe3:~$ conda activate sc37
(sc37) user@xgpe3:~$ conda install -c nusdbsystem -c conda-forge singa-cpu=3.0.0
Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: done


==> WARNING: A newer version of conda exists. <==
  current version: 4.8.3
  latest version: 4.8.4

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /path/to/miniconda3/envs/sc37

  added / updated specs:
    - singa-cpu=3.0.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    singa-3.0.0                |         cpu_py37        22.2 MB  nusdbsystem
    singa-cpu-3.0.0            |             py37           4 KB  nusdbsystem
    ------------------------------------------------------------
                                           Total:        22.2 MB

The following NEW packages will be INSTALLED:

  _openmp_mutex      conda-forge/linux-64::_openmp_mutex-4.5-1_llvm
  deprecated         conda-forge/noarch::deprecated-1.2.7-py_0
  dnnl               nusdbsystem/linux-64::dnnl-1.1-build
  freetype           conda-forge/linux-64::freetype-2.10.2-he06d7ca_0
  future             conda-forge/linux-64::future-0.18.2-py37hc8dfbb8_1
  glog               conda-forge/linux-64::glog-0.3.5-hf484d3e_1001
  jpeg               conda-forge/linux-64::jpeg-9d-h516909a_0
  lcms2              conda-forge/linux-64::lcms2-2.11-hbd6801e_0
  libblas            conda-forge/linux-64::libblas-3.8.0-16_openblas
  libcblas           conda-forge/linux-64::libcblas-3.8.0-16_openblas
  libgfortran-ng     conda-forge/linux-64::libgfortran-ng-7.5.0-hdf63c60_16
  liblapack          conda-forge/linux-64::liblapack-3.8.0-16_openblas
  libopenblas        conda-forge/linux-64::libopenblas-0.3.9-h5ec1e0e_0
  libpng             conda-forge/linux-64::libpng-1.6.37-hed695b0_2
  libprotobuf        conda-forge/linux-64::libprotobuf-3.9.2-h8b12597_0
  libtiff            conda-forge/linux-64::libtiff-4.1.0-hc7e4089_6
  libwebp-base       conda-forge/linux-64::libwebp-base-1.1.0-h516909a_3
  llvm-openmp        conda-forge/linux-64::llvm-openmp-10.0.1-hc9558a2_0
  lz4-c              conda-forge/linux-64::lz4-c-1.9.2-he1b5a44_3
  numpy              conda-forge/linux-64::numpy-1.16.5-py37h95a1406_0
  olefile            conda-forge/noarch::olefile-0.46-py_0
  onnx               conda-forge/linux-64::onnx-1.6.0-py37he1b5a44_0
  pillow             conda-forge/linux-64::pillow-7.2.0-py37h718be6c_1
  protobuf           conda-forge/linux-64::protobuf-3.9.2-py37he1b5a44_1
  python_abi         conda-forge/linux-64::python_abi-3.7-1_cp37m
  singa              nusdbsystem/linux-64::singa-3.0.0-cpu_py37
  singa-cpu          nusdbsystem/linux-64::singa-cpu-3.0.0-py37
  six                conda-forge/noarch::six-1.15.0-pyh9f0ad1d_0
  tqdm               conda-forge/noarch::tqdm-4.48.2-pyh9f0ad1d_0
  wrapt              conda-forge/linux-64::wrapt-1.12.1-py37h8f50634_1
  zstd               conda-forge/linux-64::zstd-1.4.5-h6597ccf_2

The following packages will be UPDATED:

  libgcc-ng           pkgs/main::libgcc-ng-9.1.0-hdf63c60_0 --> conda-forge::libgcc-ng-9.3.0-h24d8f2e_16
  openssl              pkgs/main::openssl-1.1.1g-h7b6447c_0 --> conda-forge::openssl-1.1.1g-h516909a_1

The following packages will be SUPERSEDED by a higher-priority channel:

  _libgcc_mutex           pkgs/main::_libgcc_mutex-0.1-main --> conda-forge::_libgcc_mutex-0.1-conda_forge
  ca-certificates    pkgs/main::ca-certificates-2020.7.22-0 --> conda-forge::ca-certificates-2020.6.20-hecda079_0
  certifi               pkgs/main::certifi-2020.6.20-py37_0 --> conda-forge::certifi-2020.6.20-py37hc8dfbb8_0


Proceed ([y]/n)? y


Downloading and Extracting Packages
singa-cpu-3.0.0      | 4 KB      | #################################################################################### | 100%
singa-3.0.0          | 22.2 MB   | #################################################################################### | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
(sc37) user@xgpe3:~$ python -c "from singa.module import Module"
(sc37) user@xgpe3:~$

Is this a bug or just the way it goes?

Add save and load method for Module class

Updated on May 15

class Layer:
   def get_params(self):
       """the params of this layer and sublayers as a dict;  param name is: layername.param
           e.g., self.W = Tensor(), self.b=Tensor()
                  name of W and b is  like conv1.W and conv1.b  
       """

    def get_states(self):
       """states of this layer as sublayers that are necessary for model evaluation/inference.
           the states include the params and others, e.g., the running mean and var of batchnorm.
      """

class Module(Layer):   
  def compile(self ...):
     """set the name of each layer and sublayers, which will be used to create the dict 
          for get_params and get_states. Then no need to manually config the layer name 
         the __init__ method of a layer.
 
        For instance,
        class Blk(Layer):
             def __init__(self):
                  self.conv1= Conv2d()
                  self.conv2 = Conv2d()

        class MyModel(Module):
              def __init__(self):         
                 self.blk1 = Blk() --> blk1.conv1, blk1.conv2
                 self.blk2 = Blk()  --> blk2.conv1, blk2.conv2
   """

  # high priority
  def save(self, fpath, ckp_states={}):
      """Save the model and optionally some states.
      
      Args:
         fpath: output file path (without the extension)
         ckp_states(dict): states for checkpoint that are not attributes of Module, e.g., epoch ID.
      """
      cust_states = {}
      if ckp_states is not None:
         cust_states = ckp_states + model (include sublayers) attributes - get_states()
      save model states via onnx with customized field for the cust_states

  def load(self, fpath, dev, use_graph, graph_alg):
      """Load the model onto dev
   
       Args:
         path: input file path (without the extension)

       Returns:
          dict for the ckp_states.
      ```
      load model states + cust_states
      model attributes = model states + attributes from cust_states
      self.compile()
      restore the model attributes
      return the rest states as a dict

# lower priority
def save(fpath, model, ckp_states):
    attributes <-- model
    replace all tensors in attributes + ckp_states into dict name -->(shape, dtype)
    dump the tensors via numpy.savez_compressed
    dump model via pickle

def load(fpath, dev, use_graph, graph_alg):
     load model via pickle
     load tensors via numpy.load
     restore the tensors 
     return the ckp_states

Clarification:

Params: layer parameters (Tensor) that are updated via SGD. Layer.get_params()
States: Params + other variables that are necessary for model evaluation/inference. Superset of params. Layer.get_states()
Attributes: members of a class instance class.__dict__. Superset of states.

Add new operators and layers

SINGA has many common operators and layers.
There are also many operators to be implemented.
Here is a list of popular operators define by ONNX https://github.com/onnx/onnx/blob/master/docs/Operators.md
Here is the list of operators implemented in SINGA http://singa.apache.org/docs/onnx/#supported-operators
The task is to add the operators in ONNX but not in SINGA.
Refer to this link for how to add new operators and corresponding layers into SINGA.

Preparation for V3.1 release

I propose to release a minor version to reflect the changes since v3.0.

Please test the following items and check the documentation if they are done.

new onnx models and operators @joddiy
distributed training @chrishkchris
computational graph to support RNN @chrishkchris
website update @nudles
pypi package generation @nudles
github workflow for code quality and coverage management @moazreyad
tensor APIs @dcslin
new autograd operators @joddiy
anything else?

Here is the checklist and steps

Select a release manager. The release manager (RM) is the coordinator for the release process. It is the RM's signature (.asc) that is uploaded together with the release. The RM generates KEY (RSA 4096-bit) and uploads it to a public key server. The RM needs to get his key endorsed (signed) by other Apache user, to be connected to the web of trust. He should first ask the mentor to help signing his key. How to generate the key?
Check license. FAQ; SINGA Issue
- the codebase does not include third-party code which is not compatible to APL;
- The dependencies are compatible with APL. GNU-like licenses are NOT compatible;
- All source files written by us MUST include the Apache license header: http://www.apache.org/legal/src-headers.html. There's a script in there which helps propagating the header to all files.
- Update the LICENSE file. If we include any third party code in the release package which is not APL, must state it at the end of the LICENSE file and include the license boilerplate in the original file.
Bump the version. Check code and documentation
- The build process is error-free.
- Unit tests are included (as much as possible)
- Conda packages run without errors.
- The online documentation on the Apache website is up to date.
Prepare the RELEASE_NOTES file. Include the following items, Introduction, Features, Bugs (link to JIRA or Github PR), Changes, Dependency list, Incompatibility issues. Follow this example.
Prepare DISCLAIMER file. Modify from the template
Package the release candidate. The release should be packaged into : apache-singa-VERSION.tar.gz. The release should not include any binary files including git files. Upload the release to for stage. The tar file, signature, KEY and SHA256 checksum file should be included. MD5 is no longer used. Policy is here
- apache-singa-VERSION.tar.gz
- KEY
- XX.acs
- .SHA256

Call for vote by sending an email

To: [email protected]
Subject: [VOTE] Release apache-singa-X.Y.Z (release candidate N)

Hi all,

I have created a build for Apache SINGA X.Y.Z, release candidate N.
The artifacts to be voted on are located here:  xxxx
The hashes of the artifacts are as follows: xxx
Release artifacts are signed with the following key: xxx
Please vote on releasing this package. The vote is open for at least 72 hours and passes if a majority of at least three +1 votes are cast.

[ ] +1 Release this package as Apache SINGA X.Y.Z
[ ] 0 I don't feel strongly about it, but I'm okay with the release
[ ] -1 Do not release this package because...

Here is my vote:
+1

Subject: [RESULT] [VOTE] Release apache-singa-X.Y.Z (release candidate N)
To: [email protected]

Thanks to everyone who has voted and given their comments. The tally is as follows.

N binding +1s:
<names>

N non-binding +1s:
<names>

No 0s or -1s.

 I am delighted to announce that the proposal to release Apache SINGA X.Y.Zhas passed.

Upload the package for distribution to https://dist.apache.org/repos/dist/release/VERSION/.
Update the Download page of SINGA website. The tar.gz file MUST be downloaded from mirror, using closer.cgi script; other artifacts MUST be downloaded from main Apache site. More details here. Some feedback we got during the previous releases: "Download pages must only link to formal releases, so must not include links to GitHub.", "Links to KEYS, sigs and hashes must not use dist.apache.org; instead use https://www.apache.org/dist/singa/...;", "Also you only need one KEYS link, and there should be a description of how to use KEYS + sig or hash to verify the downloads."
Remove the RC tag and compile the conda packages.

Publish the release information.

To: [email protected], [email protected] 
Subject: [ANNOUNCE] Apache SINGA X.Y.Z released

We are pleased to announce that SINGA X.Y.Z is released.

SINGA is a general distributed deep learning platform for training big deep learning models over large datasets. 
The release is available at: http://singa.apache.org/downloads.html
The main features of this release include XXX
We look forward to hearing your feedback, suggestions, and contributions to the project.

On behalf of the SINGA team, {SINGA Team Member Name}

(i) Some bugs in autograd.py and (ii) test_operation.py needs updates

Today when I run the singa/test/python/test_operation.py, I get these errors:

ubuntu@ip-172-31-24-48:~/singa/test/python$ python3 test_operation.py
..................................................................E.FF..............................FF.....FF................
======================================================================
ERROR: test_conv2d_cpu (__main__.TestPythonOperation)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_operation.py", line 216, in test_conv2d_cpu
    y = conv_1(cpu_input_tensor)  # PyTensor
  File "/home/ubuntu/singa/build/python/singa/autograd.py", line 1380, in __call__
    y = conv2d(self.handle, x, self.W, self.b)
  File "/home/ubuntu/singa/build/python/singa/autograd.py", line 1241, in conv2d
    return _Conv2d(handle)(x, W, b)[0]
  File "/home/ubuntu/singa/build/python/singa/autograd.py", line 247, in __call__
    return self._do_forward(*xs)
  File "/home/ubuntu/singa/build/python/singa/autograd.py", line 298, in _do_forward
    ys = self.forward(*xs)
  File "/home/ubuntu/singa/build/python/singa/autograd.py", line 1203, in forward
    return singa.GpuConvForward(x, W, b, self.handle)
TypeError: in method 'GpuConvForward', argument 4 of type 'singa::CudnnConvHandle const &'

======================================================================
FAIL: test_div_broadcast_cpu (__main__.TestPythonOperation)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_operation.py", line 2616, in test_div_broadcast_cpu
    np.testing.assert_array_almost_equal(tensor.to_numpy(tensor.from_raw_tensor(dx1)), grad1, decimal=5)
  File "/usr/local/lib/python3.5/dist-packages/numpy/testing/_private/utils.py", line 1007, in assert_array_almost_equal
    precision=decimal)
  File "/usr/local/lib/python3.5/dist-packages/numpy/testing/_private/utils.py", line 819, in assert_array_compare
    raise AssertionError(msg)
AssertionError:
Arrays are not almost equal to 5 decimals

Mismatch: 3.33%
Max absolute difference: 3.0517578e-05
Max relative difference: 9.684139e-07
 x: array([[-1.30722e+01,  2.65515e+00, -6.92423e-02, -2.97908e-01,
         6.12429e+00,  3.71461e-01],
       [ 1.33601e+01, -4.65283e+00, -4.74600e-01, -9.15998e-01,...
 y: array([[-1.30722e+01,  2.65515e+00, -6.92423e-02, -2.97908e-01,
         6.12429e+00,  3.71461e-01],
       [ 1.33601e+01, -4.65283e+00, -4.74600e-01, -9.15998e-01,...

======================================================================
FAIL: test_div_broadcast_gpu (__main__.TestPythonOperation)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_operation.py", line 2584, in test_div_broadcast_gpu
    np.testing.assert_array_almost_equal(tensor.to_numpy(tensor.from_raw_tensor(dx1)), grad1, decimal=5)
  File "/usr/local/lib/python3.5/dist-packages/numpy/testing/_private/utils.py", line 1007, in assert_array_almost_equal
    precision=decimal)
  File "/usr/local/lib/python3.5/dist-packages/numpy/testing/_private/utils.py", line 819, in assert_array_compare
    raise AssertionError(msg)
AssertionError:
Arrays are not almost equal to 5 decimals

Mismatch: 40%
Max absolute difference: 6.1035156e-05
Max relative difference: 3.51512e-07
 x: array([-173.63599,  -30.95938,  139.375  ,   -4.83802,   -2.26971],
      dtype=float32)
 y: array([-173.63605,  -30.95938,  139.37502,   -4.83802,   -2.26971],
      dtype=float32)

======================================================================
FAIL: test_pow_broadcast_cpu (__main__.TestPythonOperation)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_operation.py", line 2678, in test_pow_broadcast_cpu
    np.testing.assert_array_almost_equal(tensor.to_numpy(tensor.from_raw_tensor(dx1)), grad1, decimal=5)
  File "/usr/local/lib/python3.5/dist-packages/numpy/testing/_private/utils.py", line 1007, in assert_array_almost_equal
    precision=decimal)
  File "/usr/local/lib/python3.5/dist-packages/numpy/testing/_private/utils.py", line 819, in assert_array_compare
    raise AssertionError(msg)
AssertionError:
Arrays are not almost equal to 5 decimals

Mismatch: 40%
Max absolute difference: 6.1035156e-05
Max relative difference: 1.3951524e-07
 x: array([ 169.04495, -238.43016, 1852.8772 ,  437.48016,  -20.75186],
      dtype=float32)
 y: array([ 169.04497, -238.43016, 1852.8772 ,  437.48022,  -20.75186],
      dtype=float32)

======================================================================
FAIL: test_pow_broadcast_gpu (__main__.TestPythonOperation)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_operation.py", line 2645, in test_pow_broadcast_gpu
    np.testing.assert_array_almost_equal(tensor.to_numpy(result), y, decimal=5)
  File "/usr/local/lib/python3.5/dist-packages/numpy/testing/_private/utils.py", line 1007, in assert_array_almost_equal
    precision=decimal)
  File "/usr/local/lib/python3.5/dist-packages/numpy/testing/_private/utils.py", line 819, in assert_array_compare
    raise AssertionError(msg)
AssertionError:
Arrays are not almost equal to 5 decimals

Mismatch: 6.67%
Max absolute difference: 6.1035156e-05
Max relative difference: 8.3724494e-08
 x: array([[[  1.     , 216.     ,  64.     ,  36.     , 343.     ],
        [ 27.     , 125.     , 512.     ,  36.     , 343.     ],
        [  1.     , 343.     ,   1.     ,  81.     , 343.     ],...
 y: array([[[  1., 216.,  64.,  36., 343.],
        [ 27., 125., 512.,  36., 343.],
        [  1., 343.,   1.,  81., 343.],...

======================================================================
FAIL: test_reshape_cpu (__main__.TestPythonOperation)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_operation.py", line 1455, in test_reshape_cpu
    np.testing.assert_array_almost_equal(tensor.to_numpy(tensor.from_raw_tensor(dx)), grad, decimal=5)
  File "/usr/local/lib/python3.5/dist-packages/numpy/testing/_private/utils.py", line 1007, in assert_array_almost_equal
    precision=decimal)
  File "/usr/local/lib/python3.5/dist-packages/numpy/testing/_private/utils.py", line 752, in assert_array_compare
    raise AssertionError(msg)
AssertionError:
Arrays are not almost equal to 5 decimals

(shapes (2, 3), (3, 2) mismatch)
 x: array([[1., 1., 1.],
       [1., 1., 1.]], dtype=float32)
 y: array([[1., 1.],
       [1., 1.],
       [1., 1.]], dtype=float32)

======================================================================
FAIL: test_reshape_gpu (__main__.TestPythonOperation)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_operation.py", line 1475, in test_reshape_gpu
    np.testing.assert_array_almost_equal(tensor.to_numpy(tensor.from_raw_tensor(dx)), grad, decimal=5)
  File "/usr/local/lib/python3.5/dist-packages/numpy/testing/_private/utils.py", line 1007, in assert_array_almost_equal
    precision=decimal)
  File "/usr/local/lib/python3.5/dist-packages/numpy/testing/_private/utils.py", line 752, in assert_array_compare
    raise AssertionError(msg)
AssertionError:
Arrays are not almost equal to 5 decimals

(shapes (2, 3), (3, 2) mismatch)
 x: array([[1., 1., 1.],
       [1., 1., 1.]], dtype=float32)
 y: array([[1., 1.],
       [1., 1.],
       [1., 1.]], dtype=float32)

----------------------------------------------------------------------
Ran 125 tests in 0.586s

FAILED (failures=6, errors=1)

Refactor autograd module

#688 is refactoring the autograd module.
Here are some comments about the current APIs in autograd.

Relationship between the classes and functions in autograd.
Operator implements the forward and backward method for autograd.
For each Operator class, there is a function that creates an Operator instance and calls the forward method.
Layer stores the states (handles and parameters) and calls Operator function for the real computation. Note that a layer class can have sub-layers (as states) for creating complex and deep models.
Issue:
When we create a network using the Module API, there are both stateless (e.g., flatten) and statefull (conv2d) operations. Currently, we create layers in __init__ of Module and calls the layers and operator function in forward method. Therefore, Layer and Operator are mixed, which may confuse the users. A better way is to use Layer instances only. For every operator, we create a corresponding layer class to replace the layer function.
Layer API.
issue: when and how to initialize the parameters and (handle) of a layer?

do initialization in __init__ method OR when the data is forwarded for the first time #674
pass an initializer function to the __init__ method of a layer and use it to initialize the parameters OR pass an initializer function to the __init__ method of the Module class and use it to initialize the parameter (through get_params) of the layers after forwarding the layers for once. The second approach requires the Module class's __init__ to do a forward pass of all layers and then get_params of each layer for initialization. To do that, it needs at least the shapes of the input tensors and the device. The drawback of the first approach is that we need to include the initializer in every Layer constructor.

comments are welcomed.

AlexNet bacward shape missmatch + ReLu return a tuple

Hi,

I have implemented AlexNet in singa but I obtain an error during the backward_and_update instruction. I am using Singa 3.0.0.rc1 on cpu.

This is my AlexNet implementation:
`from singa import autograd
from singa import module
from singa import opt

all = ['AlexNet', 'alexnet']

class AlexNet(module.Module):
def init(self, num_classes=1000):
super(AlexNet, self).init()
# 12 sur GPU donc 6 & 6
self.features1 = [
autograd.Conv2d(3,64,kernel_size=11,stride=4,padding=2),
autograd.ReLU(),
autograd.MaxPool2d(kernel_size=3, stride=2),
autograd.Conv2d(64,192,kernel_size=5,padding=2),
autograd.ReLU(),
autograd.MaxPool2d(kernel_size=3, stride=2),
autograd.Conv2d(192,384,kernel_size=3,padding=1),
autograd.ReLU(),
autograd.Conv2d(384, 256,kernel_size=3,padding=1),
autograd.ReLU()
]
self.features2 = [
autograd.Conv2d(256, 256,kernel_size=3,padding=1),
autograd.ReLU(),
autograd.MaxPool2d(kernel_size=3, stride=2)
]
self.avgpool = autograd.AvgPool2d(6, stride=1)
self.flatten = autograd.Flatten()
self.classifier = [
autograd.Dropout(),
autograd.Linear(256 * 6 * 6, 4096),
autograd.ReLU(),
autograd.Dropout(),
autograd.Linear(4096, 4096),
autograd.ReLU(),
autograd.Linear(4096, num_classes)
]
self.optimizer = opt.SGD(lr=0.001, momentum=0.9)
def loss(self, out, ty):
return autograd.softmax_cross_entropy(out, ty)
def optim(self, loss, dist_option, spars):
if dist_option == 'fp32':
self.optimizer.backward_and_update(loss)
elif dist_option == 'fp16':
self.optimizer.backward_and_update_half(loss)
elif dist_option == 'partialUpdate':
self.optimizer.backward_and_partial_update(loss)
elif dist_option == 'sparseTopK':
self.optimizer.backward_and_sparse_update(loss, topK=True, spars=spars)
elif dist_option == 'sparseThreshold':
self.optimizer.backward_and_sparse_update(loss, topK=False, spars=spars)
def forward(self, x):
for (i,layers) in enumerate([self.features1, self.features2, [ self.avgpool,self.flatten ] , self.classifier]):
for (j,fn) in enumerate(layers):
x = fn(x)
if(type(x) is tuple):# FIXME I have to do that because of a bug in Singa? (ReLU)
x = x[0]
return x

def alexnet(**kwargs):
return AlexNet(**kwargs)
`
And I get : AssertionError: ('shape mismatch', (9216, 4096), (256, 4096))
Which is my first linear layer : 256 * 6 * 6, 4096

When I use my VGG16 implementation, I got a similar error :
AssertionError: ('shape mismatch', (25088, 4096), (512, 4096))

It seems that the backward operation does not map the correct shape to the corresponding layer.

Moreover, the ReLu class return a 1-tuple containing a Tensor. Is it intended or is it a bug?

Add a dataset module

Data loading is an important part of DL training, which could be slow and become a bottleneck if not implemented well.
The tasks include

implement dataset classes for common benchmark datasets to make them easy to access within SINGA (e.g., without manual downloading).
implement common preprocessing operations
implement parallel data loading for higher efficiency

Raise and handle exceptions in CPP code

Currently, we abort the program when any check fails via glog's CHECK functions.
We do not catch any exceptions like memory exception or cudnn exceptions.

As a result, the program will abort or crash whenever there is an error or exception, which sometimes shutdown the jupyter notebook or colab notebook when we run the code in the notebook environment.

This ticket is to raise and handle exceptions in CPP code.

ref: http://www.swig.org/Doc3.0/SWIGDocumentation.html#Customization_exception

Make the project name consistent

In both the codes and documentation, we need to make the project name consistent.
The official name of the project is SINGA, but sometimes we have to use singa like for the url or github repo name.
So I suggest to use SINGA (e.g., in documentation) and singa (e.g., in codes), and avoid Singa

cannot do matmul for high-dim tensors

Hi, @dcslin , we ccannot do matmul for high-dim tensors:

The error is:

F0324 05:47:19.174587 15611 tensor.cc:1413] Check failed: A.shape().size() == 2u (4 vs. 2)

please use this test case:

x1 = np.random.randn(1, 12, 256, 64).astype(np.float32)
x2 = np.random.randn(1, 12, 64, 256).astype(np.float32)
x1 = tensor.from_numpy(x1)
x1.to_device(gpu_dev)
x2 = tensor.from_numpy(x2)
x2.to_device(gpu_dev)

y = autograd.Matmul()(x1, x2)
print(tensor.to_numpy(y[0]))

The padding of Conv2d and Pool22 not support SAME_UPPER, SAME_LOWER

Now the singa's padding is, we assign a value, and it adds the amount of value's padding at both sides of head and tail.

However, according to the onnx doc, there are four types of padding method, which are NOTSET, SAME_UPPER, SAME_LOWER or VALID. Singa cannot support SAME_UPPER, SAME_LOWER yet.

SAME_UPPER or SAME_LOWER mean pad the input so that the output spatial size match the input. In case of an odd number add the extra padding at the end for SAME_UPPER and at the beginning for SAME_LOWER.

For example, the input is 32*32, the stride is 1, the kernel is 4*4, and output is same, we need to add (3,3) zeros totally, each side is 3/2 = 1.5, cannot work basing on the current setting of singa.

autograd models got AttributeError

Got AttributeError when running some Autograd models:

python examples/cnn/autograd/cifar10_multiprocess.py
AttributeError: module 'singa.singa_wrap' has no attribute 'NcclIdHolder'

python examples/cnn/autograd/mnist_dist.py
AttributeError: module 'singa.singa_wrap' has no attribute 'Communicator'

Does it mean that file singa_wrap is not updated?

Expand the model zoo (example model set)

SINGA has multiple example models at http://singa.apache.org/docs/examples/
Some are implemented from scratch and some are converted from ONNX, which has a bigger model zoo https://github.com/onnx/models.
The task is to convert more onnx models and implement some popular (and interesting) models that are not in onnx model zoo.
Here are some reference model zoos https://modelzoo.co/, https://gluon-nlp.mxnet.io/model_zoo/index.html

Add Erf operator in c++ end

@dcslin Hi, shicong, there is a new NLP model we can support and it requires a new operator called Erf. It looks like the tanh. Can you help me to add it to the c++ end?

I guess you can import the math.h to call the erf, please refer to this doc

Cannot do mul for int tensors

Hi, @dcslin , we cannot do the mul operator for int tensors:

The error is:

F0324 05:04:22.542809 14739 tensor.cc:932] Unknown combination of data type kInt and language kCuda

please use this test case:

x1 = np.array([1], dtype=np.int32)
x2 = np.array([256], dtype=np.int32)
x1 = tensor.from_numpy(x1)
x1.to_device(gpu_dev)
x2 = tensor.from_numpy(x2)
x2.to_device(gpu_dev)

y = autograd.Mul()(x1, x2)
print(tensor.to_numpy(y[0]))

Propose to upgrade RNN operation buffering for Graph

In singa/examples/rnn/train.py,

When we run it with Graph using Sequential, i.e. line 200: model.graph(True, True), the training can complete without error.

However, when we run it with Graph using node input dependencies, i.e. line 200: model.graph(True, False), it will display the error as below:

Summarized from some discussion with @XJDKC, the error may be due to the following:

There may be some operations we have used but did not specify read_blocks and write_blocks well
Hence, some operations are performed in advance because the operation is not buffer as required

Therefore, to support computation of graph using node input dependencies, we may need to update the buffering of some RNN operation.

AssertionError for the ONNX training testcases?

AssertionError with the onnx testcase: https://github.com/apache/singa/blob/master/examples/onnx/training/train.py

$ cd examples/onnx
$ python3 training/train.py --model vgg16

Then I get the following error msg:

File "training/train.py", line 437, in <module>
    args.onnx_model_path, args.data, sgd, args.graph, args.verbosity)
  File "training/train.py", line 295, in run
    model.compile([tx], is_train=True, use_graph=graph, sequential=sequential)
  File "/home/extend/lijiansong/work-space/anaconda2/envs/intel-caffe/lib/python3.6/site-packages/singa/model.py", line 177, in compile
    self.forward(*inputs)
  File "/home/extend/lijiansong/work-space/anaconda2/envs/intel-caffe/lib/python3.6/site-packages/singa/layer.py", line 63, in wrapper
    return func(self, *args, **kwargs)
  File "training/train.py", line 191, in forward
    y = self.linear(y)
  File "/home/extend/lijiansong/work-space/anaconda2/envs/intel-caffe/lib/python3.6/site-packages/singa/layer.py", line 110, in __call__
    return self.forward(*args, **kwargs)
  File "/home/extend/lijiansong/work-space/anaconda2/envs/intel-caffe/lib/python3.6/site-packages/singa/layer.py", line 61, in wrapper
    self.initialize(*args, **kwargs)
  File "/home/extend/lijiansong/work-space/anaconda2/envs/intel-caffe/lib/python3.6/site-packages/singa/layer.py", line 45, in wrapper
    'initialize function expects PlaceHolders or Tensors')
AssertionError: initialize function expects PlaceHolders or Tensors

Something maybe wrong with the layer initialization?

singa version: 3100(the latest build from the source code of master branch)
Python version: 3.5.2
ONNX version: 1.5.0

apache / singa Goto Github PK

singa's Introduction

Apache SINGA

Quick Start

Issues

Code Analysis:

Mailing Lists

singa's People

Contributors

Stargazers

Watchers

Forkers

singa's Issues

Issues:

Membership Data:

Project Activity:

Community Health:

Input

Layers

BatchNorm2d

Conv2d

Gemm

Metaclass

Zero-value tensors

CURAND_STATUS_LAUNCH_FAILURE

CUDNN_STATUS_EXECUTION_FAILED

Recommend Projects

Recommend Topics

Recommend Org