Distributed deep learning system
apache / singa Goto Github PK
View Code? Open in Web Editor NEWa distributed deep learning platform
License: Apache License 2.0
a distributed deep learning platform
License: Apache License 2.0
Distributed deep learning system
There are no issues requiring board attention
Apache SINGA was founded 2019-10-16 (a year ago)
There are currently 22 committers and 16 PMC members in this project.
The Committer-to-PMC ratio is roughly 3:2.
Community changes, past quarter:
The community is working on release v3.1, which include the following
features/changes:
Overall, the community has slowed down a bit after releasing version 3.0. We
are still active in development for v3.1, with 52 PRs opened and 53 PRs
closed. The community is growing stably with 3 new committers joined since
last report.
In #674 , we propose to move the param creation and initialization into the forward propagation stage, i.e., __call__
.
However, sometimes, we may want to access the parameters of a model after it is created, e.g.,
m = ModelFoo()
m.get_params() # returns the params of each layer via get_params
We will get errors since the params are not created yet.
To resolve this issue, we can add a new method to the Module class
def init(self, x):
# x represents the input tensor(s) whose values could be randomly filled,
# but the shape and device are set correctly.
self.forward(x) # the forward propagation will initialize all params.
The following code will pass without errors.
m = ModelFoo()
m.init(x)
m.get_params() # returns the params of each layer via get_params
Comments are welcomed
It would be better to have a error code list for debugging.
Then we can raise the errors.
Here are some example errors:
This issue is open to discuss different options for adding GPU build and test to Github workflows.
To enable this feature, SINGA must provide a real or virtual machine with GPU as host machine for running the workflow. Then use the self-hosted runner feature of Github Actions. See also this MLOps video tutorial.
The team need to take some decisions:
What do you think?
To use DNNL in SINGA, it would be convenient to have a DNNL conda package.
However, Intel does not provide the conda package for DNNL.
Therefore, we may need to compile DNNL via conda and upload it to anaconda cloud.
Some updates:
@joddiy is fixing some problem in onnx
root@d05828f767ee:~/dcsysh/singa/test/python# python3 test_onnx_backend.py
ss............................ssssssssssssssssssssssssssssssss................ssss..ss..ss......FFFF..ssssssssss..ssssssssssssssss............ssss....................ssssssss........................ssss....ssssssssssssssssssssssssss........ssssssssssssssssssssssssssss..........ssssssssssssss......FsFsssssssssssssssss..................ssss....ssssss..ssss..........ss.s............ssss....ssssssssssssssssssss........ssssssssssss..............ssssssssssssssssssssssss......ss......ssss..ss........FFFF..ssssssssss............ssssssssssssssssssssssssss......ssss....ssssssssssssssssss..................................ss........ssssssssssssssss....ssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss................ssssssssssssssssssssssssssssssss................ssssssssssssssssss....................ssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss..........ssss......ssss........ssss..............ss..........................................ssssssssssss....................ssssssssssssssssssss....ssssssssssssssssssssssssssssss................ssssss................
======================================================================
FAIL: test_averagepool_2d_same_lower_cpu (__main__.OnnxBackendNodeModelTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/onnx/backend/test/runner/__init__.py", line 248, in device_test_func
return test_func(*args, device=device, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/onnx/backend/test/runner/__init__.py", line 313, in run
atol=model_test.atol)
File "/usr/local/lib/python3.6/dist-packages/onnx/backend/test/runner/__init__.py", line 178, in assert_similar_outputs
atol=atol)
File "/usr/local/lib/python3.6/dist-packages/numpy/testing/_private/utils.py", line 1533, in assert_allclose
verbose=verbose, header=header, equal_nan=equal_nan)
File "/usr/local/lib/python3.6/dist-packages/numpy/testing/_private/utils.py", line 846, in assert_array_compare
raise AssertionError(msg)
AssertionError:
Not equal to tolerance rtol=0.001, atol=1e-07
Mismatched elements: 189 / 3072 (6.15%)
Max absolute difference: 1.3230393
Max relative difference: 3.
x: array([[[[ 1.764052e+00, 1.082105e+00, 6.894476e-01, ...,
1.501069e+00, 8.121531e-01, 2.665550e-01],
[ 4.381333e-01, -1.760931e-01, -2.374533e-01, ...,...
y: array([[[[ 4.410131e-01, 5.410524e-01, 3.447238e-01, ...,
7.505345e-01, 4.060766e-01, 1.332775e-01],
[ 2.190667e-01, -1.760931e-01, -2.374533e-01, ...,...
======================================================================
FAIL: test_averagepool_2d_same_lower_cuda (__main__.OnnxBackendNodeModelTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/onnx/backend/test/runner/__init__.py", line 248, in device_test_func
return test_func(*args, device=device, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/onnx/backend/test/runner/__init__.py", line 313, in run
atol=model_test.atol)
File "/usr/local/lib/python3.6/dist-packages/onnx/backend/test/runner/__init__.py", line 178, in assert_similar_outputs
atol=atol)
File "/usr/local/lib/python3.6/dist-packages/numpy/testing/_private/utils.py", line 1533, in assert_allclose
verbose=verbose, header=header, equal_nan=equal_nan)
File "/usr/local/lib/python3.6/dist-packages/numpy/testing/_private/utils.py", line 846, in assert_array_compare
raise AssertionError(msg)
AssertionError:
Not equal to tolerance rtol=0.001, atol=1e-07
Mismatched elements: 189 / 3072 (6.15%)
Max absolute difference: 1.3230393
Max relative difference: 3.
x: array([[[[ 1.764052e+00, 1.082105e+00, 6.894476e-01, ...,
1.501069e+00, 8.121531e-01, 2.665550e-01],
[ 4.381333e-01, -1.760931e-01, -2.374533e-01, ...,...
y: array([[[[ 4.410131e-01, 5.410524e-01, 3.447238e-01, ...,
7.505345e-01, 4.060766e-01, 1.332775e-01],
[ 2.190667e-01, -1.760931e-01, -2.374533e-01, ...,...
======================================================================
FAIL: test_averagepool_2d_same_upper_cpu (__main__.OnnxBackendNodeModelTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/onnx/backend/test/runner/__init__.py", line 248, in device_test_func
return test_func(*args, device=device, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/onnx/backend/test/runner/__init__.py", line 313, in run
atol=model_test.atol)
File "/usr/local/lib/python3.6/dist-packages/onnx/backend/test/runner/__init__.py", line 178, in assert_similar_outputs
atol=atol)
File "/usr/local/lib/python3.6/dist-packages/numpy/testing/_private/utils.py", line 1533, in assert_allclose
verbose=verbose, header=header, equal_nan=equal_nan)
File "/usr/local/lib/python3.6/dist-packages/numpy/testing/_private/utils.py", line 846, in assert_array_compare
raise AssertionError(msg)
AssertionError:
Not equal to tolerance rtol=0.001, atol=1e-07
Mismatched elements: 189 / 3072 (6.15%)
Max absolute difference: 0.9547153
Max relative difference: 3.
x: array([[[[-0.176093, -0.237453, 0.757017, ..., 0.112902, -0.50158 ,
-0.67406 ],
[-0.773234, -1.090172, -0.339745, ..., 0.040076, -0.369122,...
y: array([[[[-0.176093, -0.237453, 0.757017, ..., 0.112902, -0.50158 ,
-0.33703 ],
[-0.773234, -1.090172, -0.339745, ..., 0.040076, -0.369122,...
======================================================================
FAIL: test_averagepool_2d_same_upper_cuda (__main__.OnnxBackendNodeModelTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/onnx/backend/test/runner/__init__.py", line 248, in device_test_func
return test_func(*args, device=device, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/onnx/backend/test/runner/__init__.py", line 313, in run
atol=model_test.atol)
File "/usr/local/lib/python3.6/dist-packages/onnx/backend/test/runner/__init__.py", line 178, in assert_similar_outputs
atol=atol)
File "/usr/local/lib/python3.6/dist-packages/numpy/testing/_private/utils.py", line 1533, in assert_allclose
verbose=verbose, header=header, equal_nan=equal_nan)
File "/usr/local/lib/python3.6/dist-packages/numpy/testing/_private/utils.py", line 846, in assert_array_compare
raise AssertionError(msg)
AssertionError:
Not equal to tolerance rtol=0.001, atol=1e-07
Mismatched elements: 189 / 3072 (6.15%)
Max absolute difference: 0.9547153
Max relative difference: 3.
x: array([[[[-0.176093, -0.237453, 0.757017, ..., 0.112902, -0.50158 ,
-0.67406 ],
[-0.773234, -1.090172, -0.339745, ..., 0.040076, -0.369122,...
y: array([[[[-0.176093, -0.237453, 0.757017, ..., 0.112902, -0.50158 ,
-0.33703 ],
[-0.773234, -1.090172, -0.339745, ..., 0.040076, -0.369122,...
======================================================================
FAIL: test_equal_bcast_cpu (__main__.OnnxBackendNodeModelTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/onnx/backend/test/runner/__init__.py", line 248, in device_test_func
return test_func(*args, device=device, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/onnx/backend/test/runner/__init__.py", line 313, in run
atol=model_test.atol)
File "/usr/local/lib/python3.6/dist-packages/onnx/backend/test/runner/__init__.py", line 178, in assert_similar_outputs
atol=atol)
File "/usr/local/lib/python3.6/dist-packages/numpy/testing/_private/utils.py", line 1533, in assert_allclose
verbose=verbose, header=header, equal_nan=equal_nan)
File "/usr/local/lib/python3.6/dist-packages/numpy/testing/_private/utils.py", line 846, in assert_array_compare
raise AssertionError(msg)
AssertionError:
Not equal to tolerance rtol=0.001, atol=1e-07
Mismatched elements: 1 / 60 (1.67%)
x: array([[[False, False, False, False, False],
[False, False, False, False, False],
[False, False, False, False, False],...
y: array([[[False, False, False, False, False],
[False, False, False, False, False],
[False, False, False, False, False],...
======================================================================
FAIL: test_equal_cpu (__main__.OnnxBackendNodeModelTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/onnx/backend/test/runner/__init__.py", line 248, in device_test_func
return test_func(*args, device=device, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/onnx/backend/test/runner/__init__.py", line 313, in run
atol=model_test.atol)
File "/usr/local/lib/python3.6/dist-packages/onnx/backend/test/runner/__init__.py", line 178, in assert_similar_outputs
atol=atol)
File "/usr/local/lib/python3.6/dist-packages/numpy/testing/_private/utils.py", line 1533, in assert_allclose
verbose=verbose, header=header, equal_nan=equal_nan)
File "/usr/local/lib/python3.6/dist-packages/numpy/testing/_private/utils.py", line 846, in assert_array_compare
raise AssertionError(msg)
AssertionError:
Not equal to tolerance rtol=0.001, atol=1e-07
Mismatched elements: 1 / 60 (1.67%)
x: array([[[False, False, False, False, False],
[False, False, False, False, False],
[False, False, False, False, True],...
y: array([[[False, False, False, False, False],
[False, False, False, False, False],
[False, False, False, False, True],...
======================================================================
FAIL: test_maxpool_2d_same_lower_cpu (__main__.OnnxBackendNodeModelTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/onnx/backend/test/runner/__init__.py", line 248, in device_test_func
return test_func(*args, device=device, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/onnx/backend/test/runner/__init__.py", line 313, in run
atol=model_test.atol)
File "/usr/local/lib/python3.6/dist-packages/onnx/backend/test/runner/__init__.py", line 178, in assert_similar_outputs
atol=atol)
File "/usr/local/lib/python3.6/dist-packages/numpy/testing/_private/utils.py", line 1533, in assert_allclose
verbose=verbose, header=header, equal_nan=equal_nan)
File "/usr/local/lib/python3.6/dist-packages/numpy/testing/_private/utils.py", line 846, in assert_array_compare
raise AssertionError(msg)
AssertionError:
Not equal to tolerance rtol=0.001, atol=1e-07
Mismatched elements: 35 / 3072 (1.14%)
Max absolute difference: 1.6961312
Max relative difference: 0.
x: array([[[[ 1.764052e+00, 1.764052e+00, 9.787380e-01, ...,
1.532779e+00, 1.469359e+00, 3.781625e-01],
[ 1.764052e+00, 1.764052e+00, 9.787380e-01, ...,...
y: array([[[[ 1.764052e+00, 1.764052e+00, 9.787380e-01, ...,
1.532779e+00, 1.469359e+00, 3.781625e-01],
[ 1.764052e+00, 1.764052e+00, 9.787380e-01, ...,...
======================================================================
FAIL: test_maxpool_2d_same_lower_cuda (__main__.OnnxBackendNodeModelTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/onnx/backend/test/runner/__init__.py", line 248, in device_test_func
return test_func(*args, device=device, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/onnx/backend/test/runner/__init__.py", line 313, in run
atol=model_test.atol)
File "/usr/local/lib/python3.6/dist-packages/onnx/backend/test/runner/__init__.py", line 178, in assert_similar_outputs
atol=atol)
File "/usr/local/lib/python3.6/dist-packages/numpy/testing/_private/utils.py", line 1533, in assert_allclose
verbose=verbose, header=header, equal_nan=equal_nan)
File "/usr/local/lib/python3.6/dist-packages/numpy/testing/_private/utils.py", line 846, in assert_array_compare
raise AssertionError(msg)
AssertionError:
Not equal to tolerance rtol=0.001, atol=1e-07
Mismatched elements: 35 / 3072 (1.14%)
Max absolute difference: 1.6961312
Max relative difference: 0.
x: array([[[[ 1.764052e+00, 1.764052e+00, 9.787380e-01, ...,
1.532779e+00, 1.469359e+00, 3.781625e-01],
[ 1.764052e+00, 1.764052e+00, 9.787380e-01, ...,...
y: array([[[[ 1.764052e+00, 1.764052e+00, 9.787380e-01, ...,
1.532779e+00, 1.469359e+00, 3.781625e-01],
[ 1.764052e+00, 1.764052e+00, 9.787380e-01, ...,...
======================================================================
FAIL: test_maxpool_2d_same_upper_cpu (__main__.OnnxBackendNodeModelTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/onnx/backend/test/runner/__init__.py", line 248, in device_test_func
return test_func(*args, device=device, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/onnx/backend/test/runner/__init__.py", line 313, in run
atol=model_test.atol)
File "/usr/local/lib/python3.6/dist-packages/onnx/backend/test/runner/__init__.py", line 178, in assert_similar_outputs
atol=atol)
File "/usr/local/lib/python3.6/dist-packages/numpy/testing/_private/utils.py", line 1533, in assert_allclose
verbose=verbose, header=header, equal_nan=equal_nan)
File "/usr/local/lib/python3.6/dist-packages/numpy/testing/_private/utils.py", line 846, in assert_array_compare
raise AssertionError(msg)
AssertionError:
Not equal to tolerance rtol=0.001, atol=1e-07
Mismatched elements: 37 / 3072 (1.2%)
Max absolute difference: 1.2028884
Max relative difference: 0.
x: array([[[[ 1.764052, 0.978738, 2.240893, ..., 1.469359, 0.378163,
0.378163],
[ 0.177426, -0.347912, 0.462782, ..., 0.976639, 0.706573,...
y: array([[[[ 1.764052, 0.978738, 2.240893, ..., 1.469359, 0.378163,
0.378163],
[ 0.177426, -0.347912, 0.462782, ..., 0.976639, 0.706573,...
======================================================================
FAIL: test_maxpool_2d_same_upper_cuda (__main__.OnnxBackendNodeModelTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/onnx/backend/test/runner/__init__.py", line 248, in device_test_func
return test_func(*args, device=device, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/onnx/backend/test/runner/__init__.py", line 313, in run
atol=model_test.atol)
File "/usr/local/lib/python3.6/dist-packages/onnx/backend/test/runner/__init__.py", line 178, in assert_similar_outputs
atol=atol)
File "/usr/local/lib/python3.6/dist-packages/numpy/testing/_private/utils.py", line 1533, in assert_allclose
verbose=verbose, header=header, equal_nan=equal_nan)
File "/usr/local/lib/python3.6/dist-packages/numpy/testing/_private/utils.py", line 846, in assert_array_compare
raise AssertionError(msg)
AssertionError:
Not equal to tolerance rtol=0.001, atol=1e-07
Mismatched elements: 37 / 3072 (1.2%)
Max absolute difference: 1.2028884
Max relative difference: 0.
x: array([[[[ 1.764052, 0.978738, 2.240893, ..., 1.469359, 0.378163,
0.378163],
[ 0.177426, -0.347912, 0.462782, ..., 0.976639, 0.706573,...
y: array([[[[ 1.764052, 0.978738, 2.240893, ..., 1.469359, 0.378163,
0.378163],
[ 0.177426, -0.347912, 0.462782, ..., 0.976639, 0.706573,...
----------------------------------------------------------------------
Ran 1114 tests in 2.126s
Hi,
For the creation of Tensor in singa 3.0.0 cpu for py36 results in an error as described below:
NotImplementedError: Wrong number or type of arguments for overloaded function 'new_Tensor'.
Possible C/C++ prototypes are:
singa::Tensor::Tensor()
singa::Tensor::Tensor(std::vector< size_t,std::allocator< size_t > > const &,singa::DataType)
singa::Tensor::Tensor(std::vector< size_t,std::allocator< size_t > > const &)
singa::Tensor::Tensor(std::vector< size_t,std::allocator< size_t > > const &,std::shared_ptr< singa::Device >,singa::DataType)
singa::Tensor::Tensor(std::vector< size_t,std::allocator< size_t > > const &,std::shared_ptr< singa::Device >)
singa::Tensor::Tensor(singa::Tensor const &)
The tensor were created from numpy with code implementation as below:
import numpy as np
from singa import tensor
tensor.from_numpy( np.asarray([[1, 0, 0], [0, 1, 0]], dtype=np.float32) )
Following are the operating system specification:
OS: MacOS version 10.15.3
python version: 3.6.10
singa: 3.0.0 cpu_py36
The same code seems to work on singa version 3.0.0.rc1 cpu_py36
Hi, all, during refactoring SONNX, I found the following issues:
ONNX prefers to use tensors as input instead of attributes, which may incurs some issues when we create SINGA operators(or layers). There are two cases:
for @dcslin
In some model, the developer prefers gemm instead of linear, so we need to add gemm to Layer,
I've checked the metaclass carefully, but It seems I cannot use the metaclass to modify the forward function in this case. The case is, I have a graph written by ONNX, I need to write a forward by using SINGA's operator. In this case, I can call the SINGA's operator by the graph, but I cannot write a forward function automatically from the graph.
This more like the exec
function.
for example, I have a graph like this:
graph = {
"op1" : {"inputs":["a1"], outputs:["a2"]},
"op2" : {"inputs":["a2"], outputs:["a3"]},
}
# what I can do
def forward(x):
tensors = []
for op, op_info in graph.items():
inputs = [tensors[inp] for inp in op_info.inputs]
outputs = op()
for (outp, val) in zip(op_info.outputs, outputs):
tensors[outp] = val
# what I cannot do by metaclass but can with exec
program = parse_graph_to_str(graph)
# 'a2=op1(a1)\na3=op2(a2)'
exec(program)
So, the above forward is my current implementation.
The check and assertion in some functions now have very brief descriptions, which make it not easy to locate and identify the error.
The displayed information should include at least:
Release version, 2.1 vs 3.0 depending on the API change
Please tick the feature if it is done.
Here is the checklist and steps
Select a release manager. The release manager (RM) is the coordinator for the release process. It is the RM's signature (.asc) that is uploaded together with the release. The RM generates KEY (RSA 4096-bit) and uploads it to a public key server. The RM needs to get his key endorsed (signed) by other Apache user, to be connected to the web of trust. He should first ask the mentor to help signing his key. How to generate the key?
Check license. FAQ; SINGA Issue
Bump the version. Check code and documentation
Prepare the RELEASE_NOTES file. Include the following items, Introduction, Features, Bugs (link to JIRA or Github PR), Changes, Dependency list, Incompatibility issues. Follow this example.
Prepare DISCLAIMER file. Modify from the template
Package the release candidate. The release should be packaged into : apache-singa-VERSION.tar.gz. The release should not include any binary files including git files. Upload the release to for stage. The tar file, signature, KEY and SHA256 checksum file should be included. MD5 is no longer used. Policy is here
Call for vote by sending an email
To: [email protected]
Subject: [VOTE] Release apache-singa-X.Y.Z (release candidate N)
Hi all,
I have created a build for Apache SINGA X.Y.Z, release candidate N.
The artifacts to be voted on are located here: xxxx
The hashes of the artifacts are as follows: xxx
Release artifacts are signed with the following key: xxx
Please vote on releasing this package. The vote is open for at least 72 hours and passes if a majority of at least three +1 votes are cast.
[ ] +1 Release this package as Apache SINGA X.Y.Z
[ ] 0 I don't feel strongly about it, but I'm okay with the release
[ ] -1 Do not release this package because...
Here is my vote:
+1
Wait at least 48 hours for test responses. Any PMC, committer or contributor can test features for releasing, and feedback. Everyone should check these before vote +1. If the vote passes, then send the result email. Otherwise, repeat from the beginning.
Subject: [RESULT] [VOTE] Release apache-singa-X.Y.Z (release candidate N)
To: [email protected]
Thanks to everyone who has voted and given their comments. The tally is as follows.
N binding +1s:
<names>
N non-binding +1s:
<names>
No 0s or -1s.
I am delighted to announce that the proposal to release Apache SINGA X.Y.Zhas passed.
Upload the package for distribution to https://dist.apache.org/repos/dist/release/VERSION/.
Update the Download page of SINGA website. The tar.gz file MUST be downloaded from mirror, using closer.cgi script; other artifacts MUST be downloaded from main Apache site. More details here. Some feedback we got during the previous releases: "Download pages must only link to formal releases, so must not include links to GitHub.", "Links to KEYS, sigs and hashes must not use dist.apache.org; instead use https://www.apache.org/dist/singa/...;", "Also you only need one KEYS link, and there should be a description of how to use KEYS + sig or hash to verify the downloads."
Remove the RC tag and compile the conda packages.
Publish the release information.
To: [email protected], [email protected]
Subject: [ANNOUNCE] Apache SINGA X.Y.Z released
We are pleased to announce that SINGA X.Y.Z is released.
SINGA is a general distributed deep learning platform for training big deep learning models over large datasets.
The release is available at: http://singa.apache.org/downloads.html
The main features of this release include XXX
We look forward to hearing your feedback, suggestions, and contributions to the project.
On behalf of the SINGA team, {SINGA Team Member Name}
Add each initialization method as a function or class in https://github.com/apache/singa/blob/master/python/singa/initializer.py. e.g.,
class InitializationBase(object):
def __init__(self, ):
pass
def call(self, x):
pass
def __call__(self, x):
self.call(x)
class Uniform(InitializationBase):
def __init__(self, low=-1, high=1):
self.low = low
self.high = high
def call(self, x):
x.uniform(self.low, self.high)
def uniform(x, low=-1, high=1):
x.uniform(low, high)
@dcslin Shicong, did you encountered such error using conda build before? I would like your help because seems you are recently on CI/CD
I am trying to use the dev branch to build conda using singa/tool/conda/singa, it gives the errors concerning onnx version
conda_build.exceptions.DependencyNeedsBuildingError: Unsatisfiable dependencies for platform linux-64: {"onnx[version='>=1.3.0']"}
dcsysh@panda:~/singa/tool/conda/singa$ export CUDA=10.0
dcsysh@panda:~/singa/tool/conda/singa$ conda-build . --python 3.6
No numpy version specified in conda_build_config.yaml. Falling back to default numpy value of 1.11
WARNING:conda_build.metadata:No numpy version specified in conda_build_config.yaml. Falling back to default numpy value of 1.11
Copying /home/dcsysh/singa to /home/dcsysh/anaconda3/conda-bld/singa_1583505296868/work/
Adding in variants from internal_defaults
INFO:conda_build.variants:Adding in variants from internal_defaults
Adding in variants from /home/dcsysh/singa/tool/conda/singa/conda_build_config.yaml
INFO:conda_build.variants:Adding in variants from /home/dcsysh/singa/tool/conda/singa/conda_build_config.yaml
Adding in variants from config.variant
INFO:conda_build.variants:Adding in variants from config.variant
/home/dcsysh/anaconda3/lib/python3.6/site-packages/conda_build/environ.py:427: UserWarning: The environment variable 'CUDA' is being passed through with value '10.0'. If you are splitting build and test phases with --no-test, please ensure that this value is also set similarly at test time.
UserWarning
Attempting to finalize metadata for singa
INFO:conda_build.metadata:Attempting to finalize metadata for singa
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done
Adding .* to spec 'libprotobuf 3.6.1' to ensure satisfiability. Please consider putting {{ var_name }}.* or some relational operator (>/</>=/<=) on this spec in meta.yaml, or if req is also a build req, using {{ pin_compatible() }} jinja2 function instead. See https://conda.io/docs/user-guide/tasks/build-packages/variants.html#pinning-at-the-variant-level
WARNING:conda_build.utils:Adding .* to spec 'libprotobuf 3.6.1' to ensure satisfiability. Please consider putting {{ var_name }}.* or some relational operator (>/</>=/<=) on this spec in meta.yaml, or if req is also a build req, using {{ pin_compatible() }} jinja2 function instead. See https://conda.io/docs/user-guide/tasks/build-packages/variants.html#pinning-at-the-variant-level
WARNING conda_build.utils:ensure_valid_spec(1749): Adding .* to spec 'libprotobuf 3.6.1' to ensure satisfiability. Please consider putting {{ var_name }}.* or some relational operator (>/</>=/<=) on this spec in meta.yaml, or if req is also a build req, using {{ pin_compatible() }} jinja2 function instead. See https://conda.io/docs/user-guide/tasks/build-packages/variants.html#pinning-at-the-variant-level
Adding .* to spec 'libopenblas 0.3.3' to ensure satisfiability. Please consider putting {{ var_name }}.* or some relational operator (>/</>=/<=) on this spec in meta.yaml, or if req is also a build req, using {{ pin_compatible() }} jinja2 function instead. See https://conda.io/docs/user-guide/tasks/build-packages/variants.html#pinning-at-the-variant-level
WARNING:conda_build.utils:Adding .* to spec 'libopenblas 0.3.3' to ensure satisfiability. Please consider putting {{ var_name }}.* or some relational operator (>/</>=/<=) on this spec in meta.yaml, or if req is also a build req, using {{ pin_compatible() }} jinja2 function instead. See https://conda.io/docs/user-guide/tasks/build-packages/variants.html#pinning-at-the-variant-level
WARNING conda_build.utils:ensure_valid_spec(1749): Adding .* to spec 'libopenblas 0.3.3' to ensure satisfiability. Please consider putting {{ var_name }}.* or some relational operator (>/</>=/<=) on this spec in meta.yaml, or if req is also a build req, using {{ pin_compatible() }} jinja2 function instead. See https://conda.io/docs/user-guide/tasks/build-packages/variants.html#pinning-at-the-variant-level
BUILD START: ['singa-2.1.0.dev-cudnn7.3.1_cuda10.0_py36.tar.bz2']
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done
## Package Plan ##
environment location: /home/dcsysh/anaconda3/conda-bld/singa_1583505296868/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_place
The following NEW packages will be INSTALLED:
_libgcc_mutex: 0.1-main
blas: 1.0-openblas
ca-certificates: 2020.1.1-0
certifi: 2019.11.28-py36_0
cudatoolkit: 10.0.130-0
cudnn: 7.3.1-cuda10.0_0
gflags: 2.2.2-he6710b0_0
glog: 0.3.5-hf484d3e_1
intel-openmp: 2018.0.3-0
ld_impl_linux-64: 2.33.1-h53a641e_7
libedit: 3.1.20181209-hc058e9b_0
libffi: 3.2.1-hd88cf55_4
libgcc-ng: 9.1.0-hdf63c60_0
libgfortran-ng: 7.3.0-hdf63c60_0
libmklml: 2018.0.3-0
libopenblas: 0.3.3-h5a2b251_3
libprotobuf: 3.6.1-hd408876_0
libstdcxx-ng: 9.1.0-hdf63c60_0
mkl-dnn: 0.14-h6bb024c_0
ncurses: 6.2-he6710b0_0
nomkl: 3.0-0
numpy: 1.16.0-py36h99e49ec_1
numpy-base: 1.16.0-py36h2f8d375_1
openblas: 0.3.3-3
openblas-devel: 0.3.3-3
openssl: 1.1.1d-h7b6447c_4
pcre: 8.43-he6710b0_0
pip: 20.0.2-py36_1
protobuf: 3.6.1-py36he6710b0_0
python: 3.6.10-h0371630_0
readline: 7.0-h7b6447c_5
setuptools: 45.2.0-py36_0
six: 1.14.0-py36_0
sqlite: 3.31.1-h7b6447c_0
swig: 3.0.12-h38cdd7d_3
tk: 8.6.8-hbc83047_0
wheel: 0.34.2-py36_0
xz: 5.2.4-h14c3975_4
zlib: 1.2.11-h7b6447c_3
Preparing transaction: ...working... done
Verifying transaction: ...working... done
Executing transaction: ...working... done
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... failed
Leaving build/test directories:
Work:
/home/dcsysh/anaconda3/conda-bld/work
Test:
/home/dcsysh/anaconda3/conda-bld/test_tmp
Leaving build/test environments:
Test:
source activate /home/dcsysh/anaconda3/conda-bld/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_
Build:
source activate /home/dcsysh/anaconda3/conda-bld/_build_env
Traceback (most recent call last):
File "/home/dcsysh/anaconda3/lib/python3.6/site-packages/conda_build/environ.py", line 757, in get_install_actions
actions = install_actions(prefix, index, specs, force=True)
File "/home/dcsysh/anaconda3/lib/python3.6/site-packages/conda/common/io.py", line 88, in decorated
return f(*args, **kwds)
File "/home/dcsysh/anaconda3/lib/python3.6/site-packages/conda/plan.py", line 474, in install_actions
txn = solver.solve_for_transaction(prune=prune, ignore_pinned=not pinned)
File "/home/dcsysh/anaconda3/lib/python3.6/site-packages/conda/core/solve.py", line 117, in solve_for_transaction
should_retry_solve)
File "/home/dcsysh/anaconda3/lib/python3.6/site-packages/conda/core/solve.py", line 158, in solve_for_diff
force_remove, should_retry_solve)
File "/home/dcsysh/anaconda3/lib/python3.6/site-packages/conda/core/solve.py", line 275, in solve_final_state
ssc = self._add_specs(ssc)
File "/home/dcsysh/anaconda3/lib/python3.6/site-packages/conda/core/solve.py", line 555, in _add_specs
explicit_pool = ssc.r._get_package_pool(self.specs_to_add)
File "/home/dcsysh/anaconda3/lib/python3.6/site-packages/conda/resolve.py", line 553, in _get_package_pool
pool = self.get_reduced_index(specs)
File "/home/dcsysh/anaconda3/lib/python3.6/site-packages/conda/common/io.py", line 88, in decorated
return f(*args, **kwds)
File "/home/dcsysh/anaconda3/lib/python3.6/site-packages/conda/resolve.py", line 574, in get_reduced_index
explicit_specs, features = self.verify_specs(explicit_specs)
File "/home/dcsysh/anaconda3/lib/python3.6/site-packages/conda/resolve.py", line 288, in verify_specs
raise ResolvePackageNotFound(bad_deps)
conda.exceptions.ResolvePackageNotFound:
- onnx[version='>=1.3.0']
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/dcsysh/anaconda3/bin/conda-build", line 11, in <module>
sys.exit(main())
File "/home/dcsysh/anaconda3/lib/python3.6/site-packages/conda_build/cli/main_build.py", line 469, in main
execute(sys.argv[1:])
File "/home/dcsysh/anaconda3/lib/python3.6/site-packages/conda_build/cli/main_build.py", line 460, in execute
verify=args.verify, variants=args.variants)
File "/home/dcsysh/anaconda3/lib/python3.6/site-packages/conda_build/api.py", line 209, in build
notest=notest, need_source_download=need_source_download, variants=variants)
File "/home/dcsysh/anaconda3/lib/python3.6/site-packages/conda_build/build.py", line 2344, in build_tree
notest=notest,
File "/home/dcsysh/anaconda3/lib/python3.6/site-packages/conda_build/build.py", line 1408, in build
create_build_envs(top_level_pkg, notest)
File "/home/dcsysh/anaconda3/lib/python3.6/site-packages/conda_build/build.py", line 1292, in create_build_envs
raise e
File "/home/dcsysh/anaconda3/lib/python3.6/site-packages/conda_build/build.py", line 1282, in create_build_envs
channel_urls=tuple(m.config.channel_urls))
File "/home/dcsysh/anaconda3/lib/python3.6/site-packages/conda_build/environ.py", line 759, in get_install_actions
raise DependencyNeedsBuildingError(exc, subdir=subdir)
conda_build.exceptions.DependencyNeedsBuildingError: Unsatisfiable dependencies for platform linux-64: {"onnx[version='>=1.3.0']"}
Hi, @dcslin, as you know, I need to develop and test the new API for onnx, so I have to make sure the new implement of autograd and layers is ok. However, when I test the PR #697 , I have three types of issues of the test_operation.py
:
For example, in the test case of sum
, the value GPU tensor always is zero. But when I remove the conv2d test case, the sum
case will be fine. It seems the conv2d layer results in the zero GPU tensor issue.
The following cases have the same problem:
an issue when I run the Conv2d with odd_padding. As you see, sometimes, I need to padding zeros from only one direction, so I write this function:
def handle_odd_pad_fwd(x, odd_padding):
"""
handle odd padding mode forward
Args:
x, the input tensor
odd_padding, the odd_padding
Returns:
tensor, the output
"""
x_tensor = tensor.from_raw_tensor(x)
# (axis, left padding if True else right padding)
flags = [(2, True), (2, False), (3, True), (3, False)]
for (axis, left), pad in zip(flags, odd_padding):
if pad == 0:
continue
zeros_shape = list(x_tensor.data.shape())
zeros_shape[axis] = pad
zero_padding = np.zeros(zeros_shape).astype(np.float32)
zero_padding = tensor.Tensor(device=x.device(), data=zero_padding)
if left:
x_tensor = tensor.concatenate((zero_padding, x_tensor), axis)
else:
x_tensor = tensor.concatenate((x_tensor, zero_padding), axis)
return x_tensor.data
But it seems, when I call this func, it'd be fine if I call only one or two times, however, if I call it more times, it will report a error:
F0526 12:53:40.017063 15641 tensor_math_cuda.h:791] Check failed: status == CURAND_STATUS_SUCCESS (201 vs. 0) CURAND_STATUS_LAUNCH_FAILURE
I guess, the reason maybe it doesn't release the GPU memory in time?
The following cases have the same problem:
The third error msg is:
F0526 18:52:07.318809 21112 tensor_math_cuda.h:193] Error on line 193: CUDNN_STATUS_EXECUTION_FAILED
And the following cases have the same problem:
The Layer class in Autograd is to maintain the model parameters.
It passes the parameters into the operation and thus operations are stateless.
Typically the parameter size depends on the input and layer configuration.
Currently, we require the users to provide the input size in the layer constructor.
Then we can create the parameter tensor and initialize it in the constructor, e.g., Linear layer. One potential problem is that the initialization operation may not be buffered. @XJDKC Is it an issue?
For some layers like RNN implemented using cudnn, although we can get the input size, the parameter size is unknown until the cudnn handle is created, which is done until the data is forwarded through the layer.
Another way is to delay the parameter tensor creation until the layer is called for forward propagation. At that time, we have the input tensor (and its device). Then in the layer constructor, we do not need the user to provide the input size. The drawback is that after the layer is created, the get_params() function would still fail to get the parameter tensors as they are not created yet. @dcslin To switch to this approach, we need to change the constructors of existing layer classes and examples. We also need to provide an initializer function/class into the constructor for initializing the parameter tensors after they are created.
Please add your comments.
The autograd.softmax may have some problem, where I found it when I took part in the review of PR #572.
In examples/autograd/mlp.py (multilayer perception), the result is:
ubuntu@ip-172-31-26-47:~/singa/examples/autograd$ python3 mlp.py
train_data_shape: (400, 2)
train_label_shape: (400, 2)
training loss = 0.6908062
training loss = 0.5960194
training loss = 0.57797414
training loss = 0.55334115
training loss = 0.48568404
training loss = 0.38458923
training loss = 0.30776194
training loss = 0.24188559
training loss = 0.18657134
training loss = 0.15864176
training loss = 0.13929243
However, if I use softmax + cross_entropy instead of softmax_cross_entropy, there is such error:
ubuntu@ip-172-31-26-47:~/singa/examples/autograd$ python3 mlp.py
train_data_shape: (400, 2)
train_label_shape: (400, 2)
training loss = 6.682101
WARNING: Logging before InitGoogleLogging() is written to STDERR
F0113 09:20:05.180658 12032 tensor_math_cpp.h:357] Check failed: a > 0.f (-nan vs. 0)
*** Check failure stack trace: ***
Aborted (core dumped)
In the review of PR #572, I did not suspect SoftMax because I compared the 1-D result with Pytorch. However, now when I run the result with 2-D input, the backpropagation cannot with parameter axis = 1.
My test codes and results for softmax are in:
https://gist.github.com/chrishkchris/1bce55260b5e771ce974940a855292e2
I will need to further see how to debug.
I used the clang-formatter with VS-code after I alter the tensor.h file, it results in different format with the dev branch.
The tensor.cc should have re-formatted before in PR #581. So, did I use incorrect setting in clang-formatter?
Hi,
The issue might be known, however while creating a Neural network layer stack with unmatched layer can cause the current python session to end abruptly, without generating any stack trace. while calculating model.loss(e.g. autograd.mse_loss(y,t) )
for example of a simple feed forward neural network:
class MLP():
def init(self):
self.linear1 = autograd.Linear(3,4)
self.linear2 = autograd.Linear(4,3)
def forward(self,x):
y = self.linear1(x)
return self.linear2(y)
if the output does not have a dimension of 3, the current session will terminate without generating any error.
A stack trace is generated with below warning.
WARNING: Logging before InitGoogleLogging() is written to STDERR
F0520 17:37:19.265754 288538048 tensor.cc:431] Check failed: shape_.at(m - i) == 1 (3 vs. 1) i= 0
*** Check failure stack trace: ***
This causes to rerun the entire program/notebook again. The same issue is not seen in autograd.backward which generates an assertion error.
Thanks and Regards,
Shashank
The docstrings will be used to build the API documentation pages at https://apache-singa.readthedocs.io/en/latest/
editorconfig is a configuration file adopted by many editors to ensure consistent coding style, which can avoid many diff results when we do git merge.
Hi,
For converting a ONNX model to SINGA, sonnx.py is being used. Different modules are converted using sonnx.py. Current implementation does not support maxpool2d with ceil_mode set to True and count_include_pad attributes.
For MaxPool2d implemented using PyTorch, ceil_mode is a boolean operator which is default set to false. Sometimes while converting the pytorch model to onnx the atttribute get transferred to onnx implementation as:
onnx_node.attrs["ceil_mode"] = 0
Which represents a valid context for conversion to singa format. The current code in sonnx however checks only for the presence of attribute ceil_mode in onnx_node.attrs before raising an exception as illustrated below:
def _create_max_avg_pool(cls, onnx_node, inputs, opset_version):
"""
get the max or avg pool operator from onnx node
Args:
onnx_node: a given onnx node
Args:
inputs: the input tensor
Args:
opset_version: the opset version
Returns:
handle, the handle of singa operator
Returns:
forward, the autograd of singa operator
"""
kernel = tuple(onnx_node.attrs["kernel_shape"])
padding = tuple(
onnx_node.attrs["pads"]) if "pads" in onnx_node.attrs else (0, 0)
stride = tuple(onnx_node.getattr('strides', (1, 1)))
# default the odd_padding is 0, once there are same pad mode, we modify it
# for odd_padding, please refer the autegrade.py
odd_padding = (0, 0, 0, 0)
if "auto_pad" in onnx_node.attrs:
auto_pad = utils.force_unicode(onnx_node.attrs['auto_pad'])
if auto_pad in ('SAME_UPPER', 'SAME_LOWER'):
padding, odd_padding = utils.get_padding_shape(
auto_pad, inputs[0].shape[2:], kernel, stride)
**# not support count_include_pad and auto_pad
if "count_include_pad" in onnx_node.attrs or "ceil_mode" in onnx_node.attrs :
raise ValueError(
"Not implemented yet for count_include_pad or ceil_mode")**
# only support 2d
if len(kernel) != 2:
raise ValueError("Not implemented yet")
is_max = onnx_node.op_type == 'MaxPool'
x = inputs[0]
if x.device.id() == -1:
handle = singa.PoolingHandle(x.data, kernel, stride, padding,
is_max)
else:
handle = singa.CudnnPoolingHandle(x.data, kernel, stride, padding,
is_max)
_, forward = cls._common_onnx_node_to_singa_op(onnx_node, inputs,
opset_version)
return _, forward(handle, odd_padding)
The code does not consider if the value of ceil_mode is set as False/0
The following changes can allow considering this edge case
def _create_max_avg_pool(cls, onnx_node, inputs, opset_version):
"""
get the max or avg pool operator from onnx node
Args:
onnx_node: a given onnx node
Args:
inputs: the input tensor
Args:
opset_version: the opset version
Returns:
handle, the handle of singa operator
Returns:
forward, the autograd of singa operator
"""
kernel = tuple(onnx_node.attrs["kernel_shape"])
padding = tuple(
onnx_node.attrs["pads"]) if "pads" in onnx_node.attrs else (0, 0)
stride = tuple(onnx_node.getattr('strides', (1, 1)))
# default the odd_padding is 0, once there are same pad mode, we modify it
# for odd_padding, please refer the autegrade.py
odd_padding = (0, 0, 0, 0)
if "auto_pad" in onnx_node.attrs:
auto_pad = utils.force_unicode(onnx_node.attrs['auto_pad'])
if auto_pad in ('SAME_UPPER', 'SAME_LOWER'):
padding, odd_padding = utils.get_padding_shape(
auto_pad, inputs[0].shape[2:], kernel, stride)
# not support count_include_pad and auto_pad
if "ceil_mode" in onnx_node.attrs and onnx_node.attrs["ceil_mode"]:
raise ValueError(
"Not implemented yet for count_include_pad or ceil_mode")
if "count_include_pad" in onnx_node.attrs:
raise ValueError(
"Not implemented yet for count_include_pad or ceil_mode")
# only support 2d
if len(kernel) != 2:
raise ValueError("Not implemented yet")
is_max = onnx_node.op_type == 'MaxPool'
x = inputs[0]
if x.device.id() == -1:
handle = singa.PoolingHandle(x.data, kernel, stride, padding,
is_max)
else:
handle = singa.CudnnPoolingHandle(x.data, kernel, stride, padding,
is_max)
_, forward = cls._common_onnx_node_to_singa_op(onnx_node, inputs,
opset_version)
return _, forward(handle, odd_padding)
The issue is faced while converting shufflenetv2 from onnx to singa
Request to let us know if this change is possible
Thanks and Regards,
Shashank Nigam
Hi, @dcslin , The asType cannot work when it is after the reshape operators.
Please check by using the following code:
dev = device.create_cuda_gpu()
X = np.array([[1, 0], [1, 1]]).astype(np.int32)
x = tensor.from_numpy(X)
x.to_device(dev)
x = autograd.cast(x, tensor.int32)
x = autograd.reshape(x, [1, 2, 2])
x = autograd.cast(x, tensor.float32)
The update frequency of the dev branch, master branch and version is high, middle and low respectively.
However, the current CI builds the conda package once the master has new commits.
I suggest to
It would be better to implement a metric as a function in the metric.py as metric typically has not states. Therefore, no need to make it as a class. e.g,
def accuracy(y_pred, y_true):
"""compute the accuracy.
Args:
y_pred(numpy array or tensor): each value is a label index
y_true(numpy array or tensor): each value is a label index
"""
check shape match
convert y_pred and y_true to np array
return np.sum(y_pred== y_true) / y_true.shape[0]
Refer to https://keras.io/api/metrics/
Hi, when running python test scripts, even the test are passing, the cuda stream still fails:
python3 test/python/test_operation.py -v TestPythonOperation.test_sum_cpu
test_sum_cpu (__main__.TestPythonOperation) ... ok
----------------------------------------------------------------------
Ran 1 test in 0.002s
OK
WARNING: Logging before InitGoogleLogging() is written to STDERR
F0618 07:06:09.329576 10741 cuda_gpu.cc:48] Check failed: error == cudaSuccess (29 vs. 0) driver shutting down
*** Check failure stack trace: ***
Aborted (core dumped)
In the log of travis CI CPU version build, it displays the error in test_onnx that cannot import libprotobuf.so.20
https://travis-ci.org/github/apache/singa/jobs/664251025#L3998
======================================================================
3966
ERROR: test_onnx (unittest.loader._FailedTest)
3967
----------------------------------------------------------------------
3968
ImportError: Failed to import test module: test_onnx
3969
Traceback (most recent call last):
3970
File "/home/travis/conda-bld-1971.5/singa_1584596418932/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_pla/lib/python3.7/unittest/loader.py", line 436, in _find_test_path
3971
module = self._get_module_from_name(name)
3972
File "/home/travis/conda-bld-1971.5/singa_1584596418932/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_pla/lib/python3.7/unittest/loader.py", line 377, in _get_module_from_name
3973
__import__(name)
3974
File "/home/travis/conda-bld-1971.5/singa_1584596418932/test_tmp/test/python/test_onnx.py", line 24, in <module>
3975
from singa import sonnx
3976
File "/home/travis/conda-bld-1971.5/singa_1584596418932/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_pla/lib/python3.7/site-packages/singa/sonnx.py", line 23, in <module>
3977
import onnx.utils
3978
File "/home/travis/conda-bld-1971.5/singa_1584596418932/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_pla/lib/python3.7/site-packages/onnx/__init__.py", line 8, in <module>
3979
from .onnx_cpp2py_export import ONNX_ML
3980
ImportError: libprotobuf.so.20: cannot open shared object file: No such file or directory
Followed the installation instruction in http://singa.apache.org/ but encountered a similar error as this issue https://issues.apache.org/jira/browse/SINGA-422
I tried to delete all installed package and redo, I encountered this error.
NoBaseEnvironmentError: This conda installation has no default base environment. Use
'conda create' to create new environments and 'conda activate' to
activate environments.
I created a new virtual environment and I encountered this error when I entered any of the three commands
PackagesNotFoundError: The following packages are not available from current channels:
- singa-gpu
Has there been any similar issues to this? Thank you for the help!
Currently, we implement the rnn operations from scratch, which may not be as fast as the cudnn versions.
To use cudnn rnn operations, we need to implement the cpp operation and call it from the python side.
https://docs.nvidia.com/deeplearning/sdk/cudnn-api/index.html#cudnnRNNMode_t
Hi, @dcslin , since some onnx models need scalar-tensor but we can't support.
The error is:
Traceback (most recent call last):
File "../../test/python/test_operation.py", line 4015, in test_tmp
x = tensor.from_numpy(x)
File "/usr/local/lib/python3.5/dist-packages/singa/tensor.py", line 766, in from_numpy
ret.copy_from_numpy(np_array)
File "/usr/local/lib/python3.5/dist-packages/singa/tensor.py", line 307, in copy_from_numpy
assert np_array.size == self.size(), 'tensor shape should be the same'
AssertionError: tensor shape should be the same
please use this test case:
x = np.array(256, ndmin=0).astype(np.int32)
x = tensor.from_numpy(x)
I'm trying to test some examples of SINGA. However, when I ran the examples, the singa-gpu
package throwed an error that it could not find Module class.
I reproduce this error by the following commands:
(base) user@xgpe3:~$ conda activate sg37
(sg37) user@xgpe3:~$ conda install -c nusdbsystem -c conda-forge singa-gpu=3.0.0
Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: done
==> WARNING: A newer version of conda exists. <==
current version: 4.8.3
latest version: 4.8.4
Please update conda by running
$ conda update -n base -c defaults conda
## Package Plan ##
environment location: /path/to/miniconda3/envs/sg37
added / updated specs:
- singa-gpu=3.0.0
The following packages will be downloaded:
package | build
---------------------------|-----------------
importlib_metadata-1.7.0 | 0 3 KB conda-forge
zipp-3.1.0 | py_0 10 KB conda-forge
------------------------------------------------------------
Total: 13 KB
The following NEW packages will be INSTALLED:
_openmp_mutex conda-forge/linux-64::_openmp_mutex-4.5-1_llvm
attrs conda-forge/noarch::attrs-20.2.0-pyh9f0ad1d_0
cudatoolkit pkgs/main/linux-64::cudatoolkit-10.0.130-0
cudnn pkgs/main/linux-64::cudnn-7.6.5-cuda10.0_0
deprecated conda-forge/noarch::deprecated-1.2.7-py_0
dnnl nusdbsystem/linux-64::dnnl-1.1-build
freetype conda-forge/linux-64::freetype-2.10.2-he06d7ca_0
future conda-forge/linux-64::future-0.18.2-py37hc8dfbb8_1
glog conda-forge/linux-64::glog-0.3.5-hf484d3e_1001
importlib-metadata conda-forge/linux-64::importlib-metadata-1.7.0-py37hc8dfbb8_0
importlib_metadata conda-forge/noarch::importlib_metadata-1.7.0-0
iniconfig conda-forge/noarch::iniconfig-1.0.1-pyh9f0ad1d_0
jpeg conda-forge/linux-64::jpeg-9d-h516909a_0
lcms2 conda-forge/linux-64::lcms2-2.11-hbd6801e_0
libblas conda-forge/linux-64::libblas-3.8.0-16_openblas
libcblas conda-forge/linux-64::libcblas-3.8.0-16_openblas
libgfortran-ng conda-forge/linux-64::libgfortran-ng-7.5.0-hdf63c60_16
liblapack conda-forge/linux-64::liblapack-3.8.0-16_openblas
libopenblas conda-forge/linux-64::libopenblas-0.3.9-h5ec1e0e_0
libpng conda-forge/linux-64::libpng-1.6.37-hed695b0_2
libprotobuf conda-forge/linux-64::libprotobuf-3.9.2-h8b12597_0
libtiff conda-forge/linux-64::libtiff-4.1.0-hc7e4089_6
libwebp-base conda-forge/linux-64::libwebp-base-1.1.0-h516909a_3
llvm-openmp conda-forge/linux-64::llvm-openmp-10.0.1-hc9558a2_0
lz4-c conda-forge/linux-64::lz4-c-1.9.2-he1b5a44_3
more-itertools conda-forge/noarch::more-itertools-8.5.0-py_0
numpy conda-forge/linux-64::numpy-1.16.5-py37h95a1406_0
olefile conda-forge/noarch::olefile-0.46-py_0
onnx conda-forge/linux-64::onnx-1.6.0-py37he1b5a44_0
packaging conda-forge/noarch::packaging-20.4-pyh9f0ad1d_0
pillow conda-forge/linux-64::pillow-7.2.0-py37h718be6c_1
pluggy conda-forge/linux-64::pluggy-0.13.1-py37hc8dfbb8_2
protobuf conda-forge/linux-64::protobuf-3.9.2-py37he1b5a44_1
py conda-forge/noarch::py-1.9.0-pyh9f0ad1d_0
pyparsing conda-forge/noarch::pyparsing-2.4.7-pyh9f0ad1d_0
pytest conda-forge/linux-64::pytest-6.0.1-py37hc8dfbb8_0
python_abi conda-forge/linux-64::python_abi-3.7-1_cp37m
singa nusdbsystem/linux-64::singa-3.0.0-cudnn7.6.5_cuda10.0_py37
singa-gpu nusdbsystem/linux-64::singa-gpu-3.0.0-py37
six conda-forge/noarch::six-1.15.0-pyh9f0ad1d_0
toml conda-forge/noarch::toml-0.10.1-pyh9f0ad1d_0
tqdm conda-forge/noarch::tqdm-4.48.2-pyh9f0ad1d_0
wrapt conda-forge/linux-64::wrapt-1.12.1-py37h8f50634_1
zipp conda-forge/noarch::zipp-3.1.0-py_0
zstd conda-forge/linux-64::zstd-1.4.5-h6597ccf_2
The following packages will be UPDATED:
libgcc-ng pkgs/main::libgcc-ng-9.1.0-hdf63c60_0 --> conda-forge::libgcc-ng-9.3.0-h24d8f2e_16
openssl pkgs/main::openssl-1.1.1g-h7b6447c_0 --> conda-forge::openssl-1.1.1g-h516909a_1
The following packages will be SUPERSEDED by a higher-priority channel:
_libgcc_mutex pkgs/main::_libgcc_mutex-0.1-main --> conda-forge::_libgcc_mutex-0.1-conda_forge
ca-certificates pkgs/main::ca-certificates-2020.7.22-0 --> conda-forge::ca-certificates-2020.6.20-hecda079_0
certifi pkgs/main::certifi-2020.6.20-py37_0 --> conda-forge::certifi-2020.6.20-py37hc8dfbb8_0
Proceed ([y]/n)? y
Downloading and Extracting Packages
importlib_metadata-1 | 3 KB | #################################################################################### | 100%
zipp-3.1.0 | 10 KB | #################################################################################### | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
(sg37) user@xgpe3:~$ python -c "from singa.module import Module"
Traceback (most recent call last):
File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'singa.module'
(sg37) user@xgpe3:~$
By contrast, with singa-cpu
package, everything seems fine.
(base) user@xgpe3:~$ conda activate sc37
(sc37) user@xgpe3:~$ conda install -c nusdbsystem -c conda-forge singa-cpu=3.0.0
Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: done
==> WARNING: A newer version of conda exists. <==
current version: 4.8.3
latest version: 4.8.4
Please update conda by running
$ conda update -n base -c defaults conda
## Package Plan ##
environment location: /path/to/miniconda3/envs/sc37
added / updated specs:
- singa-cpu=3.0.0
The following packages will be downloaded:
package | build
---------------------------|-----------------
singa-3.0.0 | cpu_py37 22.2 MB nusdbsystem
singa-cpu-3.0.0 | py37 4 KB nusdbsystem
------------------------------------------------------------
Total: 22.2 MB
The following NEW packages will be INSTALLED:
_openmp_mutex conda-forge/linux-64::_openmp_mutex-4.5-1_llvm
deprecated conda-forge/noarch::deprecated-1.2.7-py_0
dnnl nusdbsystem/linux-64::dnnl-1.1-build
freetype conda-forge/linux-64::freetype-2.10.2-he06d7ca_0
future conda-forge/linux-64::future-0.18.2-py37hc8dfbb8_1
glog conda-forge/linux-64::glog-0.3.5-hf484d3e_1001
jpeg conda-forge/linux-64::jpeg-9d-h516909a_0
lcms2 conda-forge/linux-64::lcms2-2.11-hbd6801e_0
libblas conda-forge/linux-64::libblas-3.8.0-16_openblas
libcblas conda-forge/linux-64::libcblas-3.8.0-16_openblas
libgfortran-ng conda-forge/linux-64::libgfortran-ng-7.5.0-hdf63c60_16
liblapack conda-forge/linux-64::liblapack-3.8.0-16_openblas
libopenblas conda-forge/linux-64::libopenblas-0.3.9-h5ec1e0e_0
libpng conda-forge/linux-64::libpng-1.6.37-hed695b0_2
libprotobuf conda-forge/linux-64::libprotobuf-3.9.2-h8b12597_0
libtiff conda-forge/linux-64::libtiff-4.1.0-hc7e4089_6
libwebp-base conda-forge/linux-64::libwebp-base-1.1.0-h516909a_3
llvm-openmp conda-forge/linux-64::llvm-openmp-10.0.1-hc9558a2_0
lz4-c conda-forge/linux-64::lz4-c-1.9.2-he1b5a44_3
numpy conda-forge/linux-64::numpy-1.16.5-py37h95a1406_0
olefile conda-forge/noarch::olefile-0.46-py_0
onnx conda-forge/linux-64::onnx-1.6.0-py37he1b5a44_0
pillow conda-forge/linux-64::pillow-7.2.0-py37h718be6c_1
protobuf conda-forge/linux-64::protobuf-3.9.2-py37he1b5a44_1
python_abi conda-forge/linux-64::python_abi-3.7-1_cp37m
singa nusdbsystem/linux-64::singa-3.0.0-cpu_py37
singa-cpu nusdbsystem/linux-64::singa-cpu-3.0.0-py37
six conda-forge/noarch::six-1.15.0-pyh9f0ad1d_0
tqdm conda-forge/noarch::tqdm-4.48.2-pyh9f0ad1d_0
wrapt conda-forge/linux-64::wrapt-1.12.1-py37h8f50634_1
zstd conda-forge/linux-64::zstd-1.4.5-h6597ccf_2
The following packages will be UPDATED:
libgcc-ng pkgs/main::libgcc-ng-9.1.0-hdf63c60_0 --> conda-forge::libgcc-ng-9.3.0-h24d8f2e_16
openssl pkgs/main::openssl-1.1.1g-h7b6447c_0 --> conda-forge::openssl-1.1.1g-h516909a_1
The following packages will be SUPERSEDED by a higher-priority channel:
_libgcc_mutex pkgs/main::_libgcc_mutex-0.1-main --> conda-forge::_libgcc_mutex-0.1-conda_forge
ca-certificates pkgs/main::ca-certificates-2020.7.22-0 --> conda-forge::ca-certificates-2020.6.20-hecda079_0
certifi pkgs/main::certifi-2020.6.20-py37_0 --> conda-forge::certifi-2020.6.20-py37hc8dfbb8_0
Proceed ([y]/n)? y
Downloading and Extracting Packages
singa-cpu-3.0.0 | 4 KB | #################################################################################### | 100%
singa-3.0.0 | 22.2 MB | #################################################################################### | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
(sc37) user@xgpe3:~$ python -c "from singa.module import Module"
(sc37) user@xgpe3:~$
Is this a bug or just the way it goes?
Updated on May 15
class Layer:
def get_params(self):
"""the params of this layer and sublayers as a dict; param name is: layername.param
e.g., self.W = Tensor(), self.b=Tensor()
name of W and b is like conv1.W and conv1.b
"""
def get_states(self):
"""states of this layer as sublayers that are necessary for model evaluation/inference.
the states include the params and others, e.g., the running mean and var of batchnorm.
"""
class Module(Layer):
def compile(self ...):
"""set the name of each layer and sublayers, which will be used to create the dict
for get_params and get_states. Then no need to manually config the layer name
the __init__ method of a layer.
For instance,
class Blk(Layer):
def __init__(self):
self.conv1= Conv2d()
self.conv2 = Conv2d()
class MyModel(Module):
def __init__(self):
self.blk1 = Blk() --> blk1.conv1, blk1.conv2
self.blk2 = Blk() --> blk2.conv1, blk2.conv2
"""
# high priority
def save(self, fpath, ckp_states={}):
"""Save the model and optionally some states.
Args:
fpath: output file path (without the extension)
ckp_states(dict): states for checkpoint that are not attributes of Module, e.g., epoch ID.
"""
cust_states = {}
if ckp_states is not None:
cust_states = ckp_states + model (include sublayers) attributes - get_states()
save model states via onnx with customized field for the cust_states
def load(self, fpath, dev, use_graph, graph_alg):
"""Load the model onto dev
Args:
path: input file path (without the extension)
Returns:
dict for the ckp_states.
```
load model states + cust_states
model attributes = model states + attributes from cust_states
self.compile()
restore the model attributes
return the rest states as a dict
# lower priority
def save(fpath, model, ckp_states):
attributes <-- model
replace all tensors in attributes + ckp_states into dict name -->(shape, dtype)
dump the tensors via numpy.savez_compressed
dump model via pickle
def load(fpath, dev, use_graph, graph_alg):
load model via pickle
load tensors via numpy.load
restore the tensors
return the ckp_states
Clarification:
Layer.get_params()
Layer.get_states()
class.__dict__
. Superset of states.SINGA has many common operators and layers.
There are also many operators to be implemented.
Here is a list of popular operators define by ONNX https://github.com/onnx/onnx/blob/master/docs/Operators.md
Here is the list of operators implemented in SINGA http://singa.apache.org/docs/onnx/#supported-operators
The task is to add the operators in ONNX but not in SINGA.
Refer to this link for how to add new operators and corresponding layers into SINGA.
I propose to release a minor version to reflect the changes since v3.0.
Please test the following items and check the documentation if they are done.
Here is the checklist and steps
Select a release manager. The release manager (RM) is the coordinator for the release process. It is the RM's signature (.asc) that is uploaded together with the release. The RM generates KEY (RSA 4096-bit) and uploads it to a public key server. The RM needs to get his key endorsed (signed) by other Apache user, to be connected to the web of trust. He should first ask the mentor to help signing his key. How to generate the key?
Check license. FAQ; SINGA Issue
Bump the version. Check code and documentation
Prepare the RELEASE_NOTES file. Include the following items, Introduction, Features, Bugs (link to JIRA or Github PR), Changes, Dependency list, Incompatibility issues. Follow this example.
Prepare DISCLAIMER file. Modify from the template
Package the release candidate. The release should be packaged into : apache-singa-VERSION.tar.gz. The release should not include any binary files including git files. Upload the release to for stage. The tar file, signature, KEY and SHA256 checksum file should be included. MD5 is no longer used. Policy is here
Call for vote by sending an email
To: [email protected]
Subject: [VOTE] Release apache-singa-X.Y.Z (release candidate N)
Hi all,
I have created a build for Apache SINGA X.Y.Z, release candidate N.
The artifacts to be voted on are located here: xxxx
The hashes of the artifacts are as follows: xxx
Release artifacts are signed with the following key: xxx
Please vote on releasing this package. The vote is open for at least 72 hours and passes if a majority of at least three +1 votes are cast.
[ ] +1 Release this package as Apache SINGA X.Y.Z
[ ] 0 I don't feel strongly about it, but I'm okay with the release
[ ] -1 Do not release this package because...
Here is my vote:
+1
Wait at least 48 hours for test responses. Any PMC, committer or contributor can test features for releasing, and feedback. Everyone should check these before vote +1. If the vote passes, then send the result email. Otherwise, repeat from the beginning.
Subject: [RESULT] [VOTE] Release apache-singa-X.Y.Z (release candidate N)
To: [email protected]
Thanks to everyone who has voted and given their comments. The tally is as follows.
N binding +1s:
<names>
N non-binding +1s:
<names>
No 0s or -1s.
I am delighted to announce that the proposal to release Apache SINGA X.Y.Zhas passed.
Upload the package for distribution to https://dist.apache.org/repos/dist/release/VERSION/.
Update the Download page of SINGA website. The tar.gz file MUST be downloaded from mirror, using closer.cgi script; other artifacts MUST be downloaded from main Apache site. More details here. Some feedback we got during the previous releases: "Download pages must only link to formal releases, so must not include links to GitHub.", "Links to KEYS, sigs and hashes must not use dist.apache.org; instead use https://www.apache.org/dist/singa/...;", "Also you only need one KEYS link, and there should be a description of how to use KEYS + sig or hash to verify the downloads."
Remove the RC tag and compile the conda packages.
Publish the release information.
To: [email protected], [email protected]
Subject: [ANNOUNCE] Apache SINGA X.Y.Z released
We are pleased to announce that SINGA X.Y.Z is released.
SINGA is a general distributed deep learning platform for training big deep learning models over large datasets.
The release is available at: http://singa.apache.org/downloads.html
The main features of this release include XXX
We look forward to hearing your feedback, suggestions, and contributions to the project.
On behalf of the SINGA team, {SINGA Team Member Name}
Today when I run the singa/test/python/test_operation.py, I get these errors:
ubuntu@ip-172-31-24-48:~/singa/test/python$ python3 test_operation.py
..................................................................E.FF..............................FF.....FF................
======================================================================
ERROR: test_conv2d_cpu (__main__.TestPythonOperation)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test_operation.py", line 216, in test_conv2d_cpu
y = conv_1(cpu_input_tensor) # PyTensor
File "/home/ubuntu/singa/build/python/singa/autograd.py", line 1380, in __call__
y = conv2d(self.handle, x, self.W, self.b)
File "/home/ubuntu/singa/build/python/singa/autograd.py", line 1241, in conv2d
return _Conv2d(handle)(x, W, b)[0]
File "/home/ubuntu/singa/build/python/singa/autograd.py", line 247, in __call__
return self._do_forward(*xs)
File "/home/ubuntu/singa/build/python/singa/autograd.py", line 298, in _do_forward
ys = self.forward(*xs)
File "/home/ubuntu/singa/build/python/singa/autograd.py", line 1203, in forward
return singa.GpuConvForward(x, W, b, self.handle)
TypeError: in method 'GpuConvForward', argument 4 of type 'singa::CudnnConvHandle const &'
======================================================================
FAIL: test_div_broadcast_cpu (__main__.TestPythonOperation)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test_operation.py", line 2616, in test_div_broadcast_cpu
np.testing.assert_array_almost_equal(tensor.to_numpy(tensor.from_raw_tensor(dx1)), grad1, decimal=5)
File "/usr/local/lib/python3.5/dist-packages/numpy/testing/_private/utils.py", line 1007, in assert_array_almost_equal
precision=decimal)
File "/usr/local/lib/python3.5/dist-packages/numpy/testing/_private/utils.py", line 819, in assert_array_compare
raise AssertionError(msg)
AssertionError:
Arrays are not almost equal to 5 decimals
Mismatch: 3.33%
Max absolute difference: 3.0517578e-05
Max relative difference: 9.684139e-07
x: array([[-1.30722e+01, 2.65515e+00, -6.92423e-02, -2.97908e-01,
6.12429e+00, 3.71461e-01],
[ 1.33601e+01, -4.65283e+00, -4.74600e-01, -9.15998e-01,...
y: array([[-1.30722e+01, 2.65515e+00, -6.92423e-02, -2.97908e-01,
6.12429e+00, 3.71461e-01],
[ 1.33601e+01, -4.65283e+00, -4.74600e-01, -9.15998e-01,...
======================================================================
FAIL: test_div_broadcast_gpu (__main__.TestPythonOperation)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test_operation.py", line 2584, in test_div_broadcast_gpu
np.testing.assert_array_almost_equal(tensor.to_numpy(tensor.from_raw_tensor(dx1)), grad1, decimal=5)
File "/usr/local/lib/python3.5/dist-packages/numpy/testing/_private/utils.py", line 1007, in assert_array_almost_equal
precision=decimal)
File "/usr/local/lib/python3.5/dist-packages/numpy/testing/_private/utils.py", line 819, in assert_array_compare
raise AssertionError(msg)
AssertionError:
Arrays are not almost equal to 5 decimals
Mismatch: 40%
Max absolute difference: 6.1035156e-05
Max relative difference: 3.51512e-07
x: array([-173.63599, -30.95938, 139.375 , -4.83802, -2.26971],
dtype=float32)
y: array([-173.63605, -30.95938, 139.37502, -4.83802, -2.26971],
dtype=float32)
======================================================================
FAIL: test_pow_broadcast_cpu (__main__.TestPythonOperation)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test_operation.py", line 2678, in test_pow_broadcast_cpu
np.testing.assert_array_almost_equal(tensor.to_numpy(tensor.from_raw_tensor(dx1)), grad1, decimal=5)
File "/usr/local/lib/python3.5/dist-packages/numpy/testing/_private/utils.py", line 1007, in assert_array_almost_equal
precision=decimal)
File "/usr/local/lib/python3.5/dist-packages/numpy/testing/_private/utils.py", line 819, in assert_array_compare
raise AssertionError(msg)
AssertionError:
Arrays are not almost equal to 5 decimals
Mismatch: 40%
Max absolute difference: 6.1035156e-05
Max relative difference: 1.3951524e-07
x: array([ 169.04495, -238.43016, 1852.8772 , 437.48016, -20.75186],
dtype=float32)
y: array([ 169.04497, -238.43016, 1852.8772 , 437.48022, -20.75186],
dtype=float32)
======================================================================
FAIL: test_pow_broadcast_gpu (__main__.TestPythonOperation)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test_operation.py", line 2645, in test_pow_broadcast_gpu
np.testing.assert_array_almost_equal(tensor.to_numpy(result), y, decimal=5)
File "/usr/local/lib/python3.5/dist-packages/numpy/testing/_private/utils.py", line 1007, in assert_array_almost_equal
precision=decimal)
File "/usr/local/lib/python3.5/dist-packages/numpy/testing/_private/utils.py", line 819, in assert_array_compare
raise AssertionError(msg)
AssertionError:
Arrays are not almost equal to 5 decimals
Mismatch: 6.67%
Max absolute difference: 6.1035156e-05
Max relative difference: 8.3724494e-08
x: array([[[ 1. , 216. , 64. , 36. , 343. ],
[ 27. , 125. , 512. , 36. , 343. ],
[ 1. , 343. , 1. , 81. , 343. ],...
y: array([[[ 1., 216., 64., 36., 343.],
[ 27., 125., 512., 36., 343.],
[ 1., 343., 1., 81., 343.],...
======================================================================
FAIL: test_reshape_cpu (__main__.TestPythonOperation)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test_operation.py", line 1455, in test_reshape_cpu
np.testing.assert_array_almost_equal(tensor.to_numpy(tensor.from_raw_tensor(dx)), grad, decimal=5)
File "/usr/local/lib/python3.5/dist-packages/numpy/testing/_private/utils.py", line 1007, in assert_array_almost_equal
precision=decimal)
File "/usr/local/lib/python3.5/dist-packages/numpy/testing/_private/utils.py", line 752, in assert_array_compare
raise AssertionError(msg)
AssertionError:
Arrays are not almost equal to 5 decimals
(shapes (2, 3), (3, 2) mismatch)
x: array([[1., 1., 1.],
[1., 1., 1.]], dtype=float32)
y: array([[1., 1.],
[1., 1.],
[1., 1.]], dtype=float32)
======================================================================
FAIL: test_reshape_gpu (__main__.TestPythonOperation)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test_operation.py", line 1475, in test_reshape_gpu
np.testing.assert_array_almost_equal(tensor.to_numpy(tensor.from_raw_tensor(dx)), grad, decimal=5)
File "/usr/local/lib/python3.5/dist-packages/numpy/testing/_private/utils.py", line 1007, in assert_array_almost_equal
precision=decimal)
File "/usr/local/lib/python3.5/dist-packages/numpy/testing/_private/utils.py", line 752, in assert_array_compare
raise AssertionError(msg)
AssertionError:
Arrays are not almost equal to 5 decimals
(shapes (2, 3), (3, 2) mismatch)
x: array([[1., 1., 1.],
[1., 1., 1.]], dtype=float32)
y: array([[1., 1.],
[1., 1.],
[1., 1.]], dtype=float32)
----------------------------------------------------------------------
Ran 125 tests in 0.586s
FAILED (failures=6, errors=1)
#688 is refactoring the autograd module.
Here are some comments about the current APIs in autograd.
Relationship between the classes and functions in autograd.
Operator
implements the forward and backward method for autograd.
For each Operator
class, there is a function that creates an Operator
instance and calls the forward method.
Layer
stores the states (handles and parameters) and calls Operator
function for the real computation. Note that a layer class can have sub-layers (as states) for creating complex and deep models.
Issue:
When we create a network using the Module
API, there are both stateless (e.g., flatten) and statefull (conv2d) operations. Currently, we create layers in __init__
of Module
and calls the layers and operator function in forward
method. Therefore, Layer and Operator are mixed, which may confuse the users. A better way is to use Layer instances only. For every operator, we create a corresponding layer class to replace the layer function.
Layer API.
issue: when and how to initialize the parameters and (handle) of a layer?
__init__
method OR when the data is forwarded for the first time #674initializer
function to the __init__
method of a layer and use it to initialize the parameters OR pass an initializer
function to the __init__
method of the Module class and use it to initialize the parameter (through get_params
) of the layers after forwarding the layers for once. The second approach requires the Module class's __init__
to do a forward pass of all layers and then get_params
of each layer for initialization. To do that, it needs at least the shapes of the input tensors and the device. The drawback of the first approach is that we need to include the initializer in every Layer constructor.comments are welcomed.
Hi,
I have implemented AlexNet in singa but I obtain an error during the backward_and_update instruction. I am using Singa 3.0.0.rc1 on cpu.
This is my AlexNet implementation:
`from singa import autograd
from singa import module
from singa import opt
all = ['AlexNet', 'alexnet']
class AlexNet(module.Module):
def init(self, num_classes=1000):
super(AlexNet, self).init()
# 12 sur GPU donc 6 & 6
self.features1 = [
autograd.Conv2d(3,64,kernel_size=11,stride=4,padding=2),
autograd.ReLU(),
autograd.MaxPool2d(kernel_size=3, stride=2),
autograd.Conv2d(64,192,kernel_size=5,padding=2),
autograd.ReLU(),
autograd.MaxPool2d(kernel_size=3, stride=2),
autograd.Conv2d(192,384,kernel_size=3,padding=1),
autograd.ReLU(),
autograd.Conv2d(384, 256,kernel_size=3,padding=1),
autograd.ReLU()
]
self.features2 = [
autograd.Conv2d(256, 256,kernel_size=3,padding=1),
autograd.ReLU(),
autograd.MaxPool2d(kernel_size=3, stride=2)
]
self.avgpool = autograd.AvgPool2d(6, stride=1)
self.flatten = autograd.Flatten()
self.classifier = [
autograd.Dropout(),
autograd.Linear(256 * 6 * 6, 4096),
autograd.ReLU(),
autograd.Dropout(),
autograd.Linear(4096, 4096),
autograd.ReLU(),
autograd.Linear(4096, num_classes)
]
self.optimizer = opt.SGD(lr=0.001, momentum=0.9)
def loss(self, out, ty):
return autograd.softmax_cross_entropy(out, ty)
def optim(self, loss, dist_option, spars):
if dist_option == 'fp32':
self.optimizer.backward_and_update(loss)
elif dist_option == 'fp16':
self.optimizer.backward_and_update_half(loss)
elif dist_option == 'partialUpdate':
self.optimizer.backward_and_partial_update(loss)
elif dist_option == 'sparseTopK':
self.optimizer.backward_and_sparse_update(loss, topK=True, spars=spars)
elif dist_option == 'sparseThreshold':
self.optimizer.backward_and_sparse_update(loss, topK=False, spars=spars)
def forward(self, x):
for (i,layers) in enumerate([self.features1, self.features2, [ self.avgpool,self.flatten ] , self.classifier]):
for (j,fn) in enumerate(layers):
x = fn(x)
if(type(x) is tuple):# FIXME I have to do that because of a bug in Singa? (ReLU)
x = x[0]
return x
def alexnet(**kwargs):
return AlexNet(**kwargs)
`
And I get : AssertionError: ('shape mismatch', (9216, 4096), (256, 4096))
Which is my first linear layer : 256 * 6 * 6, 4096
When I use my VGG16 implementation, I got a similar error :
AssertionError: ('shape mismatch', (25088, 4096), (512, 4096))
It seems that the backward operation does not map the correct shape to the corresponding layer.
Moreover, the ReLu class return a 1-tuple containing a Tensor. Is it intended or is it a bug?
Data loading is an important part of DL training, which could be slow and become a bottleneck if not implemented well.
The tasks include
Currently, we abort the program when any check fails via glog's CHECK functions.
We do not catch any exceptions like memory exception or cudnn exceptions.
As a result, the program will abort or crash whenever there is an error or exception, which sometimes shutdown the jupyter notebook or colab notebook when we run the code in the notebook environment.
This ticket is to raise and handle exceptions in CPP code.
ref: http://www.swig.org/Doc3.0/SWIGDocumentation.html#Customization_exception
In both the codes and documentation, we need to make the project name consistent.
The official name of the project is SINGA
, but sometimes we have to use singa
like for the url or github repo name.
So I suggest to use SINGA
(e.g., in documentation) and singa
(e.g., in codes), and avoid Singa
Hi, @dcslin , we ccannot do matmul for high-dim tensors:
The error is:
F0324 05:47:19.174587 15611 tensor.cc:1413] Check failed: A.shape().size() == 2u (4 vs. 2)
please use this test case:
x1 = np.random.randn(1, 12, 256, 64).astype(np.float32)
x2 = np.random.randn(1, 12, 64, 256).astype(np.float32)
x1 = tensor.from_numpy(x1)
x1.to_device(gpu_dev)
x2 = tensor.from_numpy(x2)
x2.to_device(gpu_dev)
y = autograd.Matmul()(x1, x2)
print(tensor.to_numpy(y[0]))
Now the singa's padding is, we assign a value, and it adds the amount of value's padding at both sides of head and tail.
However, according to the onnx doc, there are four types of padding method, which are NOTSET, SAME_UPPER, SAME_LOWER or VALID. Singa cannot support SAME_UPPER, SAME_LOWER yet.
SAME_UPPER or SAME_LOWER mean pad the input so that the output spatial size match the input. In case of an odd number add the extra padding at the end for SAME_UPPER and at the beginning for SAME_LOWER.
For example, the input is 32*32, the stride is 1, the kernel is 4*4, and output is same, we need to add (3,3) zeros totally, each side is 3/2 = 1.5, cannot work basing on the current setting of singa.
Got AttributeError when running some Autograd models:
python examples/cnn/autograd/cifar10_multiprocess.py
AttributeError: module 'singa.singa_wrap' has no attribute 'NcclIdHolder'
python examples/cnn/autograd/mnist_dist.py
AttributeError: module 'singa.singa_wrap' has no attribute 'Communicator'
Does it mean that file singa_wrap is not updated?
SINGA has multiple example models at http://singa.apache.org/docs/examples/
Some are implemented from scratch and some are converted from ONNX, which has a bigger model zoo https://github.com/onnx/models.
The task is to convert more onnx models and implement some popular (and interesting) models that are not in onnx model zoo.
Here are some reference model zoos https://modelzoo.co/, https://gluon-nlp.mxnet.io/model_zoo/index.html
Hi, @dcslin , we cannot do the mul operator for int tensors:
The error is:
F0324 05:04:22.542809 14739 tensor.cc:932] Unknown combination of data type kInt and language kCuda
please use this test case:
x1 = np.array([1], dtype=np.int32)
x2 = np.array([256], dtype=np.int32)
x1 = tensor.from_numpy(x1)
x1.to_device(gpu_dev)
x2 = tensor.from_numpy(x2)
x2.to_device(gpu_dev)
y = autograd.Mul()(x1, x2)
print(tensor.to_numpy(y[0]))
In singa/examples/rnn/train.py,
When we run it with Graph using Sequential, i.e. line 200: model.graph(True, True), the training can complete without error.
However, when we run it with Graph using node input dependencies, i.e. line 200: model.graph(True, False), it will display the error as below:
Summarized from some discussion with @XJDKC, the error may be due to the following:
Therefore, to support computation of graph using node input dependencies, we may need to update the buffering of some RNN operation.
AssertionError with the onnx testcase: https://github.com/apache/singa/blob/master/examples/onnx/training/train.py
$ cd examples/onnx
$ python3 training/train.py --model vgg16
Then I get the following error msg:
File "training/train.py", line 437, in <module>
args.onnx_model_path, args.data, sgd, args.graph, args.verbosity)
File "training/train.py", line 295, in run
model.compile([tx], is_train=True, use_graph=graph, sequential=sequential)
File "/home/extend/lijiansong/work-space/anaconda2/envs/intel-caffe/lib/python3.6/site-packages/singa/model.py", line 177, in compile
self.forward(*inputs)
File "/home/extend/lijiansong/work-space/anaconda2/envs/intel-caffe/lib/python3.6/site-packages/singa/layer.py", line 63, in wrapper
return func(self, *args, **kwargs)
File "training/train.py", line 191, in forward
y = self.linear(y)
File "/home/extend/lijiansong/work-space/anaconda2/envs/intel-caffe/lib/python3.6/site-packages/singa/layer.py", line 110, in __call__
return self.forward(*args, **kwargs)
File "/home/extend/lijiansong/work-space/anaconda2/envs/intel-caffe/lib/python3.6/site-packages/singa/layer.py", line 61, in wrapper
self.initialize(*args, **kwargs)
File "/home/extend/lijiansong/work-space/anaconda2/envs/intel-caffe/lib/python3.6/site-packages/singa/layer.py", line 45, in wrapper
'initialize function expects PlaceHolders or Tensors')
AssertionError: initialize function expects PlaceHolders or Tensors
Something maybe wrong with the layer initialization?
singa version: 3100(the latest build from the source code of master branch)
Python version: 3.5.2
ONNX version: 1.5.0
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.