Coder Social home page Coder Social logo

alibaba / bladedisc Goto Github PK

View Code? Open in Web Editor NEW
745.0 35.0 159.0 21.67 MB

BladeDISC is an end-to-end DynamIc Shape Compiler project for machine learning workloads.

License: Apache License 2.0

Dockerfile 0.01% HCL 0.01% Shell 0.60% CMake 0.46% C++ 68.27% Python 13.37% C 0.14% Makefile 0.01% Starlark 2.63% MLIR 14.45% Smarty 0.02% Ruby 0.01% Roff 0.02% Cuda 0.01%
compiler deep-learning machine-learning pytorch tensorflow inference-optimization mlir neural-network

bladedisc's Introduction

BladeDISC Introduction

We're hiring!🔥🔥🔥

We're always looking for candicates to join dev team. Your're the one we're searching for long:

  • 🥷 If you are an compiler or AI enthusiasts.
  • ⭐️ or if you are experienced in optimization on CPUs and GPUs.
  • ⚙️ or if you wanna build an unified and automated compiler to optimize both inference and training workload.
  • 🤿 or if you are using BladeDISC in production or research projects, and wanna have a deeper dive into it.
  • ✄ or you wanna build cutting-edge infrastructure in the AIGC era.

Please contact us via email or Dingtalk at the bottom of page.⬇️⬇️⬇️

What's New

Overview

BladeDISC is an end-to-end DynamIc Shape Compiler project for machine learning workloads, which is one of the key components of Alibaba's PAI-Blade. BladeDISC provides general, transparent, and ease-of-use performance optimization for TensorFlow/PyTorch workloads on GPGPU and CPU backends. The architecture natively supports dynamic shape workloads, with many considerations in the performance of both static and dynamic shape scenarios. It also supports multiple and flexible deployment solutions, including both Plugin Mode inside TensorFlow/PyTorch runtime, and Standalone Mode for AOT standalone execution. The project is based on MLIR and highly related to mlir-hlo project.

Refer to our website for more information, including the setup tutorial, developer guide, demo examples and documents for developers.

Features and Roadmap

Frontend Framework Support Matrix

TensorFlow [1] PyTorch [2]
Inference Yes Yes
Training Yes [3] Ongoing

[1] TensorFlow 1.12, 1.15, 2.4 & 2.5 are supported and fully verified. For other versions, some slight work on adaptation might be needed.

[2] PyTorch version >= 1.6.0 has been fully verified.

[3] Although supported, there's much room for improvement on Op coverage for training workloads.

Backend Support Matrix

Status
Nvidia GPU Yes [1]
AMD GPU Yes
Hygon DCU Yes
X86 Yes
AArch64 Yes

[1] Support for CUDA below 11.0 has been deprecated officially since Aug 2022.

Deployment Solutions

  • Plugin Mode - BladeDISC works as a plugin of TensorFlow or PyTorch. Only the supported Ops are clustered and compiled, and the unsupported ones will be executed by the original TensorFlow or PyTorch runtime. We recommend this mode to most of the users for its transparency and ease of use.

  • Standalone Mode - In Standalone mode, the input workload will be compiled into a binary that can be executed by itself, aka, does not rely on a TensorFlow or PyTorch runtime. In this mode, all ops must be supported.

Numbers of Typical Workloads

By evaluating BladeDISC using a set of typical machine learning workloads for production purposes, BladeDISC shows up to 6.95x speedup compared with PyTorch. Moreover, compared to static optimizing compilers (i.e., XLA and TensorRT), BladeDISC shows comparable or even better performance.

Fig.1 End-to-end Performance of BladeDISC and baselines. Note that some baselines fail to optimize ViT model.

Advantage in Dynamic Shape Workloads

Specifically, for the BERT large inference on T4 GPU, we provide in the examples, static compiler optimization (XLA) shows severe performance degradation due to its compilation overhead, while BladeDISC shows a 1.75x speedup.

TensorFlow XLA BladeDISC
1.78 s 41.69s 1.02s
1X 1.75X

API QuickView

For TensorFlow Users

Only two lines of code are needed on native TensorFlow program as the following:

import numpy as np
import tensorflow as tf

## enable BladeDISC on TensorFlow program
import blade_disc_tf as disc
disc.enable()

## construct TensorFlow Graph and run it
g = tf.Graph()
with g.as_default():
    ...
    with tf.session as sess:
        sess.run(...)

For more information, please refer to QuickStart for TensorFlow Users

For PyTorch Users

PyTorch users only need the following few lines of code to enable BladeDISC:

import torch_blade
# construct PyTorch Module
class MyModule(nn.Module):
    ...

module = MyModule().eval()

with torch.no_grad():
    # blade_module is the optimized module by BladeDISC
    blade_module = torch_blade.optimize(module, allow_tracing=True, model_inputs=(x, y))

# run the optimized module
blade_module(x, y)

torch_blade.optimize accepts an nn.Module object and outputs the optimized module. For more information, please refer to Quickstart for PyTorch Users.

Setup and Examples

Publications

Tutorials and Documents for Developers

Presentations and Talks

How to Contribute

Building Status

Framework Device Status
PyTorch Pre GPU pytorch_pre_gpu
PyTorch Pre CPU pytorch_pre_cpu
PyTorch2.0.0 GPU pytorch200_gpu
PyTorch2.0.0 CPU pytorch200_cpu
PyTorch2.0.0 Yitian pytorch200_yitian
PyTorch1.13.0 GPU pytorch113_gpu
PyTorch1.13.0 CPU pytorch113_cpu
PyTorch1.13.0 Yitian pytorch113_yitian
TensorFlow2.5 GPU tf250_gpu
TensorFlow2.5 CPU tf250_cpu
TensorFlow2.8 Yitian tf280_yitian

FAQ

Roadmap with mlir-hlo Project

BladeDISC is in a close relationship with mlir-hlo project. Part of the building blocks, including the MHLO Op definitions, TF to MHLO conversions, and some general purpose passes have been upstreamed to mlir-hlo repository. We'll continue to work in a close cooperative relationship with mlir-hlo project in the longer term.

Roadmap with Torch-MLIR Project

BladeDISC compiles PyTorch workloads based on Torch-MLIR. The BladeDISC Dev Team is cooperating with the community to add Torch-To-Mhlo conversion to Torch-MLIR, especially fully dynamic shape features. See RFC: llvm/torch-mlir#999. We appeal to the community developers interested in joining.

Contact Us

DingTalk

bladedisc's People

Contributors

alibaba-oss avatar bddppq avatar bikekiller avatar bladedisc avatar changqi1 avatar chenbohua3 avatar deeply avatar eedalong avatar fwd4 avatar github-actions[bot] avatar guo-peilin avatar ipe-zhangyz avatar jamesthez avatar lfengad avatar linbinskn avatar linearhit avatar minminsun avatar orion34-lanbo avatar qiuxiafei avatar wangdalin avatar wyzero avatar xiaowan0322 avatar yancey1989 avatar yuchaoli avatar yunzhongovo avatar zhangxiao-stack avatar zhiqwang avatar zzpmiracle avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bladedisc's Issues

Support fusing "isSplat" constants

The problem is observed in swin-transformer, when pytorch is doing amp.

"lmhlo.constant"(%275) {value = dense<0.000000e+00> : tensor<64x784x768xf32>} : (memref<64x784x768xf32, "gpu">) -> ()

…

"lmhlo.fusion"() ( {
      …
      "lmhlo.multiply"(%1741, %275, %1742) {disc.device = "gpu"} : (memref<64x784x768xf32, "gpu">, memref<64x784x768xf32, "gpu">, memref<64x784x768xf32, "gpu">) -> ()
      …
      "lmhlo.terminator"() : () -> ()
 }) {disc.device = "gpu", disc.fusion.name = "main_kLoop_reshape__37_1_2" : () -> ()

"lmhlo.fusion"() ( {
      …
      "lmhlo.multiply"(%1888, %275, %1889) {disc.device = "gpu"} : (memref<64x784x768xf32, "gpu">, memref<64x784x768xf32, "gpu">, memref<64x784x768xf32, "gpu">) -> ()
      …
      "lmhlo.terminator"() : () -> ()
 }) {disc.device = "gpu", disc.fusion.name = "main_kLoop_reshape__37_1_3" : () -> ()

In general, the "splat" constants outside of a fusion kernel might cause severe performance issues. In swin-transformer, the performance degradation can be very severe. Please be aware that there might be multiple kernels consuming the splat constant.

Solution 1: mark "splat" constant as fusible in fusion pass; and add an additional fusion stage that allows to duplicate the producer according to some forms of rules, like the FusionMerger in XLA.

Solution 2: add an additional FuseSplatConstPass after the regular fusion pass that specifically duplicate and fuse the Splat const into fusion kernels.

Both solutions need also to support the fusion codegen for "splat" constants. Solution 2 can be regarded as a shrink version of solution 1, which can not handle such cases:

"lmhlo.constant"(%272) {value = dense<0.000000e+00> : tensor<64x784x768xf32>} : (memref<64x784x768xf32, "gpu">) -> ()
"lmhlo.add"(%272, %273, %275)  {disc.device = "gpu"} : (memref<64x784x768xf32, "gpu">, memref<64x784x768xf32, "gpu">, memref<64x784x768xf32, "gpu">) -> ()

…

"lmhlo.fusion"() ( {
      …
      "lmhlo.multiply"(%1741, %275, %1742) {disc.device = "gpu"} : (memref<64x784x768xf32, "gpu">, memref<64x784x768xf32, "gpu">, memref<64x784x768xf32, "gpu">) -> ()
      …
      "lmhlo.terminator"() : () -> ()
 }) {disc.device = "gpu", disc.fusion.name = "main_kLoop_reshape__37_1_2" : () -> ()

"lmhlo.fusion"() ( {
      …
      "lmhlo.multiply"(%1888, %275, %1889) {disc.device = "gpu"} : (memref<64x784x768xf32, "gpu">, memref<64x784x768xf32, "gpu">, memref<64x784x768xf32, "gpu">) -> ()
      …
      "lmhlo.terminator"() : () -> ()
 }) {disc.device = "gpu", disc.fusion.name = "main_kLoop_reshape__37_1_3" : () -> ()

mlir_disc_builder.so needed by libtorch_blade.so is missing

Hi, I followed the doc for building bladeDISC for pytorch users, build_from_source but there's an error occured.

ninja: error: '/disc/tf_community/bazel-bin/tensorflow/compiler/mlir/disc/mlir_disc_builder.so', needed by '../../torch_blade/libtorch_blade.so', missing and no known rule to make it
Traceback (most recent call last):
  File "setup.py", line 135, in <module>
    zip_safe=False,
  File "/usr/lib/python3/dist-packages/setuptools/__init__.py", line 129, in setup
    return distutils.core.setup(**attrs)
  File "/usr/lib/python3.6/distutils/core.py", line 148, in setup
    dist.run_commands()
  File "/usr/lib/python3.6/distutils/dist.py", line 955, in run_commands
    self.run_command(cmd)
  File "/usr/lib/python3.6/distutils/dist.py", line 974, in run_command
    cmd_obj.run()
  File "/usr/lib/python3/dist-packages/setuptools/command/develop.py", line 36, in run
    self.install_for_development()
  File "/usr/lib/python3/dist-packages/setuptools/command/develop.py", line 136, in install_for_development
    self.run_command('build_ext')
  File "/usr/lib/python3.6/distutils/cmd.py", line 313, in run_command
    self.distribution.run_command(command)
  File "/usr/lib/python3.6/distutils/dist.py", line 974, in run_command
    cmd_obj.run()
  File "setup.py", line 73, in run
    build_temp="build/temp")
  File "/disc/pytorch_blade/cmake_build.py", line 127, in run
    self.build_extension(extdir, srcdir, build_temp)
  File "/disc/pytorch_blade/cmake_build.py", line 197, in build_extension
    ["cmake", "--build", "."] + build_args, cwd=build_temp, env=env
  File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['cmake', '--build', '.', '--config', 'Release', '--target', 'package', '--', '-j72']' returned non-zero exit status 1.

CI task failed on pre-commit

The pre-commit step always failed on tf-gpu-1 runner

Run pre-commit run -a --show-diff-on-failure
/home/github/actions-runner/_work/_temp/d83aab53-811f-4e5c-bb09-802e5f50be22.sh: line 1: pre-commit: command not found
Error: Process completed with exit code 127.

To support custom pattern based rewrite graph configurations in TorchBlade

One would run into a PyTorch operation that is not currently supported by TorchBlade such as aten::hardswish.
However, aten::hardswish can be rewritten to some primitive operations that are supported by TorchBlade.
The following code demonstrates how to rewrite the input graph with jit passes:

def _jit_pass_hardswish(graph):
    from_graph_str = """
    graph(%x):
      %r: Tensor = aten::hardswish(%x)
      return (%r)
    """

    def hard_sigmoid(x: torch.Tensor, inplace: bool = False) -> torch.Tensor:
        return F.relu6(x + 3, inplace) / 6

    @torch.jit.script    
    def hard_swish(x: torch.Tensor) -> torch.Tensor:
        return x * hard_sigmoid(x, False) 

    torch._C._jit_pass_inline(hard_swish.graph)
    torch._C._jit_pass_dce(hard_swish.graph)
    torch._C._jit_pass_constant_propagation(hard_swish.graph)

    to_graph_str = str(hard_swish.graph)
    torch._C._jit_pass_custom_pattern_based_rewrite_graph(
        from_graph_str, to_graph_str, graph)
    torch._C._jit_pass_dce(graph)
    torch._C._jit_pass_constant_propagation(graph)

Support Scalar type as the Disc module input/output type

To enable BladeDISC compilable, the input and output should be Tensor type, It works well in TensorFlow world, but insufficient in PyTorch world, because a considerable number of inputs/outputs is Scalar type as the following illustration:

- func: add.Scalar(Tensor self, Scalar other, Scalar alpha=1) -> Tensor

A workaround way is that casting it to Tensor outside of Disc cluster and cast back to Scalar inner cluster as the following illustration:

Give:
with prim::FusionGroup(
      %1: Scalar,
      %2: Scalar):
     %3 Tensor = aten::add(%1, %2)
  return %3

 Execute: CastScalarInputs(sub_graph)

 After:
  with prim::FusionGroup(
      %1.1: Tensor,
      %2.1: Tensor):
    %4 int = aten::item(%1.1, 1)
    %5 int = aten::item(%2.1, 1)
    %3 Tensor = aten::add(%4, %5)
    return %3

Maybe to support Scalar as the input type in Disc is a better way.

Is there a version of TF15 that can be used out of box?

Hi, there! After reading the paper and run the demo you provided, I am excited by the speedup disc has achieved compared to other framework on various dynamic deep learning work loads. Great Work i must say ! The easy usage API and impressive perf enhancement
make us just want to give it a try. But as far as i know, the demo provided is based on tf2.4, but our production environment is tf15. I am currently trying to adapt tf15 to the disc compiler and have encountered lots of compilation issues due to uncompatiable building tools(tf2.4 use bazel 4.0 while tf15 use bazel 0.24.1). According to the Doc, tf15 is also supported, so my question is, is there an out of box version tf15 that we can use directly? If the answer is no, is there any document guide on the adaption can be made?
Any advice would be much appreciated!

To build all with Bazel.

Background

When putting pytorch_blade and tensorflow_blade to open source, BladeDISC's project structure gets more complex. Currently we've already had the following essential directories:

  • tao: TF bridge of BladeDISC
  • tao_compiler: BladeDISC compiler main executable, which will be symbolically linked to a directory under tf_community and built with tensorflow.
  • tf_community: mirroring tensorflow/tensorflow.
  • mhlo_builder: PyTorch bridge of BladeDISC, converting TorchScript to MHLO IR.
  • pytorch_blade: Python API to optimize PyTorch model.
  • tensorflow_blade: Python API to optimize TF model.

Our Goal and Current Status

As we've discussed for a long time and many times, we're moving to MonoRepo with both open source and internal code repository, and use Bazel to build ALL. Making all in this repo able to build with Bazel would make dependency structure explicit, standard and clean, and help new developers ramp-up smoothly.
In ideal status, one could run a universal preparation script once with necessary arguments, and then bazel build or bazel test any target from any component.

But there're some obstacles in the way:

  1. Components are build independently, even with different build tool. For example:
    • tao is build with CMake, while tao_compiler is with Bazel. It worth nothing that, RAL code is built by both side.
    • pytorch_blade is build Bazel wrapped by python Setuptools, while tensorflow_blade has Bazel calling setuptools.
  2. Components usually have their own build script, shell or python, which cuts off Bazel dependency chain.
  3. Free-style preparations in these scripts make Bazelization even harder:

Approaches

1. Build tao with Bazel.

The tao directory is currently built with CMake. Converting CMake to Bazel is non-trivial but still possible. But for code of RAL, which is build both on bridge side and compiler side makes things complex. RAL code is build with CMake on bridge side and with Bazel on compiler side under tf_community directory. The BUILD file of RAL code load tf_community's rules which won't be available on bridge side. Because bridge just has include files and shared libraries of give host tensorflow.

load(
"//tensorflow:tensorflow.bzl",
"tf_cc_shared_object",

There maybe several soluctions:

  1. Make that BUILD file neutral and just load standard Bazel rules, so that it could be used for both bridge and compiler side. Same source files can be compiled into different target for each side. It's also possible to setup an option specifying which side is being building, and use select to switch between dependencies from tf_community and host tensorflow.
  2. Only expose filegroup target from RAL directory, each side write cc_library target in BUILD file under their own directory.

2. Extend common_setup.py

If tao is built with Bazel, all DISC components could expose Bazel targets (may or may not be in single workspace)! Upper-level components like pytorch_blade and tf_blade could reference those targets and move on there own building.

common_setup.py is used to do preparations before build symbolic linking and OneDNN installation before building DISC. So when building any component that depends on DISC, common_setup.py should be called in advance:

python3 ../scripts/python/common_setup.py

If we extend common_setup.py a little bit, setting environment variables in build_pytorch_blade.py, pytorch_blade will be free from extra build script( as for the relationship of Python setup tools and Bazel, see open questions). If so, why not just make common_setup.py a global setup step for this whole project, like the configure script in tensorflow.

3. Make DISC a Bazel Workspace out of tf_community

We've had pretty many Bazel workspace now, from an achitecture view, it's natural to have a single Bazel workspace for all of tao_compiler/tao/mhlo, which make up DISC. Pulling tao_compiler out of tf_community's workspace is the key to achieve this goal. I have to admit that it not a very urgent task and we may have challenges if many tf_community internal targets are referenced by tao_compiler. IREE has similar works, may that help.

These are just immature thoughts of my own, your comments pls ~

Open Questions

  1. The relationship of Python setup tools and Bazel?
    pytorch_blade is build Bazel wrapped by python Setuptools, while tensorflow_blade has Bazel calling setuptools.

Invalide tensorflow version in cpu runtime docker

Should install tensorflow with CPU version in CPU runtime Docker, the error logs:

docker run --rm -it bladedisc/bladedisc:latest-runtime-tensorflow1.15-cpu bash
$ python
>>> import tensorflow as tf
>>> tf.test.is_built_with_cuda()
True

To support dynamic SelectOp: if or else operand is dynamic shape.

Meet ":0: error: loc("clip_by_norm_11/Select"): currently unsupported operand types: 'tensor<1x1xf32>' and 'tensor<?x?xf32>'" error when converts tf.Select to mhlo.select (on mlir::mhlo::createLegalizeTFPass()). Currently, TF doens't support this.
Current workload is blacklist SelectOp by "export TAO_OP_TYPE_CLUSTERING_BLACK_LIST='Select'".
The mlir ut is following:
#loc0 = loc(unknown)
module attributes {tf.versions = {bad_consumers = [], min_consumer = 0 : i32, producer = 891 : i32}} {
func @main(%arg0: tensor<?xi32> loc(unknown), %arg1: tensor<?xi32> loc(unknown), %arg2: tensor<?xi32> loc(unknown), %arg3: tensor<?xi32> loc(unknown), %arg4: tensor loc(unknown), %arg5: tensor loc(unknown), %arg6: tensor<1xi32> loc(unknown), %arg7: tensor<1xi32> loc(unknown), %arg8: tensor<1xi32> loc(unknown), %arg9: tensor<2xi32> loc(unknown), %arg10: tensor<2xi32> loc(unknown), %arg11: tensor<2xi32> loc(unknown), %arg12: tensor<2xi32> loc(unknown), %arg13: tensor<2xi32> loc(unknown), %arg14: tensor<2xi32> loc(unknown), %arg15: tensor<2xi32> loc(unknown), %arg16: tensor<2xi32> loc(unknown), %arg17: tensor<2xi32> loc(unknown), %arg18: tensor<2xi32> loc(unknown), %arg19: tensor<2xi32> loc(unknown), %arg20: tensor<2xi32> loc(unknown), %arg21: tensor<?xi32> loc(unknown), %arg22: tensor<?xi32> loc(unknown), %arg23: tensor<?xi32> loc(unknown), %arg24: tensor<?xi32> loc(unknown), %arg25: tensor<?x?xf32> loc(unknown), %arg26: tensor<?x?xf32> loc(unknown), %arg27: tensor<?x?xf32> loc(unknown), %arg28: tensor<?x?xf32> loc(unknown), %arg29: tensor<?x?xf32> loc(unknown), %arg30: tensor<?x?xf32> loc(unknown), %arg31: tensor<?x?xf32> loc(unknown), %arg32: tensor<?x?xf32> loc(unknown), %arg33: tensor<?x?xf32> loc(unknown), %arg34: tensor<?x?xf32> loc(unknown), %arg35: tensor loc(unknown), %arg36: tensor<?x?xf32> loc(unknown)) -> (tensor<?xi32>, tensor<?xi32>, tensor<?x?xf32>, tensor<1x1xf32>, tensor<1x1xf32>, tensor<1x1xf32>, tensor<1x1xi1>, tensor<1x1xf32>, tensor<1x1xi1>, tensor<1x1xf32>, tensor<1x1xi1>, tensor<1x1xf32>, tensor<?x?xf32>, tensor<?x?xf32>, tensor<?x?xf32>, tensor, tensor<?x1xi1>) attributes {tf.entry_function = {control_outputs = "", disc.input_shape_10 = dense<0> : tensor<2xi32>, disc.input_shape_11 = dense<0> : tensor<2xi32>, disc.input_shape_12 = dense<0> : tensor<2xi32>, disc.input_shape_13 = dense<0> : tensor<2xi32>, disc.input_shape_14 = dense<0> : tensor<2xi32>, disc.input_shape_15 = dense<0> : tensor<2xi32>, disc.input_shape_16 = dense<0> : tensor<2xi32>, disc.input_shape_17 = dense<0> : tensor<2xi32>, disc.input_shape_18 = dense<0> : tensor<2xi32>, disc.input_shape_19 = dense<0> : tensor<2xi32>, disc.input_shape_20 = dense<0> : tensor<2xi32>, disc.input_shape_6 = dense<0> : tensor<1xi32>, disc.input_shape_7 = dense<0> : tensor<1xi32>, disc.input_shape_8 = dense<0> : tensor<1xi32>, disc.input_shape_9 = dense<0> : tensor<2xi32>, disc.input_value_0 = dense<> : tensor<0xi32>, disc.input_value_1 = dense<> : tensor<0xi32>, disc.input_value_2 = dense<> : tensor<0xi32>, disc.input_value_3 = dense<[0, 1]> : tensor<2xi32>, disc.input_value_4 = dense<0> : tensor, disc.input_value_5 = dense<-1> : tensor, input_placements = "gpu,gpu,gpu,gpu,gpu,gpu,cpu,cpu,cpu,cpu,cpu,cpu,cpu,cpu,cpu,cpu,cpu,cpu,cpu,cpu,cpu,cpu,cpu,cpu,cpu,gpu,gpu,gpu,gpu,gpu,gpu,gpu,gpu,gpu,gpu,gpu,gpu", inputs = "gradients_graph_model_gnn_layer_3_mul_grad_broadcastgradientargs_1_arg,gradients_graph_model_gnn_layer_3_mul_1_grad_broadcastgradientargs_1_arg,gradients_graph_model_gnn_layer_3_mul_2_grad_broadcastgradientargs_1_arg,graph_model_gnn_layer_0_strided_slice_1_stack_tao_declustered_0_arg,graph_model_gnn_layer_0_concat_axis_tao_declustered_0_arg,graph_model_gnn_layer_0_expanddims_dim_tao_declustered_0_arg,gradients_graph_model_gnn_layer_0_embedding_lookup_grad_expanddims_0_arg,gradients_graph_model_gnn_layer_0_embedding_lookup_2_grad_expanddims_0_arg,gradients_graph_model_gnn_layer_0_embedding_lookup_4_grad_expanddims_0_arg,gradients_graph_model_gnn_layer_3_concat_1_grad_concatoffset_0_arg,gradients_graph_model_gnn_layer_3_concat_1_grad_shapen_0_arg,gradients_graph_model_gnn_layer_3_concat_1_grad_concatoffset_1_arg,gradients_graph_model_gnn_layer_3_concat_1_grad_shapen_1_arg,gradients_graph_model_gnn_layer_3_concat_1_grad_concatoffset_2_arg,gradients_graph_model_gnn_layer_3_concat_1_grad_shapen_2_arg,gradients_graph_model_gnn_layer_3_mul_grad_shape_1_0_arg,gradients_graph_model_gnn_layer_3_mul_1_grad_shape_1_0_arg,gradients_graph_model_gnn_layer_3_mul_2_grad_shape_1_0_arg,gradients_graph_model_gnn_layer_3_embedding_lookup_grad_concat_0_arg,gradients_graph_model_gnn_layer_3_embedding_lookup_2_grad_concat_0_arg,gradients_graph_model_gnn_layer_3_embedding_lookup_4_grad_concat_0_arg,graph_model_gnn_layer_0_strided_slice_7_tao_declustered_0_arg,graph_model_gnn_layer_0_strided_slice_4_tao_declustered_0_arg,graph_model_gnn_layer_0_strided_slice_10_tao_declustered_0_arg,graph_model_gnn_layer_0_concat_tao_declustered_0_arg,gradients_graph_model_gnn_layer_3_relu_grad_relugrad_0_arg,graph_model_gnn_layer_3_expanddims_0_arg,graph_model_gnn_layer_3_expanddims_1_0_arg,graph_model_gnn_layer_3_expanddims_2_0_arg,graph_model_gnn_layer_3_edge_0_weight_matmul_readvariableop_0_arg,graph_model_gnn_layer_3_embedding_lookup_0_arg,graph_model_gnn_layer_3_edge_1_weight_matmul_readvariableop_0_arg,graph_model_gnn_layer_3_embedding_lookup_2_0_arg,graph_model_gnn_layer_3_edge_2_weight_matmul_readvariableop_0_arg,graph_model_gnn_layer_3_embedding_lookup_4_0_arg,clip_by_norm_10_greater_y_0_arg,clip_by_norm_10_ones_like_0_arg", output_placements = "cpu,cpu,gpu,gpu,gpu,gpu,gpu,gpu,gpu,gpu,gpu,gpu,gpu,gpu,gpu,gpu,gpu", outputs = "gradients/concat_7:0,gradients/graph_model/gnn_layer_0/UnsortedSegmentSum_grad/Maximum:0,gradients/concat:0,clip_by_norm_11/Select:0,clip_by_norm_12/Select:0,clip_by_norm_13/Select:0,clip_by_norm_11/Greater:0,clip_by_norm_11/Sum:0,clip_by_norm_12/Greater:0,clip_by_norm_12/Sum:0,clip_by_norm_13/Greater:0,clip_by_norm_13/Sum:0,gradients/graph_model/gnn_layer_3/Edge_0_Weight/MatMul_grad/MatMul_1:0,gradients/graph_model/gnn_layer_3/Edge_1_Weight/MatMul_grad/MatMul_1:0,gradients/graph_model/gnn_layer_3/Edge_2_Weight/MatMul_grad/MatMul_1:0,gradients/graph_model/gnn_layer_3/UnsortedSegmentSum_grad/ones_like/Const:0,gradients/graph_model/gnn_layer_0/UnsortedSegmentSum_grad/ExpandDims:0"}} {
%cst = "tf.Const"() {value = dense<-1> : tensor} : () -> tensor loc(#loc0)
%cst_0 = "tf.Const"() {value = dense<0> : tensor} : () -> tensor loc(#loc0)
%cst_1 = "tf.Const"() {value = dense<[0, 1]> : tensor<2xi32>} : () -> tensor<2xi32> loc(#loc0)
%cst_2 = "tf.Const"() {value = dense : tensor} : () -> tensor loc(#loc1)
%0 = "tf.GreaterEqual"(%arg24, %cst_0) {_XlaAlreadyClustered = true, device = "/job:localhost/replica:0/task:0/device:GPU:0"} : (tensor<?xi32>, tensor) -> tensor<?xi1> loc(#loc2)
%1 = "tf.ZerosLike"(%arg24) {_XlaAlreadyClustered = true, device = "/job:localhost/replica:0/task:0/device:GPU:0"} : (tensor<?xi32>) -> tensor<?xi32> loc(#loc3)
%2 = "tf.Maximum"(%arg24, %1) {_XlaAlreadyClustered = true, device = "/job:localhost/replica:0/task:0/device:GPU:0"} : (tensor<?xi32>, tensor<?xi32>) -> tensor<?xi32> loc(#loc4)
%3 = "tf.GatherV2"(%arg25, %2, %cst_0) {_XlaAlreadyClustered = true, batch_dims = 0 : i64, device = "/job:localhost/replica:0/task:0/device:GPU:0"} : (tensor<?x?xf32>, tensor<?xi32>, tensor) -> tensor<?x?xf32> loc(#loc5)
%4 = "tf.Shape"(%3) {_XlaAlreadyClustered = true, device = "/job:localhost/replica:0/task:0/device:GPU:0"} : (tensor<?x?xf32>) -> tensor<2xi32> loc(#loc6)
%5 = "tf.Fill"(%4, %cst_2) {_XlaAlreadyClustered = true, device = "/job:localhost/replica:0/task:0/device:GPU:0"} : (tensor<2xi32>, tensor) -> tensor<?x?xi1> loc(#loc7)
%6 = "tf.ZerosLike"(%3) {_XlaAlreadyClustered = true, device = "/job:localhost/replica:0/task:0/device:GPU:0"} : (tensor<?x?xf32>) -> tensor<?x?xf32> loc(#loc8)
%7 = "tf.ExpandDims"(%0, %cst) {_XlaAlreadyClustered = true, device = "/job:localhost/replica:0/task:0/device:GPU:0"} : (tensor<?xi1>, tensor) -> tensor<?x1xi1> loc(#loc9)
%8 = "tf.LogicalAnd"(%7, %5) {_XlaAlreadyClustered = true, device = "/job:localhost/replica:0/task:0/device:GPU:0"} : (tensor<?x1xi1>, tensor<?x?xi1>) -> tensor<?x?xi1> loc(#loc10)
%9 = "tf.Select"(%8, %3, %6) {_XlaAlreadyClustered = true, device = "/job:localhost/replica:0/task:0/device:GPU:0"} : (tensor<?x?xi1>, tensor<?x?xf32>, tensor<?x?xf32>) -> tensor<?x?xf32> loc(#loc11)
%10 = "tf.Slice"(%9, %arg9, %arg10) {_XlaAlreadyClustered = true, device = "/job:localhost/replica:0/task:0/device:GPU:0"} : (tensor<?x?xf32>, tensor<2xi32>, tensor<2xi32>) -> tensor<?x?xf32> loc(#loc12)
%11 = "tf.Slice"(%9, %arg11, %arg12) {_XlaAlreadyClustered = true, device = "/job:localhost/replica:0/task:0/device:GPU:0"} : (tensor<?x?xf32>, tensor<2xi32>, tensor<2xi32>) -> tensor<?x?xf32> loc(#loc13)
%12 = "tf.Slice"(%9, %arg13, %arg14) {_XlaAlreadyClustered = true, device = "/job:localhost/replica:0/task:0/device:GPU:0"} : (tensor<?x?xf32>, tensor<2xi32>, tensor<2xi32>) -> tensor<?x?xf32> loc(#loc14)
%13 = "tf.Reshape"(%arg23, %arg8) {_XlaAlreadyClustered = true, device = "/job:localhost/replica:0/task:0/device:GPU:0"} : (tensor<?xi32>, tensor<1xi32>) -> tensor<?xi32> loc(#loc15)
%14 = "tf.Reshape"(%arg22, %arg6) {_XlaAlreadyClustered = true, device = "/job:localhost/replica:0/task:0/device:GPU:0"} : (tensor<?xi32>, tensor<1xi32>) -> tensor<?xi32> loc(#loc16)
%15 = "tf.Reshape"(%arg21, %arg7) {_XlaAlreadyClustered = true, device = "/job:localhost/replica:0/task:0/device:GPU:0"} : (tensor<?xi32>, tensor<1xi32>) -> tensor<?xi32> loc(#loc17)
%16 = "tf.ConcatV2"(%14, %15, %13, %cst_0) {_XlaAlreadyClustered = true, device = "/job:localhost/replica:0/task:0/device:GPU:0"} : (tensor<?xi32>, tensor<?xi32>, tensor<?xi32>, tensor) -> tensor<?xi32> loc(#loc18)
%17 = "tf.Mul"(%10, %arg26) {_XlaAlreadyClustered = true, device = "/job:localhost/replica:0/task:0/device:GPU:0"} : (tensor<?x?xf32>, tensor<?x?xf32>) -> tensor<?x?xf32> loc(#loc19)
%18 = "tf.Reshape"(%17, %arg15) {_XlaAlreadyClustered = true, device = "/job:localhost/replica:0/task:0/device:GPU:0"} : (tensor<?x?xf32>, tensor<2xi32>) -> tensor<?x?xf32> loc(#loc20)
%19 = "tf.MatMul"(%18, %arg29) {_XlaAlreadyClustered = true, device = "/job:localhost/replica:0/task:0/device:GPU:0", transpose_a = false, transpose_b = true} : (tensor<?x?xf32>, tensor<?x?xf32>) -> tensor<?x?xf32> loc(#loc21)
%20 = "tf.MatMul"(%arg30, %18) {_XlaAlreadyClustered = true, device = "/job:localhost/replica:0/task:0/device:GPU:0", transpose_a = true, transpose_b = false} : (tensor<?x?xf32>, tensor<?x?xf32>) -> tensor<?x?xf32> loc(#loc22)
%21 = "tf.Square"(%20) {_XlaAlreadyClustered = true, device = "/job:localhost/replica:0/task:0/device:GPU:0"} : (tensor<?x?xf32>) -> tensor<?x?xf32> loc(#loc23)
%22 = "tf.Sum"(%21, %cst_1) {_XlaAlreadyClustered = true, device = "/job:localhost/replica:0/task:0/device:GPU:0", keep_dims = true} : (tensor<?x?xf32>, tensor<2xi32>) -> tensor<1x1xf32> loc(#loc24)
%23 = "tf.Greater"(%22, %arg35) {_XlaAlreadyClustered = true, device = "/job:localhost/replica:0/task:0/device:GPU:0"} : (tensor<1x1xf32>, tensor) -> tensor<1x1xi1> loc(#loc25)
%24 = "tf.Select"(%23, %22, %arg36) {_XlaAlreadyClustered = true, device = "/job:localhost/replica:0/task:0/device:GPU:0"} : (tensor<1x1xi1>, tensor<1x1xf32>, tensor<?x?xf32>) -> tensor<1x1xf32> loc(#loc26)
%25 = "tf.Reshape"(%19, %arg18) {_XlaAlreadyClustered = true, device = "/job:localhost/replica:0/task:0/device:GPU:0"} : (tensor<?x?xf32>, tensor<2xi32>) -> tensor<?x?xf32> loc(#loc27)
%26 = "tf.Mul"(%11, %arg27) {_XlaAlreadyClustered = true, device = "/job:localhost/replica:0/task:0/device:GPU:0"} : (tensor<?x?xf32>, tensor<?x?xf32>) -> tensor<?x?xf32> loc(#loc28)
%27 = "tf.Reshape"(%26, %arg16) {_XlaAlreadyClustered = true, device = "/job:localhost/replica:0/task:0/device:GPU:0"} : (tensor<?x?xf32>, tensor<2xi32>) -> tensor<?x?xf32> loc(#loc29)
%28 = "tf.MatMul"(%27, %arg31) {_XlaAlreadyClustered = true, device = "/job:localhost/replica:0/task:0/device:GPU:0", transpose_a = false, transpose_b = true} : (tensor<?x?xf32>, tensor<?x?xf32>) -> tensor<?x?xf32> loc(#loc30)
%29 = "tf.MatMul"(%arg32, %27) {_XlaAlreadyClustered = true, device = "/job:localhost/replica:0/task:0/device:GPU:0", transpose_a = true, transpose_b = false} : (tensor<?x?xf32>, tensor<?x?xf32>) -> tensor<?x?xf32> loc(#loc31)
%30 = "tf.Square"(%29) {_XlaAlreadyClustered = true, device = "/job:localhost/replica:0/task:0/device:GPU:0"} : (tensor<?x?xf32>) -> tensor<?x?xf32> loc(#loc32)
%31 = "tf.Sum"(%30, %cst_1) {_XlaAlreadyClustered = true, device = "/job:localhost/replica:0/task:0/device:GPU:0", keep_dims = true} : (tensor<?x?xf32>, tensor<2xi32>) -> tensor<1x1xf32> loc(#loc33)
%32 = "tf.Greater"(%31, %arg35) {_XlaAlreadyClustered = true, device = "/job:localhost/replica:0/task:0/device:GPU:0"} : (tensor<1x1xf32>, tensor) -> tensor<1x1xi1> loc(#loc34)
%33 = "tf.Select"(%32, %31, %arg36) {_XlaAlreadyClustered = true, device = "/job:localhost/replica:0/task:0/device:GPU:0"} : (tensor<1x1xi1>, tensor<1x1xf32>, tensor<?x?xf32>) -> tensor<1x1xf32> loc(#loc35)
%34 = "tf.Reshape"(%28, %arg19) {_XlaAlreadyClustered = true, device = "/job:localhost/replica:0/task:0/device:GPU:0"} : (tensor<?x?xf32>, tensor<2xi32>) -> tensor<?x?xf32> loc(#loc36)
%35 = "tf.Mul"(%12, %arg28) {_XlaAlreadyClustered = true, device = "/job:localhost/replica:0/task:0/device:GPU:0"} : (tensor<?x?xf32>, tensor<?x?xf32>) -> tensor<?x?xf32> loc(#loc37)
%36 = "tf.Reshape"(%35, %arg17) {_XlaAlreadyClustered = true, device = "/job:localhost/replica:0/task:0/device:GPU:0"} : (tensor<?x?xf32>, tensor<2xi32>) -> tensor<?x?xf32> loc(#loc38)
%37 = "tf.MatMul"(%36, %arg33) {_XlaAlreadyClustered = true, device = "/job:localhost/replica:0/task:0/device:GPU:0", transpose_a = false, transpose_b = true} : (tensor<?x?xf32>, tensor<?x?xf32>) -> tensor<?x?xf32> loc(#loc39)
%38 = "tf.MatMul"(%arg34, %36) {_XlaAlreadyClustered = true, device = "/job:localhost/replica:0/task:0/device:GPU:0", transpose_a = true, transpose_b = false} : (tensor<?x?xf32>, tensor<?x?xf32>) -> tensor<?x?xf32> loc(#loc40)
%39 = "tf.Square"(%38) {_XlaAlreadyClustered = true, device = "/job:localhost/replica:0/task:0/device:GPU:0"} : (tensor<?x?xf32>) -> tensor<?x?xf32> loc(#loc41)
%40 = "tf.Sum"(%39, %cst_1) {_XlaAlreadyClustered = true, device = "/job:localhost/replica:0/task:0/device:GPU:0", keep_dims = true} : (tensor<?x?xf32>, tensor<2xi32>) -> tensor<1x1xf32> loc(#loc42)
%41 = "tf.Greater"(%40, %arg35) {_XlaAlreadyClustered = true, device = "/job:localhost/replica:0/task:0/device:GPU:0"} : (tensor<1x1xf32>, tensor) -> tensor<1x1xi1> loc(#loc43)
%42 = "tf.Select"(%41, %40, %arg36) {_XlaAlreadyClustered = true, device = "/job:localhost/replica:0/task:0/device:GPU:0"} : (tensor<1x1xi1>, tensor<1x1xf32>, tensor<?x?xf32>) -> tensor<1x1xf32> loc(#loc44)
%43 = "tf.Reshape"(%37, %arg20) {_XlaAlreadyClustered = true, device = "/job:localhost/replica:0/task:0/device:GPU:0"} : (tensor<?x?xf32>, tensor<2xi32>) -> tensor<?x?xf32> loc(#loc45)
%44 = "tf.ConcatV2"(%25, %34, %43, %cst_0) {_XlaAlreadyClustered = true, device = "/job:localhost/replica:0/task:0/device:GPU:0"} : (tensor<?x?xf32>, tensor<?x?xf32>, tensor<?x?xf32>, tensor) -> tensor<?x?xf32> loc(#loc46)
return %16, %2, %44, %24, %33, %42, %23, %22, %32, %31, %41, %40, %20, %29, %38, %cst_2, %7 : tensor<?xi32>, tensor<?xi32>, tensor<?x?xf32>, tensor<1x1xf32>, tensor<1x1xf32>, tensor<1x1xf32>, tensor<1x1xi1>, tensor<1x1xf32>, tensor<1x1xi1>, tensor<1x1xf32>, tensor<1x1xi1>, tensor<1x1xf32>, tensor<?x?xf32>, tensor<?x?xf32>, tensor<?x?xf32>, tensor, tensor<?x1xi1> loc(#loc0)
} loc(#loc0)
} loc(#loc0)
#loc1 = loc("gradients/graph_model/gnn_layer_3/UnsortedSegmentSum_grad/ones_like/Const")
#loc2 = loc("gradients/graph_model/gnn_layer_0/UnsortedSegmentSum_grad/GreaterEqual")
#loc3 = loc("gradients/graph_model/gnn_layer_0/UnsortedSegmentSum_grad/zeros_like")
#loc4 = loc("gradients/graph_model/gnn_layer_0/UnsortedSegmentSum_grad/Maximum")
#loc5 = loc("gradients/graph_model/gnn_layer_3/UnsortedSegmentSum_grad/GatherV2")
#loc6 = loc("gradients/graph_model/gnn_layer_3/UnsortedSegmentSum_grad/ones_like/Shape")
#loc7 = loc("gradients/graph_model/gnn_layer_3/UnsortedSegmentSum_grad/ones_like")
#loc8 = loc("gradients/graph_model/gnn_layer_3/UnsortedSegmentSum_grad/zeros_like_1")
#loc9 = loc("gradients/graph_model/gnn_layer_0/UnsortedSegmentSum_grad/ExpandDims")
#loc10 = loc("gradients/graph_model/gnn_layer_3/UnsortedSegmentSum_grad/and")
#loc11 = loc("gradients/graph_model/gnn_layer_3/UnsortedSegmentSum_grad/Select")
#loc12 = loc("gradients/graph_model/gnn_layer_3/concat_1_grad/Slice")
#loc13 = loc("gradients/graph_model/gnn_layer_3/concat_1_grad/Slice_1")
#loc14 = loc("gradients/graph_model/gnn_layer_3/concat_1_grad/Slice_2")
#loc15 = loc("gradients/graph_model/gnn_layer_0/embedding_lookup_4_grad/Reshape_1")
#loc16 = loc("gradients/graph_model/gnn_layer_0/embedding_lookup_grad/Reshape_1")
#loc17 = loc("gradients/graph_model/gnn_layer_0/embedding_lookup_2_grad/Reshape_1")
#loc18 = loc("gradients/concat_7")
#loc19 = loc("gradients/graph_model/gnn_layer_3/mul_grad/Mul_1")
#loc20 = loc("gradients/graph_model/gnn_layer_3/mul_grad/Reshape_1")
#loc21 = loc("gradients/graph_model/gnn_layer_3/Edge_0_Weight/MatMul_grad/MatMul")
#loc22 = loc("gradients/graph_model/gnn_layer_3/Edge_0_Weight/MatMul_grad/MatMul_1")
#loc23 = loc("clip_by_norm_11/ArithmeticOptimizer/ReplaceMulWithSquare_mul")
#loc24 = loc("clip_by_norm_11/Sum")
#loc25 = loc("clip_by_norm_11/Greater")
#loc26 = loc("clip_by_norm_11/Select")
#loc27 = loc("gradients/graph_model/gnn_layer_3/embedding_lookup_grad/Reshape")
#loc28 = loc("gradients/graph_model/gnn_layer_3/mul_1_grad/Mul_1")
#loc29 = loc("gradients/graph_model/gnn_layer_3/mul_1_grad/Reshape_1")
#loc30 = loc("gradients/graph_model/gnn_layer_3/Edge_1_Weight/MatMul_grad/MatMul")
#loc31 = loc("gradients/graph_model/gnn_layer_3/Edge_1_Weight/MatMul_grad/MatMul_1")
#loc32 = loc("clip_by_norm_12/ArithmeticOptimizer/ReplaceMulWithSquare_mul")
#loc33 = loc("clip_by_norm_12/Sum")
#loc34 = loc("clip_by_norm_12/Greater")
#loc35 = loc("clip_by_norm_12/Select")
#loc36 = loc("gradients/graph_model/gnn_layer_3/embedding_lookup_2_grad/Reshape")
#loc37 = loc("gradients/graph_model/gnn_layer_3/mul_2_grad/Mul_1")
#loc38 = loc("gradients/graph_model/gnn_layer_3/mul_2_grad/Reshape_1")
#loc39 = loc("gradients/graph_model/gnn_layer_3/Edge_2_Weight/MatMul_grad/MatMul")
#loc40 = loc("gradients/graph_model/gnn_layer_3/Edge_2_Weight/MatMul_grad/MatMul_1")
#loc41 = loc("clip_by_norm_13/ArithmeticOptimizer/ReplaceMulWithSquare_mul")
#loc42 = loc("clip_by_norm_13/Sum")
#loc43 = loc("clip_by_norm_13/Greater")
#loc44 = loc("clip_by_norm_13/Select")
#loc45 = loc("gradients/graph_model/gnn_layer_3/embedding_lookup_4_grad/Reshape")
#loc46 = loc("gradients/concat")

How to repro:
save the above string to select.mlir
tf-opt -xla-legalize-tf select.mlir
I cannot attach input proto. I would upload to dingding group if needed.

What is the solution(pls correct me if I'm wrong)?

  1. in ConvertSelectOp, return fail when then or else operand is dynamic?
  2. infer then or else operand's shape before createLegalizeTFPass or in ConvertSelectOp?

Unify TorchBlade and TorchDISC

The project TorchDISC for training is under heavy development. It will reuse a lot of converter passes existing in TorchBlade.
Unify TorchBlade and TorchDISC codes and design will benefit maintain of the project in a long term.
This issue is created to track the related items.

git am error when build with cxx11 abi

In scripts/python/tao_build.py, when build with cxx11 abi, a git patch is applied. However this patch is only used for gcc 7.3.0 when build with cxx11 abi. Currently, we build cxx11 abi version with gcc 7.5.0 in ubuntu 18.04. We do not need it anymore,thus remove it.

Cannot access pai_bazel

Meet exception when building tao_compiler_main
Log:
2021-12-24 04:16:17,935 ERROR bazel_build failed on exception
Traceback (most recent call last):
File "scripts/python/tao_build.py", line 71, in wrapper
File "scripts/python/tao_build.py", line 341, in bazel_build
File "/disc2/Code/aicompiler/scripts/python/tao_common.py", line 102, in execute
subprocess.check_call(shell_setting + cmd, shell=True, executable='/bin/bash')
File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'set -e; set -o pipefail; BAZELISK_BASE_URL=http://hlomodule.oss-cn-zhangjiakou.aliyuncs.com/tao_compiler/release/pai_bazel BAZEL_CUSTOM_SNAPSHOT_CLIENT=/disc2/Code/aicompiler/scripts/python/bazel_snapshot_client.py BAZEL_HACK_LOG_GIT_CMD_TO=/tmp/git_wrapper.log BAZEL_HACK_REDIRECT_URL=https://github.com,http://gitlab.alibaba-inc.com bazel build --experimental_multi_threaded_digest --define framework_shared_object=false --config=cuda //tensorflow/compiler/decoupling:tao_compiler_main' returned non-zero exit status 36.

Looks http://hlomodule.oss-cn-zhangjiakou.aliyuncs.com/tao_compiler/release/pai_bazel is private. It should be public.

Install from source without docker

Hi, we are tring to use spack to build and install BladeDISC without docker, however, we are facing some problems.

  1. Spack installs tensorflow from source, and protobuf is installed separately, so the detection logic in FindTensorflow.cmake is broken.
  2. Bazel version in the bundled tensorflow is >4.2.2, while the supported tf2.4/2.5 requires bazel3.7.2

Can you give us some instrution on how to build BladeDISC outside docker?

[BUG] torch_blade core dumped in CUDA 11.3

Core dump with the following command:

python3 -c 'import torch_blade'

The stack trace:

#0  0x00007fc490db4387 in raise () from /lib64/libc.so.6
#1  0x00007fc490db5a78 in abort () from /lib64/libc.so.6
#2  0x00007fc490df6f67 in __libc_message () from /lib64/libc.so.6
#3  0x00007fc490dff329 in _int_free () from /lib64/libc.so.6
#4  0x00007fc48ac9ee23 in std::basic_string<char, std::char_traits<char>, std::allocator<char> >::~basic_string() () from /lib64/libstdc++.so.6
#5  0x00007fc40aa770a7 in llvm::cl::opt<std::string, false, llvm::cl::parser<std::string> >::~opt() ()
   from /usr/local/lib64/python3.6/site-packages/torch/lib/libtorch_cpu.so
#6  0x00007fc490db7ce9 in __run_exit_handlers () from /lib64/libc.so.6
#7  0x00007fc490db7d37 in exit () from /lib64/libc.so.6
#8  0x00007fc490da055c in __libc_start_main () from /lib64/libc.so.6
#9  0x0000000000400c40 in _start ()

Looks like symbol conflicts, since mhlo_disc_builder.so has llvm::cl::opt<std::string, false, llvm::cl::parser<std::string> >::~opt() () exposed.

[PoC] TorchDisc: accelerating PyTorch training via LTC + BladeDISC

Background

BladeDISC is an end-to-end compiler that supports dynamic shape features, and dynamic shape is widely used on the training scene, this issue descript how to improve PyTorch training performance with DISC based on the LazyTensorCore(LTC) mechanism.

feature branch: https://github.com/alibaba/BladeDISC/tree/features/torch_disc_devel

Design Overview

image

  1. According to LTC, a MARK API should be called manually at the end of each iteration to sync and execute a Graph on a physical device.
  2. Lowering To TorchScript, LTC uses TorchScript as the backend engine, ref TSBackendImpl, we can use it lower Lazy IR to TorchScript IR.
  3. Cluster DISC SubGraph,
  4. DISC Compilation Stage
    a. mhlo conversation, DISC uses MLIR::mhloas the front-end, we should convert TorchScript IR to mhlo before compilation.
    b. compiling to an executable program, call DISC entry function to compile mhlo IR to an executable file (a dynamic library file).
    c. disc execution, call DISC RAL to execute the executable program with input Tensors.
  5. TorchScript Execution, finally call torch::jit::GraphExecutorto execute the TorchScript IR and return the result Tensors.

Implement and TODO Actions

To implement the above features, we should build a Pybind library _torch_disc.so to expose step_mark API with some important C++ functions, the TODO actions as the following:

  • setup a building environment to build _torch_disc.so with Torch LTC, Mhlo Builder, and DISC. #158
  • cluster DISC nodes into sub-graph (maybe implement cluster algorithms with a fake function). #173
  • compilation DISC sub-graph and register the DISC engine #188
  • demonstration TorchDISC with MNIST training. #207 #230

Reference

  1. PyTorch LazyTensor branch: https://github.com/pytorch/pytorch/tree/lazy_tensor_staging/lazy_tensor_core
  2. PyTorch/XLA backend example: https://github.com/pytorch/xla/tree/asuhan/xla_ltc_plugin

blade_classifier in example takes longer time than baseline?

Hi, I followed the tutorial and got a nearly 4x speedup at the benchmark part:

Seqlen: [40, 26]
Baseline: 12.270253419876099 ms
BladeDISC: 3.037029981613159 ms
BladeDISC speedup: 4.0402147802830015

But got a slow prediction performance:

The model 'RecursiveScriptModule' is not supported for sentiment-analysis. Supported models are ['YosoForSequenceClassification', 'NystromformerForSequenceClassification', 'PLBartForSequenceClassification', 'PerceiverForSequenceClassification', 'QDQBertForSequenceClassification', 'FNetForSequenceClassification', 'GPTJForSequenceClassification', 'LayoutLMv2ForSequenceClassification', 'RemBertForSequenceClassification', 'CanineForSequenceClassification', 'RoFormerForSequenceClassification', 'BigBirdPegasusForSequenceClassification', 'BigBirdForSequenceClassification', 'ConvBertForSequenceClassification', 'LEDForSequenceClassification', 'DistilBertForSequenceClassification', 'AlbertForSequenceClassification', 'CamembertForSequenceClassification', 'XLMRobertaXLForSequenceClassification', 'XLMRobertaForSequenceClassification', 'MBartForSequenceClassification', 'BartForSequenceClassification', 'LongformerForSequenceClassification', 'RobertaForSequenceClassification', 'Data2VecTextForSequenceClassification', 'SqueezeBertForSequenceClassification', 'LayoutLMForSequenceClassification', 'BertForSequenceClassification', 'XLNetForSequenceClassification', 'MegatronBertForSequenceClassification', 'MobileBertForSequenceClassification', 'FlaubertForSequenceClassification', 'XLMForSequenceClassification', 'ElectraForSequenceClassification', 'FunnelForSequenceClassification', 'DebertaForSequenceClassification', 'DebertaV2ForSequenceClassification', 'GPT2ForSequenceClassification', 'GPTNeoForSequenceClassification', 'OpenAIGPTForSequenceClassification', 'ReformerForSequenceClassification', 'CTRLForSequenceClassification', 'TransfoXLForSequenceClassification', 'MPNetForSequenceClassification', 'TapasForSequenceClassification', 'IBertForSequenceClassification'].
Predict with Baseline PyTorch model:
I really like the new design of your website!
 label: 5 stars, with a score: 0.6289
I'm not sure if I like the new design.
 label: 3 stars, with a score: 0.5859
The new design is awful!
 label: 1 star, with a score: 0.8435
It will be awesome if you give us feedback!
 label: 5 stars, with a score: 0.6736
cost: 0.05647158622741699
Predict with BladeDISC optimized model:
I really like the new design of your website!
 label: 5 stars, with a score: 0.6289
I'm not sure if I like the new design.
 label: 3 stars, with a score: 0.5859
The new design is awful!
 label: 1 star, with a score: 0.8435
It will be awesome if you give us feedback!
 label: 5 stars, with a score: 0.6736
cost: 25.09628677368164

My code is:

import torch
import torch_blade

from transformers import (
 pipeline,
 AutoTokenizer,
 AutoModelForSequenceClassification,
 TextClassificationPipeline,
)

model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
# get tokenizer from HuggingFace
tokenizer = AutoTokenizer.from_pretrained(model_name)

# place model to cuda and set it to evaluate mode
model = AutoModelForSequenceClassification.from_pretrained(model_name).cuda().eval()

def plain_tokenizer(inputs_str, return_tensors):
  inputs = tokenizer(inputs_str, return_tensors=return_tensors, padding=True)
  inputs = dict(map(lambda x: (x[0], x[1].cuda()), inputs.items()))

  return (inputs['input_ids'].cuda(), inputs['attention_mask'].cuda(), inputs['token_type_ids'].cuda(),)

class PlainTextClassificationPipeline(TextClassificationPipeline):
  def _forward(self, model_inputs):
    return self.model(*model_inputs)

# build a sentiment analysis classifier pipeline
classifier = pipeline('sentiment-analysis',
           model=model,
           tokenizer=plain_tokenizer,
           pipeline_class=PlainTextClassificationPipeline,
           device=0)

import time
optimized_ts = torch.jit.load("opt.disc.pt")

optimized_ts.config = model.config
# build a sentiment analysis classifier pipeline
blade_classifier = pipeline('sentiment-analysis',
              model=optimized_ts,
              tokenizer=plain_tokenizer,
              pipeline_class=PlainTextClassificationPipeline,
              device=0)

input_strs = ["I really like the new design of your website!",
       "I'm not sure if I like the new design.",
       "The new design is awful!",
       "It will be awesome if you give us feedback!",]

print("Predict with Baseline PyTorch model:")
start = time.time()
results = classifier(input_strs)
for inp_str, result in zip(input_strs, results):
  print(inp_str)
  print(f" label: {result['label']}, with a score: {round(result['score'], 4)}")
end = time.time()
print(f"cost: {end-start}")

print("Predict with BladeDISC optimized model:")
start = time.time()
results = blade_classifier(input_strs)
for inp_str, result in zip(input_strs, results):
  print(inp_str)
  print(f" label: {result['label']}, with a score: {round(result['score'], 4)}")
end = time.time()
print(f"cost: {end-start}")

What caused the problem? Did I miss anything?

env is:
Docker Image: bladedisc/bladedisc:latest-runtime-torch1.7.1
Nvidia Driver 495.29.05
CUDA 11.0
CuDNN 8.2.1

rename `torch.optimize` and `torch.export`

  • Name torch.optimize and torch.export with nouns since they are modules
  • Alias function torch.optimize to torch.optimization.optimize after renaming module torch.optimize to torch.optimization

Protoc Download Failed

image

Protoc package download failed when setting up BladeDISC in docker container. It should be a firewall issue. Please add a aliyun cache, thank you

`addmm` op caused an unchanged loss

In #207, I added addmm op to the blacklist to lowing to Disc, because this caused the training to remain unchanged. It seems a bug, we should debug and fix it.

add TorchDisc cluster algorithm to cluster disc compilable nodes

From the implementation of TorchDisc PoC, we have implemented a FakeCluster to lowing single TorchScript node to Disc, but in a production environment, we should cluster all Disc compilable nodes into one cluster instead of a single one, that we can utilize the fusion and high-performance code generator in Disc compiler to maximize the performance.

Some exciting cluster algorithms in BladeDisc:

Survey DISC compilation on TorchScript graphs that are dumped from LazyTensor.

We are capable to dump the TorchScript computation graph tracing via LazyTensor now, including backward and gradient updating subgraphs. With the full graph dumped we can reuse the DISC compilation flow used in inference.

This issue is the placeholder for all the related tasks. Actions:

  • Add utilities to help dump TorchScript computation graph traced by LazyTensor #286
  • Dump TSBackend graphs and inputs for debugging and survey

Unify the usage of `#pragma once` and `#ifndef` guards

Please unify the usage of #pragma once and #ifndef guards.

It's preferred to use #ifndef guards in general cases. However, #pragma once is more frequently used in PyTorch C++ header files. So, it seems reasonable that TorchBlade should follow the convention of PyTorch.

document structure [draft]

  • BladeDISC introduction
  • Get Started
    • quick start for tensorflow wrapper @Yancey1989
    • quick start for pytorch wrapper
  • Setup BladeDISC
    • Installation with Docker
    • Build from source
      • CPU backend
      • Nvidia CUDA backend
      • AMD ROCm backend
  • Tutorials @JamesTheZ
    • tensorflow train demo
    • tensorflow predictition demo
    • pytorch prediction
  • Contribution @Yancey1989
    • how to contribute to BladeDISC
    • how to add a new custom-call op
    • how to add the support of a new torch Op @fortianyou
  • Designs
  • FAQ
    • relations with MHLO community
    • relations with torch-mlir community
  • Publications
    • link of paper...

Enable CPU backend on CI pipeline

The CI run failed with the CPU device on the current codebase. This issue just records the steps to reproduce these errors and we should fix them with some PRs.

  1. Launch development Docker container

    docker pull bladedisc/bladedisc:latest-devel-cuda10.0
    docker run --rm -it -v $PWD:/workspace -w /workspace bladedisc/bladedisc:latest-devel-cuda10.0 bash
  2. Execute the command to run unit tests

    python scripts/python/tao_build.py /opt/venv_disc/ -s configure --bridge-gcc default --compiler-gcc default --cpu_only
    python scripts/python/tao_build.py /opt/venv_disc/ -s build_tao_compiler --cpu_only
    python scripts/python/tao_build.py /opt/venv_disc/ -s test_tao_compiler --cpu_only 

There are 7 unit tests that run failed totally, the snippet error message of sub.cc.test is as follows, I also upload the error logs file here

object file to shared library command: gcc --shared -o /home/github/.cache/bazel/_bazel_github/ae143f62e65da17ca5d5f97d2dc3de17/execroot/org_tensorflow/_tmp/f0a918c4823924815dc3c918baef281fImplicitBroadcast2DF32.so /home/github/.cache/bazel/_bazel_github/ae143f62e65da17ca5d5f97d2dc3de17/execroot/org_tensorflow/_tmp/f0a918c4823924815dc3c918baef281fImplicitBroadcast2DF32.so.o
/usr/bin/ld: /home/github/.cache/bazel/_bazel_github/ae143f62e65da17ca5d5f97d2dc3de17/execroot/org_tensorflow/_tmp/f0a918c4823924815dc3c918baef281fImplicitBroadcast2DF32.so: version node not found for symbol packed_@K&y^KV
/usr/bin/ld: failed to set dynamic section sizes: Bad value
collect2: error: ld returned 1 exit status
save shared lib file to : /home/github/.cache/bazel/_bazel_github/ae143f62e65da17ca5d5f97d2dc3de17/execroot/org_tensorflow/_tmp/f0a918c4823924815dc3c918baef281fImplicitBroadcast2DF32.so

The whole building error message is as the following:

INFO: Elapsed time: 53.235s, Critical Path: 44.90s
INFO: 116 processes: 30 remote cache hit, 139 local.
INFO: Build completed, 7 tests FAILED, 116 total actions
//tensorflow/compiler/mlir/disc/tests/regression:empty_tensor.cc.test    PASSED in 3.5s
//tensorflow/compiler/mlir/disc/tests/regression:io_forwarding.cc.test   PASSED in 2.3s
//tensorflow/compiler/mlir/disc/tests/regression:multi_cc.cc.test        PASSED in 1.9s
//tensorflow/compiler/mlir/disc/tests/regression:uint_cast.cc.test       PASSED in 1.6s
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:abs.cc.test         PASSED in 4.4s
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:add.cc.test         PASSED in 4.4s
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:add_n.cc.test       PASSED in 4.2s
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:add_v2.cc.test      PASSED in 4.4s
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:all.cc.test         PASSED in 3.9s
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:any.cc.test         PASSED in 4.2s
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:batch_matmul.cc.test PASSED in 1.6s
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:batch_matmul_v2.cc.test PASSED in 2.0s
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:bias_add.cc.test    PASSED in 4.2s
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:broadcast_to.cc.test PASSED in 3.6s
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:cast.cc.test        PASSED in 2.6s
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:ceil.cc.test        PASSED in 3.8s
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:concat_v2.cc.test   PASSED in 3.5s
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:const.cc.test       PASSED in 3.4s
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:conv2d.cc.test      PASSED in 2.4s
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:conv2d_backprop_filter.cc.test PASSED in 1.9s
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:conv2d_backprop_input.cc.test PASSED in 1.9s
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:dynamic_stitch.cc.test PASSED in 4.5s
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:equal.cc.test       PASSED in 5.3s
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:erf.cc.test         PASSED in 12.3s
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:exp.cc.test         PASSED in 3.8s
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:expand_dims.cc.test PASSED in 4.6s
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:fill.cc.test        PASSED in 3.2s
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:floor.cc.test       PASSED in 4.2s
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:floor_mod.cc.test   PASSED in 8.6s
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:gather_nd.cc.test   PASSED in 3.4s
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:gather_v2.cc.test   PASSED in 5.4s
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:greater_equal.cc.test PASSED in 4.9s
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:identity.cc.test    PASSED in 4.2s
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:isfinite.cc.test    PASSED in 3.7s
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:less_equal.cc.test  PASSED in 5.6s
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:log.cc.test         PASSED in 4.1s
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:logical_and.cc.test PASSED in 4.8s
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:logical_not.cc.test PASSED in 3.6s
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:logical_or.cc.test  PASSED in 4.9s
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:matmul.cc.test      PASSED in 2.1s
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:minimum.cc.test     PASSED in 5.8s
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:mul.cc.test         PASSED in 4.4s
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:neg.cc.test         PASSED in 4.1s
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:pack.cc.test        PASSED in 4.2s
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:pow.cc.test         PASSED in 5.4s
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:prod.cc.test        PASSED in 8.2s
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:random_uniform.cc.test PASSED in 1.7s
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:range.cc.test       PASSED in 3.5s
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:relu.cc.test        PASSED in 3.8s
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:relu_grad.cc.test   PASSED in 3.9s
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:reshape.cc.test     PASSED in 4.0s
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:reverse.cc.test     PASSED in 3.0s
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:round.cc.test       PASSED in 6.7s
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:rsqrt.cc.test       PASSED in 4.2s
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:select.cc.test      PASSED in 4.2s
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:shape.cc.test       PASSED in 3.9s
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:sigmoid.cc.test     PASSED in 4.0s
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:sign.cc.test        PASSED in 4.3s
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:slice.cc.test       PASSED in 4.6s
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:snapshot.cc.test    PASSED in 4.0s
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:split.cc.test       PASSED in 3.8s
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:squared_difference.cc.test PASSED in 6.0s
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:squeeze.cc.test     PASSED in 3.9s
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:strided_slice.cc.test PASSED in 7.5s
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:tanh.cc.test        PASSED in 4.3s
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:tile.cc.test        PASSED in 4.6s
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:topk.cc.test        PASSED in 1.7s
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:transpose.cc.test   PASSED in 3.4s
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:unpack.cc.test      PASSED in 4.3s
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:zeros_like.cc.test  PASSED in 3.6s
//tensorflow/compiler/mlir/disc/transforms/tests:conv-rewrite.mlir.test  PASSED in 0.4s
//tensorflow/compiler/mlir/disc/transforms/tests:demo.mlir.test          PASSED in 0.4s
//tensorflow/compiler/mlir/disc/transforms/tests:disc-assign-kernel-name.mlir.test PASSED in 0.4s
//tensorflow/compiler/mlir/disc/transforms/tests:disc-assign-memory-space.mlir.test PASSED in 0.4s
//tensorflow/compiler/mlir/disc/transforms/tests:disc-const-to-ral.mlir.test PASSED in 0.4s
//tensorflow/compiler/mlir/disc/transforms/tests:disc-convert-tensor-to-std.mlir.test PASSED in 0.3s
//tensorflow/compiler/mlir/disc/transforms/tests:disc-flatten-memref-access.mlir.test PASSED in 0.4s
//tensorflow/compiler/mlir/disc/transforms/tests:disc-hlo-legalize-to-lhlo.mlir.test PASSED in 0.4s
//tensorflow/compiler/mlir/disc/transforms/tests:disc-input-inline-fusion.mlir.test PASSED in 0.4s
//tensorflow/compiler/mlir/disc/transforms/tests:disc-legalize-printf-to-llvm.mlir.test PASSED in 0.4s
//tensorflow/compiler/mlir/disc/transforms/tests:disc-lower-to-library-call.mlir.test PASSED in 0.4s
//tensorflow/compiler/mlir/disc/transforms/tests:disc-math-approximation.mlir.test PASSED in 0.4s
//tensorflow/compiler/mlir/disc/transforms/tests:disc-memref-canonicalizer.mlir.test PASSED in 0.4s
//tensorflow/compiler/mlir/disc/transforms/tests:disc-ral-legalize-alloc-dealloc-to-llvm.mlir.test PASSED in 0.4s
//tensorflow/compiler/mlir/disc/transforms/tests:disc-remove-dead-buffer.mlir.test PASSED in 0.4s
//tensorflow/compiler/mlir/disc/transforms/tests:disc-shape-simplifier-tie-shape.mlir.test PASSED in 0.4s
//tensorflow/compiler/mlir/disc/transforms/tests:disc-shape-simplifier.mlir.test PASSED in 0.4s
//tensorflow/compiler/mlir/disc/transforms/tests:dot_rewriter.mlir.test  PASSED in 0.4s
//tensorflow/compiler/mlir/disc/transforms/tests:dynamic-reshape-canonicalizer.mlir.test PASSED in 0.4s
//tensorflow/compiler/mlir/disc/transforms/tests:element_type_converter.mlir.test PASSED in 0.4s
//tensorflow/compiler/mlir/disc/transforms/tests:gpu_conv_padding.mlir.test PASSED in 0.4s
//tensorflow/compiler/mlir/disc/transforms/tests:lhlo-input-inline-fusion.mlir.test PASSED in 0.4s
//tensorflow/compiler/mlir/disc/transforms/tests:map-parallel-loops.mlir.test PASSED in 0.4s
//tensorflow/compiler/mlir/disc/transforms/tests:memref-load-cse.mlir.test PASSED in 0.4s
//tensorflow/compiler/mlir/disc/transforms/tests:mhlo_mark_shape_calc_op.mlir.test PASSED in 0.4s
//tensorflow/compiler/mlir/disc/transforms/tests:mhlo_place_ops.mlir.test PASSED in 0.4s
//tensorflow/compiler/mlir/disc/transforms/tests:parallel-loop-tiling-inbound-check.mlir.test PASSED in 0.4s
//tensorflow/compiler/mlir/disc/transforms/tests:revise-kernel-outlining.mlir.test PASSED in 0.4s
//tensorflow/compiler/mlir/disc/transforms/tests:revise_args.mlir.test   PASSED in 0.4s
//tensorflow/compiler/mlir/disc/transforms/tests:specialize_fusion_with_speculation.mlir.test PASSED in 0.3s
//tensorflow/compiler/mlir/disc/transforms/tests:split_large_ops.mlir.test PASSED in 0.4s
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:floor_div.cc.test    FLAKY, failed in 1 out of 2 in 5.5s
  Stats over 2 runs: max = 5.5s, min = 5.1s, avg = 5.3s, dev = 0.2s
  /home/github/.cache/bazel/_bazel_github/ae143f62e65da17ca5d5f97d2dc3de17/execroot/org_tensorflow/bazel-out/k8-opt/testlogs/tensorflow/compiler/mlir/disc/tests/tensorflow_ops/floor_div.cc.test/test_attempts/attempt_1.log
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:greater.cc.test      FLAKY, failed in 1 out of 2 in 5.1s
  Stats over 2 runs: max = 5.1s, min = 4.9s, avg = 5.0s, dev = 0.1s
  /home/github/.cache/bazel/_bazel_github/ae143f62e65da17ca5d5f97d2dc3de17/execroot/org_tensorflow/bazel-out/k8-opt/testlogs/tensorflow/compiler/mlir/disc/tests/tensorflow_ops/greater.cc.test/test_attempts/attempt_1.log
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:max.cc.test          FLAKY, failed in 1 out of 2 in 8.1s
  Stats over 2 runs: max = 8.1s, min = 8.1s, avg = 8.1s, dev = 0.0s
  /home/github/.cache/bazel/_bazel_github/ae143f62e65da17ca5d5f97d2dc3de17/execroot/org_tensorflow/bazel-out/k8-opt/testlogs/tensorflow/compiler/mlir/disc/tests/tensorflow_ops/max.cc.test/test_attempts/attempt_1.log
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:min.cc.test          FLAKY, failed in 1 out of 2 in 7.7s
  Stats over 2 runs: max = 7.7s, min = 7.4s, avg = 7.6s, dev = 0.1s
  /home/github/.cache/bazel/_bazel_github/ae143f62e65da17ca5d5f97d2dc3de17/execroot/org_tensorflow/bazel-out/k8-opt/testlogs/tensorflow/compiler/mlir/disc/tests/tensorflow_ops/min.cc.test/test_attempts/attempt_1.log
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:softplus.cc.test     FLAKY, failed in 1 out of 2 in 5.5s
  Stats over 2 runs: max = 5.5s, min = 4.6s, avg = 5.1s, dev = 0.4s
  /home/github/.cache/bazel/_bazel_github/ae143f62e65da17ca5d5f97d2dc3de17/execroot/org_tensorflow/bazel-out/k8-opt/testlogs/tensorflow/compiler/mlir/disc/tests/tensorflow_ops/softplus.cc.test/test_attempts/attempt_1.log
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:less.cc.test         FLAKY, failed in 2 out of 3 in 5.0s
  Stats over 3 runs: max = 5.0s, min = 5.0s, avg = 5.0s, dev = 0.0s
  /home/github/.cache/bazel/_bazel_github/ae143f62e65da17ca5d5f97d2dc3de17/execroot/org_tensorflow/bazel-out/k8-opt/testlogs/tensorflow/compiler/mlir/disc/tests/tensorflow_ops/less.cc.test/test_attempts/attempt_1.log
  /home/github/.cache/bazel/_bazel_github/ae143f62e65da17ca5d5f97d2dc3de17/execroot/org_tensorflow/bazel-out/k8-opt/testlogs/tensorflow/compiler/mlir/disc/tests/tensorflow_ops/less.cc.test/test_attempts/attempt_2.log
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:sum.cc.test          FLAKY, failed in 2 out of 3 in 16.2s
  Stats over 3 runs: max = 16.2s, min = 14.1s, avg = 14.9s, dev = 0.9s
  /home/github/.cache/bazel/_bazel_github/ae143f62e65da17ca5d5f97d2dc3de17/execroot/org_tensorflow/bazel-out/k8-opt/testlogs/tensorflow/compiler/mlir/disc/tests/tensorflow_ops/sum.cc.test/test_attempts/attempt_1.log
  /home/github/.cache/bazel/_bazel_github/ae143f62e65da17ca5d5f97d2dc3de17/execroot/org_tensorflow/bazel-out/k8-opt/testlogs/tensorflow/compiler/mlir/disc/tests/tensorflow_ops/sum.cc.test/test_attempts/attempt_2.log
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:maximum.cc.test     FAILED in 3 out of 3 in 5.4s
  Stats over 3 runs: max = 5.4s, min = 5.1s, avg = 5.2s, dev = 0.1s
  /home/github/.cache/bazel/_bazel_github/ae143f62e65da17ca5d5f97d2dc3de17/execroot/org_tensorflow/bazel-out/k8-opt/testlogs/tensorflow/compiler/mlir/disc/tests/tensorflow_ops/maximum.cc.test/test.log
  /home/github/.cache/bazel/_bazel_github/ae143f62e65da17ca5d5f97d2dc3de17/execroot/org_tensorflow/bazel-out/k8-opt/testlogs/tensorflow/compiler/mlir/disc/tests/tensorflow_ops/maximum.cc.test/test_attempts/attempt_1.log
  /home/github/.cache/bazel/_bazel_github/ae143f62e65da17ca5d5f97d2dc3de17/execroot/org_tensorflow/bazel-out/k8-opt/testlogs/tensorflow/compiler/mlir/disc/tests/tensorflow_ops/maximum.cc.test/test_attempts/attempt_2.log
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:mean.cc.test        FAILED in 3 out of 3 in 8.7s
  Stats over 3 runs: max = 8.7s, min = 6.8s, avg = 7.7s, dev = 0.8s
  /home/github/.cache/bazel/_bazel_github/ae143f62e65da17ca5d5f97d2dc3de17/execroot/org_tensorflow/bazel-out/k8-opt/testlogs/tensorflow/compiler/mlir/disc/tests/tensorflow_ops/mean.cc.test/test.log
  /home/github/.cache/bazel/_bazel_github/ae143f62e65da17ca5d5f97d2dc3de17/execroot/org_tensorflow/bazel-out/k8-opt/testlogs/tensorflow/compiler/mlir/disc/tests/tensorflow_ops/mean.cc.test/test_attempts/attempt_1.log
  /home/github/.cache/bazel/_bazel_github/ae143f62e65da17ca5d5f97d2dc3de17/execroot/org_tensorflow/bazel-out/k8-opt/testlogs/tensorflow/compiler/mlir/disc/tests/tensorflow_ops/mean.cc.test/test_attempts/attempt_2.log
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:not_equal.cc.test   FAILED in 3 out of 3 in 5.3s
  Stats over 3 runs: max = 5.3s, min = 4.6s, avg = 4.9s, dev = 0.3s
  /home/github/.cache/bazel/_bazel_github/ae143f62e65da17ca5d5f97d2dc3de17/execroot/org_tensorflow/bazel-out/k8-opt/testlogs/tensorflow/compiler/mlir/disc/tests/tensorflow_ops/not_equal.cc.test/test.log
  /home/github/.cache/bazel/_bazel_github/ae143f62e65da17ca5d5f97d2dc3de17/execroot/org_tensorflow/bazel-out/k8-opt/testlogs/tensorflow/compiler/mlir/disc/tests/tensorflow_ops/not_equal.cc.test/test_attempts/attempt_1.log
  /home/github/.cache/bazel/_bazel_github/ae143f62e65da17ca5d5f97d2dc3de17/execroot/org_tensorflow/bazel-out/k8-opt/testlogs/tensorflow/compiler/mlir/disc/tests/tensorflow_ops/not_equal.cc.test/test_attempts/attempt_2.log
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:real_div.cc.test    FAILED in 3 out of 3 in 4.6s
  Stats over 3 runs: max = 4.6s, min = 3.9s, avg = 4.4s, dev = 0.3s
  /home/github/.cache/bazel/_bazel_github/ae143f62e65da17ca5d5f97d2dc3de17/execroot/org_tensorflow/bazel-out/k8-opt/testlogs/tensorflow/compiler/mlir/disc/tests/tensorflow_ops/real_div.cc.test/test.log
  /home/github/.cache/bazel/_bazel_github/ae143f62e65da17ca5d5f97d2dc3de17/execroot/org_tensorflow/bazel-out/k8-opt/testlogs/tensorflow/compiler/mlir/disc/tests/tensorflow_ops/real_div.cc.test/test_attempts/attempt_1.log
  /home/github/.cache/bazel/_bazel_github/ae143f62e65da17ca5d5f97d2dc3de17/execroot/org_tensorflow/bazel-out/k8-opt/testlogs/tensorflow/compiler/mlir/disc/tests/tensorflow_ops/real_div.cc.test/test_attempts/attempt_2.log
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:softmax.cc.test     FAILED in 3 out of 3 in 4.8s
  Stats over 3 runs: max = 4.8s, min = 4.2s, avg = 4.5s, dev = 0.2s
  /home/github/.cache/bazel/_bazel_github/ae143f62e65da17ca5d5f97d2dc3de17/execroot/org_tensorflow/bazel-out/k8-opt/testlogs/tensorflow/compiler/mlir/disc/tests/tensorflow_ops/softmax.cc.test/test.log
  /home/github/.cache/bazel/_bazel_github/ae143f62e65da17ca5d5f97d2dc3de17/execroot/org_tensorflow/bazel-out/k8-opt/testlogs/tensorflow/compiler/mlir/disc/tests/tensorflow_ops/softmax.cc.test/test_attempts/attempt_1.log
  /home/github/.cache/bazel/_bazel_github/ae143f62e65da17ca5d5f97d2dc3de17/execroot/org_tensorflow/bazel-out/k8-opt/testlogs/tensorflow/compiler/mlir/disc/tests/tensorflow_ops/softmax.cc.test/test_attempts/attempt_2.log
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:softmax_cross_entropy_with_logits.cc.test FAILED in 3 out of 3 in 8.0s
  Stats over 3 runs: max = 8.0s, min = 7.4s, avg = 7.7s, dev = 0.3s
  /home/github/.cache/bazel/_bazel_github/ae143f62e65da17ca5d5f97d2dc3de17/execroot/org_tensorflow/bazel-out/k8-opt/testlogs/tensorflow/compiler/mlir/disc/tests/tensorflow_ops/softmax_cross_entropy_with_logits.cc.test/test.log
  /home/github/.cache/bazel/_bazel_github/ae143f62e65da17ca5d5f97d2dc3de17/execroot/org_tensorflow/bazel-out/k8-opt/testlogs/tensorflow/compiler/mlir/disc/tests/tensorflow_ops/softmax_cross_entropy_with_logits.cc.test/test_attempts/attempt_1.log
  /home/github/.cache/bazel/_bazel_github/ae143f62e65da17ca5d5f97d2dc3de17/execroot/org_tensorflow/bazel-out/k8-opt/testlogs/tensorflow/compiler/mlir/disc/tests/tensorflow_ops/softmax_cross_entropy_with_logits.cc.test/test_attempts/attempt_2.log
//tensorflow/compiler/mlir/disc/tests/tensorflow_ops:sub.cc.test         FAILED in 3 out of 3 in 4.8s
  Stats over 3 runs: max = 4.8s, min = 4.3s, avg = 4.5s, dev = 0.2s
  /home/github/.cache/bazel/_bazel_github/ae143f62e65da17ca5d5f97d2dc3de17/execroot/org_tensorflow/bazel-out/k8-opt/testlogs/tensorflow/compiler/mlir/disc/tests/tensorflow_ops/sub.cc.test/test.log
  /home/github/.cache/bazel/_bazel_github/ae143f62e65da17ca5d5f97d2dc3de17/execroot/org_tensorflow/bazel-out/k8-opt/testlogs/tensorflow/compiler/mlir/disc/tests/tensorflow_ops/sub.cc.test/test_attempts/attempt_1.log
  /home/github/.cache/bazel/_bazel_github/ae143f62e65da17ca5d5f97d2dc3de17/execroot/org_tensorflow/bazel-out/k8-opt/testlogs/tensorflow/compiler/mlir/disc/tests/tensorflow_ops/sub.cc.test/test_attempts/attempt_2.log

Add Tensorflow-Blade

In the past weeks, we have open-source some of the cpp and python code for tensorflow-blade, which optimize the tensorflow model for inference. The first stage of tensorflow blade open source is done when the following actions are complete:

  • cpp code and related tests merged: pr
  • python code and related tests merged: pr
  • complete ci pipeline with develop docker: pr
  • tutorial for users to use tensorflow-blade
  • docs for design of tensorflow-blade
  • docs for make contribution to tensorflow-blade

invalid output number of `torch.eq(t1, t2).sum()`

From the MNIST example of TorchDisc, the program outputs an invalid test acc number:

Test set: Average loss: 0.3399, Accuracy: 281447840135434/10000 (2814478401354%)

A tiny program to reproduce this bug:

import torch
import _torch_disc as disc
disc._ltc_init_disc_backend()

def func():
    t1 = torch.tensor([[1, 2], [3, 4]])
    t2 = torch.tensor([[1, 2], [5, 6]])
    t1, t2 = t1.to('lazy'), t2.to('lazy')
    t3 = torch.eq(t1, t2)
    return t3.sum()

print(func())

outputs:

tensor(3328288142057042689, device='lazy:0')

Could not export Python function call '_ReduceFromModelParallelRegion'. during torch.jit.save

Hi, below is my function from Megatron-LM

class _ReduceFromModelParallelRegion(torch.autograd.Function):
    """All-redcue the input from the model parallel region."""

    @staticmethod
    def forward(ctx, input_):
        return _reduce(input_)

    @staticmethod
    def backward(ctx, grad_output):
        return grad_output

def reduce_from_model_parallel_region(input_):
    return _ReduceFromModelParallelRegion.apply(input_)

And the exception occurred during torch.jit.save:

  File "test.py", line 605, in generate_sentence
    torch.jit.save(optimized_ts, "opt.disc.pt")
  File "/usr/local/lib/python3.6/dist-packages/torch/jit/_serialization.py", line 81, in save
    m.save(f, _extra_files=_extra_files)
  File "/usr/local/lib/python3.6/dist-packages/torch/jit/_script.py", line 487, in save
    return self._c.save(*args, **kwargs)
RuntimeError: 
Could not export Python function call '_ReduceFromModelParallelRegion'. Remove calls to Python functions before export. Did you forget to add @script or @script_method annotation? If this is a nn.ModuleList, add it to __constants__:
/work/generate/gpt2/mpu/mappings.py(135): reduce_from_model_parallel_region
/work/generate/gpt2/mpu/layers.py(134): forward
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py(709): _slow_forward
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py(725): _call_impl
/work/generate/gpt2/model/gpt2_modeling.py(83): forward
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py(709): _slow_forward
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py(725): _call_impl
/work/generate/gpt2/fp16/fp16.py(65): forward
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py(709): _slow_forward
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py(725): _call_impl
/usr/local/lib/python3.6/dist-packages/torch/jit/_trace.py(940): trace_module
/usr/local/lib/python3.6/dist-packages/torch/jit/_trace.py(742): trace
/usr/local/lib/python3.6/dist-packages/torch_blade/exporter.py(235): export
/usr/local/lib/python3.6/dist-packages/torch/autograd/grad_mode.py(26): decorate_context
/usr/local/lib/python3.6/dist-packages/torch_blade/optimization.py(37): _optimize
/usr/local/lib/python3.6/dist-packages/torch/autograd/grad_mode.py(26): decorate_context
/usr/local/lib/python3.6/dist-packages/torch_blade/optimization.py(111): optimize
/usr/local/lib/python3.6/dist-packages/torch/autograd/grad_mode.py(26): decorate_context
test.py(602): generate_sentence

How to save the model with torch.autograd.Function ? Thank you very much !

Opensource our solutions to ONNX accelerators

Although, BladeDISC choose MHLO as the Tensor-Level IR for machine learning workloads.
We found that ONNX is widely adopted by the community and vendors.
We think that the community benefits from its open governance structure, transparency, and Interoperability.
As a result, ONNX makes it easier to access hardware optimizations.

However, there are some deficiencies to deploying models via ONNX:

  • Some control flow, dynamic shapes, and mutation operations in PyTorch/TensorFlow can't be correctly exported via tracing
  • Vendors usually provide limited support on ONNX operators. For example, ONNX-TensorRT only supports about 129 ONNX operators out of 160+
  • It's not practical to provide a full conversion coverage of all operations from PyTorch/TensorFlow
  • The compatibility among tools used in the optimization and conversion pipeline is poor usually

To make users feasible to various ONNX tools, BladeDISC/PAI-Blade also provides a compiler path targeted to ONNX besides MHLO.

TorchBlade Actions:

  • Refactor Engine and Backends #159
  • Add ONNX-TensorRT backend #184 #177
  • Setup CI using NGC(with TorchTensorRT installed) #183
  • Add ONNX-TensorRT d2/torch-tensorrt benchmark examples ##202
  • Add document and tutorials

TfBlade Actions:
#195

Add a replay toolkit

To make profiling a single cluster easier, we should implement a toolkit to replay a cluster. For my preliminary idea, this toolkit includes two phases:

  1. dump cluster args and compiler input IR with the protobuf format on disc_launch_op, users can specify the iteration with environ variable and then find the dump message on logs as the following example:

    Launch the training jobs with some environment variables:

    export BLADEDISC_REPLAY_ITERATION=1000
    export BLADEDISC_REPLAY_CLUSTER=cluster_24
    python train.py > train.log
    ...

    Then users can find the replay logs with grep command after period of time:

    grep "BladeDISC replay toolkit" train.log
    BladeDISC replay toolkit  dumps the disc compiler input file : `/tmp/tempfile-xxxx.input`, record args file: 
    `/tmp/record_args.xxx.pb`
  2. execute with an executable program disc_replay_main with the nvprof profiler toolkit

    nvprof disc_replay_main /tmp/tempfile-xxxx.input /tmp/record_args.xxx.pb

TODOs:

  • implement disc_replay_main executable program.
  • dump record args on tensorflow bridge site.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.