differentiableuniverseinitiative / mesh Goto Github PK

View Code? Open in Web Editor NEW

This project forked from tensorflow/mesh

0.0 0.0 0.0 2.28 MB

Mesh TensorFlow: Model Parallelism Made Easier

License: Apache License 2.0

Python 99.90% Shell 0.10%

mesh's People

Contributors

Watchers

mesh's Issues

bug with mft.reshape

Working on #2 I implemented a script that performs a forward and inverse 3D FFT (script and job).

When the computation is distributed on several GPUs, a reshape is required. Otherwise, only a slice of the mesh is returned. For instance, given a cube of size 128x128x128 and a 4x4 mesh on which I would distribute the x and y axes, lowering.export_to_tf_tensor(output_field) would return a cube of size 32x32x128.

However, once I added the reshape as:

mesh/examples/fft_test_horovod.py

Line 90 in 64f3154

ret_initc = mtf.reshape(input_field, [batch_dim, ffx_dim, ffy_dim, ffz_dim])

so that each process get the full cube, sometimes the script seems to be stucked in the tf session...

With @EiffL , we tried to debug it and managed to enable the full computation, but only for a mesh distributed on the same node. @EiffL thought of deadlocks, so we basically added hvd.join() at the end of collectives function of mesh_nbody_benchmark.py, such as all2all, allconcat and shift_by_n_processors.

We still don't know how to solve the problem... In order to know if the problem comes from the mesh or the horovod layer, I'll try to reproduce the error with a script containing only reshapes and see what's happening.

Adding support for hvd LaidOutVariable

In our current prototype of the horovod mesh implementation, we are using copy/pasted code from the TPU SIMD implementation, it is not expected to work...

mesh/mesh_tensorflow/hvd_simd_mesh_impl.py

Line 112 in 70a13d3

class LaidOutVariable(object):

This issue is to document the reimplementation of these variables using the horovod backend. As points of reference, we can look at these variables are implemented in both the Device Placemnt Impl and SIMD impl

Checking/Optimizing Horovod implementation collectives

We currently have prototypes for most of the collectives needed for actual computations in the horovod mesh implementation here:
https://github.com/DifferentiableUniverseInitiative/mesh/blob/70a13d38c5b4b16200dcb8f3d68f866633875181/mesh_tensorflow/hvd_simd_mesh_impl.py

In particular:

mesh/mesh_tensorflow/hvd_simd_mesh_impl.py

Line 270 in 70a13d3

def allreduce(self, x, mesh_axes, reduction_fn_string):
mesh/mesh_tensorflow/hvd_simd_mesh_impl.py

Line 308 in 70a13d3

def allconcat(self, x, mesh_axis, concat_axis, stack=False):
mesh/mesh_tensorflow/hvd_simd_mesh_impl.py

Line 378 in 70a13d3

def alltoall(self, x, mesh_axis, split_axis, concat_axis):

We have detected that in some situations, depending on the mesh, the result of the computation is not quite right. This is probably due to wrong assumptions of how to split tensors and shuffle dimensions around, probably in the all2all step, maybe in the allconcat.

In particular, we found that for meshes of size 2x2 it seemed ok, but failed on 4x4.

To check these things, a good way is to start from a very simple mesh script: https://github.com/DifferentiableUniverseInitiative/mesh/blob/hvd/examples/test_horovod.py

This script can be modified for instance to do a forward and backward FFT (taking inspiration from https://github.com/DifferentiableUniverseInitiative/IDRIS-hackathon/blob/main/scripts/fft_benchmark.py)

The steps would be:

Add a 3D FFT and output an image that shows the residuals between inputs and outputs
Test that script under different 2D mesh sizes, from [1x1, 2x2, 4x4] and see if a problem appears
Given a failing configuration, try to see which mesh_impl functions are called, and try to narrow down which one is the culprit
fix
profit!

Bug with 3D convolutions

We got some weird deadlock when trying to run a simple 3d conv with blocks, most likely from the halo exchange:

all_inspector.cc:107] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock. 
Missing ranks:
1: [shift/HorovodAllgather_shift_ExpandDims_0]
3: [shift_1/HorovodAllgather_shift_1_ExpandDims_0]

@b-remy can you document here exactly how this happened? We'll need to sort it out....

pnum_tensor property missing

In hvd_simd_mesh_impl.py the laid out tensor function that requires a property .pnum_tensor. In the standard simd_mesh_impl.py this is defined on line 87, but there is not an equivalent in hvd_simd_mesh_impl.py. This raises problems when random is called.
Below is an example of this happening:

    lowering = mtf.Lowering(graph, {mesh: mesh_impl})
  File "<...>/mesh/mesh_tensorflow/ops.py", line 728, in __init__
    op.lower(self)
  File "<...>/mesh/mesh_tensorflow/ops.py", line 5799, in lower
    mesh_impl.random(output_shape, self._tf_fn, self._kwargs)))
  File "<...>/mesh/mesh_tensorflow/hvd_simd_mesh_impl.py", line 607, in random
    tf.equal(self.laid_out_pcoord(axis).one_slice, 0), x.dtype)
  File "<...>/mesh/mesh_tensorflow/ops.py", line 1209, in laid_out_pcoord
    return self.slicewise(my_fn, self.laid_out_pnum())
  File "<...>/mesh/mesh_tensorflow/hvd_simd_mesh_impl.py", line 268, in laid_out_pnum
    return self.LaidOutTensor([self.pnum_tensor])
AttributeError: 'HvdSimdMeshImpl' object has no attribute 'pnum_tensor'

differentiableuniverseinitiative / mesh Goto Github PK

mesh's People

Contributors

Watchers

mesh's Issues

bug with mft.reshape

Adding support for hvd LaidOutVariable

Checking/Optimizing Horovod implementation collectives

Bug with 3D convolutions

pnum_tensor property missing

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent