differentiableuniverseinitiative / mesh Goto Github PK
View Code? Open in Web Editor NEWThis project forked from tensorflow/mesh
Mesh TensorFlow: Model Parallelism Made Easier
License: Apache License 2.0
This project forked from tensorflow/mesh
Mesh TensorFlow: Model Parallelism Made Easier
License: Apache License 2.0
Working on #2 I implemented a script that performs a forward and inverse 3D FFT (script and job).
When the computation is distributed on several GPUs, a reshape
is required. Otherwise, only a slice of the mesh is returned. For instance, given a cube of size 128x128x128
and a 4x4
mesh on which I would distribute the x and y axes, lowering.export_to_tf_tensor(output_field)
would return a cube of size 32x32x128
.
However, once I added the reshape
as:
mesh/examples/fft_test_horovod.py
Line 90 in 64f3154
so that each process get the full cube, sometimes the script seems to be stucked in the tf session...
With @EiffL , we tried to debug it and managed to enable the full computation, but only for a mesh distributed on the same node. @EiffL thought of deadlocks, so we basically added hvd.join()
at the end of collectives function of mesh_nbody_benchmark.py, such as all2all
, allconcat
and shift_by_n_processors
.
We still don't know how to solve the problem... In order to know if the problem comes from the mesh or the horovod layer, I'll try to reproduce the error with a script containing only reshapes and see what's happening.
In our current prototype of the horovod mesh implementation, we are using copy/pasted code from the TPU SIMD implementation, it is not expected to work...
mesh/mesh_tensorflow/hvd_simd_mesh_impl.py
Line 112 in 70a13d3
This issue is to document the reimplementation of these variables using the horovod backend. As points of reference, we can look at these variables are implemented in both the Device Placemnt Impl and SIMD impl
We currently have prototypes for most of the collectives needed for actual computations in the horovod mesh implementation here:
https://github.com/DifferentiableUniverseInitiative/mesh/blob/70a13d38c5b4b16200dcb8f3d68f866633875181/mesh_tensorflow/hvd_simd_mesh_impl.py
In particular:
mesh/mesh_tensorflow/hvd_simd_mesh_impl.py
Line 270 in 70a13d3
mesh/mesh_tensorflow/hvd_simd_mesh_impl.py
Line 308 in 70a13d3
mesh/mesh_tensorflow/hvd_simd_mesh_impl.py
Line 378 in 70a13d3
We have detected that in some situations, depending on the mesh, the result of the computation is not quite right. This is probably due to wrong assumptions of how to split tensors and shuffle dimensions around, probably in the all2all step, maybe in the allconcat.
In particular, we found that for meshes of size 2x2 it seemed ok, but failed on 4x4.
To check these things, a good way is to start from a very simple mesh script: https://github.com/DifferentiableUniverseInitiative/mesh/blob/hvd/examples/test_horovod.py
This script can be modified for instance to do a forward and backward FFT (taking inspiration from https://github.com/DifferentiableUniverseInitiative/IDRIS-hackathon/blob/main/scripts/fft_benchmark.py)
The steps would be:
mesh_impl
functions are called, and try to narrow down which one is the culpritWe got some weird deadlock when trying to run a simple 3d conv with blocks, most likely from the halo exchange:
all_inspector.cc:107] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
Missing ranks:
1: [shift/HorovodAllgather_shift_ExpandDims_0]
3: [shift_1/HorovodAllgather_shift_1_ExpandDims_0]
@b-remy can you document here exactly how this happened? We'll need to sort it out....
In hvd_simd_mesh_impl.py
the laid out tensor function that requires a property .pnum_tensor
. In the standard simd_mesh_impl.py
this is defined on line 87, but there is not an equivalent in hvd_simd_mesh_impl.py
. This raises problems when random
is called.
Below is an example of this happening:
lowering = mtf.Lowering(graph, {mesh: mesh_impl})
File "<...>/mesh/mesh_tensorflow/ops.py", line 728, in __init__
op.lower(self)
File "<...>/mesh/mesh_tensorflow/ops.py", line 5799, in lower
mesh_impl.random(output_shape, self._tf_fn, self._kwargs)))
File "<...>/mesh/mesh_tensorflow/hvd_simd_mesh_impl.py", line 607, in random
tf.equal(self.laid_out_pcoord(axis).one_slice, 0), x.dtype)
File "<...>/mesh/mesh_tensorflow/ops.py", line 1209, in laid_out_pcoord
return self.slicewise(my_fn, self.laid_out_pnum())
File "<...>/mesh/mesh_tensorflow/hvd_simd_mesh_impl.py", line 268, in laid_out_pnum
return self.LaidOutTensor([self.pnum_tensor])
AttributeError: 'HvdSimdMeshImpl' object has no attribute 'pnum_tensor'
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.