Coder Social home page Coder Social logo

mesh's People

Contributors

adarob avatar afrozenator avatar brettkoonce avatar brianwa84 avatar cghawthorne avatar conchylicultor avatar copybara-service[bot] avatar craffel avatar crccw avatar daphnei avatar dustinvtran avatar hwchung27 avatar irolnick avatar katelee168 avatar lucidrains avatar majnemer avatar marcvanzee avatar mmatena avatar mrry avatar nconstant-google avatar nfiedel avatar nshazeer avatar penpornk avatar pierrot0 avatar saberkun avatar sharannarang avatar sudroy avatar sun51 avatar toponado-zz avatar wuthefwasthat avatar

Watchers

 avatar  avatar  avatar

mesh's Issues

bug with mft.reshape

Working on #2 I implemented a script that performs a forward and inverse 3D FFT (script and job).

When the computation is distributed on several GPUs, a reshape is required. Otherwise, only a slice of the mesh is returned. For instance, given a cube of size 128x128x128 and a 4x4 mesh on which I would distribute the x and y axes, lowering.export_to_tf_tensor(output_field) would return a cube of size 32x32x128.

However, once I added the reshape as:

ret_initc = mtf.reshape(input_field, [batch_dim, ffx_dim, ffy_dim, ffz_dim])

so that each process get the full cube, sometimes the script seems to be stucked in the tf session...

With @EiffL , we tried to debug it and managed to enable the full computation, but only for a mesh distributed on the same node. @EiffL thought of deadlocks, so we basically added hvd.join() at the end of collectives function of mesh_nbody_benchmark.py, such as all2all, allconcat and shift_by_n_processors.

We still don't know how to solve the problem... In order to know if the problem comes from the mesh or the horovod layer, I'll try to reproduce the error with a script containing only reshapes and see what's happening.

Checking/Optimizing Horovod implementation collectives

We currently have prototypes for most of the collectives needed for actual computations in the horovod mesh implementation here:
https://github.com/DifferentiableUniverseInitiative/mesh/blob/70a13d38c5b4b16200dcb8f3d68f866633875181/mesh_tensorflow/hvd_simd_mesh_impl.py

In particular:

We have detected that in some situations, depending on the mesh, the result of the computation is not quite right. This is probably due to wrong assumptions of how to split tensors and shuffle dimensions around, probably in the all2all step, maybe in the allconcat.

In particular, we found that for meshes of size 2x2 it seemed ok, but failed on 4x4.

To check these things, a good way is to start from a very simple mesh script: https://github.com/DifferentiableUniverseInitiative/mesh/blob/hvd/examples/test_horovod.py

This script can be modified for instance to do a forward and backward FFT (taking inspiration from https://github.com/DifferentiableUniverseInitiative/IDRIS-hackathon/blob/main/scripts/fft_benchmark.py)

The steps would be:

  • Add a 3D FFT and output an image that shows the residuals between inputs and outputs
  • Test that script under different 2D mesh sizes, from [1x1, 2x2, 4x4] and see if a problem appears
  • Given a failing configuration, try to see which mesh_impl functions are called, and try to narrow down which one is the culprit
  • fix
  • profit!

Bug with 3D convolutions

We got some weird deadlock when trying to run a simple 3d conv with blocks, most likely from the halo exchange:

all_inspector.cc:107] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock. 
Missing ranks:
1: [shift/HorovodAllgather_shift_ExpandDims_0]
3: [shift_1/HorovodAllgather_shift_1_ExpandDims_0]

@b-remy can you document here exactly how this happened? We'll need to sort it out....

pnum_tensor property missing

In hvd_simd_mesh_impl.py the laid out tensor function that requires a property .pnum_tensor. In the standard simd_mesh_impl.py this is defined on line 87, but there is not an equivalent in hvd_simd_mesh_impl.py. This raises problems when random is called.
Below is an example of this happening:

    lowering = mtf.Lowering(graph, {mesh: mesh_impl})
  File "<...>/mesh/mesh_tensorflow/ops.py", line 728, in __init__
    op.lower(self)
  File "<...>/mesh/mesh_tensorflow/ops.py", line 5799, in lower
    mesh_impl.random(output_shape, self._tf_fn, self._kwargs)))
  File "<...>/mesh/mesh_tensorflow/hvd_simd_mesh_impl.py", line 607, in random
    tf.equal(self.laid_out_pcoord(axis).one_slice, 0), x.dtype)
  File "<...>/mesh/mesh_tensorflow/ops.py", line 1209, in laid_out_pcoord
    return self.slicewise(my_fn, self.laid_out_pnum())
  File "<...>/mesh/mesh_tensorflow/hvd_simd_mesh_impl.py", line 268, in laid_out_pnum
    return self.LaidOutTensor([self.pnum_tensor])
AttributeError: 'HvdSimdMeshImpl' object has no attribute 'pnum_tensor'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.