Coder Social home page Coder Social logo

kyegomez / rt-2 Goto Github PK

View Code? Open in Web Editor NEW
312.0 7.0 41.0 2.65 MB

Democratization of RT-2 "RT-2: New model translates vision and language into action"

Home Page: https://discord.gg/qUtxnK2NMf

License: MIT License

Python 100.00%
artificial-intelligence attention-mechanism embodied-agent gpt4 multi-modal robotics transformer

rt-2's Introduction

Multi-Modality

Robotic Transformer 2 (RT-2): The Vision-Language-Action Model

rt gif

GitHub issues GitHub forks GitHub stars GitHub license Share on Twitter Share on FacebookShare on LinkedIn Discord Share on Reddit Share on Hacker News Share on Pinterest Share on WhatsApp


This is my implementation of the model behind RT-2. RT-2 leverages PALM-E as the backbone with a Vision encoder and language backbone where images are embedded and concatenated in the same space as the language embeddings. This architecture is quite easy to architect but suffers from a lack of deep understanding of both the unified multi modal representation or the individual modality representations.

CLICK HERE FOR THE PAPER

Installation

RT-2 can be easily installed using pip:

pip install rt2

Usage

The RT2 class is a PyTorch module that integrates the PALM-E model into the RT-2 class. Here are some examples of how to use it:

Initialization

First, you need to initialize the RT2 class. You can do this by providing the necessary parameters to the constructor:

import torch
from rt2.model import RT2

# img: (batch_size, 3, 256, 256)
# caption: (batch_size, 1024)
img = torch.randn(1, 3, 256, 256)
caption = torch.randint(0, 20000, (1, 1024))

# model: RT2
model = RT2()

# Run model on img and caption
output = model(img, caption)
print(output)  # (1, 1024, 20000)

Benefits

RT-2 stands at the intersection of vision, language, and action, delivering unmatched capabilities and significant benefits for the world of robotics.

  • Leveraging web-scale datasets and firsthand robotic data, RT-2 provides exceptional performance in understanding and translating visual and semantic cues into robotic control actions.
  • RT-2's architecture is based on well-established models, offering a high chance of success in diverse applications.
  • With clear installation instructions and well-documented examples, you can integrate RT-2 into your systems quickly.
  • RT-2 simplifies the complexities of multi-domaster understanding, reducing the burden on your data processing and action prediction pipeline.

Model Architecture

RT-2 integrates a high-capacity Vision-Language model (VLM), initially pre-trained on web-scale data, with robotics data from RT-2. The VLM uses images as input to generate a sequence of tokens representing natural language text. To adapt this for robotic control, RT-2 outputs actions represented as tokens in the model’s output.

RT-2 is fine-tuned using both web and robotics data. The resultant model interprets robot camera images and predicts direct actions for the robot to execute. In essence, it converts visual and language patterns into action-oriented instructions, a remarkable feat in the field of robotic control.

Datasets

Datasets used in the paper

Dataset Description Source Percentage in Training Mixture (RT-2-PaLI-X) Percentage in Training Mixture (RT-2-PaLM-E)
WebLI Around 10B image-text pairs across 109 languages, filtered to the top 10% scoring cross-modal similarity examples to give 1B training examples. Chen et al. (2023b), Driess et al. (2023) N/A N/A
Episodic WebLI Not used in co-fine-tuning RT-2-PaLI-X. Chen et al. (2023a) N/A N/A
Robotics Dataset Demonstration episodes collected with a mobile manipulation robot. Each demonstration is annotated with a natural language instruction from one of seven skills. Brohan et al. (2022) 50% 66%
Language-Table Used for training on several prediction tasks. Lynch et al. (2022) N/A N/A

Commercial Use Cases

The unique capabilities of RT-2 open up numerous commercial applications:

  • Automated Factories: RT-2 can significantly enhance automation in factories by understanding and responding to complex visual and language cues.
  • Healthcare: In robotic surgeries or patient care, RT-2 can assist in understanding and performing tasks based on both visual and verbal instructions.
  • Smart Homes: Integration of RT-2 in smart home systems can lead to improved automation, understanding homeowner instructions in a much more nuanced manner.

Contributing

Contributions to RT-2 are always welcome! Feel free to open an issue or pull request on the GitHub repository.

Contact

For any queries or issues, kindly open a GitHub issue or get in touch with kyegomez.

Citation

@inproceedings{RT-2,2023,
  title={},
  author={Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski,
Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, Pete Florence, Chuyuan Fu,
Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Hausman, Alexander Herzog,
Jasmine Hsu, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang,
Isabel Leal, Lisa Lee, Tsang-Wei Edward Lee, Sergey Levine, Yao Lu, Henryk Michalewski, Igor Mordatch,
Karl Pertsch, Kanishka Rao, Krista Reymann, Michael Ryoo, Grecia Salazar, Pannag Sanketi,
Pierre Sermanet, Jaspiar Singh, Anikait Singh, Radu Soricut, Huong Tran, Vincent Vanhoucke, Quan Vuong,
Ayzaan Wahid, Stefan Welker, Paul Wohlhart, Jialin Wu, Fei Xia, Ted Xiao, Peng Xu, Sichun Xu, Tianhe Yu,
and Brianna Zitkovich},
  year={2024}
}

License

RT-2 is provided under the MIT License. See the LICENSE file for details.

rt-2's People

Contributors

kyegomez avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

rt-2's Issues

Can not run test.py

When I run python test.py, arise error as below.

Traceback (most recent call last): File "tests/test.py", line 3, in <module> from rt2.experimental.rt2_palme import PALME, RT2 ModuleNotFoundError: No module named 'rt2.experimental'

how to fine-tune this model?

thanks for your great work.
in the paper,De-Tokenize convert token to robot action,how robot action tokenizer?
I can't find it in code.
how should i prepare dataset like in the paper, and fine-tune it?

run test.py error

when I run the file test.py, it fails:
TypeError: 'RT2' object is not callable.

can you help me solve this bug?

Upvote & Fund

  • We're using Polar.sh so you can upvote and help fund this issue.
  • We receive the funding once the issue is completed & confirmed by you.
  • Thank you in advance for helping prioritize & fund our backlog.
Fund with Polar

Intialization sample error - ImportError: cannot import name 'AutoregressiveWrapper' from 'zeta.structs'

Created a venv, using python 3.10 on Ubuntu 22.04 LTS

Then, used Installation and Initialization steps from the project.

Initialization step error:

(rt2) student@nuc21:~/rt2$ python test.py
Traceback (most recent call last):
  File "/mnt/LexLinux/student/rt2/test.py", line 2, in <module>
    from rt2.model import RT2
  File "/mnt/LexLinux/student/rt2/lib/python3.10/site-packages/rt2/__init__.py", line 1, in <module>
    from rt2.model import RT2
  File "/mnt/LexLinux/student/rt2/lib/python3.10/site-packages/rt2/model.py", line 3, in <module>
    from zeta.structs import (
ImportError: cannot import name 'AutoregressiveWrapper' from 'zeta.structs' (/mnt/LexLinux/student/rt2/lib/python3.10/site-packages/zeta/structs/__init__.py)
(rt2) student@nuc21:~/rt2$

Test app:

import torch
from rt2.model import RT2

# img: (batch_size, 3, 256, 256)
# caption: (batch_size, 1024)
img = torch.randn(1, 3, 256, 256)
caption = torch.randint(0, 20000, (1, 1024))

# model: RT2
model = RT2()

# Run model on img and caption
output = model(img, caption)
print(output)  # (1, 1024, 20000)

Upvote & Fund

  • We're using Polar.sh so you can upvote and help fund this issue.
  • We receive the funding once the issue is completed & confirmed by you.
  • Thank you in advance for helping prioritize & fund our backlog.
Fund with Polar

test.py problem

When I run test.py get error
from rt2.experimental.rt2_palme import PALME, RT2
ImportError: cannot import name 'PALME' from 'rt2.experimental.rt2_palme'
I saw you said you would update, but have you not done this yet?

Upvote & Fund

  • We're using Polar.sh so you can upvote and help fund this issue.
  • We receive the funding once the issue is completed & confirmed by you.
  • Thank you in advance for helping prioritize & fund our backlog.
Fund with Polar

palme in model is not work

rt2 model use PALME

in palme project, palme forward params as follow:
image

but in rt2 model,use like this, params is not same.
image

How to implement?

This project is exciting, but I can't implement it (even though I followed your instructions). Can you recheck your instruction and update in README. Very thank you again.

Error running example.py

Hello,

I got the error 'RuntimeError: mat1 and mat2 shapes cannot be multiplied (512x4 and 512x512)' when running the example.py script. I'm at commit e17ceb4.

Please could you help me with what the error means?

Thank you very much!

KC

Upvote & Fund

  • We're using Polar.sh so you can upvote and help fund this issue.
  • We receive the funding once the issue is completed & confirmed by you.
  • Thank you in advance for helping prioritize & fund our backlog.
Fund with Polar

Problem with example.py

Hi there, I'm getting this error when running python3 example.py:
RuntimeError: The size of tensor a (12) must match the size of tensor b (6) at non-singleton dimension 0

No Examples folder

Is there some kind of branch or tag with the examples?
Any example of how to fine-tune it?
Any schematics of the different values what they correspond to and how flexible this is for different morphology of the robots?
Any simulation of the robot used is available?

For example my assumptions here seem wrong:

import torch 
from rt2.model import RT2
from icecream import ic
# Load model
model = RT2()

# Example inputs
video = torch.randn(2, 3, 6, 224, 224)
instructions = [
    'bring me that apple sitting on the table',
    'please pass the butter'
]

# Get evaluation logits
train_logits = model.train(video, instructions)
model.model.eval()
eval_logits = model.eval(video, instructions, cond_scale=3.)

# Assuming softmax gives the probabilities of each action
probabilities = torch.nn.functional.softmax(eval_logits, dim=-1)

# Get the most probable action index
predicted_action_indices = torch.argmax(probabilities, dim=-1)

ic(predicted_action_indices)

# Assuming you have a list of action names corresponding to indices
actions = ["action1", "action2", "..."]  # Replace with actual action names
predicted_actions = [actions[index] for index in predicted_action_indices]

print(predicted_actions)

Also, what kind of video input we are setting here with random values ?

Many question hope you can answer them :)

Upvote & Fund

  • We're using Polar.sh so you can upvote and help fund this issue.
  • We receive the funding once the issue is completed & confirmed by you.
  • Thank you in advance for helping prioritize & fund our backlog.
Fund with Polar

Some questions about the README and examples.

Hi there, thanks for your inspiring works about RT-2! Though your README is really with effort, I have some questions about how to use this repo.

  1. What is the rt2 in "Forward Pass" part? It doesn't show in the context, and I believe it doesn't mean the library (i.e. import rt2) since rt2 is not callable.
  2. The shape of the tensor video in the "Forward Pass" part is not the same as the shape of the tensor with the same name in "Initialization" part.
  3. I'm really wondering what the "video" tensor is about since it's just a tensor with random values and the paper doesn't contain any related information about it.
  4. It seems that the example in tests module doesn't work since the rt2.experimental is deprecated (or rather, deleted), is it possible to fix it?

btw: is this repo the official implementation of RT-2? If not, is it possible to mention it in your README?
Thank you for your help!

What tokenizer and embedding do I need?

How can I transform a textual caption into the input of this model?

Upvote & Fund

  • We're using Polar.sh so you can upvote and help fund this issue.
  • We receive the funding once the issue is completed & confirmed by you.
  • Thank you in advance for helping prioritize & fund our backlog.
Fund with Polar

How do I install deepspeed

When I was pip install rt2, an error occurred when running deepspeed, the error is as follows, I can't find a solution, help me

Collecting deepspeed (from palme->-r requirements.txt (line 6))
Using cached deepspeed-0.10.2.tar.gz (858 kB)
Preparing metadata (setup.py) ... error
error: subprocess-exited-with-error

× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> [9 lines of output]
Traceback (most recent call last):
File "", line 36, in
File "", line 34, in
File "C:\Users\shtseng\AppData\Local\Temp\pip-install-he0jlh01\deepspeed_3f29e5e48d7248c2937cac6abb05e625\setup.py", line 147, in
assert torch_available, "Unable to pre-compile ops without torch installed. Please install torch before attempting to pre-compile ops."
AssertionError: Unable to pre-compile ops without torch installed. Please install torch before attempting to pre-compile ops.
[WARNING] Unable to import torch, pre-compiling ops will be disabled. Please visit https://pytorch.org/ to see how to properly install torch on your system.
[WARNING] unable to import torch, please install it if you want to pre-compile any deepspeed ops.
DS_BUILD_OPS=1
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

How to use the repo?

Hello,

I hope this message finds you well. I want to express my gratitude for providing the repository; it has been immensely helpful in enabling me to successfully execute the example.py script.

Furthermore, I have thoroughly reviewed the associated paper, which has given me a solid understanding of the project's context. However, I now have a few queries regarding the practical usage of the repository.

I have successfully managed to work with video and text inputs, but I am a bit unsure about how to incorporate the "RT-2" component, which is designed for Video-Language-Action interaction. I might be overlooking something, and I'd appreciate any guidance or clarification you could provide in this regard.

Additionally, while I've been able to obtain results using video and text inputs, I would greatly appreciate some clarification on the interpretation of these results. If you could shed some light on the meaning or implications of these outcomes, it would be immensely helpful.

Thank you very much for your assistance, and I look forward to your response.

Best regards,

Upvote & Fund

  • We're using Polar.sh so you can upvote and help fund this issue.
  • We receive the funding once the issue is completed & confirmed by you.
  • Thank you in advance for helping prioritize & fund our backlog.
Fund with Polar

can't find Rust compiler

Hi.

pip install rt2

is producing this output.

Collecting six Using cached six-1.16.0-py2.py3-none-any.whl (11 kB) Building wheels for collected packages: tokenizers Building wheel for tokenizers (pyproject.toml): started Building wheel for tokenizers (pyproject.toml): finished with status 'error' Failed to build tokenizers ERROR: Command errored out with exit status 1: command: 'C:\Users\...' 'C:\Users\...\envs\rt2\lib\site-packages\pip\_vendor\pep517\in_process\_in_process.py' build_wheel 'C:\Users...' cwd: C:\Users\... Complete output (51 lines): running bdist_wheel running build running build_py creating build creating build\lib.win-amd64-3.6 creating build\lib.win-amd64-3.6\tokenizers copying py_src\tokenizers\__init__.py -> build\lib.win-amd64-3.6\tokenizers creating build\lib.win-amd64-3.6\tokenizers\models copying py_src\tokenizers\models\__init__.py -> build\lib.win-amd64-3.6\tokenizers\models creating build\lib.win-amd64-3.6\tokenizers\decoders copying py_src\tokenizers\decoders\__init__.py -> build\lib.win-amd64-3.6\tokenizers\decoders creating build\lib.win-amd64-3.6\tokenizers\normalizers copying py_src\tokenizers\normalizers\__init__.py -> build\lib.win-amd64-3.6\tokenizers\normalizers creating build\lib.win-amd64-3.6\tokenizers\pre_tokenizers copying py_src\tokenizers\pre_tokenizers\__init__.py -> build\lib.win-amd64-3.6\tokenizers\pre_tokenizers creating build\lib.win-amd64-3.6\tokenizers\processors copying py_src\tokenizers\processors\__init__.py -> build\lib.win-amd64-3.6\tokenizers\processors creating build\lib.win-amd64-3.6\tokenizers\trainers copying py_src\tokenizers\trainers\__init__.py -> build\lib.win-amd64-3.6\tokenizers\trainers creating build\lib.win-amd64-3.6\tokenizers\implementations copying py_src\tokenizers\implementations\base_tokenizer.py -> build\lib.win-amd64-3.6\tokenizers\implementations copying py_src\tokenizers\implementations\bert_wordpiece.py -> build\lib.win-amd64-3.6\tokenizers\implementations copying py_src\tokenizers\implementations\byte_level_bpe.py -> build\lib.win-amd64-3.6\tokenizers\implementations copying py_src\tokenizers\implementations\char_level_bpe.py -> build\lib.win-amd64-3.6\tokenizers\implementations copying py_src\tokenizers\implementations\sentencepiece_bpe.py -> build\lib.win-amd64-3.6\tokenizers\implementations copying py_src\tokenizers\implementations\sentencepiece_unigram.py -> build\lib.win-amd64-3.6\tokenizers\implementations copying py_src\tokenizers\implementations\__init__.py -> build\lib.win-amd64-3.6\tokenizers\implementations creating build\lib.win-amd64-3.6\tokenizers\tools copying py_src\tokenizers\tools\visualizer.py -> build\lib.win-amd64-3.6\tokenizers\tools copying py_src\tokenizers\tools\__init__.py -> build\lib.win-amd64-3.6\tokenizers\tools copying py_src\tokenizers\__init__.pyi -> build\lib.win-amd64-3.6\tokenizers copying py_src\tokenizers\models\__init__.pyi -> build\lib.win-amd64-3.6\tokenizers\models copying py_src\tokenizers\decoders\__init__.pyi -> build\lib.win-amd64-3.6\tokenizers\decoders copying py_src\tokenizers\normalizers\__init__.pyi -> build\lib.win-amd64-3.6\tokenizers\normalizers copying py_src\tokenizers\pre_tokenizers\__init__.pyi -> build\lib.win-amd64-3.6\tokenizers\pre_tokenizers copying py_src\tokenizers\processors\__init__.pyi -> build\lib.win-amd64-3.6\tokenizers\processors copying py_src\tokenizers\trainers\__init__.pyi -> build\lib.win-amd64-3.6\tokenizers\trainers copying py_src\tokenizers\tools\visualizer-styles.css -> build\lib.win-amd64-3.6\tokenizers\tools running build_ext running build_rust error: can't find Rust compiler

Are the requirements kept up to date? There is nothing mentioned about rust.

I am using win10.

How to use model

Hi.

First of all thank you so much for providing this repo. This model opens up so many possibilities, but only if it's open source.

From the paper I understand that the output of the model is either a string of words, a string of words with an action, or just an action, depending on the output mode. The action consists of a token indicating whether to finish previous action before starting new, changes in x, y, z, rotations around x, y and z, and a desired extension of the end effector (gripper).

The example provided runs smoothly, but I'm unsure how to convert the eval_logits to the above output format.

Could you provide an example of this final step?

Thanks a lot in advance!

Upvote & Fund

  • We're using Polar.sh so you can upvote and help fund this issue.
  • We receive the funding once the issue is completed & confirmed by you.
  • Thank you in advance for helping prioritize & fund our backlog.
Fund with Polar

Callable' has no attribute '_abc_registry

pip install -r requirements.txt,这个项目pip安装库后pip没法用了怎么解决呢?
Traceback (most recent call last):
File "/usr/local/bin/pip", line 5, in
from pip._internal.cli.main import main
File "/usr/local/lib/python3.10/site-packages/pip/init.py", line 1, in
from typing import List, Optional
File "/usr/local/lib/python3.10/site-packages/typing.py", line 1359, in
class Callable(extra=collections_abc.Callable, metaclass=CallableMeta):
File "/usr/local/lib/python3.10/site-packages/typing.py", line 1007, in new
self._abc_registry = extra._abc_registry
AttributeError: type object 'Callable' has no attribute '_abc_registry'

Upvote & Fund

  • We're using Polar.sh so you can upvote and help fund this issue.
  • We receive the funding once the issue is completed & confirmed by you.
  • Thank you in advance for helping prioritize & fund our backlog.
Fund with Polar

ImportError: cannot import name 'PALME' from 'rt2.experimental.rt2_palme'

Thanks for the great RT2. The setup is getting better with time. And yet to learn how to use it.

When I run the following command I get errors as follows:

$ python tests/test.py 
Traceback (most recent call last):
 File "tests/test.py", line 3, in <module>
   from rt2.experimental.rt2_palme import PALME, RT2
ImportError: cannot import name 'PALME' from 'rt2.experimental.rt2_palme' (<path_to_env>/lib/python3.8/site-packages/rt2/experimental/rt2_palme.py)

How to find the correct PALME library?
the
$ pip install palme installs. However, the error.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.