Google’s New AI Model, RT-2, Transforming Vision and Language into Robotic Actions

by Jill Madison | Jul 29, 2023 | General News

Google announced the release of RT-2, a vision-language-action (VLA) model. It is a Transformer-based model trained on text and images from the web, allowing it to directly output robotic actions. Just as language models learn general ideas and concepts from web text, RT-2 gains knowledge from web data to affect robot behavior.

Challenges in Real-World Robot Learning

Creating helpful robots capable of performing general tasks in the real world has always been challenging. Unlike chatbots, robots require “grounding” in the physical world and must possess a deep understanding of their abilities. Training a robot is not just about learning facts about objects like apples; it involves recognizing an apple in context, differentiating it from other objects like red balls, comprehending its appearance, and most importantly, knowing how to pick it up.

Historically, achieving this level of competence has demanded training robots on billions of data points, covering every conceivable object, environment, task, and situation in the physical world. This process has been time-consuming and costly, making it impractical for most innovators.

The RT-2 Approach to Learning

Recent advancements in AI have improved robots’ reasoning abilities, enabling them to handle multi-step problems through chain-of-thought prompting. Vision models like PaLM-E have helped robots better understand their surroundings. RT-1 demonstrated that Transformers, known for their ability to generalize information across systems, could facilitate cross-robot learning.

However, robots relied on complex stacks of systems, with high-level reasoning and low-level manipulation systems communicating with each other imperfectly. This process resembled a game of telephone – a high-level instruction had to be conveyed to lower-level systems to execute actions. RT-2 reportedly transforms this approach by consolidating all functions into a single model. Not only can it perform complex reasoning like foundation models, but it can also directly output robot actions. RT-2 shows that with a small amount of robot training data, it can transfer the concepts embedded in its language and vision training data to direct robot actions, even for tasks it has never been trained to do.

For instance, previous systems required explicit training to identify and pick up trash before learning how to throw it away. However, RT-2, having absorbed knowledge from a large corpus of web data, can already recognize trash without explicit training and even knows how to dispose of it, despite never being specifically taught that action. Additionally, RT-2 reportedly can comprehend the abstract nature of trash, recognizing that what was once a bag of chips or a banana peel becomes trash after being consumed. This level of understanding stems from its vision-language training data, allowing it to accomplish tasks efficiently.

RT-2 Outperforms RT-1 by 30% in “Unseen” Tasks

RT-2’s ability to transfer knowledge into actions offers the potential for robots to rapidly adapt to new situations and environments. In tests involving over 6,000 robotic trials, RT-2 performed as well as its predecessor, RT-1, on tasks from its training data, known as “seen” tasks. However, it significantly outperformed RT-1 on novel, unseen scenarios, achieving a success rate of 62% compared to RT-1’s 32%.

by: Jill Madison

Raised in New York City by parents who are electrical engineers, Jill was easy prey for entry into the field of electrical engineering, where she worked on designing engine control systems for a number of large auto manufacturers. She developed an "interest" in robotics during this period of time and decided to fuel her interest by reporting on the topic.