Google's new AI model RT-2 translates vision and language into robotic actions

- Country:
- United States
In a groundbreaking development, Google has introduced Robotics Transformer 2 (RT-2) - a first-of-its-kind vision-language-action (VLA) model for robots to more easily understand and perform actions, in both familiar and new situations.
Trained on both web and robotics data, RT-2 translates this knowledge into generalised instructions for robotic control, while retaining web-scale capabilities. Simply put, the new AI model translates vision and language into robotic actions.
Notably, RT-2 shows that with a small amount of robot training data, the system is able to transfer concepts embedded in its language and vision training data to direct robot actions - even for tasks it’s never been trained to do. For example, previous systems required explicit training to recognize and dispose of trash. In contrast, RT-2, having been trained on a vast corpus of web data, already possesses the concept of what trash is, enabling it to identify and handle it effortlessly.
The capabilities and semantic and visual understanding of RT-2 are evident in over 6,000 robotic trials. On tasks within its training data, also known as "seen" tasks, RT-2 functioned as well as its predecessor, RT-1. However, the most striking finding was that it nearly doubled its performance on novel, previously unseen scenarios, achieving a commendable 62%, compared to the previous model's 32%.
"Not only does RT-2 show how advances in AI are cascading rapidly into robotics, it shows enormous promise for more general-purpose robots. While there is still a tremendous amount of work to be done to enable helpful robots in human-centred environments, RT-2 shows us an exciting future for robotics just within grasp," Google said.
RT-2 signifies a significant step toward the development of a general-purpose robot that can operate effectively in a real-world scenario. By incorporating vision, language, and action comprehension in a single model, RT-2 opens up exciting possibilities for robots to reason, problem-solve, and interpret information, paving the way for their application in a diverse range of tasks and scenarios.