Today’s episode is all about an LLM trained for robotics applications called RFM-1 that completely blows my mind because of the implications for what can now suddenly be accomplished so easily with robotics.
Before we dig into RFM-1, I’d like to mention two other major announcements in A.I. robotics in the past month or so:
1. Nvidia announced GR00T (spelled with two zeros instead of two “O”s, presumably to avoid trademark issues with the Marvel character named Groot). This is Nvidia’s own general-purpose foundation model for humanoid robots which was announced during Nvidia’s big GTC conference.
2. A startup called Figure that is developing a humanoid robot raised $675m in a Series B, valuing the company at a wild $2.6B. This is particularly crazy given that the robot is still in development; it could be years before there’s a product to buy. Evidently the Figure investors (which include Microsoft, Nvidia, Jeff Bezos and the OpenAI Startup Fund) see a lot of potential in Figure, whose robots are intended to be general-purpose, meaning they can do a lot of tasks humans do. And, thanks to a collaboration with OpenAI, the robots are now expected to have enhanced natural-language processing and reasoning capabilities… and get to market more quickly.
All right, those announcements on humanoid robots aside, let’s dig into RFM-1, which is actually a robot arm like those that are commonly used in factories instead of a humanoid robot that’s trying to replicate all human actions like GR00T and Figure are.
The company behind RFM-1, which stands for Robotic Foundation Model 1, is Covariant, the rapidly growing A.I. factory-robotics company that’s led by Pieter Abbeel, a Berkeley professor, the world’s best-known A.I. roboticist and our guest on in Episode #503 of this podcast.
The driving concept behind RFM-1 is that Covariant believes that the next major technological breakthrough lies in extending AI advancements into the physical realm. Robotics stands at the forefront of this shift, poised to unlock efficiencies in the physical world that mirror those we’ve unlocked digitally with the likes of GPT-4, Gemini Ultra and Claude 3. Those software-only foundation LLMs have led to impressive results with a wide range of modalities, such as text, images, videos, music, and code, however existing LLMs still make errors about the physical laws of reality that small children wouldn’t make and they don’t achieve the accuracy, precision, and reliability required for robots’ effective and autonomous real-world interaction.
This is the gap that RFM-1, the Robotic Foundation Model, impressively fills. RFM-1 is trained on both general internet data and data rich in physical real-world interactions. This allows for a big leap forward that brings us closer to building generalized AI models that can accurately simulate and operate in the demanding conditions of the physical world. You don’t have to take my word for it — you can check out videos on RFM-1’s remarkable language and physics capabilities.
Covariant’s edge in developing RFM-1 comes from their pioneering use of embodied AI (meaning in the physical world, as with AI robotics) since 2017. Since then, they have deployed a fleet of high-performing robotic systems to real customer sites across the world, creating a vast and multimodal real-world dataset in the process. This dataset mirrors the complexity of deploying systems into the real world and is enriched with data in various forms, including images, videos, sensor data, and quantitative metrics.
As a consequence of having rich data across all of these modalities, RFM-1 is set up as a multimodal any-to-any sequence model, is an 8 billion parameter transformer trained on text, images, videos, robot actions, and a range of numerical sensor readings.
To put this into plain English, what this means is that RFM-1 flexibly accepts text, images, videos, robot actions and various sensor readings as inputs as well as as outputs. So, for example, you could provide RFM-1 with a video of a robotic action you’d like it to take but you could also provide text saying that you’d like the robot to do something slightly different from the video, allowing on-the-fly customization. RFM-1 could then output an image or a video of what you’ve described or it could simply take the robotic action, whatever you prefer! The implications of this are broad and game-changing. By tokenizing all modalities into a common space and performing autoregressive (meaning next-token prediction like conversational agents like ChatGPT do), RFM-1 enables diverse applications, such as scene analysis, grasp action generation, and outcome prediction. This kind of approach enables robots across any industry or scenario to take human guidance and converse with humans to get feedback where it’s unsure of what to do or has other questions. On Covariant’s RFM-1 blog post there are several GIFs demonstrating the model’s ability to ask for human feedback in order to better understand a task or to obtain guidance on how to successfully complete an action at all.
You’ve probably clued into this now, but RFM-1's ability to process natural-language tokens as input and predict natural-language tokens as output opens up the door to intuitive natural language interfaces, enabling anyone to quickly program new robot behavior in minutes rather than weeks or months. This language-guided robot programming lowers the barriers to customizing AI behavior to address each customer's dynamic business needs and the long tail of corner case scenarios. As Covariant continues to expand the granularity of robot control and diversity of tasks, they envision a future where people can use language to compose entire robot programs, further reducing the barrier to deploying new robot stations.
In addition, one of the key strengths of RFM-1 is its understanding of physics through learned world models. These models allow robots to develop physics intuitions that are critical for operating in the real world, where accuracy requirements are tight and the line between success and failure is thin.
Very exciting news indeed, although there are limitations to brief you on that need to be addressed by further R&D. First, despite promising offline results, RFM-1 has not yet been deployed to Covariant customers, and so its real-world performance remains to be seen. Additionally, due to the model's context length limitations (such as low 512 x 512 pixel resolution at a slow 5 frames per second framerate), RFM-1 has a limited ability to model small objects and rapid motions accurately. Lastly, while RFM-1 can understand basic language commands, the overall orchestration logic still relies heavily on traditional programming languages like Python and C++, so further work is needed to enable people to compose entire robot programs using natural language.
Foundation models like RFM-1 and GR00T as well as those being developed for Figure humanoid robots represent the start of a new era with Robotics Foundation Models that give robots the human-like ability to reason on the fly and taking a huge step forward toward delivering the autonomy needed to automate repetitive and dangerous tasks across industries like agriculture, manufacturing, logistics, construction, waste management, potentially even helping out in healthcare by assisting in surgical procedures, handling medical supplies, or supporting patient care tasks that require precise manipulation skills. I cannot stress how huge I think these robotics LLMs developments are and how they will play a role in lifting productivity and economic growth for decades to come.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.