While companies like OpenAI and Midjourney have focused on building AI tools that operate in the digital space, such as text conversations and image generation, a startup named Covariant, founded by three former OpenAI researchers, has taken a different path: bringing artificial intelligence into the physical world.
Located in Emeryville, California, the company focuses on developing technologies that enable robots to pick up, move, and sort objects within warehouses and distribution centers.
The primary goal is to give robots the ability to understand their surroundings and make appropriate decisions in real time. Additionally, the company aims to equip robots with a broad understanding of the English language, allowing interaction as if they were a physical version of ChatGPT.
Although the technology is still under development and not yet perfect, it clearly indicates that systems managing text conversations and image generation will soon become the foundation for operating machines in factories, warehouses, roads, and even homes.
From digital data to practical intelligence
This approach relies on the same fundamentals that chatbots are built on: learning from vast amounts of data. Just as ChatGPT gained its writing and analytical abilities by studying texts from across the internet, robots improve as they are fed more visual and sensory data.
Covariant, which has raised $222 million in funding, does not manufacture robots themselves but focuses on developing the software that powers them. The company plans to deploy this technology first in warehouses before expanding to production factories and possibly even self-driving cars.
The systems behind this technology are known as neural networks, inspired by the work of neurons in the brain. By identifying patterns in massive data sets, these networks can recognize and generate words, sounds, and images—the same principle that enabled OpenAI to build ChatGPT.
The latest step in this path is integrating multiple types of data. By studying images alongside their accompanying texts, the system learns the relationship between shape and description, understanding, for example, that the word “banana” refers to a curved yellow fruit. This approach underpins OpenAI’s Sora tool, capable of generating videos based on short textual descriptions.
Covariant’s founders—Professor Peter Abbeel from the University of California, Berkeley, and his former students Peter Chen, Rocky Duan, and Tianhao Zhang—used the same method to train systems capable of operating sorting robots in warehouses worldwide.
The company collected massive data over years from cameras and sensors documenting how robots work. By combining this data with the vast texts used to train ChatGPT, they built a system that provides robots with a broader understanding of their environment.
As a result, the robot can handle unexpected situations. It knows how to pick up a banana even if it has never encountered one before. It can also understand linguistic instructions: if told “pick up a banana,” it will comply, and if told “pick up a yellow fruit,” it will understand that as well.
The system can even generate imagined video clips predicting what might happen while attempting a task, such as picking up a banana. Although these videos have no direct practical value in the warehouse, they reflect the robot’s level of understanding of its surroundings.
Challenges and usage limits
The new technology, called the foundational robot model (RFM), is not without errors, much like chatbots. It may sometimes drop objects or misunderstand instructions.
Professor Gary Marcus, an AI researcher and professor emeritus of psychology and neuroscience at New York University, sees the technology as promising in environments where a certain margin of error is acceptable, such as warehouses. However, he warned about the difficulty of deploying it in production factories or situations that could be dangerous: “It depends on the cost of error. If the robot weighs 150 pounds and makes a wrong move, the consequences could be costly.”
Nevertheless, researchers expect these systems to improve rapidly as more diverse data is introduced, distinguishing them from traditional robots programmed for specific repetitive tasks like screwing a bolt or lifting a box of one size. Those robots were incapable of handling random or unexpected situations.
Today, by learning from digital examples simulating the real world, robots can deal with the unknown and respond to textual or voice commands just like chat programs. This means robots, like image and text generation systems, will become more flexible and quicker to adapt.
As Dr. Chen summarized: “What exists in digital data can be transferred to the real world.”
Recommended for you
Exhibition City Completes About 80% of Preparations for the Damascus International Fair Launch
Iron Price on Friday 15-8-2025: Ton at 40,000 EGP
Afghan Energy and Water Minister to Al Jazeera: We Build Dams with Our Own Funds to Combat Drought
Unified Admission Applications Start Tuesday with 640 Students to be Accepted in Medicine
Al-Jaghbeer: The Industrial Sector Leads Economic Growth
KPMG: Saudi Arabia Accelerates Unified Digital Government Transformation