In 2025, the 3 things the robotics field has taught me @DrJimFan
1⃣️ Hardware leads software, but hardware reliability severely limits the speed of software iteration. We have already seen extremely sophisticated engineering masterpieces: Optimus, e-Atlas, Figure, Neo, G1, and more. But the problem is, our best AI still hasn't fully exploited the potential of these cutting-edge hardware components. The capabilities of the (robot) body are obviously stronger than the commands the brain can currently issue. However, to "serve" these robots, an entire operations team is often required. Robots don't self-repair like humans do: overheating, motor failures, bizarre firmware issues—these are almost daily nightmares. Once an error occurs, it's irreversible and unforgiving. The only thing truly being scaled is my patience.
2⃣️ Benchmark testing in the robotics field remains an epic disaster. In the world of large models, everyone knows what MMLU and SWE-Bench are. But in robotics, there is no consensus: what hardware platform to use, how to define tasks, what the scoring standards are, which simulator to use, or whether to go directly to the real world? By definition, everyone claims to be SOTA—because every time a news release is made, a new benchmark is defined on the fly. Everyone picks the most impressive demo from 100 failures. By 2026, our field must do better; we can no longer treat reproducibility and scientific standards as second-class citizens.
3⃣️ The VLM-based VLA approach always feels a bit off. VLA refers to Vision-Language-Action models, which are currently the mainstream paradigm for robot brains. The formula is simple: take a pre-trained VLM checkpoint and "attach" an action module on top. But upon closer inspection, problems become apparent. VLMs are fundamentally highly optimized for benchmarks like visual question answering, which leads to two consequences: most of the VLM's parameters serve language and knowledge rather than the physical world; visual encoders are actively trained to discard low-level details because question-answering tasks only require high-level understanding. But for robots, tiny details are crucial for dexterous operations. Therefore, VLA performance is unlikely to improve linearly with the scale of VLM parameters. The problem lies in the pre-training objectives themselves being misaligned. #AI #Robtics
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
In 2025, the 3 things the robotics field has taught me @DrJimFan
1⃣️ Hardware leads software, but hardware reliability severely limits the speed of software iteration. We have already seen extremely sophisticated engineering masterpieces: Optimus, e-Atlas, Figure, Neo, G1, and more. But the problem is, our best AI still hasn't fully exploited the potential of these cutting-edge hardware components. The capabilities of the (robot) body are obviously stronger than the commands the brain can currently issue. However, to "serve" these robots, an entire operations team is often required. Robots don't self-repair like humans do: overheating, motor failures, bizarre firmware issues—these are almost daily nightmares. Once an error occurs, it's irreversible and unforgiving. The only thing truly being scaled is my patience.
2⃣️ Benchmark testing in the robotics field remains an epic disaster. In the world of large models, everyone knows what MMLU and SWE-Bench are. But in robotics, there is no consensus: what hardware platform to use, how to define tasks, what the scoring standards are, which simulator to use, or whether to go directly to the real world? By definition, everyone claims to be SOTA—because every time a news release is made, a new benchmark is defined on the fly. Everyone picks the most impressive demo from 100 failures. By 2026, our field must do better; we can no longer treat reproducibility and scientific standards as second-class citizens.
3⃣️ The VLM-based VLA approach always feels a bit off. VLA refers to Vision-Language-Action models, which are currently the mainstream paradigm for robot brains. The formula is simple: take a pre-trained VLM checkpoint and "attach" an action module on top. But upon closer inspection, problems become apparent. VLMs are fundamentally highly optimized for benchmarks like visual question answering, which leads to two consequences: most of the VLM's parameters serve language and knowledge rather than the physical world; visual encoders are actively trained to discard low-level details because question-answering tasks only require high-level understanding. But for robots, tiny details are crucial for dexterous operations. Therefore, VLA performance is unlikely to improve linearly with the scale of VLM parameters. The problem lies in the pre-training objectives themselves being misaligned. #AI #Robtics