\n \n \n
\n
\n\n \n \n \n \n \n Leveling the Playing Field: Carefully Comparing Classical and Learned Controllers for Quadrotor Trajectory Tracking.\n \n \n \n\n\n \n Kunapuli, P., Welde, J., Jayaraman, D., & Kumar, V.\n\n\n \n\n\n\n
RSS. 2025.\n
\n\n
\n\n
\n\n
\n\n \n\n \n\n \n link\n \n \n\n bibtex\n \n\n \n \n \n abstract \n \n\n \n\n \n \n \n \n \n \n \n\n \n \n \n\n\n\n
\n
@article{kunapuli2025leveling,\ntitle={Leveling the Playing Field: Carefully Comparing Classical and Learned Controllers for Quadrotor Trajectory Tracking},\nauthor={Pratik Kunapuli and Jake Welde and Dinesh Jayaraman and Vijay Kumar},\nabstract={Learning-based control approaches like reinforcement learning (RL) have recently produced a slew of impressive results for tasks like quadrotor trajectory tracking and drone racing. Naturally, it is common to demonstrate the advantages of these new controllers against established methods like analytical controllers. We observe, however, that reliably comparing the performance of these very different classes of controllers is more complicated than might appear at first sight. As a case study, we take up the problem of agile tracking of an end-effector for a quadrotor with a fixed-arm. We develop a set of best practices for synthesizing the best RL and Geometric controllers for benchmarking. In the process, we fix widely prevalent RL-favoring biases in prior studies that provide asymmetric access to: (1) the task definition in the form of objective functions, (2) datasets for parameter optimization, and (3) “feed-forward” controller inputs revealing the desired future trajectory. The resulting contributions are threefold: first, our improved robust experimental protocol reveals that the gaps between the two controller classes are much smaller than expected from previously published findings. Geometric control performs on par or better than RL in most practical settings, while RL fares better in transient performance at the expense of steady-state error. Second, our improvements to the experimental protocol for comparing learned and classical controller synthesis approaches are critical: each of the above asymmetries can yield misleading conclusions, and we show evidence that suggests that they indeed have in prior quadrotor studies. Finally, we open-source implementations of Geometric and RL controllers for these aerial vehicles implementing best practices for future development.},\njournal={RSS},\nyear={2025}\n}\n\n
\n
\n\n\n
\n Learning-based control approaches like reinforcement learning (RL) have recently produced a slew of impressive results for tasks like quadrotor trajectory tracking and drone racing. Naturally, it is common to demonstrate the advantages of these new controllers against established methods like analytical controllers. We observe, however, that reliably comparing the performance of these very different classes of controllers is more complicated than might appear at first sight. As a case study, we take up the problem of agile tracking of an end-effector for a quadrotor with a fixed-arm. We develop a set of best practices for synthesizing the best RL and Geometric controllers for benchmarking. In the process, we fix widely prevalent RL-favoring biases in prior studies that provide asymmetric access to: (1) the task definition in the form of objective functions, (2) datasets for parameter optimization, and (3) “feed-forward” controller inputs revealing the desired future trajectory. The resulting contributions are threefold: first, our improved robust experimental protocol reveals that the gaps between the two controller classes are much smaller than expected from previously published findings. Geometric control performs on par or better than RL in most practical settings, while RL fares better in transient performance at the expense of steady-state error. Second, our improvements to the experimental protocol for comparing learned and classical controller synthesis approaches are critical: each of the above asymmetries can yield misleading conclusions, and we show evidence that suggests that they indeed have in prior quadrotor studies. Finally, we open-source implementations of Geometric and RL controllers for these aerial vehicles implementing best practices for future development.\n
\n\n\n
\n\n\n
\n
\n\n \n \n \n \n \n Illustrated Landmark Graphs for Long-Horizons Policy Learning.\n \n \n \n\n\n \n Watson, C., Krishna, A., Alur, R., & Jayaraman, D.\n\n\n \n\n\n\n
TMLR. 2025.\n
\n\n
\n\n
\n\n
\n\n \n\n \n\n \n link\n \n \n\n bibtex\n \n\n \n \n \n abstract \n \n\n \n\n \n \n \n \n \n \n \n\n \n \n \n\n\n\n
\n
@article{watson2025ilg,\ntitle={Illustrated Landmark Graphs for Long-Horizons Policy Learning},\nauthor={Christopher Watson and Arjun Krishna and Rajeev Alur and Dinesh Jayaraman},\nabstract={Applying learning-based approaches to long-horizon sequential decision-making tasks requires a human teacher to carefully craft reward functions or curate demonstrations to elicit desired behaviors. To simplify this, we first introduce an alternative form of task-specification, Illustrated Landmark Graph (ILG), that represents the task as a directed-acyclic graph where each vertex corresponds to a region of the state space (a landmark), and each edge represents an easier to achieve sub-task. A landmark in the ILG is conveyed to the agent through a few illustrative examples grounded in the agent’s observation space. Second, we propose ILG-Learn, a human in the loop algorithm that interleaves planning over the ILG and sub-task policy learning. ILG-Learn adaptively plans through the ILG by relying on the human teacher’s feedback to estimate the success rates of learned policies. We conduct experiments on long-horizon block stacking and point maze navigation tasks, and find that our approach achieves considerably higher success rates (~ 50% improvement) compared to hierarchical reinforcement learning and imitation learning baselines. Additionally, we highlight how the flexibility of the ILG specification allows the agent to learn a sequence of sub-tasks that is better suited to its limited capabilities.}, \njournal={TMLR},\nyear={2025}\n}\n
\n
\n\n\n
\n Applying learning-based approaches to long-horizon sequential decision-making tasks requires a human teacher to carefully craft reward functions or curate demonstrations to elicit desired behaviors. To simplify this, we first introduce an alternative form of task-specification, Illustrated Landmark Graph (ILG), that represents the task as a directed-acyclic graph where each vertex corresponds to a region of the state space (a landmark), and each edge represents an easier to achieve sub-task. A landmark in the ILG is conveyed to the agent through a few illustrative examples grounded in the agent’s observation space. Second, we propose ILG-Learn, a human in the loop algorithm that interleaves planning over the ILG and sub-task policy learning. ILG-Learn adaptively plans through the ILG by relying on the human teacher’s feedback to estimate the success rates of learned policies. We conduct experiments on long-horizon block stacking and point maze navigation tasks, and find that our approach achieves considerably higher success rates ( 50% improvement) compared to hierarchical reinforcement learning and imitation learning baselines. Additionally, we highlight how the flexibility of the ILG specification allows the agent to learn a sequence of sub-tasks that is better suited to its limited capabilities.\n
\n\n\n
\n\n\n
\n\n\n
\n
\n\n \n \n \n \n \n Vision Language Models are In-Context Value Learners.\n \n \n \n\n\n \n Ma, Y. J., Hejna, J., Fu, C., Shah, D., Liang, J., Xu, Z., Kirmani, S., Xu, P., Driess, D., Xiao, T., Bastani, O., Jayaraman, D., Yu, W., Zhang, T., Sadigh, D., & Xia, F.\n\n\n \n\n\n\n
ICLR. 2025.\n
\n\n
\n\n
\n\n
\n\n \n\n \n\n \n link\n \n \n\n bibtex\n \n\n \n \n \n abstract \n \n\n \n\n \n \n \n \n \n \n \n\n \n \n \n\n\n\n
\n
@article{ma2025gvl,\ntitle={Vision Language Models are In-Context Value Learners},\nauthor={Yecheng Jason Ma and Joey Hejna and Chuyuan Fu and Dhruv Shah and Jacky Liang and Zhuo Xu and Sean Kirmani and Peng Xu and Danny Driess and Ted Xiao and Osbert Bastani and Dinesh Jayaraman and Wenhao Yu and Tingnan Zhang and Dorsa Sadigh and Fei Xia},\nabstract={Predicting temporal progress from visual trajectories is important for intelligent robots that can learn, adapt, and improve. However, learning such progress estimator, or temporal value function, across different tasks and domains requires both a large amount of diverse data and methods which can scale and generalize. To address these challenges, we present Generative Value Learning (GVL), a universal value function estimator that leverages the world knowledge embedded in vision-language models (VLMs) to predict task progress. Naively asking a VLM to predict values for a video sequence performs poorly due to the strong temporal correlation between successive frames. Instead, GVL poses value estimation as a temporal ordering problem over shuffled video frames; this seemingly more challenging task encourages VLMs to more fully exploit their underlying semantic and temporal grounding capabilities to differentiate frames based on their perceived task progress, consequently producing significantly better value predictions. Without any robot or task specific training, GVL can in-context zero-shot and few-shot predict effective values for more than 300 distinct real-world tasks across diverse robot platforms, including challenging bimanual manipulation tasks. Furthermore, we demonstrate that GVL permits flexible multi-modal in-context learning via examples from heterogeneous tasks and embodiments, such as human videos. The generality of GVL enables various downstream applications pertinent to visuomotor policy learning, including dataset filtering, success detection, and value-weighted regression -- all without any model training or finetuning.},\njournal={ICLR},\nyear={2025}\n}\n
\n
\n\n\n
\n Predicting temporal progress from visual trajectories is important for intelligent robots that can learn, adapt, and improve. However, learning such progress estimator, or temporal value function, across different tasks and domains requires both a large amount of diverse data and methods which can scale and generalize. To address these challenges, we present Generative Value Learning (GVL), a universal value function estimator that leverages the world knowledge embedded in vision-language models (VLMs) to predict task progress. Naively asking a VLM to predict values for a video sequence performs poorly due to the strong temporal correlation between successive frames. Instead, GVL poses value estimation as a temporal ordering problem over shuffled video frames; this seemingly more challenging task encourages VLMs to more fully exploit their underlying semantic and temporal grounding capabilities to differentiate frames based on their perceived task progress, consequently producing significantly better value predictions. Without any robot or task specific training, GVL can in-context zero-shot and few-shot predict effective values for more than 300 distinct real-world tasks across diverse robot platforms, including challenging bimanual manipulation tasks. Furthermore, we demonstrate that GVL permits flexible multi-modal in-context learning via examples from heterogeneous tasks and embodiments, such as human videos. The generality of GVL enables various downstream applications pertinent to visuomotor policy learning, including dataset filtering, success detection, and value-weighted regression – all without any model training or finetuning.\n
\n\n\n
\n\n\n
\n\n\n
\n
\n\n \n \n \n \n \n Articulate-Anything: Automatic Modeling of Articulated Objects via a Vision-Language Foundation Model.\n \n \n \n\n\n \n Le, L., Xie, J., Liang, W., Wang, H., Yang, Y., Ma, Y. J., Vedder, K., Krishna, A., Jayaraman, D., & Eaton, E.\n\n\n \n\n\n\n
ICLR. 2025.\n
\n\n
\n\n
\n\n
\n\n \n\n \n\n \n link\n \n \n\n bibtex\n \n\n \n \n \n abstract \n \n\n \n\n \n \n \n \n \n \n \n\n \n \n \n\n\n\n
\n
@article{le2025articulate,\ntitle={Articulate-Anything: Automatic Modeling of Articulated Objects via a Vision-Language Foundation Model},\nauthor={Long Le and Jason Xie and William Liang and Hung-Ju Wang and Yue Yang and Yecheng Jason Ma and Kyle Vedder and Arjun Krishna and Dinesh Jayaraman and Eric Eaton},\nabstract={Interactive 3D simulated objects are crucial in AR/VR, animations, and robotics, driving immersive experiences and advanced automation. However, creating these articulated objects requires extensive human effort and expertise, limiting their broader applications. To overcome this challenge, we present Articulate-Anything, a system that automates the articulation of diverse, complex objects from many input modalities, including text, images, and videos. Articulate-Anything leverages vision-language models (VLMs) to generate code that can be compiled into an interactable digital twin for use in standard 3D simulators. Our system exploits existing 3D asset datasets via a mesh retrieval mechanism, along with an actor-critic system that iteratively proposes, evaluates, and refines solutions for articulating the objects, self-correcting errors to achieve a robust out- come. Qualitative evaluations demonstrate Articulate-Anything's capability to articulate complex and even ambiguous object affordances by leveraging rich grounded inputs. In extensive quantitative experiments on the standard PartNet-Mobility dataset, Articulate-Anything substantially outperforms prior work, increasing the success rate from 8.7-11.6% to 75% and setting a new bar for state-of-art performance. We further showcase the utility of our generated assets by using them to train robotic policies for fine-grained manipulation tasks that go beyond basic pick and place.},\njournal={ICLR},\nyear={2025}\n}\n
\n
\n\n\n
\n Interactive 3D simulated objects are crucial in AR/VR, animations, and robotics, driving immersive experiences and advanced automation. However, creating these articulated objects requires extensive human effort and expertise, limiting their broader applications. To overcome this challenge, we present Articulate-Anything, a system that automates the articulation of diverse, complex objects from many input modalities, including text, images, and videos. Articulate-Anything leverages vision-language models (VLMs) to generate code that can be compiled into an interactable digital twin for use in standard 3D simulators. Our system exploits existing 3D asset datasets via a mesh retrieval mechanism, along with an actor-critic system that iteratively proposes, evaluates, and refines solutions for articulating the objects, self-correcting errors to achieve a robust out- come. Qualitative evaluations demonstrate Articulate-Anything's capability to articulate complex and even ambiguous object affordances by leveraging rich grounded inputs. In extensive quantitative experiments on the standard PartNet-Mobility dataset, Articulate-Anything substantially outperforms prior work, increasing the success rate from 8.7-11.6% to 75% and setting a new bar for state-of-art performance. We further showcase the utility of our generated assets by using them to train robotic policies for fine-grained manipulation tasks that go beyond basic pick and place.\n
\n\n\n
\n\n\n
\n
\n\n \n \n \n \n \n REGENT: A Retrieval-Augmented Generalist Agent That Can Act In-Context in New Environments.\n \n \n \n\n\n \n Sridhar, K., Dutta, S., Jayaraman, D., & Lee, I.\n\n\n \n\n\n\n
ICLR. 2025.\n
\n\n
\n\n
\n\n
\n\n \n\n \n\n \n link\n \n \n\n bibtex\n \n\n \n \n \n abstract \n \n\n \n\n \n \n \n \n \n \n \n\n \n \n \n\n\n\n
\n
@article{sridhar2025regent,\ntitle={REGENT: A Retrieval-Augmented Generalist Agent That Can Act In-Context in New Environments},\nauthor={Kaustubh Sridhar and Souradeep Dutta and Dinesh Jayaraman and Insup Lee},\nabstract={Do generalist agents require large models pre-trained on massive amounts of data to rapidly adapt to new environments? We propose a novel approach to pre-train relatively small models and adapt them to unseen environments via in-context learning, without any finetuning. Our key idea is that retrieval offers a powerful bias for fast adaptation. Indeed, we demonstrate that even a simple retrieval-based 1-nearest neighbor agent offers a surprisingly strong baseline for today's state-of-the-art generalist agents. From this starting point, we construct a semi-parametric agent, REGENT, that trains a transformer-based policy on sequences of queries and retrieved neighbors. REGENT can generalize to unseen robotics and game-playing environments via retrieval augmentation and in-context learning, achieving this with up to 3x fewer parameters and up to an order-of-magnitude fewer pre-training datapoints, significantly outperforming today's state-of-the-art generalist agents.},\njournal={ICLR},\nyear={2025}\n}\n
\n
\n\n\n
\n Do generalist agents require large models pre-trained on massive amounts of data to rapidly adapt to new environments? We propose a novel approach to pre-train relatively small models and adapt them to unseen environments via in-context learning, without any finetuning. Our key idea is that retrieval offers a powerful bias for fast adaptation. Indeed, we demonstrate that even a simple retrieval-based 1-nearest neighbor agent offers a surprisingly strong baseline for today's state-of-the-art generalist agents. From this starting point, we construct a semi-parametric agent, REGENT, that trains a transformer-based policy on sequences of queries and retrieved neighbors. REGENT can generalize to unseen robotics and game-playing environments via retrieval augmentation and in-context learning, achieving this with up to 3x fewer parameters and up to an order-of-magnitude fewer pre-training datapoints, significantly outperforming today's state-of-the-art generalist agents.\n
\n\n\n
\n\n\n
\n
\n\n \n \n \n \n \n \n ZeroMimic: Distilling Robotic Manipulation Skills from Web Videos.\n \n \n \n \n\n\n \n Junyao Shi*, Z. Z., Tianyou Wang, I. P., Amy Luo, J. W., & Jason Ma, D. J.\n\n\n \n\n\n\n
ICRA. 2025.\n
\n\n
\n\n
\n\n
\n\n \n \n
Paper\n \n \n\n \n\n \n link\n \n \n\n bibtex\n \n\n \n \n \n abstract \n \n\n \n\n \n \n \n \n \n \n \n\n \n \n \n\n\n\n
\n
@article{shi2025zeromimic,\ntitle={ZeroMimic: Distilling Robotic Manipulation Skills from Web Videos},\nauthor={Junyao Shi*, Zhuolun Zhao*, Tianyou Wang, Ian Pedroza, Amy Luo, Jie Wang, Jason Ma, Dinesh Jayaraman},\nabstract={Many recent advances in robotic manipulation\nhave come through imitation learning, yet these rely largely\non mimicking a particularly hard-to-acquire form of demon-\nstrations: those collected on the same robot in the same room\nwith the same objects as the trained policy must handle at test\ntime. In contrast, large pre-recorded human video datasets\ndemonstrating manipulation skills in-the-wild already exist,\nwhich contain valuable information for robots. Is it possible to\ndistill a repository of useful robotic skill policies out of such\ndata without any additional requirements on robot-specific\ndemonstrations or exploration? We present the first such\nsystem ZeroMimic, that generates immediately deployable image\ngoal-conditioned skill policies for several common categories\nof manipulation tasks (opening, closing, pouring, pick&place,\ncutting, and stirring) each capable of acting upon diverse objects\nand across diverse unseen task setups. ZeroMimic is carefully\ndesigned to exploit recent advances in semantic and geometric\nvisual understanding of human videos, together with modern\ngrasp affordance detectors and imitation policy classes. After\ntraining ZeroMimic on the popular EpicKitchens dataset of ego-\ncentric human videos, we evaluate its out-of-the-box performance\nin varied kitchen settings, demonstrating its impressive abilities\nto handle these varied tasks. To enable plug-and-play reuse of\nZeroMimic policies on other task setups and robots, we will\nrelease software and policy checkpoints for all skills.},\njournal={ICRA},\nyear={2025},\nurl={https://zeromimic.github.io/}\n}\n\n
\n
\n\n\n
\n Many recent advances in robotic manipulation have come through imitation learning, yet these rely largely on mimicking a particularly hard-to-acquire form of demon- strations: those collected on the same robot in the same room with the same objects as the trained policy must handle at test time. In contrast, large pre-recorded human video datasets demonstrating manipulation skills in-the-wild already exist, which contain valuable information for robots. Is it possible to distill a repository of useful robotic skill policies out of such data without any additional requirements on robot-specific demonstrations or exploration? We present the first such system ZeroMimic, that generates immediately deployable image goal-conditioned skill policies for several common categories of manipulation tasks (opening, closing, pouring, pick&place, cutting, and stirring) each capable of acting upon diverse objects and across diverse unseen task setups. ZeroMimic is carefully designed to exploit recent advances in semantic and geometric visual understanding of human videos, together with modern grasp affordance detectors and imitation policy classes. After training ZeroMimic on the popular EpicKitchens dataset of ego- centric human videos, we evaluate its out-of-the-box performance in varied kitchen settings, demonstrating its impressive abilities to handle these varied tasks. To enable plug-and-play reuse of ZeroMimic policies on other task setups and robots, we will release software and policy checkpoints for all skills.\n
\n\n\n
\n\n\n
\n
\n\n \n \n \n \n \n Leveraging Symmetry to Accelerate Learning of Trajectory Tracking Controllers for Free-Flying Robotic Systems.\n \n \n \n\n\n \n Welde*, J., Rao*, N., Kunapuli*, P., Jayaraman, D., & Kumar, V.\n\n\n \n\n\n\n
ICRA. 2025.\n
\n\n
\n\n
\n\n
\n\n \n\n \n\n \n link\n \n \n\n bibtex\n \n\n \n \n \n abstract \n \n\n \n\n \n \n \n \n \n \n \n\n \n \n \n\n\n\n
\n
@article{welde2025symmetry,\ntitle={Leveraging Symmetry to Accelerate Learning of Trajectory Tracking Controllers for Free-Flying Robotic Systems},\nauthor={Jake Welde* and Nishanth Rao* and Pratik Kunapuli* and Dinesh Jayaraman and Vijay Kumar},\nabstract={Tracking controllers enable robotic systems to\naccurately follow planned reference trajectories. In particular,\nreinforcement learning (RL) has shown promise in the synthesis\nof controllers for systems with complex dynamics and modest\nonline compute budgets. However, the poor sample efficiency of\nRL and the challenges of reward design make training slow and\nsometimes unstable, especially for high-dimensional systems. In\nthis work, we leverage the inherent Lie group symmetries of\nrobotic systems with a floating base to mitigate these chal-\nlenges when learning tracking controllers. We model a general\ntracking problem as a Markov decision process (MDP) that\ncaptures the evolution of both the physical and reference states.\nNext, we prove that symmetry in the underlying dynamics and\nrunning costs leads to an MDP homomorphism, a mapping\nthat allows a policy trained on a lower-dimensional “quotient”\nMDP to be lifted to an optimal tracking controller for the\noriginal system. We compare this symmetry-informed approach\nto an unstructured baseline, using Proximal Policy Optimization\n(PPO) to learn tracking controllers for three systems: the\nParticle (a forced point mass), the Astrobee (a fully-\nactuated space robot), and the Quadrotor (an underactuated\nsystem). Results show that a symmetry-aware approach both\naccelerates training and reduces tracking error after the same\nnumber of training steps.},\njournal={ICRA},\nyear={2025}\n}\n
\n
\n\n\n
\n Tracking controllers enable robotic systems to accurately follow planned reference trajectories. In particular, reinforcement learning (RL) has shown promise in the synthesis of controllers for systems with complex dynamics and modest online compute budgets. However, the poor sample efficiency of RL and the challenges of reward design make training slow and sometimes unstable, especially for high-dimensional systems. In this work, we leverage the inherent Lie group symmetries of robotic systems with a floating base to mitigate these chal- lenges when learning tracking controllers. We model a general tracking problem as a Markov decision process (MDP) that captures the evolution of both the physical and reference states. Next, we prove that symmetry in the underlying dynamics and running costs leads to an MDP homomorphism, a mapping that allows a policy trained on a lower-dimensional “quotient” MDP to be lifted to an optimal tracking controller for the original system. We compare this symmetry-informed approach to an unstructured baseline, using Proximal Policy Optimization (PPO) to learn tracking controllers for three systems: the Particle (a forced point mass), the Astrobee (a fully- actuated space robot), and the Quadrotor (an underactuated system). Results show that a symmetry-aware approach both accelerates training and reduces tracking error after the same number of training steps.\n
\n\n\n
\n\n\n\n\n\n