We study imitation learning to teach robots from expert demonstrations. During robots' execution, compounding errors from hardware noise and external disturbances, coupled with incomplete data coverage, can drive the agent into unfamiliar states and cause unpredictable behavior. To address this challenge, we propose a framework, CCIL: Continuity-based data augmentation for Corrective Imitation Learning. It leverages the local continuity inherent in dynamic systems to synthesize corrective labels. CCIL learns a dynamics model from the expert data and uses it to generate labels guiding the agent back to expert states. Our approach makes minimal assumptions, requiring neither expert re-labeling nor ground truth dynamics models. By exploiting local continuity, we derive provable bounds on the errors of the synthesized labels. Through evaluations across diverse robotic domains in simulation and the real world, we demonstrate CCIL's effectiveness in improving imitation learning performance.
Without corrective labels, the agent can knock over the cube while trying to grasp it.
Without corrective labels, the agent is not precise enough to reliably insert the gear.
Without corrective labels, the agent is not able to precisely grasp the coin in the right place.
Our label generation algorithm consists of three steps: learning a dynamics model, generating corrective labels, and filtering out high-error labels.
We learn a dynamics model by minimizing the following loss: $$\mathbb{E}_{(s_t^*,a_t^*,s_{t+1}^*)\sim\mathcal{D}^*}\left[\hat{f}(s_t^*,a_t^*)+s_t^*-s_{t+1}^*\right]$$ Notably, a learned dynamics model can only yield reliable predictions near its data support - but not on arbitrary states and actions. CCIL decides where to query the learned dynamics models by leveraging the presence of local Lipschitz continuity in the system dynamics. CCIL encourages the learned dynamics function to exhibit local Lipschitz continuity by modifying the training objective, specifically by regularizing the continuity of the learned model with spectral normalization. Concretely, to train a dynamics model $\hat{f}$ using a neural network of $n$-layers with weight matrices $W_1,\ldots,W_n$, one can iteratively minimize the above training objective while regularizing the model by setting $$W_i\leftarrow \frac{W_i}{\max\left(\|W_i\|_2,K^{-n}\right)}\cdot K^{-n}$$ for every $W_i$, where $K$ is the Lipschitz constraint hyperparameter.
With a learned dynamics model $\hat{f}$, we can generate a corrective label $(s_t^\mathcal{G}, a_t^\mathcal{G})$ for every expert data point $(s_t^*, a_t^*)$ such that $s_t^\mathcal{G}+\hat{f}(s_t^\mathcal{G},a_t^\mathcal{G})\approx s_t^*$. One of our label generation methods is BackTrack, inspired by the backwards Euler method used in modern simulators: \begin{align*} s_t^\mathcal{G} &\leftarrow s_t^* - \hat{f}(s_t^*, a_t^*) \\ a_t^\mathcal{G} &\leftarrow a_t^* \end{align*}
By leveraging the local continuity in the environment dynamics, we can derive provable bounds on the correctness of the generated labels. Armed with this error bound, we can filter out high-error labels and only use the ones that are likely to be correct. Concretely, we set a maximum allowable error, which naturally creates a maximum allowable distance between the generated state and the expert state. This can be viewed as a trust region around each expert data point, within which we can trust the generated labels to be accurate.
CCIL can yield a prominent performance boost in low-data regimes compared to using standard behavior cloning, showcasing its data efficiency and robustness.
CCIL makes a critical assumption that the system dynamics contain local continuity. In practice, however, its application is relatively insensitive to the hyper-parameter choice of Lipschitz constraint in learning the dynamics model. As long as we filter generated labels using appropriate label error threshold, CCIL could yield a significant performance boost.
CCIL's corrective labels expand the support of the demonstration data, allowing the policy to recover from significant disturbances that push it outside of the original expert state distribution.
Method | Success Rate | Avg. Score |
---|---|---|
Expert | 100.0% | 1.00 |
BC | 31.9% | 0.58 ± 0.25 |
MOReL | 0.0% | 0.001 ± 0.001 |
MILO | 0.0% | 0.21 ± 0.003 |
NoiseBC | 39.3% | 0.62 ± 0.28 |
CCIL | 56.4% | 0.75 ± 0.25 |
Method | Hover | Circle | FlyThrough |
---|---|---|---|
Expert | -1104 | -10 | -4351 |
BC | -1.08 × 108 | -9.56 × 107 | -1.06 × 108 |
MOReL | -1.25 × 108 | -1.24 × 108 | -1.25 × 108 |
MILO | -1.26 × 108 | -1.25 × 108 | -1.25 × 108 |
NoiseBC | -1.13 × 108 | -9.88 × 107 | -1.07 × 108 |
CCIL | -0.96 × 108 | -8.03 × 107 | -0.78 × 108 |
Mujoco | Metaworld | |||||||
---|---|---|---|---|---|---|---|---|
Hopper | Walker | Ant | Halfcheetah | CoffeePull | ButtonPress | CoffeePush | DrawerClose | |
Expert | 3234.30 | 4592.30 | 3879.70 | 12135.00 | 4409.95 | 3895.82 | 4488.29 | 4329.34 |
BC | 1983.98 ± 672.66 | 1922.55 ± 1410.09 | 2965.20 ± 202.71 | 8309.31 ± 795.30 | 3552.59 ± 233.41 | 3693.02 ± 104.99 | 1288.19 ± 746.37 | 3247.06 ± 468.73 |
MOReL | 152.19 ± 34.12 | 70.27 ± 3.59 | 1000.77 ± 15.21 | -2.24 ± 0.02 | 18.78 ± 0.09 | 14.85 ± 17.08 | 18.66 ± 0.02 | 1222.23 ± 1241.47 |
MILO | 566.98 ± 100.32 | 526.72 ± 127.99 | 1006.53 ± 160.43 | 151.08 ± 117.06 | 232.49 ± 110.44 | 986.46 ± 105.79 | 230.62 ± 19.37 | 4621.11 ± 39.68 |
NoiseBC | 1563.56 ± 1012.02 | 2893.21 ± 1076.89 | 3776.65 ± 442.13 | 8468.98 ± 738.83 | 3072.86 ± 785.91 | 3663.44 ± 63.10 | 2551.11 ± 857.79 | 4226.71 ± 18.90 |
CCIL | 2631.25 ± 303.86 | 3538.48 ± 573.23 | 3338.35 ± 474.17 | 8757.38 ± 379.12 | 4168.46 ± 192.98 | 3775.22 ±91.24 | 2484.19 ± 976.03 | 4145.45 ± 76.23 |
@inproceedings{
ke2024ccil,
title={CCIL: Continuity-Based Data Augmentation for Corrective Imitation Learning},
author={Liyiming Ke and Yunchu Zhang and Abhay Deshpande and Siddhartha Srinivasa and Abhishek Gupta},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=LQ6LQ8f4y8}
}