Vision-and-language navigation (VLN) asks an agent to follow a given language instruction to navigate through a real 3D environment. Despite significant advances, conventional VLN agents are trained typically under disturbance-free environments and may easily fail in real-world navigation scenarios, since they are unaware of how to deal with various possible disturbances, such as sudden obstacles or human interruptions, which widely exist and may usually cause an unexpected route deviation. In this paper, we present a model-agnostic training paradigm, called Progressive Perturbation-aware Contrastive Learning (PROPER) to enhance the generalization ability of existing VLN agents to the real world, by requiring them to learn towards deviation-robust navigation. Specifically, a simple yet effective path perturbation scheme is introduced to implement the route deviation, with which the agent is required to still navigate successfully following the original instruction. Since directly enforcing the agent to learn perturbed trajectories may lead to insufficient and inefficient training, a progressively perturbed trajectory augmentation strategy is designed, where the agent can self-adaptively learn to navigate under perturbation with the improvement of its navigation performance for each specific trajectory. For encouraging the agent to well capture the difference brought by perturbation and adapt to both perturbation-free and perturbation-based environments, a perturbation-aware contrastive learning mechanism is further developed by contrasting perturbation-free trajectory encodings and perturbation-based counterparts. Extensive experiments on the standard Room-to-Room (R2R) benchmark show that PROPER can benefit multiple state-of-the-art VLN baselines in perturbation-free scenarios. We further collect the perturbed path data to construct an introspection subset based on the R2R, called Path-Perturbed R2R (PP-R2R). The results on PP-R2R show unsatisfying robustness of popular VLN agents and the capability of PROPER in improving the navigation robustness under deviation.
Index Terms—Vision-and-language navigation, navigation robustness, progressive training, contrastive learning
This paper proposes Progressive Perturbation-aware Con- trastive Learning (PROPER) for training deviation-robust VLN agents, which introduces a simple yet effective path perturbation scheme into the navigation process. To better utilize the perturbed trajectory data and capture the dif- ference brought by perturbation, a progressively perturbed trajectory augmentation strategy and a perturbation-aware contrastive learning paradigm are developed to improve the agent’s robustness. Experimental results on both the public R2R dataset and our constructed introspection subset PP- R2R show the superiority of PROPER beyond multiple state- of-the-art VLN baselines and its effectiveness in promoting navigation robustness under deviation.
In future work, we plan to improve the proposed method for generalizing to more VLN benchmark datasets such as Touchdown and REVERIE. Promoting the navi- gation robustness of VLN agents under more real-world challenges, such as sensor errors or visual ambiguity when matching the instruction is also worth exploring when de- ploying them into real-world applications.