DRIVE-Nav

Abstract

Open-Vocabulary Object Navigation (OVON) requires an embodied agent to locate a language-specified target in unknown environments. Existing zero-shot methods often reason over dense frontier points under incomplete observations, causing unstable route selection, repeated revisits, and unnecessary action overhead. We present DRIVE-Nav, a structured framework that organizes exploration around persistent directions rather than raw frontiers. By inspecting encountered directions more completely and restricting subsequent decisions to still-relevant directions within a forward 240° view range, DRIVE-Nav reduces redundant revisits and improves path efficiency. The framework extracts and tracks directional candidates from weighted Fast Marching Method (FMM) paths, maintains representative views for semantic inspection, and combines vision–language-guided prompt enrichment with cross-frame verification to improve grounding reliability. Experiments on HM3D-OVON, HM3Dv2, and MP3D demonstrate strong overall performance and consistent efficiency gains. On HM3D-OVON, DRIVE-Nav achieves 50.2% SR and 32.56% SPL, improving the previous best method by 1.9% SR and 5.56% SPL. It also delivers the best SPL on HM3Dv2 and MP3D and transfers to a physical humanoid robot. Real-world deployment also demonstrates its effectiveness.

Method Framework

DRIVE-Nav framework integrates three core modules: (1) Directional Reasoning — at decision corners, the agent extracts directionally consistent exits from path–circle intersections using FMM. (2) Adaptive Inspection — candidate exits are inspected through snapshot maintenance that keeps visual evidence aligned with evolving frontier geometry. (3) Verification — a Qwen3-VL-guided prompt enhancement module and an iterative segmentation–verification loop suppress open-vocabulary false positives through multi-round reasoning.

Simulation Experiments

Comparison with state-of-the-art methods on HM3D-OVON, HM3Dv2, and MP3D. Bold indicates the best result. DRIVE-Nav achieves the highest SR and SPL on HM3D-OVON and MP3D, and the best SPL on HM3Dv2.

Method	Venue	Training-free	HM3D-OVON		HM3Dv2		MP3D
Method	Venue	Training-free	SR↑	SPL↑	SR↑	SPL↑	SR↑	SPL↑
SemExp	CVPR'20	×	–	–	–	–	36.0	14.4
PONI	CVPR'22	×	–	–	–	–	31.8	12.1
ZSON	NeurIPS'22	×	–	–	–	–	15.3	4.8

CoW	CVPR'23	✓	–	–	–	–	7.4	3.7
ESC	ICML'23	✓	–	–	–	–	28.7	14.2
ActPept	RAL'24	✓	–	–	–	–	39.8	17.4
UniGoal	CVPR'25	✓	–	–	–	–	41.0	16.4
OpenFMNav	NAACL-F'24	✓	–	–	–	–	37.2	15.7
L3MVN	IROS'23	✓	–	–	36.3	15.7	–	–
InstructNav	CoRL'24	✓	–	–	58.0	20.9	–	–
SG-Nav	NeurIPS'24	✓	–	–	49.6	25.5	40.2	16.0
VLFM	ICRA'24	✓	–	–	63.6	32.5	36.4	17.5
ApexNav	RAL'25	✓	–	–	76.2	38.0	39.2	17.8

DAgger	CVPR'23	×	10.2	4.7	–	–	–	–
DAgRL	CVPR'23	×	18.3	7.9	–	–	–	–
RL	ICLR'20	×	18.6	7.5	–	–	–	–
DAgRL+OD	IROS'24	×	37.1	19.9	–	–	–	–
Uni-NaVid	RSS'25	×	39.5	19.8	–	–	–	–
MTU3D	ICCV'25	×	40.8	12.1	–	–	–	–
VLFM	ICRA'24	✓	35.2	19.6	–	–	–	–
TANGO	CVPR'25	✓	35.5	19.5	–	–	–	–
MSGNav	CVPR'26	✓	48.3	27.0	–	–	–	–
DRIVE-Nav (Ours)	Ours	✓	50.2	32.6	72.4	41.3	41.8	22.6

Simulation Demos

The agent efficiently explores unseen indoor environments with directional reasoning and iterative verification, and successfully locates target objects specified by open-vocabulary language descriptions.

HM3D-OVON

HM3Dv2

MP3D

Real-World Deployment

We deploy DRIVE-Nav on a physical humanoid robot and demonstrate its capability to efficiently navigate to target objects in real-world environments, without any prior knowledge of the surroundings.

BibTeX

@article{drivenav2025,
  author={Maoguo Gao and Zejun Zhu and Zhiming Sun and Zhengwei Ma and Longze Yuan and Zhongjing Ma and Zhigang Gao and Jinhui Zhang and Suli Zou},
  title={DRIVE-Nav: Directional Reasoning, Inspection, and Verification for Efficient Open-Vocabulary Navigation},
  year={2025}
}

Acknowledgement

This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.