Towards Natural Language-Guided Drones: GeoText-1652 Benchmark with Spatial Relation Matching

1National University of Singapore, 2University of Macau, 3Hangzhou Dianzi University
*Correspondence

Abstract

Navigating drones through natural language commands re- mains challenging due to the dearth of accessible multi-modal datasets and the stringent precision requirements for aligning visual and textual data. To address this pressing need, we introduce GeoText-1652, a new natural language-guided geolocalization benchmark. This dataset is sys- tematically constructed through an interactive human-computer process leveraging Large Language Model (LLM) driven annotation techniques in conjunction with pre-trained vision models. GeoText-1652 extends the established University-1652 image dataset with spatial-aware text anno- tations, thereby establishing one-to-one correspondences between image, text, and bounding box elements. We further introduce a new optimiza- tion objective to leverage fine-grained spatial associations, called blend- ing spatial matching, for region-level spatial relation matching. Extensive experiments reveal that our approach maintains a competitive recall rate comparing other prevailing cross-modality methods. This underscores the promising potential of our approach in elevating drone control and nav- igation through the seamless integration of natural language commands in real-world scenarios.

Dataset Property

Dataset Property

Annotation Framework

Dataset Property

Model Structure

Dataset Property

Real-world Generalization

Dataset Property

BibTeX

@inproceedings{chu2024towards, 
      title={Towards Natural Language-Guided Drones: GeoText-1652 Benchmark with Spatial Relation Matching}, 
      author={Chu, Meng and Zheng, Zhedong and Ji, Wei and Wang, Tingyu and Chua, Tat-Seng}, 
      booktitle={EECV}, 
      year={2024} 
      }