This is the offical repo for paper "From Object to Context: Scene Knowledge Enhanced Visual Grounding for Geospatial Understanding". The dataset and code are coming soon.
Remote Sensing Visual Grounding (RSVG) is a critical task aimed at precise object localization in remote sensing images using language expressions. Existing methods align visual and textual features through cross-modal fusion but often fail to capture object dependencies, hindering complex visual reasoning about relationships and contexts. To address this, we introduce the Luojia-VG dataset, a benchmark enhancing visual reasoning with scene knowledge, including object-level annotations and contextual descriptions of relationships, functions, and activities. Unlike previous datasets focusing on basic object descriptions, Luojia-VG bridges the semantic gap between referring expressions and detailed visual content. Furthermore, we propose Knowledge-Enhanced Visual Grounding (KEVG), a novel model that combines scene knowledge with visual features and textual queries. KEVG contains two key components: the Deep Knowledge Fusion (DKF) module and the Query-Region Alignment (QRA) module. The DKF module progressively embeds scene knowledge into multi-scale visual features via cross-attention, enhancing the model's fine-grained understanding of scene contexts. The QRA module aligns image regions with the query by concentrating on the most contextually relevant areas for precise localization. Experiments demonstrate KEVG achieves state-of-the-art performance, with Pr@0.5 scores of 82.31% on DIOR-RSVG and 83.29% on Luojia-VG.


