CVPR 2026 - GeoAI Papers

#1 Oral Data Leakage Detection and De-duplication in Large Scale Geospatial Image Datasets

DetectionUAVGeo-Localization

Authors: Yeshwanth Kumar Adimoolam, Charalambos Poullis, Melinos Averkiou

Institutions: CYENS Center of Excellence, Concordia University, University of Cyprus

In our study, we conducted a comprehensive analysis of three widely used datasets in the domain of building footprint extraction using deep neural networks: the INRIA Aerial Image Labelling dataset, SpaceNet 2: Building Detection v2, and the AICrowd Mapping Challenge datasets. Our experiments revealed several issues in the AICrowd Mapping Challenge dataset, where nearly 90% (about 250k) of the training split images had identical copies, indicating a high level of duplicate data. Additionally, we...

#2 Oral Does YOLO Really Need to See Every Training Image in Every Epoch?

Detection

Authors: Xingxing Xie, Jiahua Dong, Junwei Han, Gong Cheng

Institutions: Northwestern Polytechnical University, Northwest Polytechnical University Xi'an

YOLO detectors are known for their fast inference speed, yet training them remains unexpectedly time-consuming due to their exhaustive pipeline that processes every training image in every epoch, even when many images have already been sufficiently learned. This stands in clear contrast to the efficiency suggested by the ``You Only Look Once'' philosophy.This naturally raises an important question: Does YOLO really need to see every training image in every epoch? To explore this, we propose an A...

#3 Oral Dual-level Adapter Boosting Prompt-free Curvilinear Structure Segmentation

VFMSegmentationSuper-Resolution

Authors: Kai Zhu, Li Chen, Jun Cheng

Institutions: Wuhan University of Science and Technology, Institute for Infocomm Reserach, A*STAR

Curvilinear structure segmentation is essential in domains such as medical imaging, remote sensing, and materials science. Existing methods often require extensive domain-specific training and lack generalization to novel domains. To overcome these limitations, we propose the Segment Anything Curve Model (SACM) — a universal, curvilinear segmentation framework built upon the pretrained Segment Anything Model (SAM). SACM introduces a dual-level adapter architecture that enables both fine-grained ...

#4 Oral GeoViS: Geospatially Rewarded Visual Search for Remote Sensing Visual Grounding

MLLMGeo-LocalizationImage Fusion

Authors: Peirong Zhang, Yidan Zhang, Luxiao Xu, Jinliang Lin, Zonghao Guo, Fengxiang Wang, Xue Yang, Kaiwen Wei, Lei Wang

Institutions: University of the Chinese Academy of Sciences, Tsinghua University, National University of Defense Technology, Shanghai AI Laboratory, Chongqing University, Chinese Academy of Sciences

Recent advances in multimodal large language models (MLLMs) have led to remarkable progress in visual grounding, enabling fine-grained cross-modal alignment between textual queries and image regions. However, transferring such capabilities to remote sensing imagery remains challenging, as targets are often extremely small within kilometer-scale scenes, and queries typically involve intricate geospatial relations such as relative positions, spatial hierarchies, or contextual dependencies across d...

#5 Oral Memory-Augmented Scene Understanding and Exploration for Open-World Aerial Object-Goal Navigation

UAV

Authors: Jiacong Zhou, Jiaxu Miao, Yourun Lin, Xianyun Wang, Jun Xiao, Jun Yu

Institutions: Hangzhou Dianzi University, Zhejiang University, Harbin Institute of Technology

Aerial object-goal navigation (Aerial ObjectNav) requires an Unmanned Aerial Vehicle (UAV) to navigate to target objects in large-scale outdoor environments using only visual observations and high-level object descriptions, without detailed step-by-step instructions. Existing approaches rely on local observations or short-term history, lacking comprehensive scene understanding and efficient spatial exploration strategies, which constrains their navigation capability in complex aerial scenarios....

#6 Oral MetaSpectra+: A Compact Broadband Metasurface Camera for Snapshot Hyperspectral+ Imaging

Hyperspectral

Authors: Yuxuan Liu, Wei Xu, Qi Guo

Institutions: Purdue University

We present MetaSpectra+, a compact multifunctional camera that supports two operating modes: (1) snapshot HDR + hyperspectral or (2) snapshot polarization + hyperspectral imaging. It utilizes a novel metasurface-refractive assembly that splits the incident beam into multiple channels and independently controls each channel’s dispersion, exposure, and polarization. Unlike prior multifunctional metasurface imagers restricted to narrow (10--100 nm) bands, MetaSpectra+ operates over nearly the entir...

#7 Oral OccuFly: A 3D Vision Benchmark for Semantic Scene Completion from the Aerial Perspective

UAVGeneration3D Reconstruction

Authors: Markus Gross, Sai Bharadhwaj Matha, Aya Fahmy, Rui Song, Daniel Cremers, Henri Meeß

Institutions: TU Munich, Fraunhofer, University of California, Los Angeles; University of Cambridge, Technical University Munich, SWARM Biotactics GmbH

Semantic Scene Completion (SSC) is crucial for 3D perception in mobile robotics, as it enables holistic scene understanding by jointly estimating dense volumetric occupancy and per-voxel semantics. Although SSC has been widely studied in terrestrial domains such as autonomous driving, aerial scenarios like autonomous flying remain largely unexplored, thereby limiting progress on downstream applications. Furthermore, LiDAR sensors represent the primary modality for SSC data generation, which pose...

#8 Oral Plant Taxonomy Meets Plant Counting: A Fine-Grained, Taxonomic Dataset for Counting Hundreds of Plant Species

MLLMGeneration

Authors: Jinyu Xu, Tianqi Hu, Xiaonan Hu, Letian Zhou, Songliang Cao, Meng Zhang, Hao Lu

Institutions: Huazhong University of Science and Technology

Visually cataloging and quantifying the natural world requires pushing the boundaries of both detailed visual classification and counting at scale. Despite significant progress, particularly in crowd and traffic analysis, the fine-grained, taxonomy-aware plant counting remains underexplored in vision. In contrast to crowds, plants are complicated by nonrigid morphologies and physical appearance variations across growth stages and environments. Tofill this gap, we present TPC-268, the first plant...

#9 Oral Spectrum from Defocus: Fast Spectral Imaging with Chromatic Focal Stack

Hyperspectral

Authors: M. Kerem Aydin, Yi-Chun Hung, Jaclyn Pytlarz, Qi Guo, Emma Alexander

Institutions: Northwestern Univeristy, Northwestern University, Dolby Laboratories Inc, Purdue University

Hyperspectral cameras rely on spectral filters, dispersive optics, or coded apertures, which reduce light throughput and increase hardware complexity. These systems face harsh trade-offs between spatial, spectral, and temporal resolution in inherently low-photon conditions. Computational imaging systems break through these trade-offs with compressive sensing, but have typically required complex optics and/or extensive computation. We present Spectrum from Defocus (SfD), a chromatic focal sweep m...

#10 Poster $\oslash$ Source Models Leak What They Shouldn’t $\nrightarrow$ : Unlearning Zero-Shot Transfer in Domain Adaptation Through Adversarial Optimization

VFMGeo-Localization

Authors: Arnav Devalapally, Poornima Jain, Kartik Srinivas, Vineeth Balasubramanian

Institutions: University of Michigan, Indian Institute of Technology, Hyderabad, Microsoft Research and IIT-Hyderabad

The increasing adaptation of vision models across domains, such as satellite imagery and medical scans, has raised an emerging privacy risk: models may inadvertently retain and leak sensitive source-domain specific information in the target domain. This creates a compelling use case for machine unlearning to protect the privacy of sensitive source-domain data. Among adaptation techniques, source-free domain adaptation (SFDA) calls for an urgent need for machine unlearning (MU), where the source ...

#11 Poster ACPV-Net: All-Class Polygonal Vectorization for Seamless Vector Map Generation from Aerial Imagery

UAVGeneration

Authors: Weiqin Jiao, Hao Cheng, George Vosselman, Claudio Persello

Institutions: University of Twente

We tackle the problem of generating a complete vector map representation from aerial imagery in a single run: producing polygons for all land-cover classes with shared boundaries and no gaps or overlaps. Existing polygonization methods are typically class-specific; extending them to multiple classes via per-class runs commonly leads to topological inconsistencies, such as duplicated edges, gaps, and overlaps. We formalize this new task as All-Class Polygonal Vectorization (ACPV) and release the ...

#12 Poster AeroDGS: Physically Consistent Dynamic Gaussian Splatting for Single-Sequence Aerial 4D Reconstruction

UAV3D Reconstruction

Authors: Hanyang Liu, Rongjun Qin

Institutions: Ohio State University, Columbus, The Ohio State University

Recent advances in 4D scene reconstruction have greatly improved dynamic modeling across various domains. However, existing approaches remain limited under aerial conditions with single-view capture, wide spatial range, and dynamic objects of limited spatial footprint and large motion disparity. These challenges cause severe depth ambiguity and unstable motion estimation, making monocular aerial reconstruction inherently ill-posed.To this end, we present AeroDGS, a physics-guided 4D Gaussian spl...

#13 Highlight AeroGS: Scale-Aware Gaussian Splatting for Pose-Free Dynamic UAV Scene Reconstruction

UAV3D Reconstruction

Authors: Tingyun Li, Xinyi Liu, Yongjun Zhang, Yi Wan, Xiaoan Liu, Fan Weiwei, Jiahao Liu

Institutions: Wuhan University

Monocular UAV videos pose a fundamental challenge for 3D reconstruction: dynamic scene modeling requires accurate camera poses, yet recovering poses from long UAV trajectories often fails under texture-sparse regions and moving objects.Existing approaches typically handle either pose-free static reconstruction or dynamic reconstruction with known poses, but jointly solving both from casual aerial footage remains difficult due to motion coupling and severe scale variation.We introduce \modelname,...

#14 Poster AirSim360: A Panoramic Simulation Platform within Drone View

UAVGeneration

Authors: Xian Ge, Yuling Pan, Yuhang Zhang, Xiang Li, Weijun Zhang, Dizhe Zhang, Zhaoliang Wan, Xin Lin, Xiangkai Zhang, Juntao Liang, Xiangtai Li, jerett Jiang, Bo Du, Ming-Hsuan Yang, Lu Qi

Institutions: Insta360, Shenzhen University, Northwestern University, Insta360 Research, insta360, University of California, San Diego, Institute of Automation，Chinese Academy of Sciences, ByteDance Inc., Wuhan University, University of California at Merced, Insta360; Wuhan University

The field of 360-degree omnidirectional understanding has been receiving increasing attention for advancing spatial intelligence. However, the lack of large-scale and diverse data remains a major limitation. In this work, we propose AirSim360, a simulation platform for omnidirectional data from aerial viewpoints, enabling wide-ranging scene sampling with drones. Specifically, AirSim360 focuses on three key aspects: a render-aligned data and labeling paradigm for pixel-level geometric, semantic, ...

#15 Poster APEX: A Decoupled Memory-based Explorer for Asynchronous Aerial Object Goal Navigation

VFMMLLMUAV

Authors: Daoxuan Zhang, Ping Chen, Xiaobo Xia, Xiu Su, Ruichen Zhen, Jianqiang Xiao, Shuo Yang

Institutions: Harbin Institute of Technology, ShenZhen, Harbin Institute of Technology, Shenzhen, The University of Sydney, Central South University, Harbin Institute of Technology (Shenzhen)

The Aerial Object Goal Navigation, a challenging frontier in Embodied AI, requires an Unmanned Aerial Vehicle (UAV) agent to autonomously explore, reason, and identify a specific target using only visual perception and language description. However, existing methods struggle with the memorization of complex spatial representations in aerial environments, reliable and interpretable action decision-making, and inefficient exploration and information gathering. To address these challenges, we intro...

#16 Poster Asking like Socrates: Socrates helps VLMs understand remote sensing images

VFMMLLMGeneration

Authors: Run Shao, Ziyu Li, Zhaoyang Zhang, Linrui Xu, Xinran He, Hongyuan Yuan, Bolei He, Yongxing Dai, Yan Yiming, Chen Yijun, Wang Guo, Haifeng Li

Institutions: Central South University, Baidu Inc., University of Science and Technology of China, Baidu, Zhejiang University, Central South University, China

Recent multimodal reasoning models, inspired by DeepSeek-R1, have significantly advanced vision–language systems. However, in remote sensing (RS) tasks, we observe widespread pseudo reasoning: models narrate the process of reasoning rather than genuinely reason toward the correct answer based on visual evidence. We attribute this to the Glance Effect, where a single, coarse perception of large-scale RS imagery results in incomplete understanding and reasoning based on linguistic self-consistency...

#17 Poster AVION: Aerial Vision–Language Instruction from Offline Teacher to Prompt-Tuned Network

VFMMLLMUAV

Authors: Yu Hu, Jianyang Gu, Hao Liu, Yue Cao, Jozsef Hamari, Zheng Liu, Mohsen Zardadi

Institutions: University of British Columbia, Zhejiang University, TerraSense Analytics

Adapting vision-language models to remote sensing imagery remains challenging due to two key factors: limited semantic coverage in textual representations and insufficient adaptability of visual features. These issues are particularly significant in aerial scenes, which involve various visual appearances and fine-grained object distinctions. We propose AVION, a knowledge distillation framework tailored for remote sensing adaptation of vision-language models. The teacher module constructs semanti...

#18 Poster Balanced Hierarchical Contrastive Learning with Decoupled Queries for Fine-grained Object Detection in Remote Sensing Images

DetectionSelf-Supervised

Authors: Jingzhou Chen, Dexin Chen, Fengchao Xiong, Yuntao Qian, Liang Xiao

Institutions: Nanjing University of Science and Technology, Zhejiang University

Fine-grained remote sensing datasets often use hierarchical label structures to differentiate objects in a coarse-to-fine manner, with each object annotated across multiple levels. However, embedding this semantic hierarchy into the representation learning space to improve fine-grained detection performance remains challenging. Previous studies have applied supervised contrastive learning at different hierarchical levels to group objects under the same parent class while distinguishing sibling s...

#19 Poster BDNet:Bio-Inspired dual-backbone Small Object Detection Network

DetectionUAVSuper-Resolution

Authors: Wenchao Guan, Chuan Lin, Sihan Huang, Xiongzhen Wang, Xintao Pang

Institutions: Guangxi University of Science and Technology, Macao Polytechnic University

In remote sensing images, small objects often suffer from low color contrast and blurred edges, resulting in suboptimal feature extraction performance. Physiological studies indicate that the LGN/V1–V2–V4 pathway offers color opponency sensitivity and hierarchical enhancement advantages for the extraction of color information, while the V1–V4 pathway shows strong orientation selectivity in edge information extraction. The integration of these two types of information in the V4 region significant...

#20 Poster BEV-CAR: Enhancing Monocular Bird’s Eye View Segmentation with Context-Aware Rasterization

Segmentation

Authors: Yixin Xiong, Ke Wang, Tongtong Cheng, Chunhui Liu, Kai Liu

Institutions: Chongqing University, Hong Kong Polytechnic University

Bird’s Eye View (BEV) semantic segmentation is essential for autonomous driving and mobile robotics, yet it still faces significant challenges on accurate segmentation of foreground object and efficient estimating of layout categories obscured by objects. To address these issues, we propose BEV-CAR, a Context-Aware Rasterization method that rasterizes the BEV representation without any coordinate transformations. By optimising each ray and incorporating depth features, BEV-CAR effectively addres...

#21 Poster Beyond Endpoints: Path-Centric Reasoning for Vectorized Off-Road Network Extraction

MLLM

Authors: wenfei guan, Jilin Mei, Tong Shen, Xumin Wu, Shuo Wang, Chen Min, Yu Hu

Institutions: Institute of Computing Technology, Chinese Academy of Sciences, University of the Chinese Academy of Sciences, Institute of Computing Technology, CAS, Chinese Academy of Sciences

Deep learning has advanced vectorized road extraction in urban settings, yet off-road environments remain underexplored and challenging. A significant domain gap causes advanced models to fail in wild terrains due to two key issues: lack of large-scale vectorized datasets and structural weakness in prevailing methods. Models such as SAM-Road employ a node-centric paradigm that reasons at sparse endpoints, making them fragile to occlusions and ambiguous junctions in off-road scenes, leading to t...

#22 Poster Beyond Matching to Tiles: Bridging Unaligned Aerial and Satellite Views for Vision-Only UAV Navigation

UAVGeo-Localization

Authors: Liu Kejia, Haoyang Zhou, Ruoyu Xu, Peicheng Wang, Mingli Song, Haofei Zhang

Institutions: Zhejiang University

Recent advances in cross-view geo-localization (CVGL) methods have shown strong potential for supporting unmanned aerial vehicle (UAV) navigation in GNSS-denied environments. However, existing work predominantly focuses on matching UAV views to onboard map tiles, which introduces an inherent trade-off between accuracy and storage overhead, and overlooks the importance of the UAV’s heading during navigation. Moreover, the substantial discrepancies and varying overlaps in cross-view scenarios have...

#23 Poster Beyond Tie Points: Satellite Image Block Adjustment based on Dense Feature Consistency

VFMGeo-Localization

Authors: Yi Liu, Yi Wan, Lei Yu, Panwang Xia, Qiong Wu, Yingying Pei, Xuejun Huang, Junjian Zhang, Xiangyuan Cai, Hongwei Hu, Yongjun Zhang

Institutions: Wuhan University, antgroup, Ant Group, Alibaba Group

Owing to the weak stereo geometry of satellite images, Planar Block Adjustment (PBA) is a predominant technique for correcting geometric distortions in satellite images, which treats elevation as a known constraint and primarily optimizes planar coordinates. Existing PBA methods mainly rely on explicit tie points, suffering from parallax caused by inaccurate elevation (e.g., near high buildings) and irreversible error accumulation, which severely degrades adjustment accuracy. In this paper, a "B...

#24 Poster Beyond What's Shared: Recovering Lost Unique Information from Intermediate Layers to Boost Multimodal Geo-Foundation Models

VFMGeo-LocalizationImage Fusion

Authors: JangHyeon Lee, Philipe Ambrozio Dias, Yao-Yi Chiang, Dalton Lunga

Institutions: University of Minnesota, Oak Ridge National Laboratory, University of Minnesota, Minneapolis

Learning general-purpose representations of geographic locations has become essential to geospatial tasks such as population estimation and environmental monitoring. To obtain such representations, multimodal geo-foundation models often use contrastive learning (CL) to align satellite imagery with geo-coordinates, implicitly assuming that cross-modal (shared) information suffices for downstream tasks. However, not all task-relevant information is shared between modalities, and retaining modality...

#25 Poster Breaking Smooth-Motion Assumptions: A UAV Benchmark for Multi-Object Tracking in Complex and Adverse Conditions

UAVTracking

Authors: Jingtao Ye, zhang kexin, Xunchi Ma, Johann Li, Guangming Zhu, Peiyi Shen, Linhua Jiang, Xiangdong Zhang, Liang Zhang

Institutions: Xi'an University of Electronic Science and Technology, Xidian University

The rapid movements and agile maneuvers of unmanned aerial vehicles (UAVs) induce significant observational challenges for multi-object tracking (MOT). However, existing UAV-perspective MOT benchmarks often lack these complexities, featuring predominantly predictable camera dynamics and linear motion patterns. To address this gap, we introduce DynUAV, a new benchmark for dynamic UAV-perspective MOT, characterized by intense ego-motion and the resulting complex apparent trajectories. The benchmar...

#26 Highlight Brewing Stronger Features: Dual-Teacher Distillation for Multispectral Earth Observation

VFMSegmentationChange Detection

Authors: Filip Wolf, Blaz Rolih, Luka Cehovin Zajc

Institutions: University of Ljubljana, University of Ljubljana, Slovenia

Foundation models are transforming Earth Observation (EO), yet the diversity of EO sensors and modalities makes a single universal model unrealistic. Multiple specialized EO foundation models (EOFMs) will likely coexist, making efficient knowledge transfer across modalities essential. Most existing EO pretraining relies on masked image modeling, which emphasizes local reconstruction but provides limited control over global semantic structure. To address this, we propose a dual-teacher contrastiv...

#27 Poster Bridge: Basis-Driven Causal Inference Marries VFMs for Domain Generalization

VFMDetectionUAV

Authors: Mingbo Hong, Feng Liu, Caroline Gevaert, George Vosselman, Hao Cheng

Institutions: University of Twente, Drexel University, University of Twente; The World Bank

Detectors often suffer from degraded performance, primarily due to the distributional gap between the source and target domains. This issue is especially evident in single-source domains with limited data, as models tend to rely on confounders (e.g., illumination, co-occurrence, and style) from the source domain, leading to spurious correlations that hinder generalization. To this end, this paper proposes a novel Basis-driven framework for domain generalization, namely **Bridge**, that incorpora...

#28 Poster CF-IPT: Cross-Modal Fusion Interactive Prompt Tuning of Vision-Language Pre-Trained Model for Multisource Remote Sensing Data Classification

VFMMLLMHyperspectral

Authors: Jinheng Ji, Jiahui Qu, Wenqian Dong, Yunsong Li

Institutions: Xidian University

Fine-tuning Vision-Language Models (VLMs) trained on large-scale datasets of natural image-text pairs has demonstrated impressive performance for various downstream tasks. However, their fine-tuning for remote sensing (RS) tasks faces dual barriers: (1) Data-level barrier caused by the fundamental modality gap between natural imagery and RS data, and (2) Task-level barrier stemming from the requirement for multi-source interaction modeling capabilities. This paper proposes a Cross-modal Fusion I...

#29 Poster ChangeBridge: Spatiotemporal Image Generation with Multimodal Controls for Remote Senisng

Change DetectionSuper-ResolutionGeneration

Authors: Zhenghui Zhao, Chen Wu, Xiangyong Cao, Di Wang, Hongruixuan Chen, Datao Tang, Liangpei Zhang, Zhuo Zheng

Institutions: Wuhan University, Xi'an Jiaotong University, The University of Tokyo, Stanford University

Spatiotemporal image generation is a highly meaningful task, which can generate future scenes conditioned on given observations. However, existing change generation methods can only handle event-driven changes (e.g., new buildings) and fail to model cross-temporal variations (e.g., seasonal shifts). In this work, we propose ChangeBridge, a conditional spatiotemporal image generation model for remote sensing. Given pre-event images and multimodal event controls, ChangeBridge generates post-event ...

#30 Poster ContourVertex: Bridging Semantics and Geometry for Referring Remote Sensing Interpretation

Segmentation

Authors: Jinming Chai, Lingling Li, Licheng Jiao, Xiaoqiang Lu, Long Sun, Xu Liu, Wenping Ma, Weibin Li

Institutions: Xidian University, Xi'an University

Referring expression comprehension and segmentation (RECS) task plays a vital role in remote sensing due to its high efficiency in multi-tasking. However, RECS has reached a performance bottleneck rooted in representational insufficiency, primarily due to cross-task representational fragmentation in multi-task interpretation. In this paper, we propose RECS4R, a unified multi-task framework to upgrade RECS performance. At representation level, we introduce language-guided unified contour decoding...

#31 Poster CRFT: Consistent–Recurrent Feature Flow Transformer for Cross-Modal Image Registration

MLLMImage Fusion

Authors: Xuecong Liu, Mengzhu Ding, Zixuan Sun, Zhang Li, Xichao Teng

Institutions: Northeastern University, Northeastern University at Qinhuangdao, National University of Defense Technology

We present Consistent–Recurrent Feature Flow Transformer (CRFT), a unified coarse-to-fine framework that learns feature flow for robust cross-modal registration. CRFT learns a modality-consistent feature flow representation within a transformer-based architecture that jointly performs feature alignment and flow estimation. The coarse stage establishes global correspondences through multi-scale feature correlation, while the fine stage refines local details via hierarchical feature fusion and ada...

#32 Poster Cross-modal Fuzzy Alignment Network for Text-Aerial Person Retrieval and A Large-scale Benchmark

MLLMUAVImage Fusion

Authors: Yifei Deng, Chenglong Li, YUYANG ZHANG, Guyue Hu, Jin Tang

Institutions: Anhui University; Anhui University, Anhui University, University of Hong Kong

Text-aerial person retrieval aims to identify targets in UAV-captured images from eyewitness descriptions, supporting intelligent transportation and public security applications. Compared to ground-view text–image person retrieval, UAV-captured images often suffer from degraded visual information due to drastic variations in viewing angles and flight altitudes, making semantic alignment with textual descriptions very challenging. To address this issue, we propose a novel Cross-modal Fuzzy Ali...

#33 Poster Cross-View Splatter: Feed-Forward View Synthesis with Georeferenced Images

Geo-LocalizationGeneration

Authors: Matias Turkulainen, Akshay Krishnan, Filippo Aleotti, Mohamed Sayed, Guillermo Garcia-Hernando, Juho Kannala, Arno Solin, Gabriel Brostow, Daniyar Turmukhambetov

Institutions: Aalto University, Georgia Institute of Technology, Niantic, Inc., University College London, University of London, Niantic Spatial, University of Oulu, Department of Computer Science, University College London, Niantic

We present Cross-View Splatter, a feed-forward method that predicts pixel-aligned Gaussian splats for outdoor scenes captured at ground-level AND by satellite. Faithful reconstructions require good camera coverage, but ground imagery is time-consuming and hard to capture at scale for large outdoor scenes. Fortunately, satellite imagery can provide a global geometric prior that is easy to access via public APIs. Cross-View Splatter fuses orthorectified satellite views with GPS-tagged ground photo...

#34 Poster CrossEarth-Gate: Fisher-Guided Adaptive Tuning Engine for Efficient Adaptation of Cross-Domain Remote Sensing Semantic Segmentation

VFMSegmentation

Authors: Shilei Cao, Ziyang Gong, Hehai Lin, Yang Liu, Jiashun Cheng, Xiaoxing Hu, Haoyuan Liang, Guowen Li, Chengwei Qin, Hong Cheng, Xue Yang, Juepeng Zheng, Haohuan Fu

Institutions: Sun Yat-sen University, Shanghai Artificial Intelligence Laboratory; SUN YAT-SEN UNIVERSITY, The Hong Kong University of Science and Technology, The Chinese University of Hong Kong, Hong Kong University of Science and Technology, Beijing Institute of Technology, Sun Yat-Sen University, The Hong Kong University of Science and Technology (Guangzhou), Shanghai AI Laboratory, Tsinghua University, Tsinghua University

In Remote Sensing (RS), Parameter-Efficient Fine-Tuning (PEFT) has emerged as a key approach to activate the generalizable representation ability of foundation models for downstream tasks. However, existing specialized PEFT methods often fail when applied to large-scale Earth observation tasks, as they are unable to fully handle the multifaceted and unpredictable domain gaps (e.g., spatial, semantic, and frequency shifts) inherent in RS data. To overcome this, we propose CrossEarth-Gate, which i...

#35 Poster CrossVL: Complexity-Aware Feature Routing and Paired Curriculum for Cross-View Vision-Language Detection

VFMDetectionMLLM

Authors: Zhipeng Liu, Chunbo Luo

Institutions: University of Exeter

Vision–language models (VLMs) enable text-guided object detection but degrade severely under cross-view scenarios where ground and aerial viewpoints differ in altitude, scale, and spatial layout. These geometric changes introduce systematic complexity variations between viewpoints, e.g., ground view images contain dense and highly occluded structures, while aerial images are sparse and globally organized. Fixed VLM fusion mechanisms cannot handle this discrepancy. We propose \textbf{CrossVL}, a ...

#36 Poster Dedelayed: Deleting remote inference delay via on-device correction

VFMSegmentation

Authors: Dan Jacobellis, Mateen Ulhaq, Fabien Racapé, Hyomin Choi, Neeraja Yadwadkar

Institutions: The University of Texas at Austin, InterDigital, University of Texas at Austin

Video comprises the vast majority of bits that are generated daily, and is the primary signal driving current innovations in robotics, remote sensing, and wearable technology.Yet, the most powerful video understanding models are too expensive for the resource-constrained platforms used in these applications.One approach is to offload inference to the cloud; this gives access to GPUs capable of processing high-resolution videos in real time.But even with reliable, high-bandwidth communication cha...

#37 Highlight Dual-branch Distilled Transformer for Efficient Asymmetric UAV Tracking

UAVTracking

Authors: Hongtao Yang, Bineng Zhong, Qihua Liang, Yaozong Zheng, Xiantao Hu, Yuanliang Xue, Shuxiang Song

Institutions: Guangxi Normal University, Nanjing University of Science and Technology, Xi’an Research Institute of High Technology

Given the real-time demands of UAV tracking, many methods simplify the backbone to reduce computation, but this often weakens feature representation and degrades performance in complex scenarios. To alleviate this issue, we propose EATrack, an efficient and asymmetric UAV tracking framework centered around a teacher-guided dual-branch distillation strategy that enhances the feature expressiveness of the lightweight student model. Specifically, EATrack investigates two complementary perspectives ...

#38 Poster EMR-Diff: Edge-aware Multimodal Residual Diffusion Model for Hyperspectral Image Super-resolution

HyperspectralSuper-ResolutionGeneration

Authors: Tao Zhang, Shengtao Yao, Rong Zeng, Zunjie Zhu, Bolun Zheng, Yaoqi Sun, Ying Fu, Chenggang Yan

Institutions: Hangzhou Dianzi University, Lishui University, Beijing Institute of Technology, Hangzhou Dianzi University, Tsinghua University

Hardware constraints make it challenging to simultaneously acquire hyperspectral images (HSIs) with both high spatial and high spectral resolutions. A promising solution is to fuse low-resolution HSI (LR-HSI) with high-resolution multispectral images (HR-MSI) to generate high-resolution HSI (HR-HSI). Recently, diffusion models have introduced possibilities for HSI super-resolution, but suffer from low-efficiency sampling, detail-limited generation, and insufficient denoising. To address these is...

#39 Poster Enhancing Unregistered Hyperspectral Image Super-Resolution via Unmixing-based Abundance Fusion Learning

HyperspectralSuper-Resolution

Authors: Yingkai Zhang, Tao Zhang, Jing Nie, Ying Fu

Institutions: Beijing Institute of Technology, Hangzhou Dianzi University

Unregistered hyperspectral image (HSI) super-resolution (SR) typically aims to enhance a low-resolution HSI using an unregistered high-resolution reference image.In this paper, we propose an unmixing-based fusion framework that decouples spatial-spectral information to simultaneously mitigate the impact of unregistered fusion and enhance the learnability of SR models.Specifically, we first utilize singular value decomposition for initial spectral unmixing, preserving the original endmembers whil...

#40 Poster Exploring Spatiotemporal Feature Propagation for Video-Level Compressive Spectral Reconstruction: Dataset, Model and Benchmark

Hyperspectral

Authors: Lijing Cai, Zhan Shi, Chenglong Huang, Jinyao Wu, Qiping Li, Zikang Huo, Linsen Chen, Chongde Zi, Xun Cao

Institutions: nanjing university, Nanjing University

Recently, Spectral Compressive Imaging (SCI) has achieved remarkable success, unlocking significant potential for dynamic spectral vision. However, existing reconstruction methods, primarily image-based, suffer from two limitations: (i) Encoding process masks spatial-spectral features, leading to uncertainty in reconstructing missing information from single compressed measurements, and (ii) The frame-by-frame reconstruction paradigm fails to ensure temporal consistency, which is crucial in the v...

#41 Poster F2Net: A Frequency-Fused Network for Ultra-High Resolution Remote Sensing Segmentation

Segmentation

Authors: Hengzhi Chen, Liqian Feng, Wenhua Wu, Xiaogang Zhu, Qiuxia Wu, Lianlei Shan, Kun Hu

Institutions: University of Sydney, University of Sydney, Adelaide University, South China University of Technology, Tsinghua University, Edith Cowan University

Semantic segmentation of ultra-high-resolution (UHR) remote sensing imagery is critical for applications like environmental monitoring and urban planning but faces com- putational and optimization challenges. Conventional methods either lose fine details through downsampling or fragment global context via patch processing. While multi-branch networks ad- dress this trade-off, they suffer from computational inefficiency and conflicting gradient dynamics during training. We propose F2Net, a freque...

#42 Poster Fourier Angle Alignment for Oriented Object Detection in Remote Sensing

Detection

Authors: Changyu Gu, Linwei Chen, Lin Gu, Ying Fu

Institutions: Beijing Institute of Technology, Tohoku University

In remote sensing rotated object detection, mainstream methods suffer from two bottlenecks, directional incoherence at detector neck and task conflict at detecting head. Ulitising fourier rotation equivariance, we introduce **Fourier Angle Alignment**, which analyses angle information through frequency spectrum and aligns the main direction to a certain orientation. Then we propose two plug and play modules : **FAAFusion** and **FAA Head**. FAAFusion works at the detector neck, aligning the mai...

#43 Poster FUSAR-GPT: A Spatiotemporal Feature-Embedded and Two-Stage Decoupled Visual Language Model for SAR Imagery

VFMMLLMSAR

Authors: Xiaokun Zhang, Yi Yang, Ziqi Ye, Baiyun Baiyun, Xiaorong Guo, Qingchen Fang, Ry Zhang, Xinpeng Zhou, Haipeng Wang

Institutions: Fudan University, Fudan Univercity, FuDan university

Research on the intelligent interpretation of all-weather, all-time Synthetic Aperture Radar (SAR) is crucial for advancing remote sensing applications. In recent years, although Visual Language Models (VLMs) have demonstrated strong open-world understanding capabilities on RGB images, their performance is severely limited when directly applied to the SAR field due to the complexity of the imaging mechanism, sensitivity to scattering features, and the scarcity of high-quality text corpora. To sy...

#44 Highlight Gallant: Voxel Grid-based Humanoid Locomotion and Local-navigation across 3-D Constrained Terrains

UAV

Authors: Qingwei Ben, Botian Xu, Kailin Li, Feiyu Jia, Wentao Zhang, Jingping Wang, Jingbo Wang, Dahua Lin, Jiangmiao Pang

Institutions: The Chinese University of Hong Kong, Tsinghua University, Shanghai Jiaotong University, University of Science and Technology of China, The University of Tokyo, Shanghai AI LAB, Shanghai AI Laboratory

Robust humanoid locomotion requires accurate and globally consistent perception of the surrounding 3D environment. However, existing perception modules, mainly based on depth images or elevation maps, offer only partial and locally flattened views of the environment, failing to capture the full 3D structure.This paper presents $\textbf{Gallant}$, a voxel-grid–based framework for humanoid locomotion and local navigation in 3D constrained terrains.It leverages voxelized LiDAR data as a lightweight...

#45 Highlight GardenDesigner: Encoding Aesthetic Principles into Jiangnan Garden Construction via a Chain of Agents

Generation

Authors: Mengtian Li, Fan Yang, Ruixue Xiong, Yiyan Fan, Zhifeng Xie, Zeyu Wang

Institutions: Shanghai University, shanghai university, The Hong Kong University of Science and Technology (Guangzhou)

Jiangnan gardens, a prominent style of Chinese classical gardens, hold great potential as digital assets for film and game production and digital tourism. However, manual modeling of Jiangnan gardens heavily relies on expert experience for layout design and asset creation, making the process time-consuming. To address this gap, we propose GardenDesigner, a novel framework that encodes aesthetic principles for Jiangnan garden construction and integrates a chain of agents based on procedural model...

#46 Poster Geo$^\textbf{2}$: Geometry-Guided Cross-view Geo-Localization and Image Synthesis

VFMUAVGeo-Localization

Authors: Yancheng Zhang, Xiaohan Zhang, Guangyu Sun, Zonglin Lyu, Safwan Wshah, Chen Chen

Institutions: University of Central Florida, University of Vermont, New York University

Cross-view geo-spatial learning consists of two important tasks: Cross-View Geo-Localization (CVGL) and Cross-View Image Synthesis (CVIS), both of which rely on establishing geometric correspondences between ground and aerial views. Recent Geometric Foundation Models (GFMs) have demonstrated strong capabilities in extracting generalizable 3D geometric features from images, but their potential in cross-view geo-spatial tasks remains underexplored. In this work, we present Geo^2, a unified framewo...

#47 Poster GeoBridge: A Semantic-Anchored Multi-View Foundation Model Bridging Images and Text for Geo-Localization

VFMUAVGeo-Localization

Authors: Zixuan Song, Jing Zhang, Di Wang, Zidie Zhou, Wenbin Liu, Haonan Guo, En Wang, Bo Du

Institutions: Zhongguancun Academy, The University of Sydney, Wuhan University, JiLin University, Jilin University, China

Cross-view geo-localization infers a location by retrieving geo-tagged reference images that visually correspond to a query image. However, the traditional satellite-centric paradigm limits robustness when high-resolution or up-to-date satellite imagery is unavailable. It further underexploits complementary cues across views (e.g., drone, satellite, and street) and modalities (e.g., language and image). To address these challenges, we propose GeoBridge, a foundation model that performs bidirecti...

#48 Highlight GeoCoT: Towards Reliable Remote Sensing Reasoning with Manifold Perspective

MLLMImage Fusion

Authors: Daixun Li, Zirui Li, Sibo He, Jiayun Tian, Mingxiang Cao, Weiying Xie, Yunke Wang, Xin Zhang, Yusi Zhang, Yunsong Li, Chang Xu, Leyuan Fang

Institutions: State Key Laboratory of Integrated Services Networks, Xi'an University of Electronic Science and Technology, Xidian University, University of Sydney, Xidian University of Electronic Science and Technology, Hunan University

Multimodal Large Language Models (MLLMs) have shown strong potential in remote sensing (RS) through multi-task reasoning and cross-modal generalization.However, existing RS-MLLMs mainly rely on a single shared expert for all tasks, making it hard to produce reliable results. Meanwhile, the intrinsic redundancy and homogeneity of RS images bring substantial difficulties for both training and inference. These challenges directly conflict with the demands of remote sensing, which values task precis...

#49 Poster GeoDiT: A Diffusion-based Vision-Language Model for Geospatial Understanding

DetectionMLLMGeo-Localization

Authors: Jiaqi Liu, Ronghao Fu, Haoran Liu, Lang Sun, Qipeng Wang, Bo Yang

Institutions: Jilin University

Autoregressive models are structurally misaligned with the inherently parallel nature of geospatial understanding, forcing a rigid sequential narrative onto scenes and fundamentally hindering the generation of structured and coherent outputs. We challenge this paradigm by reframing geospatial generation as a parallel refinement process, enabling a holistic, coarse-to-fine synthesis that resolves all semantic elements simultaneously. To operationalize this, we introduce GeoDiT, the first diffusio...

#50 Poster GeoFlow: Real-Time Fine-Grained Cross-View Geolocalization via Iterative Flow Prediction

Geo-Localization

Authors: Ayesh Abu Lehyeh, Xiaohan Zhang, Ahmad Arrabi, Waqas Sultani, Chen Chen, Safwan Wshah

Institutions: University of Vermont, University of Central Florida

Accurate and fast localization is vital for safe autonomous navigation in GPS-denied areas. Fine-Grained Cross-View Geolocalization (FG-CVG) aims to estimate the precise 2-Degree-of-Freedom (2-DoF) location of a ground image relative to a satellite image. However, current methods force a difficult trade-off, with high-accuracy models being slow for real-time use. In this paper, we introduce GeoFlow, a new approach that offers a lightweight and highly efficient framework that breaks this accuracy...

#51 Highlight GeoMMBench and GeoMMAgent: Toward Expert-Level Multimodal Intelligence in Geoscience and Remote Sensing

MLLMGeo-Localization

Authors: Aoran Xiao, Shihao Cheng, Yonghao Xu, Yexian Ren, Hongruixuan Chen, Naoto Yokoya

Institutions: Harbin Institute of Technology, Wuhan University, Linköping University, Nanjing University of Information Science and Technology, The University of Tokyo

Recent advances in multimodal large language models (MLLMs) have accelerated progress in domain-oriented AI, yet their development in geoscience and remote sensing (RS) remains constrained by distinctive challenges: wide-ranging disciplinary knowledge, heterogeneous sensor modalities, and a fragmented spectrum of tasks. To bridge these gaps, we introduce GeoMMBench, a comprehensive multimodal question-answering benchmark covering diverse RS disciplines, sensors, and tasks, enabling broader and m...

#52 Poster GeoSANE: Learning Geospatial Representations From Models, Not Data

VFMSegmentationGeo-Localization

Authors: Joëlle Hanna, Damian Falk, Stella X. Yu, Damian Borth

Institutions: Universität St. Gallen, University of St. Gallen, University of Michigan - Ann Arbor, University of St.Gallen

Recent advances in remote sensing have led to an increase in the number of available foundation models; each trained on different modalities, datasets, and objectives, yet capturing only part of the vast geospatial knowledge landscape. While these models show strong results within their respective domains, their capabilities remain complementary rather than unified. Therefore, instead of choosing one model over another, we aim to combine their strengths into a single shared representation.We in...

#53 Poster Global Underwater Geolocation from Time-Lapse Polarization Imagery

Geo-LocalizationGeneration

Authors: Sara Aghajanzadeh, Xiaoyang Bai, Zhongmin Zhu, David Forsyth, Viktor Gruev

Institutions: University of Illinois Urbana Champaign, The University of Hong Kong, University of Illinois Urbana-Champaign, University of Illinois at Urbana-Champaign; University of Illinois at Urbana-Champaign

It is extremely hard for an underwater agent to know where it is. Satellite signals disappear within centimeters of the surface; acoustic baselines require heavy infrastructure to instrument small regions. The polarization of the sky, visible underwater, reveals the elevation of the sun. The pattern of elevation over the day reveals location to an agent with a clock. However, recovering elevation from polarization images is very difficult. SOTA geolocalization methods can localize well for lo...

#54 Poster Helios: Stable Latent Image Modeling for Multimodal Earth Observation

VFMSelf-Supervised

Authors: Henry Herzog, Favyen Bastani, Yawen Zhang, Gabriel Tseng, Joseph Redmon, Hadrien Sablon, Ryan Park, Jacob Morrison, Alexandra Buraczynski, Karen Farley, Josh Hansen, Andrew Howe, Patrick Johnson, Mark Otterlee, Ted Schmitt, Hunter Pitelka, Stephen Daspit, Rachel Ratner, Christopher Wilhelm, Sebastian Wood, Mike Jacobi, Hannah Kerner, Evan Shelhamer, Ali Farhadi, Ranjay Krishna, Patrick Beukema

Institutions: Allen Institute for Artificial Intelligence, Allen Institute for AI, Allen Institute for Artificial Intelligence; McGill University, Stanford University, University of Washington; Allen Institute for Artificial Intelligence, The Allen Institute for Artificial Intelligence, Arizona State University, UBC / Vector, University of Washington

Earth observation data presents a unique challenge: it is spatial like images, sequential like video or text, and highly multimodal. We present Helios: a multimodal, spatio-temporal foundation model that employs a novel self-supervised learning formulation, masking strategy, and loss all designed for the Earth observation domain. Helios achieves state-of-the-art performance compared to 12 other foundation models across a variety of research benchmarks and real-world tasks from external partners....

#55 Poster HierUQ: Hierarchical Uncertainty Quantification with Adaptive Granularity Reconciliation for Degraded Image Classification

Self-Supervised

Authors: YANG CHU, Xiaomeng Yang, Keli Deng, Yuntao Qian

Institutions: Zhejiang University

Hierarchical classification (HC) on degraded images presents challenges due to feature corruption, unreliable confidence estimation, and fine-grained misclassification. Existing methods often struggle to balance semantic consistency and adaptive decision paths under low-quality visual conditions. To address this, we propose HierUQ, a unified framework that integrates uncertainty quantification with adaptive granularity reconciliation. A Vision Transformer backbone extracts global features, which...

#56 Poster Hilbert Curve-Based Attention Enabling Topology-Preserving Image Tensor Representation for Semantic Segmentation Network

SegmentationUAV

Authors: Linkang Xu, Gang Li, Yue Song, Xiangxin Ji

Institutions: TongJi University, Tongji University

Drone-based building defect segmentation remains challenging due to complex surface textures and illumination variations. We propose TPSegformer, a topology-preserving segmentation framework that mitigates mis-segmentation in such scenarios. Its decoder incorporates a Hilbert curve–based topology-preserving mechanism to maintain spatial continuity and boundary precision during category layer computation. A lightweight multi-scale fusion module enhances semantic representation, while global conte...

#57 Poster HySeg: Learning Generative Priors for Structure-Aware Remote Sensing Segmentation

SegmentationMLLMGeneration

Authors: Jie Qiu, XIN LI, Fan Yang, Yan Wang, Dong Yu, Changying Wang, Linwei Dai, Yongxiang Chen, Youqin Chen, Jianzhang Chen

Institutions: Fujian Agriculture and Forestry University, G42, AIQ, Beijing Jiaotong University, IFLYTEK CO.LTD., Fujian University of Technology

High-resolution remote sensing imagery exhibits complex spatial regularities where topology, continuity, and region adjacency govern semantic organization. However, existing remote sensing image semantic segmentation (RSISS) networks, being predominantly discriminative, estimate strong posteriors from data while lacking generative priors that encode such structural dependencies. This imbalance leads to fragmented boundaries, texture overfitting, and poor cross-domain generalization. We address ...

#58 Poster IMAIA: Interactive Maps AI Assistant for Travel Planning and Geo-Spatial Intelligence

MLLMGeo-Localization

Authors: Jieren Deng, Zhizhang Hu, Ziyan He, Aleksandar Cvetkovic, Pak Chung, Dragomir Yankov, Chiqun Zhang

Institutions: Capital One, Amazon, Microsoft

Most mapping tools remain point-and-click, making it hard to ask spatial questions or relate what a camera sees to its surrounding geography in a view-aware way. We present **IMAIA** — the *Interactive Maps AI Assistant* — which enables natural-language interaction with both vector (street) maps and satellite imagery, while enriching camera inputs with geospatial intelligence to help users interpret the world around them.IMAIA consists of two complementary modules:* **Maps Plus**, which treats t...

#59 Poster Instance-level Visual Active Tracking with Occlusion-Aware Planning

UAVSuper-ResolutionTracking

Authors: Haowei Sun, Kai Zhou, Hao Gao, Shiteng Zhang, Jinwu Hu, Xutao Wen, Qixiang Ye, Mingkui Tan

Institutions: South China University of Technology, University of Chinese Academy of Sciences

Visual Active Tracking (VAT) aims to control cameras to follow a target in 3D space, which is critical for applications like drone navigation and security surveillance. However, it faces two key bottlenecks in real-world deployment: confusion from visually similar distractors caused by insufficient instance-level discrimination and severe failure under occlusions due to the absence of active planning. To address these, we propose OA-VAT, a unified pipeline with three complementary modules. First...

#60 Poster Is your VLM Sky-Ready? A Comprehensive Spatial Intelligence Benchmark for UAV Navigation

VFMMLLMUAV

Authors: Lingfeng Zhang, Yuchen Zhang, Hongsheng Li, Haoxiang Fu, Yingbo Tang, Hangjun Ye, Long Chen, Xiaojun Liang, Xiaoshuai Hao, Wenbo Ding

Institutions: Tsinghua University; Xiaomi Corporation, Georgia Institute of Technology, Shenzhen International Graduate School, Tsinghua University, Institute of Automation, Chinese Academy of Sciences, Xiaomi Corporation, Wayve, Pengcheng Laboratory, Beijing Academy of Artificial Intelligence(BAAl), Tsinghua Univeresity

Vision-Language Models (VLMs), leveraging their powerful visual perception and reasoning capabilities, have been widely applied in Unmanned Aerial Vehicle (UAV) tasks.However, the spatial intelligence capabilities of existing VLMs in UAV scenarios remain largely unexplored, raising concerns about their effectiveness in navigating and interpreting dynamic environments.To bridge this gap, we introduce SpatialSky-Bench, a comprehensive benchmark specifically designed to evaluate the spatial intelli...

#61 Poster Joint Spectral Image Reconstruction and Semantic Segmentation with Cooperative Unfolding

SegmentationHyperspectralSelf-Supervised

Authors: Zijun He, Ping Wang, Xiaodong Wang, ChangChen ChangChen, Xin Yuan

Institutions: Westlake University, Zhejiang University

Coded Aperture Snapshot Spectral Imaging (CASSI) is an emerging hyperspectral image (HSI) acquisition technique for downstream semantic segmentation. Due to the ill-posedness nature of CASSI systems, typical solutions are compelled to conduct a two-stage reconstruction-then-segmentation pipeline, namely viewing them as two separate tasks. However, we observe that such two tasks are interrelated and mutually reinforcing for representation learning, and thus separating them limits the overall accu...

#62 Poster LNEM: Lunar Neural Elevation Model

Geo-Localization3D Reconstruction

Authors: SUWAN LEE, Jo Ryeong Yim, Kibaek Park, Dong Kim, Eunhyeuk Kim, Minsup Jeong, Chae Sim, Seokju Lee

Institutions: KENTECH, Korea Aerospace Research Institute, Korean Institute of Energy Technology, Korea Astronomy and Space Science Institute, Korea Astronomy and SpaceScience Institute, Korea Institute of Energy Technology (KENTECH)

High-resolution and high-precision digital elevation models (DEMs) of the lunar surface are essential for landing site selection and lunar geological research. However, traditional stereo matching techniques provide limited representation of 3D scene, struggling with non-textured regions and extreme light variations. Furthermore, recent lunar neural rendering methods are ill-suited for 3D reconstruction due to their reliance on simple pinhole approximations for pushbroom sensors. These challenge...

#63 Poster Local Precise Refinement: A Dual-Gated Mixture-of-Experts for Enhancing Foundation Model Generalization against Spectral Shifts

VFMSegmentationHyperspectral

Authors: Xi Chen, Maojun Zhang, Yu Liu, Shen Yan

Institutions: National University of Defense Technology

Domain Generalization Semantic Segmentation (DGSS) in spectral remote sensing is severely challenged by spectral shifts across diverse acquisition conditions, which cause significant performance degradation for models deployed in unseen domains. While Parameter-Efficient Fine-Tuning (PEFT) on foundation models is a promising direction, existing methods employ global, homogeneous adjustments. This "one-size-fits-all" tuning struggles with the spatial heterogeneity of land cover, causing semantic ...

#64 Poster LoD-Loc v3: Generalized Aerial Localization in Dense Cities using Instance Silhouette Alignment

VFMSegmentationUAV

Authors: Shuaibang Peng, Juelin Zhu, Xia Li, Kun Yang, Yu Liu, Maojun Zhang, Shen Yan

Institutions: National University of Defense Technology, Northwest Polytechnical University Xi'an

We present LoD-Loc v3, a novel method for generalized aerial visual localization in dense urban environments. While prior work LoD-Loc v2 [89] achieves localization through semantic building silhouette alignment with low-detail city models, it suffers from two key limitations: poor cross-scene generalization and frequent failure in dense building scenes. Our method addresses these challenges through two key innovations. First, we develop a new synthetic data generation pipeline that produces $\t...

#65 Poster LookasideVLN: Direction-Aware Aerial Vision-and-Language Navigation

MLLMUAV

Authors: Yuwei Ning, Ganlong Zhao, Yipeng Qin, Si Liu, Yang Liu, Liang Lin, Guanbin Li

Institutions: Sun Yat-sen University, University of Hong Kong, Cardiff University, Beihang University, SUN YAT-SEN UNIVERSITY

Aerial Vision-and-Language Navigation (Aerial VLN) enables unmanned aerial vehicles (UAVs) to follow natural language instructions and navigate complex urban environments.While recent advances have achieved progress through large-scale memory graphs and lookahead path planning, they remain limited by shallow instruction understanding and high computational cost. In particular, existing methods rely primarily on landmark descriptions, overlooking directional cues—a key source of spatial context i...

#66 Poster Lumosaic: Hyperspectral Video via Active Illumination and Coded-Exposure Pixels

Hyperspectral

Authors: Dhruv Verma, Andrew Qiu, Roberto Rangel, Ayandev Barman, Hao Yang, Chenjia Hu, Fengqi Zhang, Roman Genov, David B. Lindell, Kiriakos Kutulakos, Alex Mariakakis

Institutions: University of Toronto, Pinterest, Inc., University of Toronto; Universidade de São Paulo; Universidade do Estado do Rio de Janeiro, University of Toronto, University of Toronto

We present Lumosaic, a compact active hyperspectral video system designed for real-time capture of dynamic scenes. Our approach combines a narrowband LED array with a coded-exposure-pixel (CEP) camera capable of high-speed, per-pixel exposure control, enabling joint encoding of scene information across space, time, and wavelength within each video frame. Unlike passive snapshot systems that divide light across multiple spectral channels simultaneously and assume no motion during a frame’s exposu...

#67 Poster MM-OVSeg: Multimodal Optical–SAR Fusion for Open-Vocabulary Segmentation in Remote Sensing

VFMSegmentationSAR

Authors: YIMIN WEI, Aoran Xiao, Hongruixuan Chen, Junshi Xia, Naoto Yokoya

Institutions: The University of Tokyo, Harbin Institute of Technology, RIKEN

Open-vocabulary segmentation enables pixel-level recognition from an open set of textual categories, allowing generalization beyond fixed classes. Despite great potential in remote sensing, progress in this area remains largely limited to clear-sky optical data and struggles under cloudy or haze-contaminated conditions. We present MM-OVSeg, a multimodal Optical–SAR fusion framework for resilient open-vocabulary segmentation under adverse weather conditions. MM-OVSeg leverages the complementary s...

#68 Poster MMLandmarks: a Cross-View Instance-Level Benchmark for Geo-Spatial Understanding

UAVGeo-LocalizationGeneration

Authors: Oskar Kristoffersen, Alba Reinders, Morten Hannemose, Anders Dahl, Dim Papadopoulos

Institutions: Technical University of Denmark (DTU), DTU, Technical University of Denmark, DTU Compute

Geo-spatial analysis of our world benefit from a multimodal approach, as every single geographic location can be described in numerous ways (images from various viewpoints, textual descriptions, and geographic coordinates). Current geo-spatial benchmarks have limited coverage across modalities, considerably restricting progress in the field, as current approaches cannot integrate all relevant modalities within a unified framework. We introduce the Multi-Modal Landmark dataset (MMLandmarks), a be...

#69 Poster MOGeo: Beyond One-to-One Cross-View Object Geo-localization

Geo-Localization

Authors: Lv Bo, Qingwang Zhang, Le Wu, Yuanyuan Li, YINGYING ZHU

Institutions: Shenzhen University, Fudan University

Cross-View Object Geo-Localization (CVOGL) aims to locate an object of interest in a query image within a corresponding satellite image. Existing methods typically assume that the query image contains only a single object, which does not align with the complex, multi-object geo-localization requirements in real-world applications, making them unsuitable for practical scenarios. To bridge the gap between the realistc setting and existing task, we propose a new task, called Cross-View Multi-Object...

#70 Poster MOMO: Mars Orbital MOdel Foundation Model for Mars Orbital Applications

VFMSegmentation

Authors: Mirali Purohit, Bimal Gajera, Irish Mehta, Bhanu Tokas, Jacob Adler, Steven Lu, Scott Dickenshied, Serina Diniega, Brian Bue, Umaa Rebbapragada, Hannah Kerner

Institutions: Arizona State University (ASU), Arizona State University, Jet Propulsion Laboratory

We introduce MOMO, the first multi-sensor foundation model for Mars remote sensing. MOMO uses model merge to integrate representations learned independently from three key Martian sensors (HiRISE, CTX, and THEMIS), spanning resolutions from 0.25 m/pixel to 100 m/pixel. Central to our method is our novel Equal Validation Loss (EVL) strategy, which aligns checkpoints across sensors based on validation loss similarity before fusion via task arithmetic. This ensures models are merged at compatible c...

#71 Poster MOS: Mitigating Optical-SAR Modality Gap for Cross-Modal Ship Re-Identification

SARImage FusionSuper-Resolution

Authors: Yujian Zhao, Hankun Liu, Guanglin Niu

Institutions: Beihang University, Beijing University of Aeronautics and Astronautics

Cross-modal ship re-identification (ReID) between optical and synthetic aperture radar (SAR) imagery has recently emerged as a critical yet underexplored task in maritime intelligence and surveillance. However, the substantial modality gap between optical and SAR images poses a major challenge for robust identification. To address this issue, we propose MOS, a novel framework designed to Mitigate the Optical–SAR modality gap and achieve modality-consistent feature learning for optical-SAR cross-...

#72 Highlight MSAG: A Multispectral Aerial–Ground Benchmark for Any-Scenario Person Re-Identification

UAVHyperspectral

Authors: Yuxuan Zhao, Zhongao Zhou, Bin Yang, He Li, Jian Liang, Jun Chen, Bo Du, Mang Ye

Institutions: Wuhan University

Recent person re-identification (ReID) leverages heterogeneous sensing with multiple modalities and viewpoints to improve robustness across diverse conditions. However, most approaches target predefined scenario pairs (e.g., visible-infrared or aerial-ground) and train separate task-specific models. In contrast, real-world applications require retrieving identities from galleries that cover all scenarios, making such designs inefficient and complex to deploy. To bridge this gap, we introduce Any...

#73 Poster NeighborMAE: Exploiting Spatial Dependencies between Neighboring Earth Observation Images in Masked Autoencoders Pretraining

VFMChange DetectionSelf-Supervised

Authors: Liang Zeng, Valerio Marsocci, Wufan Zhao, Andrea Nascetti, Maarten Vergauwen

Institutions: KU Leuven, European Space Agency Φ-lab, The Hong Kong University of Science and Technology (Guangzhou), KTH Royal Institute of Technology

Masked Image Modeling has been one of the most popular self-supervised learning paradigms to learn representations from large-scale, unlabeled Earth Observation images. While incorporating multi-modal and multi-temporal Earth Observation data into Masked Image Modeling has been widely explored, the spatial dependencies between images captured from neighboring areas remains largely overlooked. Since the Earth's surface is continuous, neighboring images are highly related and offer rich contextual...

#74 Poster Olbedo: An Albedo and Shading Aerial Dataset for Large-Scale Outdoor Environments

VFMUAVGeneration

Authors: Shuang Song, Debao Huang, Deyan Deng, Haolin Xiong, Yang Tang, Yajie Zhao, Rongjun Qin

Institutions: The Ohio State University, Ohio State University, Columbus, University of Southern California

Intrinsic image decomposition (IID) of outdoor scenes is crucial for relighting, editing, and understanding large-scale environments, but progress has been limited by the lack of real-world datasets with reliable albedo and shading supervision. We introduce \textit{Olbedo}, a large-scale aerial dataset for outdoor albedo--shading decomposition in the wild. \textit{Olbedo} contains 5,664 UAV images captured across four landscape types, multiple years, and diverse illumination conditions. Each vie...

#75 Poster OpenDPR: Open-Vocabulary Change Detection via Vision-Centric Diffusion-Guided Prototype Retrieval for Remote Sensing Imagery

VFMMLLMChange Detection

Authors: Qi Guo, Jue Wang, Yinhe Liu, Yanfei Zhong

Institutions: Wuhan University

Open-vocabulary change detection (OVCD) seeks to recognize arbitrary changes of interest by enabling generalization beyond a fixed set of predefined classes. We reformulate OVCD as a two-stage pipeline: first generate class-agnostic change proposals using visual foundation models (VFMs) such as SAM and DINOv2, and then perform category identification with vision-language models (VLMs) such as CLIP. We reveal that category identification errors are the primary bottleneck of OVCD, mainly due to th...

#76 Poster ORSATR-X: A Foundation Model based on Differential-and-Excitation Networks for Optical Remote Sensing Object Recognition

VFMSegmentationDetection

Authors: Canyu Mo, Yongxiang Liu, Jiehua Zhang, Zilong Yu, Zhen Liu, Tianpeng Liu, Li Liu

Institutions: National University of Defense Technology

Recent advances in Remote Sensing Foundation Models (RSFMs) have demonstrated considerable potential for Earth Observation (EO) tasks. While adopting natural image foundation models (e.g., DINO) provides a data-efficient strategy for building RSFMs, their strong generalization capability does not fully transfer to complex remote sensing (RS) scenarios due to severe background interference, notably in perceiving challenging targets like low-contrast objects. To this end, we propose ORSATR-X, a no...

#77 Poster Orthogonal Spatial-Aware Multi-View Anchor Graph Clustering for Incomplete Remote Sensing Data

Self-Supervised

Authors: Yongshan Zhang, Xiaohuan Lin, Lefei Zhang, Zhihua Cai

Institutions: China University of Geosciences Wuhan, Wuhan University

Multi-view clustering for remote sensing data has received increasing attention by leveraging diverse data representations to enhance Earth observation. Existing methods are primarily developed under the assumption that each pixel is fully observed across all views. No prior work has investigated the more practical yet challenging scenario where some views suffer from partially missing data. To bridge this gap, this paper presents the first study on clustering incomplete remote sensing data, ter...

#78 Poster Partial Weakly-Supervised Oriented Object Detection

Detection

Authors: Mingxin Liu, Peiyuan Zhang, Yuan Liu, Wei Zhang, Yue Zhou, Ning Liao, Ziyang Gong, Junwei Luo, Zhirui Wang, Yi Yu, Xue Yang

Institutions: Shanghai Jiaotong University, Wuhan University, Nanyang Technological University, Shanghai Artificial Intelligence Laboratory; SUN YAT-SEN UNIVERSITY, Aerospace Information Research Institute, Chinese Academy of Sciences, The Ohio State University, Shanghai AI Laboratory

The growing demand for oriented object detection (OOD) across various domains has driven significant research in this area. However, the high cost of dataset annotation remains a major concern. Current mainstream OOD algorithms can be mainly categorized into three types: (1) fully supervised methods using complete oriented bounding box (OBB) annotations, (2) semi-supervised methods using partial OBB annotations, and (3) weakly supervised methods using weak annotations such as horizontal boxes or...

#79 Poster PAUL: Uncertainty-Guided Partition and Augmentation for Robust Cross-View Geo-Localization under Noisy Correspondence

UAVGeo-Localization

Authors: Zheng Li, Xueyi Zhang, Yanming Guo, Yuxiang Xie, Ding Zhaoyun, Siqi Cai, Haizhou Li, Mingrui Lao

Institutions: National University of Defense Technology, nudt, Harbin Institute of Technology, The Chinese University of Hong Kong (Shenzhen); National University of Singapore

Cross-view geo-localization is a critical task for UAV navigation, event detection, and aerial surveying, which establish correspondence between drone-captured and satellite imagery. Most existing approaches embed cross-view data into a joint feature space to maximize similarity between paired images. However, these methods typically assume perfect alignment of image pairs in training data, an assumption that rarely holds in practical scenarios. In real-world conditions, factors such as urban ca...

#80 Highlight PiLoT: Neural Pixel-to-3D Registration for UAV-based Ego and Target Geo-localization

VFMUAVGeo-Localization

Authors: Xiaoya Cheng, Long Wang, Yan Liu, Xinyi Liu, Hanlin Tan, Yu Liu, Maojun Zhang, Shen Yan

Institutions: National University of Defense Technology, Sensetime, Hangzhou Dianzi University

We present PiLoT, a unified framework that tackles UAV-based ego and target geo-localization. Conventional approaches rely on decoupled pipelines that fuse GNSS and Visual-Inertial Odometry (VIO) for ego-pose estimation, and active sensors like laser rangefinders for target localization. However, these methods are susceptible to failure in GNSS-denied environments and incur substantial hardware costs and complexity.PiLoT breaks this paradigm by directly registering live video stream against a ge...

#81 Highlight PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation

SegmentationMLLMUAV

Authors: shuyan ke, Yifan Mei, Changli Wu, yonghan zheng, Jiayi Ji, Liujuan Cao, Rongrong Ji

Institutions: Xiamen University

Reasoning segmentation has recently expanded from ground-level scenes to remote-sensing imagery, yet UAV data introduces fundamentally different challenges, including oblique viewpoints, ultra-high resolutions, and extreme scale variations. To address these UAV-specific conditions, we formally define the UAV Reasoning Segmentation task and organize its semantic demands into three dimensions: Spatial, Attribute, and Scene-level reasoning. Based on this formulation, we construct DRSeg, the first l...

#82 Poster ProgTrack: A Multi-Object Tracking Algorithm with Progressive Matching Strategy

UAVSuper-ResolutionTracking

Authors: Chenhui Zhang, Guoqing Dong, WeijiePeng WeijiePeng

Institutions: Xidian University, Northwestern Polytechinical University, Xi'an University of Electronic Science and Technology

Multi-object tracking (MOT) based on unmanned aerial vehicle (UAV) aims to identify and continuously track the positions of multiple ground targets during UAV flight. Current mainstream methods utilize appearance matching and motion matching to match targets in consecutive frames. However, these methods often fail in the following scenarios: First, scenarios with multi-scale targets, where small targets have weak appearance features and small bounding boxes; second, scenarios with complex backgr...

#83 Poster Prompt-Free Universal Region Proposal Network

VFMDetection

Authors: Qihong Tang, Changhan Liu, Shaofeng Zhang, Wenbin Li, Qi Fan, Yang Gao

Institutions: Nanjing University, nankai university, University of Science and Technology of China, The Hong Kong University of Science and Technology

Identifying potential objects is critical for object recognition and analysis across various computer vision applications.Existing methods typically localize potential objects by relying on exemplar images, predefined categories, or textual descriptions.However, their reliance on image and text prompts often limits flexibility, restricting adaptability in real-world scenarios.In this paper, we introduce a novel Prompt-Free Universal Region Proposal Network (\ourmodel), which identifies potential...

#84 Poster Prompt-Free Unknown Label Generation for Open World Detection in Remote Sensing

VFMDetectionGeneration

Authors: Abdullah Azeem, Ruisheng Wang, Qingquan Li, Abubakar Siddique

Institutions: Shenzhen University, University of Calgary

Autonomous object detection in remote sensing requires systems that can discover new categories and assign them usable labels during deployment. Existing Open-World Object Detectors identify unknown objects but leave them unnamed until manual annotation. In contrast, Open-Vocabulary Detectors recognize unseen categories only with provided prompts at test time, lacking autonomous discovery or naming. This work presents HSGDet, a detector that achieves both discovery and semantic assignment at tes...

#85 Poster PRUE: A Practical Recipe for Field Boundary Segmentation at Scale

VFMSegmentationGeo-Localization

Authors: Gedeon Muhawenayo, Caleb Robinson, Subash Khanal, Zhanpei Fang, Isaac Corley, Alexander Wollam, Tianyi Gao, Leonard Strnad, Ryan Avery, Lyndon Estes, Ana Tárano, Nathan Jacobs, Hannah Kerner

Institutions: Arizona State University (ASU), Microsoft, Washington University in St Louis, Oregon State University, Taylor Geospatial, Washington University in St. Louis, Wherobots, Clark University, Arizona State University

Large-scale maps of field boundaries are essential for agricultural monitoring tasks. Existing deep learning approaches for satellite-based field mapping have undesirable properties for large-scale inference, including sensitivity to illumination, spatial scale, and geographic location changes. We conduct the first systematic evaluation of segmentation and geospatial foundation models (GFM) for global field boundary delineation using the Fields of The World (FTW) benchmark. We evaluate 18 models...

#86 Poster QuCNet: Quantum Deep Learning Driven Multi-Circuit Network for Remote Sensing Image Classification

Geo-Localization

Authors: Komal Komal, Mukul Gupta, Saumya Singh, SANTOSH VIPPARTHI, Chakradhar Reddy Chandupatla, Subrahmanyam Murala

Institutions: IIT Ropar, INDIAN INSTITUTE OF TECHNOLOGY ROPAR, Indian Institute Of Technology–Ropar (IIT–Ropar), Trinity College Dublin, Ireland

We present QuCNet, a hybrid quantum classical network for efficient remote sensing image classification. QuCNet integrates a lightweight convolutional encoder with sixteen parallel four-qubit trainable quantum circuits (TQCs) trained under a Hybrid Cyclic Weight-Sharing (HCWS)} strategy. This design enhances expressibility while keeping the parameter count extremely low ~87K, 85× smaller than prior hybrid models). Guided by expressibility analysis, the proposed quantum configuration maintains st...

#87 Poster RAMEN: Resolution-Adjustable Multimodal Encoder for Earth Observation

VFMHyperspectral

Authors: Nicolas Houdré, Diego Marcos, Hugo Turckheim, Dino Ienco, Laurent Wendling, Camille Kurtz, Sylvain Lobry

Institutions: Université Paris Cité - LIPADE, INRIA, National Institute for Agriculture, Environment and Food; INRAE, National Research Institute in Agriculture and Environment, Université Paris Cité (LIPADE), Université Paris Cité

Earth observation (EO) data spans a wide range of spatial, spectral, and temporal resolutions, from high-resolution optical imagery to low resolution multispectral products or radar time series. While recent foundation models have improved multimodal integration for learning meaningful representations, they often expect fixed input resolutions or are based on sensor-specific encoders limiting generalization across heterogeneous EO modalities. To overcome these limitations we introduce RAMEN, a r...

#88 Poster RDF-MIG: A Robust Diffusion Framework for Masked Image Generation to Augment Semantic Segmentation and Change Detection

SegmentationChange DetectionHyperspectral

Authors: Zian Cao, Wei Wei, QINGSHAN GAO, Yuanyuanfu Yuanyuanfu

Institutions: Huazhong University of Science and Technology, Pingan Technology

Change detection and semantic segmentation are key techniques for satellite image analysis in remote sensing. However, acquiring high-quality labeled data is costly and time-consuming. Although recent studies have explored generative models to ease data scarcity, a unified framework supporting both tasks is still lacking, and most methods overlook noise accumulation and cannot generate multispectral images. To address this, we propose the robust diffusion framework for masked image generation (R...

#89 Poster Real-Time Long Horizon Air Quality Forecasting via Group-Relative Policy Optimization

VFM

Authors: Inha Kang, Eunki Kim, Wonjeong Ryu, Jaeyo Shin, Seungjun Yu, Yoon-Hee Kang, Seongeun Jeong, Eunhye Kim, Soontae Kim, Hyunjung Shim

Institutions: Korea Advanced Institute of Science & Technology, Ajou University, Kunsan National University, KAIST

Accurate long horizon forecasting of particulate matter (PM) concentration fields is essential for operational public health decisions. However, achieving reliable forecasts remains challenging in regions with complex terrain and strong atmospheric dynamics such as East Asia. While foundation models such as Aurora offer global generality, they often miss region-specific dynamics and rely on non–real-time inputs, limiting their practical utility for localized warning systems. To address this gap,...

#90 Highlight ReAttnCLIP: Training-Free Open-Vocabulary Remote Sensing Image Segmentation via Re-defined Attention in CLIP

VFMSegmentationMLLM

Authors: Xin Niu, Manqi Zhao, Dongsheng Jiang, Yingying Wu, Bing Su

Institutions: Renmin University of China, Huawei Technologies Ltd., BeiJing China-Power Information Technology Co., Ltd,

Remote sensing image segmentation is critical for a range of applications, including natural disaster monitoring and precision agriculture. Open-vocabulary segmentation enhances flexibility by removing fixed category constraints, enabling more fine-grained and adaptive scene understanding. Unlike CLIP’s original pretraining objective, which emphasizes global image-text alignment, segmentation tasks require accurate and discriminative patch-level representations to support precise pixel-wise pred...

#91 Poster Regulating Rather than Constraining: Adaptive Guidance for Complex Spectral Reconstruction in Pansharpening

Hyperspectral

Authors: Zhuwei Wen, Zimin Xia, He Chen, Linwei Yue, Xianwei Zheng

Institutions: The State Key Lab. LIESMARS, Wuhan University, Wuhan, PR China, Wuhan University, EPFL - EPF Lausanne, Wuhan University, China University of Geosciences Wuhan

In remote sensing pansharpening, spectrally mixed regions, where the spectral interactions among adjacent land covers lead to highly inconsistent reconstruction patterns, remain the most challenging areas. Due to the complex spatial distribution and heterogeneous spectral characteristics of ground objects, existing methods relying on rigid architectures and physical constraints struggle to learn generalized reconstruction patterns from limited spectral mixing samples, resulting in unstable gener...

#92 Poster Remote Sensing Image Super-Resolution for Imbalanced Textures: A Texture-Aware Diffusion Framework

VFMSuper-ResolutionGeneration

Authors: Enzhuo Zhang, Sijie Zhao, Dilxat Muhtar, Zhenshi Li, Xueliang Zhang, Pengfeng Xiao

Institutions: Nanjing University, nanjing university

Generative diffusion priors have recently achieved state-of-the-art performance in Natural Image Super-Resolution, demonstrating a powerful capability to synthesize photorealistic details. However, their direct application to Remote Sensing Image Super-Resolution (RSISR) reveals significant shortcomings. Remote sensing images present a unique challenge: ground objects often exhibit globally stochastic yet locally clustered patterns. This characteristic leads to highly imbalanced texture distribu...

#93 Poster ReSAM: Refine, Requery, and Reinforce: Self-Prompting Point-Supervised Segmentation for Remote Sensing Images

VFMSegmentation

Authors: Muhammad Naseer Subhani

Institutions: Independent Researcher

Interactive segmentation models such as the Segment Anything Model (SAM) have demonstrated remarkable generalization on natural images, but perform suboptimally on remote sensing imagery (RSI) due to severe domain shift and the scarcity of dense annotations. To address this, we propose a self-prompting point-supervised framework that adapts SAM to RSIs using only sparse point annotation. Our method employs a Refine–Requery–Reinforce loop, where coarse pseudo-masks are generated from initial poin...

#94 Poster Rethinking Occlusion Modeling for UAV Tracking

UAVTracking

Authors: Jian Zhang, Xincheng Yu, Yi Lin

Institutions: Sichuan University

Occlusion remains one of the major challenges in UAV tracking, where dynamic viewpoints and complex environments often cause partial or complete visibility loss.Existing transformer-based trackers typically regard occlusion as random information dropout, overlooking its structured and spatially correlated nature in real-world scenes.We rethink occlusion modeling in UAV tracking as a structured process governed by spatial dependencies.Based on this insight, we introduce Clustered Occlusion Modeli...

#95 Highlight Revisiting the Necessity of Full Accuracy: Weakly Supervised Object-Level Offset Correction for Misaligned Building Labels

Segmentation

Authors: Junda Xu, Yanmeng Liu, Xiangqiang Zeng, Jinrong Wu, Ying Qu, Libao Zhang

Institutions: Beijing Normal University

Google Earth imagery, combined with building footprint databases, offers an efficient way to construct localized building datasets. However, the lack of orthorectification in these images leads to spatial misalignments between annotations and their corresponding roof locations. Adopting such misaligned data directly for model training can severely degrade segmentation performance. To address the challenge, we propose an Object-based Multi-stage Alignment Framework (OMAF) that generates high-qual...

#96 Poster RHO: Robust Holistic OSM-Based Metric Cross-View Geo-Localization

Geo-Localization

Authors: Junwei Zheng, Ruize Dai, Ruiping Liu, Zichao Zeng, Yufan Chen, Fangjinhua Wang, Kunyu Peng, Kailun Yang, Jiaming Zhang, Rainer Stiefelhagen

Institutions: Karlsruhe Institute of Technology, KIT, Karlsruher Institut für Technologie, University College London, University of London; Karlsruher Institut für Technologie, Karlsruhe Institute of Technology (KIT), ETHZ - ETH Zurich, Hunan University

Metric Cross-View Geo-Localization (MCVGL) aims to estimate the 3-DoF camera pose (position and heading) by matching ground and satellite images. In this work, instead of pinhole and satellite images, we study robust MCVGL using holistic panoramas and OpenStreetMap (OSM). To this end, we establish a large-scale MCVGL benchmark dataset, CV-RHO, with over 2.7M images under different weather and lighting conditions, as well as sensor noise. Furthermore, we propose a model termed RHO with a two-bran...

#97 Poster RoadGIE: Towards A Global-Scale Aerial Benchmark for Generalizable Interactive Road Extraction

VFMSegmentationUAV

Authors: Chenxu Peng, Chenxu Wang, Yimian Dai, Yongxiang Liu, Ming-Ming Cheng, Xiang Li

Institutions: Nankai University; Zhejiang University of Technology, NanKai University, Nankai University, National University of Defense Technology, Nankai University, Tsinghua University

Accurate road segmentation from aerial imagery is fundamental to many geospatial applications. However, existing datasets often suffer from limited scene diversity, low semantic granularity, and poor structural continuity, restricting their generalization across environments. To address these challenges, we introduce $WorldRoadSeg-360K$, the largest and most diverse road segmentation dataset to date, comprising 366,947 high-resolution images collected from 38 countries and 223 cities across vari...

#98 Poster Robust Remote Sensing Image–Text Retrieval with Noisy Correspondence

MLLM

Authors: qiya song, Yiqiang Xie, Yuan Sun, Renwei Dian, Xudong Kang

Institutions: Hunan Normal University, Sichuan University, Hunan University

As a pivotal task that bridges remote visual and linguistic understanding, Remote Sensing Image-Text Retrieval (RSITR) has attracted considerable research interest in recent years. However, almost all RSITR methods implicitly assume that image-text pairs are matched perfectly. In practice, acquiring a large set of well-aligned data pairs is often prohibitively expensive or even infeasible. Although several studies have acknowledged the presence of noisy pairs, little work has explored how to end...

#99 Poster Rotation Invariant and Symmetry Aware Pixel Difference Network for Remote Sensing Object Detection

DetectionUAVSuper-Resolution

Authors: Jialei Zhan, Li Liu, Jiehua Zhang, Yuhang Xie, Yongxiang Liu, Jiangming Chen, Ming-Ming Cheng

Institutions: National University of Defense Technology, Xi'an University of Electronic Science and Technology, Nankai University, Tsinghua University

Recent advancements in remote sensing object detection have predominantly focused on oriented bounding box design and small object feature enhancement, while often overlooking the intrinsic geometric properties of remote sensing images, such as rotation invariance and structural symmetry. Many aerial objects appear in multiple orientations and exhibit clear symmetrical patterns, which, if not explicitly modeled, can lead to detection failures and inaccurate localization under geometric variation...

#100 Poster SARMAE: Masked Autoencoder for SAR Representation Learning

VFMSegmentationSAR

Authors: Danxu Liu, Di Wang, Hebaixu Wang, Haoyang Chen, Wentao Jiang, Yilin Cheng, Haonan Guo, Wei Cui, Jing Zhang

Institutions: Beijing Institute of Technology, Wuhan University, Zhongguancun Academy, Beijing, China, Fudan University, The University of Sydney

Synthetic Aperture Radar (SAR) imagery plays a critical role in all-weather, day-and-night remote sensing applications. However, existing SAR-oriented deep learning is constrained by data scarcity, while the physically grounded speckle noise in SAR imagery further hampers fine-grained semantic representation learning. To address these challenges, we propose SARMAE, a Noise-Aware Masked Autoencoder for self-supervised SAR representation learning. Specifically, we construct SAR-1M, the first milli...

#101 Poster SCE-Depth: A Spherical Compound Eye Framework for Wide FOV Depth Estimation

UAV3D Reconstruction

Authors: Yi Zhu, Hao Xiong, Lin Xiao, Ranfeng Shi, Qinying Gu, Leilei Gu

Institutions: Shanghai Jiao Tong University, Shanghai Jiaotong University, Shanghai artificial intelligence laboratory

Accurate depth estimation in wide field is highly desired in applications of autonomous driving, robot vision and drone controls. Biological compound eyes inspire wide Field of View (FOV) depth estimation, yet their artificial implementations face the challenge of modality misalignment. Specifically, the spherical imaging data doesn’t align with the planar neural network, diminishing the learning efficiency. Herein, we propose SCE-Depth, a bio-inspired framework for spherical compound eye depth ...

#102 Poster Scene Grounding in the Wild

Geo-Localization3D Reconstruction

Authors: Tamir Cohen, Leo Segre, Shay Shomer-Chai, Shai Avidan, Hadar Averbuch-Elor

Institutions: Tel Aviv University, Tel-Aviv University, Department of Computer Science, Cornell University

Reconstructing accurate 3D models of large-scale real-world scenes from unstructured, in-the-wild imagery remains a core challenge in computer vision, especially when the input views have little or no overlap. In such cases, existing reconstruction pipelines often produce multiple disconnected partial reconstructions or erroneously merge non-overlapping regions into overlapping geometry.In this work, we propose a framework that grounds each partial reconstruction to a complete reference model of...

#103 Highlight See What We Cannot See: A Geo-guided Reasoning Benchmark for Object Counting under Adverse Earth Observation Conditions

MLLMGeo-Localization

Authors: Jiayi Wang, Zhihong Tan, Hongchen Wei, Daiqin Yang, Zhenzhong Chen

Institutions: Wuhan University

Object counting in remote sensing imagery becomes challenging when visual cues are obscured by clouds, fog, shadows, or low-light conditions. Yet earth observation inherently provides complementary geo-modalities, including land use and map, which offer stable structural and contextual priors that remain available when appearance cues fail. In this paper, we introduce \textbf{GROC}, the first large-scale dataset \textbf{G}eo-guided \textbf{R}easoning in \textbf{O}bject \textbf{C}ounting under ...

#104 Poster Seeing Through Blur: Tackling Defocus in Spike-Based Imaging

Change DetectionSuper-Resolution

Authors: Xiantao Ma, Siwei Dong, Lin Zhu, Lizhi Wang, Hua Huang

Institutions: Beijing Institute of Technology; Beijing Institute of Technology, Peking University, Beijing Institute of Technology, Beijing Normal University

Spike cameras are a novel class of neuromorphic vision sensors that capture scene dynamics with ultra-high temporal resolution via spike planes. While recent methods have addressed motion blur and noise in spike-based reconstruction, defocus blur caused by shallow depth of field or lens adjustment delays remains a critical yet underexplored issue in real-world applications such as autonomous driving. In this work, we present DeSpike, the first end-to-end defocus removal framework specifically de...

#105 Highlight SegEarth-R2: Towards Comprehensive Language-guided Segmentation for Remote Sensing Images

SegmentationMLLMGeo-Localization

Authors: Zepeng Xin, Kaiyu Li, Luodi Chen, Wanchen Li, Xiao Yuchen, Hui Qiao, Weizhan Zhang, Deyu Meng, Xiangyong Cao

Institutions: Xi'an Jiaotong University, China Telecom

Effectively grounding complex language to pixels in remote sensing (RS) images is a critical challenge for applications like disaster response and environmental monitoring. Current models can parse simple, single-target commands but fail when presented with complex geospatial scenarios, e.g., segmenting objects at various granularities, executing multi-target instructions, and interpreting implicit user intent. To drive progress against these failures, we present LaSeRS, the first large-scale da...

#106 Poster Semantic-Adaptive Diffusion for Dynamic Spatiotemporal Fusion

Geo-LocalizationImage Fusion

Authors: Jinsong Zhang, Ying Qu, Yuan Liao, Hairong Qi, Zhenzhou Shao

Institutions: Capital Normal University, Beijing Normal University, University of Tennessee, Knoxville; University of Tennessee, Knoxville

Frequent and precise land surface monitoring is critical for numerous applications, yet existing satellites struggle to achieve both simultaneously. Spatiotemporal fusion (STF) tackles this challenge by integrating multiple satellite images to generate data with improved temporal and spatial resolution, enabling more frequent and precise land surface observations. However, current methods often fail to recover dynamic landscape changes due to significant scale discrepancies among multi-source im...

#107 Poster SGDE: Self-supervised Geometry Degradation Estimation Framework for Coded Aperture Compressive Spectral Imaging

HyperspectralSelf-Supervised

Authors: Yuqiao He, Xiaoyan LIU, Jianxu Mao, Yaonan Wang, Hui Zhang, Lizhu Liu, Yurong Chen, Wenbin He

Institutions: Hunan University

Coded Aperture Snapshot Spectral Imaging (CASSI) has emerged as a prominent technique for efficient hyperspectral imaging. However, the strong coupling between physical encoding and computational decoding makes CASSI highly sensitive to minor hardware misalignments, which can significantly degrade reconstruction quality. Existing methods either assume ideal imaging conditions, or rely on offline calibration, making them vulnerable to dynamic perturbations, such as thermal expansion and mechanica...

#108 Highlight SinGeo: Unlock Single Model's Potential for Robust Cross-View Geo-Localization

Geo-Localization

Authors: CHEN Yang, Xieyuanli Chen, Junxiang Li, Jie Tang, Tao Wu

Institutions: National University of Defense Technology, Changsha, China., National University of Defense Technology, 国防科技大学（National University of Defense Technology）

Robust cross-view geo-localization (CVGL) remains challenging despite the surge in recent progress. Existing methods still rely on field-of-view (FoV)-specific training paradigms, where models are optimized under a fixed FoV but collapse when tested on unseen FoVs and unknown orientations. This limitation necessitates deploying multiple models to cover diverse variations. Although studies have explored dynamic FoV training by simply randomizing FoVs, they failed to achieve robustness across dive...

#109 Poster Sky2Ground: A Benchmark for Site Modeling under Varying Altitude

VFMUAVGeo-Localization

Authors: Zengyan Wang, Sirshapan Mitra, rajat modi, Hui Xian Grace Lim, Yogesh Rawat

Institutions: University of Central Florida, Self

In this work, we propose the problem of localizing cameras and producing renders of a scene, given multiple images captured from ground/aerial/satellite viewpoints. We introduce a dataset called Sky2Ground, which contains synthetic/real images across all 3 viewpoints, along with camera parameters, and dense depth-maps/surface-normals. Recent works have shown that transformer-based nets like VGGT are capable of inferring scene-parameters in a single-forward pass. However, we formally reveal that ...

#110 Poster SkySense-VITA: Towards Universal In-context Segmentation of Multi-modal Remote Sensing Imagery

VFMSegmentationSAR

Authors: Kang Wu, Lei Yu, Junwei Luo, Bo Dang, Junjian Zhang, Xiangyuan Cai, Hongwei Hu, Jingdong Chen, Yansheng Li

Institutions: Wuhan University, antgroup, Ant Group, Alibaba Group

While recent foundation models for remote sensing (RS) segmentation have shown notable progress, they still face significant challenges, struggling to process diverse multi-modal inputs, synergize complementary prompt types, and leverage semantic hierarchies. To address these limitations, we introduce SkySense-VITA, a unified in-context segmentation model, which synergistically processes both optical and SAR imagery using visual, textual, or fused prompts. Based on a novel prompt-and-prediction ...

#111 Poster Sparsely Timing the Change: A Spiking Temporal Framework for Remote Sensing Interpretation

VFMChange DetectionSuper-Resolution

Authors: Shilong Li, Xiurui Xie, Qiugang Zhan, Luochao Wang, Yong Deng, Guisong Liu

Institutions: University of Electronic Science and Technology of China, Southwestern University of Finance and Economics, Southwest University of Finance and Economics, Southwestern University of Finance and Economics; University of Electronic Science and Technology of China

The temporal evolution patterns of surface spatial structures constitute a central concern within the field of intelligent remote sensing interpretation.However, constrained by the availability of only two temporal phases, modeling sparse spatio-temporal change processes to effectively interpret surface alterations remains a core challenge in intelligent remote sensing analysis. To address this, this paper proposes SpikeAdapter, a lightweight enhancement framework. This framework comprises Geo-S...

#112 Poster Spatial-Spectral Residuals Informed Diffusion Neural Operator for Pan-sharpening

HyperspectralSuper-ResolutionGeneration

Authors: jiahan huang, Ran Ran, Junming Hou, Zihao Chen, Xiaofeng Cong, Junling Li, Liang-Jian Deng

Institutions: Southeast University, University of Electronic Science and Technology of China

Pan-sharpening, a fundamental image preprocessing technique in remote sensing, aims to generate spatially and spectrally enriched multispectral imagery by integrating complementary information from texture-rich panchromatic (PAN) images and paired low-resolution multispectral (LRMS) counterparts. Although recent generative diffusion models have achieved impressive fusion quality, these performance gains often come with substantial computational costs, rendering them impractical for resource-cons...

#113 Poster Spectral Super-Resolution via Adversarial Unfolding and Data-Driven Spectrum Regularization: From Multispectral Satellite Data to NASA Hyperspectral Image

HyperspectralGeo-LocalizationSuper-Resolution

Authors: Si-Sheng Yang, Chia-Hsiang Lin

Institutions: Institute of Computer and Communication Engineering, National Cheng Kung University

The European Space Agency's Sentinel-2 satellite provides global multispectral coverage for remote sensing (RS) applications. However, limited spectral resolution (12 bands) and non-unified spatial resolution (60/20/10 m) restrict their practicality. In contrast, the high spectral-spatial resolution sensor (e.g., NASA's AVIRIS-NG) covers only the American region due to practical considerations. This raises a fundamental question: ``Can a global hyperspectral coverage be achieved by reconstructin...

#114 Poster Spectrally Distilled Representations Aligned with Instruction-Augmented LLMs for Satellite Imagery

VFMMLLMGeo-Localization

Authors: Minh Kha Do, Wei Xiang, Kang Han, Di Wu, Khoa T. Phan, Yi-Ping Phoebe Chen, Gaowen Liu, Ramana Kompella

Institutions: La Trobe University, Cisco Systems, Cisco

Vision-language foundation models (VLFMs) promise zero-shot and retrieval understanding for Earth observation. While operational satellite systems often lack full multi-spectral coverage, making RGB-only inference highly desirable for scalable deployment, the adoption of VLFMs for satellite imagery remains hindered by two factors: (1) multi-spectral inputs are informative but difficult to exploit consistently due to band redundancy and misalignment; and (2) CLIP-style text encoders limit semanti...

#115 Poster Task-Oriented Data Synthesis and Control-Rectify Sampling for Remote Sensing Semantic Segmentation

VFMSegmentationGeneration

Authors: Yunkai Yang, Yudong Zhang, Kunquan Zhang, Jinxiao Zhang, Xinying Chen, Haohuan Fu, Runmin Dong

Institutions: Sun Yat-sen University, Tencent; Tsinghua University, SUN YAT-SEN UNIVERSITY, Tsinghua University, Beijing Institute of Technology, Tsinghua University, Tsinghua University, Sun Yat-sen University,

With the rapid progress of controllable generation, training data synthesis has become a promising way to expand labeled datasets and alleviate manual annotation in remote sensing (RS). However, the complexity of semantic mask control and the uncertainty of sampling quality often limit the utility of synthetic data in downstream semantic segmentation tasks. To address these challenges, we propose a task-oriented data synthesis framework (TODSynth), including a Multimodal Diffusion Transformer (M...

#116 Poster TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation

VFMMLLMChange Detection

Authors: Yan Shu, Bin Ren, Zhitong Xiong, Xiao Xiang Zhu, Begüm Demir, Nicu Sebe, Paolo Rota

Institutions: Beijing Academy of Artificial Intelligence, Mohamed bin Zayed University of Artificial Intelligence, Technical University of Munich, Technical University Munich, Technische Universität Berlin, University of Trento

Vision-language models (VLMs) have shown promise in earth observation (EO), yet they struggle with tasks that require grounding complex spatial reasoning in precise pixel-level visual representations. To address this problem, we introduce TerraScope, a unified VLM that delivers pixel-grounded geospatial reasoning with two key capabilities: (1) modality-flexible reasoning: it handles single-modality inputs (optical or SAR) and adaptively fuses different modalities into the reasoning process when ...

#117 Poster TESSERA: Temporal Embeddings of Surface Spectra for Earth Representation and Analysis

VFMSegmentationGeo-Localization

Authors: Zhengpeng Feng, Clement Atzberger, Sadiq Jaffer, Jovana Knezevic, Silja Sormunen, Robin Young, Madeline Lisaius, Markus Immitzer, Toby Jackson, James Ball, David Coomes, Anil Madhavapeddy, Andrew Blake, Srinivasan Keshav

Institutions: University of Cambridge, dClimate, Aalto University, Cyclops MRV Inc; Universität für Bodenkultur Wien, University of Bristol, University of Cambridge , Clare Hall, U. Cambridge.

Satellite Earth-observation (EO) time series in the optical and microwave ranges are often irregular due to orbital patterns and cloud obstruction, and while compositing addresses these issues, it loses critical phenological information. To overcome this, we present TESSERA, a pixel-wise foundation model for multi-modal (Sentinel-1/2) EO time series that learns robust, label-efficient embeddings. During training, TESSERA uses Barlow Twins and sparse random temporal sampling to enforce invariance...

#118 Highlight Test-Time Multi-Prompt Adaptation for Open-Vocabulary Remote Sensing Image Segmentation

VFMSegmentationMLLM

Authors: Ting Yang, Qilong Wang, Qibin Hou, Qinghua Hu

Institutions: Tianjin University, university of tianjin of china, Nankai University

The rise of vision-language models (VLMs) has driven the initial exploration of open-vocabulary remote sensing image semantic segmentation (OVRSIS), enabling recognition of unseen categories in complex Earth observation scenes. However, existing methods primarily focus on enhancing visual representations of domain-specific remote sensing images, while overlooking the effect of textual information. In this paper, we argue that there exists a crucial issue of textual ambiguity in OVRSIS task, limi...

#119 Poster Toward Low-Cost yet Effective Temporal Learning for UAV Tracking

UAVTracking

Authors: chaocan xue, Qihua Liang, Bineng Zhong, Yanting Zu, Yuanliang Xue, Haiying Xia, Shuxiang Song

Institutions: Guangxi Normal University, Xi’an Research Institute of High Technology

The utilization of temporal information has always been an open topic in the tracking community. However, existing trackers tend to employ more and more inputs or parameters for temporal learning, hindering their deployment in resource-constrained unmanned aerial vehicles (UAVs). More importantly, this raises ambiguity whether the performance gains come from the temporal learning itself, or come from the increased inputs and parameters. In this study, we advocate designing temporal learning comp...

#120 Poster Towards Persistence: Learning Topological Constraints for Event-based Small Object Detection

SegmentationDetectionUAV

Authors: Shiman He, Nuo Chen, Xinyi Ying, Yihang Luo, Yangsi Shi, Zaiping Lin, Miao Li

Institutions: College of Electronic Science and Technology, National University of Defense Technology, National University of Defense Technology, National University of Defence Technology

Small object detection (SOD) plays a vital role in applications such as anti-UAV tasks, yet conventional image-based methods struggle in high-speed scenarios due to the limited frame rate. Event cameras offer a promising alternative by capturing spatiotemporal event streams with microsecond-level temporal resolution. To address the inherent sparsity of small objects in event data, existing methods typically formulate the detection task as semantic segmentation on spatiotemporal point clouds to ...

#121 Poster Tri-Modal Fusion Transformers for UAV-based Object Detection

DetectionUAV

Authors: Craig Iaboni, Pramod Abichandani

Institutions: New Jersey Institute of Technology

Reliable UAV object detection requires robustness to illumination changes, motion blur, and scene dynamics that suppress RGB cues. Thermal long-wave infrared (LWIR) sensing preserves contrast in low light, and event cameras retain microsecond-level temporal edges, but integrating all three modalities in a unified detector has not been systematically studied. We present a tri-modal framework that processes RGB, thermal, and event data with a dual-stream hierarchical vision transformer. At selecte...

#122 Poster TriSim: Tri-Dimensional Similarity Modeling with Extreme Value Theory for False-Negative Mitigation in Remote Sensing Image-Text Retrieval

MLLMImage FusionSelf-Supervised

Authors: Chengyu Zheng, Hanzhang Lu, Jie Nie, Shan Du

Institutions: The University of British Columbia, University of British Columbia, Ocean University of China

In remote sensing (RS) cross-modal retrieval, most existing methods employ contrastive learning as their primary optimization objective, aligning anchors with positive counterparts and distinguishing them from negative samples. To improve negative sampling, these approaches typically set thresholds on cross-modal similarity scores, designating negatives that exceed the threshold as false negative samples (FNS). However, dependence on a single cross-modal similarity threshold is fragile because i...

#123 Poster UAST: Unified Active Search and Tracking for Arbitrary Targets with UAVs

UAVTracking

Authors: Liang Qin, Min Wang, Xingyu Lu, Aowen Qiu, Wengang Zhou, Houqiang Li

Institutions: University of Science and Technology of China, Institute of Artificial Intelligence, Hefei Comprehensive National Science Center

Active search and tracking of arbitrary targets by Unmanned Aerial Vehicles (UAVs) in cluttered environments remains a highly challenging problem. Existing methods either construct complex modular pipelines, leading to substantial computational costs, or adopt end-to-end controllers that often fail to generalize across different targets and scenes. Moreover, search and tracking are typically treated separately despite their strong interdependence.In this paper, we present UAST, a simple yet effe...

#124 Poster UAV-CB: A Complex-Background RGB–T Dataset and Local Frequency Bridge Network for UAV Detection

UAVImage Fusion

Authors: Shenghui Huang, Menghao Hu, Longkun Zou, Hongyu Chi, Zekai Li, Feng Gao, Fan Yang, Qingyao Wu, Ke Chen

Institutions: Pengcheng Laboratory, Xinjiang University; Pengcheng Lab, Harbin Institute of Technology; Pengcheng Lab, South China University of Technology, Peking University, Peng Cheng Laboratory

Detecting Unmanned Aerial Vehicles (UAVs) in low-altitude environments is essential for perception and defense systems but remains highly challenging due to complex backgrounds, camouflage, and multimodal interference. In real-world scenarios, UAVs are frequently visually blended with surrounding structures such as buildings, vegetation, and power lines, resulting in low contrast, weak boundaries, and strong confusion with cluttered background textures. Existing UAV detection datasets, though di...

#125 Poster UAVLight: A Benchmark for Illumination-Robust 3D Reconstruction in Unmanned Aerial Vehicle (UAV) Scenes

UAV3D Reconstruction

Authors: Kang DU, 雪廖, Junpeng Xia, Chaozheng Guo, Yi Gu, Yirui Guan, Duotun Wang, ShengHuang ShengHuang, Zeyu Wang

Institutions: Hong Kong University of Science and Technology (Guangzhou), The Hong Kong University of Science and Technology, Meituan, Tencent, The Hong Kong University of Science and Technology (Guangzhou)

Illumination inconsistency is a fundamental challenge in multi-view 3D reconstruction. Variations in sunlight direction, cloud cover, and shadows break the constant-lighting assumption underlying both classical multi-view stereo (MVS) and structure from motion (SfM) pipelines and recent neural rendering methods, leading to geometry drift, color inconsistency, and shadow imprinting. This issue is especially critical in UAV-based reconstruction, where long flight durations and outdoor environments...

#126 Poster UniChange: Unifying Change Detection with Multimodal Large Language Model

VFMMLLMChange Detection

Authors: Xu Zhang, Danyang Li, Xiaohang Dong, Tianhao Wu, Hualong Yu, Jianye Wang, Qicheng Li, Xiang Li

Institutions: Computer Science, Nankai University, Nankai University, Sichuan Agricultural University

Change detection (CD) is a fundamental task for monitoring and analysing land cover dynamics. While recent high performance models and high quality datasets have significantly advanced the field, a critical limitation persists. Current models typically acquire limited knowledge from single-type annotated data and cannot concurrently leverage diverse binary change detection (BCD) and semantic change detection (SCD) datasets. This constraint leads to poor generalisation and limited versatility. Th...

#127 Poster UniGeoRS: A Unified Benchmark for Tri-view Geo-Localization

UAVGeo-LocalizationSuper-Resolution

Authors: Xiao Liang, Huaizhi Tang, Feiyang Zhang, Shiji Yuan, Chun Hu, Dezhi Zheng, Kang Ma

Institutions: Beijing Institute of Technology, Beijing University of Aeronautics and Astronautics, Beihang University

Cross-view geo-localization (CVGL) aims to estimate an image’s geographic location by matching it with geo-referenced images from different viewpoints, supporting applications such as autonomous driving, UAV navigation, and visual surveillance. However, due to the high cost of image collection, current CVGL datasets often suffer from limited diversity in both drone and ground imagery, which constrains model generalization. Furthermore, existing methods primarily focus on either ground-to-satelli...

#128 Poster UniGeoSeg: Towards Unified Open-World Segmentation for Geospatial Scenes

VFMSegmentationMLLM

Authors: Shuo Ni, Di Wang, He Chen, Haonan Guo, Ning Zhang, Jing Zhang

Institutions: Beijing Institude of Technology; Zhongguancun Academy, Wuhan University, Beijing Institute of Technology, Hong Kong Polytechnic University; Beijing Institute of Technology, The University of Sydney

Instruction-driven segmentation in remote sensing generates masks from guidance, offering great potential for accessible and generalizable applications. However, existing methods suffer from fragmented task formulations and limited instruction data, hindering effective understanding and generalization. To address these issues, we introduce GeoSeg-1M, the first million-scale dataset for remote sensing instruction-driven segmentation, constructed via an automatic mask filtering and instruction gen...

#129 Highlight Urban-GS: A Unified 3D Gaussian Splatting Framework for Compact and High-Fidelity Aerial-to-Street Reconstruction

Generation3D Reconstruction

Authors: Meng Wang, Changqun Xia, Yuze Wang, Junyi Wang, Wantong Duan, Xinxiong Xie, Yue Qi

Institutions: Beihang University, Peng Cheng Laboratory, Shandong University, PengCheng Labratory

Recently, 3D Gaussian Splatting (3DGS) has revolutionized radiance field reconstruction, enabling efficient and high-fidelity novel view synthesis. However, seamless integration of both aerial and street view images to model urban scenes remains a significant challenge for 3DGS. This joint setting suffers from extreme view coverage disparity, complex multi-scale details, and imbalanced viewpoint distributions.In this work, we present Urban-GS, a novel framework built upon Gaussian Splatting for...

#130 Poster V2U4Real: A Real-world Large-scale Dataset for Vehicle-to-UAV Cooperative Perception

DetectionUAVGeo-Localization

Authors: Weijia Li, Haoen Xiang, Tianxu Wang, Shuaibing Wu, Qiming Xia, Cheng Wang, Chenglu Wen

Institutions: Xiamen University, Lanzhou University of Technology, Nanchang University, XMU

Modern autonomous vehicle perception systems are often constrained by occlusions, blind spots, and limited sensing range, hindering progress toward Level 5 autonomy. While existing cooperative perception paradigms such as Vehicle-to-Vehicle (V2V) and Vehicle-to-Infrastructure (V2I) have demonstrated their effectiveness in mitigating these challenges, they remain limited to ground-level collaboration and cannot fully address large-scale occlusions or long-range perception in complex 3D environmen...

#131 Poster VGA:Empowering Aerial-Ground Localization by Visual Geometry Alignment

UAV

Authors: Tao Jun Lin, Yujiao Shi, Hongdong Li

Institutions: Australian National university, ShanghaiTech University, Australian National University

Aerial-ground visual localization is a challenging task due to the significant differences in scene scale and view point captured between two views. In this work, we explore the practical benefit of jointly learning camera calibration and bird’s-eye-view (BEV) projection for estimating full 6 Degrees-of-freedom relative camera pose between uncalibrated aerial and ground views. We present Visual Geometry Alignment (VGA), a unified framework that jointly learns a global gravity-alignment prior inf...

#132 Poster VideoFusion: A Spatio-Temporal Collaborative Network for Multi-modal Video Fusion

Change DetectionImage FusionSuper-Resolution

Authors: Linfeng Tang, Yeda Wang, Meiqi Gong, Zizhuo Li, Yuxin Deng, Xunpeng Yi, Chunyu Li, Han Xu, HAO ZHANG, Jiayi Ma

Institutions: Wuhan University, 武汉大学, Southeast University

Compared to images, videos better align with real-world acquisition scenarios and possess valuable temporal cues. However, existing multi-sensor fusion research predominantly integrates complementary context from multiple images rather than videos. This primarily stems from two factors: 1) the scarcity of large-scale multi-sensor video datasets, limiting research in video fusion, and 2) the inherent difficulty of jointly modeling spatial and temporal dependencies in a unified framework. This pap...

#133 Poster View-Aware Semantic Alignment for Aerial-Ground Person Re-Identification

MLLMUAVGeo-Localization

Authors: Quan Zhang, Zeqiang Cai, Peiming Zhao, Jingze Wu, Cailun Wu, Hongbo Chen, Jianhuang Lai

Institutions: SUN YAT-SEN UNIVERSITY

Aerial-Ground Person Re-Identification (AGPReID) remains highly challenging due to drastic viewpoint variations between drones and fixed cameras. Existing methods typically follow a view-invariant paradigm, aligning shared features across views to achieve robustness. However, view-invariant inherently enforces part-level alignment, which ignores view-specific cues and discriminative identity information. To this end, this work proposes ViSA (View-aware Semantic Alignment), a view-aware framework...

#134 Highlight VIRD: View-Invariant Representation through Dual-Axis Transformation for Cross-View Pose Estimation

Geo-Localization

Authors: Juhye Park, Wooju Lee, Dasol Hong, Changki Sung, Youngwoo Seo, DongWan Kang, Hyun Myung

Institutions: Korea Advanced Institute of Science & Technology, Electronics and Telecommunications Research Institute, Hanhwa Aerospace, Hanwha Aerospace, KAIST

Accurate global localization is crucial for autonomous driving and robotics, especially in dense urban environments where GNSS is often unreliable due to occlusion and multipath effects. As an emerging alternative, cross-view pose estimation predicts the 3-DoF camera pose corresponding to a ground-view image with respect to a geo-referenced satellite image. However, existing methods struggle to bridge the significant viewpoint gap between the ground and satellite views mainly due to limited spat...

#135 Poster Visual Prototype Conditioned Focal Region Generation for UAV-Based Object Detection

DetectionUAVGeneration

Authors: Wenhao Li, Zimeng Wu, Yu Wu, Zehua Fu, Jiaxin Chen

Institutions: Beihang University, Beijing University of Aeronautics and Astronautics, Hangzhou Innovation Institute, Beihang University

Unmanned aerial vehicle (UAV) based object detection is a critical but challenging task, when applied in dynamically changing scenarios with limited annotated training data. Layout-to-image generation approaches have proved effective in promoting detection accuracy by synthesizing labeled images based on diffusion models. However, they suffer from frequently producing artifacts, especially near layout boundaries of tiny objects, thus substantially limiting their performance. To address these iss...

#136 Highlight VLM4RSDet: Collaborative Optimization with Vision-Language Model for Enhancing Remote Sensing Object Detection

VFMDetectionMLLM

Authors: Shuohao Shi, Qiang Fang, Xin Xu

Institutions: National University of Defense Technology

Closed-set object detection in remote sensing imagery has made significant progress, but achieving high detection accuracy remains challenging. Vision-Language Models (VLMs), which possess rich prior knowledge, offer a promising solution to this challenge. However, most existing VLMs are designed for open-vocabulary tasks and exhibit inherent limitations when directly applied to closed-set scenarios, such as notable accuracy degradation and high deployment costs. To address these issues, we prop...

#137 Poster WRIVINDER: Towards Spatial Intelligence for Geo-locating Ground Images onto Satellite Imagery

VFMGeo-Localization3D Reconstruction

Authors: Chandrakanth Gudavalli, Tajuddin Manhar Mohammed, Abhay Yadav, Ananth Bhaskar, Hardik Prajapati, Cheng Peng, Rama Chellappa, Shivkumar Chandrasekaran, B.S. Manjunath

Institutions: University of California, Santa Barbara, Mayachitra, Inc., Johns Hopkins University, Mayachitra Inc, School of Data Science, University of Virginia; Mathematical Institute for Data Science (MINDS) at JHU, Mayachitra, Inc.; University of California, Santa Barbara

Aligning ground-level imagery with geo-registered satellite maps is crucial for mapping, navigation, and situational awareness, yet remains challenging under large viewpoint gaps or when GPS is unreliable. We introduce Wrivinder, a zero-shot, geometry-driven framework that aggregates multiple ground photographs to reconstruct a consistent 3D scene and align it with overhead satellite imagery. Wrivinder combines SfM reconstruction, 3D Gaussian Splatting, semantic grounding, and monocular depth–ba...

#138 Poster YieldSAT: A Multimodal Benchmark Dataset for High-Resolution Crop Yield Prediction

HyperspectralGeo-LocalizationImage Fusion

Authors: Miro Miranda, Deepak Pathak, Patrick Helber, Benjamin Bischke, Hiba Najjar, Francisco Mena, Cristhian Sanchez, Akshay Pai, Diego Arenas, Matias Valdenegro, Marcela Charfuelan, Marlon Nuske, Andreas Dengel

Institutions: German Research Center for Artificial Intelligence, German Research Center for AI, Vision Impulse GmbH; German Research Center for AI, Universität Kaiserslautern, GFZ Helmholtz Centre for Geosciences; Rheinland-Pfälzische Technische Universität, University of Groningen, DFKI & RPTU

Crop yield prediction requires substantial data to train data-driven models. However, creating yield prediction datasets is constrained by high acquisition costs, heterogeneous data quality, and data privacy regulations. Consequently, existing datasets are scarce, low in quality, or limited to regional levels or single crop types. In this work, we release YieldSAT, a large, high-quality, and multimodal dataset for high-resolution crop yield prediction. YieldSAT spans various climate zones across...

#139 Highlight ZoomEarth: Active Perception for Ultra-High-Resolution Geospatial Vision-Language Tasks

VFMSegmentationMLLM

Authors: Ruixun Liu, Bowen Fu, Jiayi Song, Kaiyu Li, Wanchen Li, Lanxuan Xue, Hui Qiao, Weizhan Zhang, Deyu Meng, Xiangyong Cao

Institutions: Xi'an Jiaotong University, China Telecom

Ultra-high-resolution (UHR) remote sensing (RS) images offer rich fine-grained information but also present challenges in effective processing. Existing dynamic resolution and token pruning methods are constrained by a passive perception paradigm, suffering from increased redundancy when obtaining finer visual inputs. In this work, we explore a new active perception paradigm that enables models to revisit information-rich regions. First, we present LRS-GRO, a large-scale benchmark dataset tailor...