CVPR 2026 - GeoAI Papers

139 papers (Oral: 9, Highlight: 20, Poster: 110)
DetectionUAVGeo-Localization
Authors: Yeshwanth Kumar Adimoolam, Charalambos Poullis, Melinos Averkiou
Institutions: CYENS Center of Excellence, Concordia University, University of Cyprus
In our study, we conducted a comprehensive analysis of three widely used datasets in the domain of building footprint extraction using deep neural networks: the INRIA Aerial Image Labelling dataset, SpaceNet 2: Building Detection v2, and the AICrowd Mapping Challenge datasets. Our experiments revealed several issues in the AICrowd Mapping Challenge dataset, where nearly 90% (about 250k) of the training split images had identical copies, indicating a high level of duplicate data. Additionally, we...
Detection
Authors: Xingxing Xie, Jiahua Dong, Junwei Han, Gong Cheng
Institutions: Northwestern Polytechnical University, Northwest Polytechnical University Xi'an
YOLO detectors are known for their fast inference speed, yet training them remains unexpectedly time-consuming due to their exhaustive pipeline that processes every training image in every epoch, even when many images have already been sufficiently learned. This stands in clear contrast to the efficiency suggested by the ``You Only Look Once'' philosophy.This naturally raises an important question: Does YOLO really need to see every training image in every epoch? To explore this, we propose an A...
VFMSegmentationSuper-Resolution
Authors: Kai Zhu, Li Chen, Jun Cheng
Institutions: Wuhan University of Science and Technology, Institute for Infocomm Reserach, A*STAR
Curvilinear structure segmentation is essential in domains such as medical imaging, remote sensing, and materials science. Existing methods often require extensive domain-specific training and lack generalization to novel domains. To overcome these limitations, we propose the Segment Anything Curve Model (SACM) — a universal, curvilinear segmentation framework built upon the pretrained Segment Anything Model (SAM). SACM introduces a dual-level adapter architecture that enables both fine-grained ...
MLLMGeo-LocalizationImage Fusion
Authors: Peirong Zhang, Yidan Zhang, Luxiao Xu, Jinliang Lin, Zonghao Guo, Fengxiang Wang, Xue Yang, Kaiwen Wei, Lei Wang
Institutions: University of the Chinese Academy of Sciences, Tsinghua University, National University of Defense Technology, Shanghai AI Laboratory, Chongqing University, Chinese Academy of Sciences
Recent advances in multimodal large language models (MLLMs) have led to remarkable progress in visual grounding, enabling fine-grained cross-modal alignment between textual queries and image regions. However, transferring such capabilities to remote sensing imagery remains challenging, as targets are often extremely small within kilometer-scale scenes, and queries typically involve intricate geospatial relations such as relative positions, spatial hierarchies, or contextual dependencies across d...
UAV
Authors: Jiacong Zhou, Jiaxu Miao, Yourun Lin, Xianyun Wang, Jun Xiao, Jun Yu
Institutions: Hangzhou Dianzi University, Zhejiang University, Harbin Institute of Technology
Aerial object-goal navigation (Aerial ObjectNav) requires an Unmanned Aerial Vehicle (UAV) to navigate to target objects in large-scale outdoor environments using only visual observations and high-level object descriptions, without detailed step-by-step instructions. Existing approaches rely on local observations or short-term history, lacking comprehensive scene understanding and efficient spatial exploration strategies, which constrains their navigation capability in complex aerial scenarios....
Hyperspectral
Authors: Yuxuan Liu, Wei Xu, Qi Guo
Institutions: Purdue University
We present MetaSpectra+, a compact multifunctional camera that supports two operating modes: (1) snapshot HDR + hyperspectral or (2) snapshot polarization + hyperspectral imaging. It utilizes a novel metasurface-refractive assembly that splits the incident beam into multiple channels and independently controls each channel’s dispersion, exposure, and polarization. Unlike prior multifunctional metasurface imagers restricted to narrow (10--100 nm) bands, MetaSpectra+ operates over nearly the entir...
UAVGeneration3D Reconstruction
Authors: Markus Gross, Sai Bharadhwaj Matha, Aya Fahmy, Rui Song, Daniel Cremers, Henri Meeß
Institutions: TU Munich, Fraunhofer, University of California, Los Angeles; University of Cambridge, Technical University Munich, SWARM Biotactics GmbH
Semantic Scene Completion (SSC) is crucial for 3D perception in mobile robotics, as it enables holistic scene understanding by jointly estimating dense volumetric occupancy and per-voxel semantics. Although SSC has been widely studied in terrestrial domains such as autonomous driving, aerial scenarios like autonomous flying remain largely unexplored, thereby limiting progress on downstream applications. Furthermore, LiDAR sensors represent the primary modality for SSC data generation, which pose...
MLLMGeneration
Authors: Jinyu Xu, Tianqi Hu, Xiaonan Hu, Letian Zhou, Songliang Cao, Meng Zhang, Hao Lu
Institutions: Huazhong University of Science and Technology
Visually cataloging and quantifying the natural world requires pushing the boundaries of both detailed visual classification and counting at scale. Despite significant progress, particularly in crowd and traffic analysis, the fine-grained, taxonomy-aware plant counting remains underexplored in vision. In contrast to crowds, plants are complicated by nonrigid morphologies and physical appearance variations across growth stages and environments. Tofill this gap, we present TPC-268, the first plant...
Hyperspectral
Authors: M. Kerem Aydin, Yi-Chun Hung, Jaclyn Pytlarz, Qi Guo, Emma Alexander
Institutions: Northwestern Univeristy, Northwestern University, Dolby Laboratories Inc, Purdue University
Hyperspectral cameras rely on spectral filters, dispersive optics, or coded apertures, which reduce light throughput and increase hardware complexity. These systems face harsh trade-offs between spatial, spectral, and temporal resolution in inherently low-photon conditions. Computational imaging systems break through these trade-offs with compressive sensing, but have typically required complex optics and/or extensive computation. We present Spectrum from Defocus (SfD), a chromatic focal sweep m...
VFMGeo-Localization
Authors: Arnav Devalapally, Poornima Jain, Kartik Srinivas, Vineeth Balasubramanian
Institutions: University of Michigan, Indian Institute of Technology, Hyderabad, Microsoft Research and IIT-Hyderabad
The increasing adaptation of vision models across domains, such as satellite imagery and medical scans, has raised an emerging privacy risk: models may inadvertently retain and leak sensitive source-domain specific information in the target domain. This creates a compelling use case for machine unlearning to protect the privacy of sensitive source-domain data. Among adaptation techniques, source-free domain adaptation (SFDA) calls for an urgent need for machine unlearning (MU), where the source ...
UAVGeneration
Authors: Weiqin Jiao, Hao Cheng, George Vosselman, Claudio Persello
Institutions: University of Twente
We tackle the problem of generating a complete vector map representation from aerial imagery in a single run: producing polygons for all land-cover classes with shared boundaries and no gaps or overlaps. Existing polygonization methods are typically class-specific; extending them to multiple classes via per-class runs commonly leads to topological inconsistencies, such as duplicated edges, gaps, and overlaps. We formalize this new task as All-Class Polygonal Vectorization (ACPV) and release the ...
UAV3D Reconstruction
Authors: Hanyang Liu, Rongjun Qin
Institutions: Ohio State University, Columbus, The Ohio State University
Recent advances in 4D scene reconstruction have greatly improved dynamic modeling across various domains. However, existing approaches remain limited under aerial conditions with single-view capture, wide spatial range, and dynamic objects of limited spatial footprint and large motion disparity. These challenges cause severe depth ambiguity and unstable motion estimation, making monocular aerial reconstruction inherently ill-posed.To this end, we present AeroDGS, a physics-guided 4D Gaussian spl...
UAV3D Reconstruction
Authors: Tingyun Li, Xinyi Liu, Yongjun Zhang, Yi Wan, Xiaoan Liu, Fan Weiwei, Jiahao Liu
Institutions: Wuhan University
Monocular UAV videos pose a fundamental challenge for 3D reconstruction: dynamic scene modeling requires accurate camera poses, yet recovering poses from long UAV trajectories often fails under texture-sparse regions and moving objects.Existing approaches typically handle either pose-free static reconstruction or dynamic reconstruction with known poses, but jointly solving both from casual aerial footage remains difficult due to motion coupling and severe scale variation.We introduce \modelname,...
UAVGeneration
Authors: Xian Ge, Yuling Pan, Yuhang Zhang, Xiang Li, Weijun Zhang, Dizhe Zhang, Zhaoliang Wan, Xin Lin, Xiangkai Zhang, Juntao Liang, Xiangtai Li, jerett Jiang, Bo Du, Ming-Hsuan Yang, Lu Qi
Institutions: Insta360, Shenzhen University, Northwestern University, Insta360 Research, insta360, University of California, San Diego, Institute of Automation,Chinese Academy of Sciences, ByteDance Inc., Wuhan University, University of California at Merced, Insta360; Wuhan University
The field of 360-degree omnidirectional understanding has been receiving increasing attention for advancing spatial intelligence. However, the lack of large-scale and diverse data remains a major limitation. In this work, we propose AirSim360, a simulation platform for omnidirectional data from aerial viewpoints, enabling wide-ranging scene sampling with drones. Specifically, AirSim360 focuses on three key aspects: a render-aligned data and labeling paradigm for pixel-level geometric, semantic, ...
VFMMLLMUAV
Authors: Daoxuan Zhang, Ping Chen, Xiaobo Xia, Xiu Su, Ruichen Zhen, Jianqiang Xiao, Shuo Yang
Institutions: Harbin Institute of Technology, ShenZhen, Harbin Institute of Technology, Shenzhen, The University of Sydney, Central South University, Harbin Institute of Technology (Shenzhen)
The Aerial Object Goal Navigation, a challenging frontier in Embodied AI, requires an Unmanned Aerial Vehicle (UAV) agent to autonomously explore, reason, and identify a specific target using only visual perception and language description. However, existing methods struggle with the memorization of complex spatial representations in aerial environments, reliable and interpretable action decision-making, and inefficient exploration and information gathering. To address these challenges, we intro...
VFMMLLMGeneration
Authors: Run Shao, Ziyu Li, Zhaoyang Zhang, Linrui Xu, Xinran He, Hongyuan Yuan, Bolei He, Yongxing Dai, Yan Yiming, Chen Yijun, Wang Guo, Haifeng Li
Institutions: Central South University, Baidu Inc., University of Science and Technology of China, Baidu, Zhejiang University, Central South University, China
Recent multimodal reasoning models, inspired by DeepSeek-R1, have significantly advanced vision–language systems. However, in remote sensing (RS) tasks, we observe widespread pseudo reasoning: models narrate the process of reasoning rather than genuinely reason toward the correct answer based on visual evidence. We attribute this to the Glance Effect, where a single, coarse perception of large-scale RS imagery results in incomplete understanding and reasoning based on linguistic self-consistency...
VFMMLLMUAV
Authors: Yu Hu, Jianyang Gu, Hao Liu, Yue Cao, Jozsef Hamari, Zheng Liu, Mohsen Zardadi
Institutions: University of British Columbia, Zhejiang University, TerraSense Analytics
Adapting vision-language models to remote sensing imagery remains challenging due to two key factors: limited semantic coverage in textual representations and insufficient adaptability of visual features. These issues are particularly significant in aerial scenes, which involve various visual appearances and fine-grained object distinctions. We propose AVION, a knowledge distillation framework tailored for remote sensing adaptation of vision-language models. The teacher module constructs semanti...
DetectionSelf-Supervised
Authors: Jingzhou Chen, Dexin Chen, Fengchao Xiong, Yuntao Qian, Liang Xiao
Institutions: Nanjing University of Science and Technology, Zhejiang University
Fine-grained remote sensing datasets often use hierarchical label structures to differentiate objects in a coarse-to-fine manner, with each object annotated across multiple levels. However, embedding this semantic hierarchy into the representation learning space to improve fine-grained detection performance remains challenging. Previous studies have applied supervised contrastive learning at different hierarchical levels to group objects under the same parent class while distinguishing sibling s...
DetectionUAVSuper-Resolution
Authors: Wenchao Guan, Chuan Lin, Sihan Huang, Xiongzhen Wang, Xintao Pang
Institutions: Guangxi University of Science and Technology, Macao Polytechnic University
In remote sensing images, small objects often suffer from low color contrast and blurred edges, resulting in suboptimal feature extraction performance. Physiological studies indicate that the LGN/V1–V2–V4 pathway offers color opponency sensitivity and hierarchical enhancement advantages for the extraction of color information, while the V1–V4 pathway shows strong orientation selectivity in edge information extraction. The integration of these two types of information in the V4 region significant...
Segmentation
Authors: Yixin Xiong, Ke Wang, Tongtong Cheng, Chunhui Liu, Kai Liu
Institutions: Chongqing University, Hong Kong Polytechnic University
Bird’s Eye View (BEV) semantic segmentation is essential for autonomous driving and mobile robotics, yet it still faces significant challenges on accurate segmentation of foreground object and efficient estimating of layout categories obscured by objects. To address these issues, we propose BEV-CAR, a Context-Aware Rasterization method that rasterizes the BEV representation without any coordinate transformations. By optimising each ray and incorporating depth features, BEV-CAR effectively addres...
MLLM
Authors: wenfei guan, Jilin Mei, Tong Shen, Xumin Wu, Shuo Wang, Chen Min, Yu Hu
Institutions: Institute of Computing Technology, Chinese Academy of Sciences, University of the Chinese Academy of Sciences, Institute of Computing Technology, CAS, Chinese Academy of Sciences
Deep learning has advanced vectorized road extraction in urban settings, yet off-road environments remain underexplored and challenging. A significant domain gap causes advanced models to fail in wild terrains due to two key issues: lack of large-scale vectorized datasets and structural weakness in prevailing methods. Models such as SAM-Road employ a node-centric paradigm that reasons at sparse endpoints, making them fragile to occlusions and ambiguous junctions in off-road scenes, leading to t...
UAVGeo-Localization
Authors: Liu Kejia, Haoyang Zhou, Ruoyu Xu, Peicheng Wang, Mingli Song, Haofei Zhang
Institutions: Zhejiang University
Recent advances in cross-view geo-localization (CVGL) methods have shown strong potential for supporting unmanned aerial vehicle (UAV) navigation in GNSS-denied environments. However, existing work predominantly focuses on matching UAV views to onboard map tiles, which introduces an inherent trade-off between accuracy and storage overhead, and overlooks the importance of the UAV’s heading during navigation. Moreover, the substantial discrepancies and varying overlaps in cross-view scenarios have...
VFMGeo-Localization
Authors: Yi Liu, Yi Wan, Lei Yu, Panwang Xia, Qiong Wu, Yingying Pei, Xuejun Huang, Junjian Zhang, Xiangyuan Cai, Hongwei Hu, Yongjun Zhang
Institutions: Wuhan University, antgroup, Ant Group, Alibaba Group
Owing to the weak stereo geometry of satellite images, Planar Block Adjustment (PBA) is a predominant technique for correcting geometric distortions in satellite images, which treats elevation as a known constraint and primarily optimizes planar coordinates. Existing PBA methods mainly rely on explicit tie points, suffering from parallax caused by inaccurate elevation (e.g., near high buildings) and irreversible error accumulation, which severely degrades adjustment accuracy. In this paper, a "B...
VFMGeo-LocalizationImage Fusion
Authors: JangHyeon Lee, Philipe Ambrozio Dias, Yao-Yi Chiang, Dalton Lunga
Institutions: University of Minnesota, Oak Ridge National Laboratory, University of Minnesota, Minneapolis
Learning general-purpose representations of geographic locations has become essential to geospatial tasks such as population estimation and environmental monitoring. To obtain such representations, multimodal geo-foundation models often use contrastive learning (CL) to align satellite imagery with geo-coordinates, implicitly assuming that cross-modal (shared) information suffices for downstream tasks. However, not all task-relevant information is shared between modalities, and retaining modality...
UAVTracking
Authors: Jingtao Ye, zhang kexin, Xunchi Ma, Johann Li, Guangming Zhu, Peiyi Shen, Linhua Jiang, Xiangdong Zhang, Liang Zhang
Institutions: Xi'an University of Electronic Science and Technology, Xidian University
The rapid movements and agile maneuvers of unmanned aerial vehicles (UAVs) induce significant observational challenges for multi-object tracking (MOT). However, existing UAV-perspective MOT benchmarks often lack these complexities, featuring predominantly predictable camera dynamics and linear motion patterns. To address this gap, we introduce DynUAV, a new benchmark for dynamic UAV-perspective MOT, characterized by intense ego-motion and the resulting complex apparent trajectories. The benchmar...
VFMSegmentationChange Detection
Authors: Filip Wolf, Blaz Rolih, Luka Cehovin Zajc
Institutions: University of Ljubljana, University of Ljubljana, Slovenia
Foundation models are transforming Earth Observation (EO), yet the diversity of EO sensors and modalities makes a single universal model unrealistic. Multiple specialized EO foundation models (EOFMs) will likely coexist, making efficient knowledge transfer across modalities essential. Most existing EO pretraining relies on masked image modeling, which emphasizes local reconstruction but provides limited control over global semantic structure. To address this, we propose a dual-teacher contrastiv...
VFMDetectionUAV
Authors: Mingbo Hong, Feng Liu, Caroline Gevaert, George Vosselman, Hao Cheng
Institutions: University of Twente, Drexel University, University of Twente; The World Bank
Detectors often suffer from degraded performance, primarily due to the distributional gap between the source and target domains. This issue is especially evident in single-source domains with limited data, as models tend to rely on confounders (e.g., illumination, co-occurrence, and style) from the source domain, leading to spurious correlations that hinder generalization. To this end, this paper proposes a novel Basis-driven framework for domain generalization, namely **Bridge**, that incorpora...
VFMMLLMHyperspectral
Authors: Jinheng Ji, Jiahui Qu, Wenqian Dong, Yunsong Li
Institutions: Xidian University
Fine-tuning Vision-Language Models (VLMs) trained on large-scale datasets of natural image-text pairs has demonstrated impressive performance for various downstream tasks. However, their fine-tuning for remote sensing (RS) tasks faces dual barriers: (1) Data-level barrier caused by the fundamental modality gap between natural imagery and RS data, and (2) Task-level barrier stemming from the requirement for multi-source interaction modeling capabilities. This paper proposes a Cross-modal Fusion I...
Change DetectionSuper-ResolutionGeneration
Authors: Zhenghui Zhao, Chen Wu, Xiangyong Cao, Di Wang, Hongruixuan Chen, Datao Tang, Liangpei Zhang, Zhuo Zheng
Institutions: Wuhan University, Xi'an Jiaotong University, The University of Tokyo, Stanford University
Spatiotemporal image generation is a highly meaningful task, which can generate future scenes conditioned on given observations. However, existing change generation methods can only handle event-driven changes (e.g., new buildings) and fail to model cross-temporal variations (e.g., seasonal shifts). In this work, we propose ChangeBridge, a conditional spatiotemporal image generation model for remote sensing. Given pre-event images and multimodal event controls, ChangeBridge generates post-event ...
Segmentation
Authors: Jinming Chai, Lingling Li, Licheng Jiao, Xiaoqiang Lu, Long Sun, Xu Liu, Wenping Ma, Weibin Li
Institutions: Xidian University, Xi'an University
Referring expression comprehension and segmentation (RECS) task plays a vital role in remote sensing due to its high efficiency in multi-tasking. However, RECS has reached a performance bottleneck rooted in representational insufficiency, primarily due to cross-task representational fragmentation in multi-task interpretation. In this paper, we propose RECS4R, a unified multi-task framework to upgrade RECS performance. At representation level, we introduce language-guided unified contour decoding...
MLLMImage Fusion
Authors: Xuecong Liu, Mengzhu Ding, Zixuan Sun, Zhang Li, Xichao Teng
Institutions: Northeastern University, Northeastern University at Qinhuangdao, National University of Defense Technology
We present Consistent–Recurrent Feature Flow Transformer (CRFT), a unified coarse-to-fine framework that learns feature flow for robust cross-modal registration. CRFT learns a modality-consistent feature flow representation within a transformer-based architecture that jointly performs feature alignment and flow estimation. The coarse stage establishes global correspondences through multi-scale feature correlation, while the fine stage refines local details via hierarchical feature fusion and ada...
MLLMUAVImage Fusion
Authors: Yifei Deng, Chenglong Li, YUYANG ZHANG, Guyue Hu, Jin Tang
Institutions: Anhui University; Anhui University, Anhui University, University of Hong Kong
Text-aerial person retrieval aims to identify targets in UAV-captured images from eyewitness descriptions, supporting intelligent transportation and public security applications. Compared to ground-view text–image person retrieval, UAV-captured images often suffer from degraded visual information due to drastic variations in viewing angles and flight altitudes, making semantic alignment with textual descriptions very challenging. To address this issue, we propose a novel Cross-modal Fuzzy Ali...
Geo-LocalizationGeneration
Authors: Matias Turkulainen, Akshay Krishnan, Filippo Aleotti, Mohamed Sayed, Guillermo Garcia-Hernando, Juho Kannala, Arno Solin, Gabriel Brostow, Daniyar Turmukhambetov
Institutions: Aalto University, Georgia Institute of Technology, Niantic, Inc., University College London, University of London, Niantic Spatial, University of Oulu, Department of Computer Science, University College London, Niantic
We present Cross-View Splatter, a feed-forward method that predicts pixel-aligned Gaussian splats for outdoor scenes captured at ground-level AND by satellite. Faithful reconstructions require good camera coverage, but ground imagery is time-consuming and hard to capture at scale for large outdoor scenes. Fortunately, satellite imagery can provide a global geometric prior that is easy to access via public APIs. Cross-View Splatter fuses orthorectified satellite views with GPS-tagged ground photo...
VFMSegmentation
Authors: Shilei Cao, Ziyang Gong, Hehai Lin, Yang Liu, Jiashun Cheng, Xiaoxing Hu, Haoyuan Liang, Guowen Li, Chengwei Qin, Hong Cheng, Xue Yang, Juepeng Zheng, Haohuan Fu
Institutions: Sun Yat-sen University, Shanghai Artificial Intelligence Laboratory; SUN YAT-SEN UNIVERSITY, The Hong Kong University of Science and Technology, The Chinese University of Hong Kong, Hong Kong University of Science and Technology, Beijing Institute of Technology, Sun Yat-Sen University, The Hong Kong University of Science and Technology (Guangzhou), Shanghai AI Laboratory, Tsinghua University, Tsinghua University
In Remote Sensing (RS), Parameter-Efficient Fine-Tuning (PEFT) has emerged as a key approach to activate the generalizable representation ability of foundation models for downstream tasks. However, existing specialized PEFT methods often fail when applied to large-scale Earth observation tasks, as they are unable to fully handle the multifaceted and unpredictable domain gaps (e.g., spatial, semantic, and frequency shifts) inherent in RS data. To overcome this, we propose CrossEarth-Gate, which i...
VFMDetectionMLLM
Authors: Zhipeng Liu, Chunbo Luo
Institutions: University of Exeter
Vision–language models (VLMs) enable text-guided object detection but degrade severely under cross-view scenarios where ground and aerial viewpoints differ in altitude, scale, and spatial layout. These geometric changes introduce systematic complexity variations between viewpoints, e.g., ground view images contain dense and highly occluded structures, while aerial images are sparse and globally organized. Fixed VLM fusion mechanisms cannot handle this discrepancy. We propose \textbf{CrossVL}, a ...
VFMSegmentation
Authors: Dan Jacobellis, Mateen Ulhaq, Fabien Racapé, Hyomin Choi, Neeraja Yadwadkar
Institutions: The University of Texas at Austin, InterDigital, University of Texas at Austin
Video comprises the vast majority of bits that are generated daily, and is the primary signal driving current innovations in robotics, remote sensing, and wearable technology.Yet, the most powerful video understanding models are too expensive for the resource-constrained platforms used in these applications.One approach is to offload inference to the cloud; this gives access to GPUs capable of processing high-resolution videos in real time.But even with reliable, high-bandwidth communication cha...
UAVTracking
Authors: Hongtao Yang, Bineng Zhong, Qihua Liang, Yaozong Zheng, Xiantao Hu, Yuanliang Xue, Shuxiang Song
Institutions: Guangxi Normal University, Nanjing University of Science and Technology, Xi’an Research Institute of High Technology
Given the real-time demands of UAV tracking, many methods simplify the backbone to reduce computation, but this often weakens feature representation and degrades performance in complex scenarios. To alleviate this issue, we propose EATrack, an efficient and asymmetric UAV tracking framework centered around a teacher-guided dual-branch distillation strategy that enhances the feature expressiveness of the lightweight student model. Specifically, EATrack investigates two complementary perspectives ...
HyperspectralSuper-ResolutionGeneration
Authors: Tao Zhang, Shengtao Yao, Rong Zeng, Zunjie Zhu, Bolun Zheng, Yaoqi Sun, Ying Fu, Chenggang Yan
Institutions: Hangzhou Dianzi University, Lishui University, Beijing Institute of Technology, Hangzhou Dianzi University, Tsinghua University
Hardware constraints make it challenging to simultaneously acquire hyperspectral images (HSIs) with both high spatial and high spectral resolutions. A promising solution is to fuse low-resolution HSI (LR-HSI) with high-resolution multispectral images (HR-MSI) to generate high-resolution HSI (HR-HSI). Recently, diffusion models have introduced possibilities for HSI super-resolution, but suffer from low-efficiency sampling, detail-limited generation, and insufficient denoising. To address these is...
HyperspectralSuper-Resolution
Authors: Yingkai Zhang, Tao Zhang, Jing Nie, Ying Fu
Institutions: Beijing Institute of Technology, Hangzhou Dianzi University
Unregistered hyperspectral image (HSI) super-resolution (SR) typically aims to enhance a low-resolution HSI using an unregistered high-resolution reference image.In this paper, we propose an unmixing-based fusion framework that decouples spatial-spectral information to simultaneously mitigate the impact of unregistered fusion and enhance the learnability of SR models.Specifically, we first utilize singular value decomposition for initial spectral unmixing, preserving the original endmembers whil...
Hyperspectral
Authors: Lijing Cai, Zhan Shi, Chenglong Huang, Jinyao Wu, Qiping Li, Zikang Huo, Linsen Chen, Chongde Zi, Xun Cao
Institutions: nanjing university, Nanjing University
Recently, Spectral Compressive Imaging (SCI) has achieved remarkable success, unlocking significant potential for dynamic spectral vision. However, existing reconstruction methods, primarily image-based, suffer from two limitations: (i) Encoding process masks spatial-spectral features, leading to uncertainty in reconstructing missing information from single compressed measurements, and (ii) The frame-by-frame reconstruction paradigm fails to ensure temporal consistency, which is crucial in the v...
Segmentation
Authors: Hengzhi Chen, Liqian Feng, Wenhua Wu, Xiaogang Zhu, Qiuxia Wu, Lianlei Shan, Kun Hu
Institutions: University of Sydney, University of Sydney, Adelaide University, South China University of Technology, Tsinghua University, Edith Cowan University
Semantic segmentation of ultra-high-resolution (UHR) remote sensing imagery is critical for applications like environmental monitoring and urban planning but faces com- putational and optimization challenges. Conventional methods either lose fine details through downsampling or fragment global context via patch processing. While multi-branch networks ad- dress this trade-off, they suffer from computational inefficiency and conflicting gradient dynamics during training. We propose F2Net, a freque...
Detection
Authors: Changyu Gu, Linwei Chen, Lin Gu, Ying Fu
Institutions: Beijing Institute of Technology, Tohoku University
In remote sensing rotated object detection, mainstream methods suffer from two bottlenecks, directional incoherence at detector neck and task conflict at detecting head. Ulitising fourier rotation equivariance, we introduce **Fourier Angle Alignment**, which analyses angle information through frequency spectrum and aligns the main direction to a certain orientation. Then we propose two plug and play modules : **FAAFusion** and **FAA Head**. FAAFusion works at the detector neck, aligning the mai...
VFMMLLMSAR
Authors: Xiaokun Zhang, Yi Yang, Ziqi Ye, Baiyun Baiyun, Xiaorong Guo, Qingchen Fang, Ry Zhang, Xinpeng Zhou, Haipeng Wang
Institutions: Fudan University, Fudan Univercity, FuDan university
Research on the intelligent interpretation of all-weather, all-time Synthetic Aperture Radar (SAR) is crucial for advancing remote sensing applications. In recent years, although Visual Language Models (VLMs) have demonstrated strong open-world understanding capabilities on RGB images, their performance is severely limited when directly applied to the SAR field due to the complexity of the imaging mechanism, sensitivity to scattering features, and the scarcity of high-quality text corpora. To sy...
UAV
Authors: Qingwei Ben, Botian Xu, Kailin Li, Feiyu Jia, Wentao Zhang, Jingping Wang, Jingbo Wang, Dahua Lin, Jiangmiao Pang
Institutions: The Chinese University of Hong Kong, Tsinghua University, Shanghai Jiaotong University, University of Science and Technology of China, The University of Tokyo, Shanghai AI LAB, Shanghai AI Laboratory
Robust humanoid locomotion requires accurate and globally consistent perception of the surrounding 3D environment. However, existing perception modules, mainly based on depth images or elevation maps, offer only partial and locally flattened views of the environment, failing to capture the full 3D structure.This paper presents $\textbf{Gallant}$, a voxel-grid–based framework for humanoid locomotion and local navigation in 3D constrained terrains.It leverages voxelized LiDAR data as a lightweight...
Generation
Authors: Mengtian Li, Fan Yang, Ruixue Xiong, Yiyan Fan, Zhifeng Xie, Zeyu Wang
Institutions: Shanghai University, shanghai university, The Hong Kong University of Science and Technology (Guangzhou)
Jiangnan gardens, a prominent style of Chinese classical gardens, hold great potential as digital assets for film and game production and digital tourism. However, manual modeling of Jiangnan gardens heavily relies on expert experience for layout design and asset creation, making the process time-consuming. To address this gap, we propose GardenDesigner, a novel framework that encodes aesthetic principles for Jiangnan garden construction and integrates a chain of agents based on procedural model...
VFMUAVGeo-Localization
Authors: Yancheng Zhang, Xiaohan Zhang, Guangyu Sun, Zonglin Lyu, Safwan Wshah, Chen Chen
Institutions: University of Central Florida, University of Vermont, New York University
Cross-view geo-spatial learning consists of two important tasks: Cross-View Geo-Localization (CVGL) and Cross-View Image Synthesis (CVIS), both of which rely on establishing geometric correspondences between ground and aerial views. Recent Geometric Foundation Models (GFMs) have demonstrated strong capabilities in extracting generalizable 3D geometric features from images, but their potential in cross-view geo-spatial tasks remains underexplored. In this work, we present Geo^2, a unified framewo...
VFMUAVGeo-Localization
Authors: Zixuan Song, Jing Zhang, Di Wang, Zidie Zhou, Wenbin Liu, Haonan Guo, En Wang, Bo Du
Institutions: Zhongguancun Academy, The University of Sydney, Wuhan University, JiLin University, Jilin University, China
Cross-view geo-localization infers a location by retrieving geo-tagged reference images that visually correspond to a query image. However, the traditional satellite-centric paradigm limits robustness when high-resolution or up-to-date satellite imagery is unavailable. It further underexploits complementary cues across views (e.g., drone, satellite, and street) and modalities (e.g., language and image). To address these challenges, we propose GeoBridge, a foundation model that performs bidirecti...
MLLMImage Fusion
Authors: Daixun Li, Zirui Li, Sibo He, Jiayun Tian, Mingxiang Cao, Weiying Xie, Yunke Wang, Xin Zhang, Yusi Zhang, Yunsong Li, Chang Xu, Leyuan Fang
Institutions: State Key Laboratory of Integrated Services Networks, Xi'an University of Electronic Science and Technology, Xidian University, University of Sydney, Xidian University of Electronic Science and Technology, Hunan University
Multimodal Large Language Models (MLLMs) have shown strong potential in remote sensing (RS) through multi-task reasoning and cross-modal generalization.However, existing RS-MLLMs mainly rely on a single shared expert for all tasks, making it hard to produce reliable results. Meanwhile, the intrinsic redundancy and homogeneity of RS images bring substantial difficulties for both training and inference. These challenges directly conflict with the demands of remote sensing, which values task precis...
DetectionMLLMGeo-Localization
Authors: Jiaqi Liu, Ronghao Fu, Haoran Liu, Lang Sun, Qipeng Wang, Bo Yang
Institutions: Jilin University
Autoregressive models are structurally misaligned with the inherently parallel nature of geospatial understanding, forcing a rigid sequential narrative onto scenes and fundamentally hindering the generation of structured and coherent outputs. We challenge this paradigm by reframing geospatial generation as a parallel refinement process, enabling a holistic, coarse-to-fine synthesis that resolves all semantic elements simultaneously. To operationalize this, we introduce GeoDiT, the first diffusio...
Geo-Localization
Authors: Ayesh Abu Lehyeh, Xiaohan Zhang, Ahmad Arrabi, Waqas Sultani, Chen Chen, Safwan Wshah
Institutions: University of Vermont, University of Central Florida
Accurate and fast localization is vital for safe autonomous navigation in GPS-denied areas. Fine-Grained Cross-View Geolocalization (FG-CVG) aims to estimate the precise 2-Degree-of-Freedom (2-DoF) location of a ground image relative to a satellite image. However, current methods force a difficult trade-off, with high-accuracy models being slow for real-time use. In this paper, we introduce GeoFlow, a new approach that offers a lightweight and highly efficient framework that breaks this accuracy...
MLLMGeo-Localization
Authors: Aoran Xiao, Shihao Cheng, Yonghao Xu, Yexian Ren, Hongruixuan Chen, Naoto Yokoya
Institutions: Harbin Institute of Technology, Wuhan University, Linköping University, Nanjing University of Information Science and Technology, The University of Tokyo
Recent advances in multimodal large language models (MLLMs) have accelerated progress in domain-oriented AI, yet their development in geoscience and remote sensing (RS) remains constrained by distinctive challenges: wide-ranging disciplinary knowledge, heterogeneous sensor modalities, and a fragmented spectrum of tasks. To bridge these gaps, we introduce GeoMMBench, a comprehensive multimodal question-answering benchmark covering diverse RS disciplines, sensors, and tasks, enabling broader and m...
VFMSegmentationGeo-Localization
Authors: Joëlle Hanna, Damian Falk, Stella X. Yu, Damian Borth
Institutions: Universität St. Gallen, University of St. Gallen, University of Michigan - Ann Arbor, University of St.Gallen
Recent advances in remote sensing have led to an increase in the number of available foundation models; each trained on different modalities, datasets, and objectives, yet capturing only part of the vast geospatial knowledge landscape. While these models show strong results within their respective domains, their capabilities remain complementary rather than unified. Therefore, instead of choosing one model over another, we aim to combine their strengths into a single shared representation.We in...
Geo-LocalizationGeneration
Authors: Sara Aghajanzadeh, Xiaoyang Bai, Zhongmin Zhu, David Forsyth, Viktor Gruev
Institutions: University of Illinois Urbana Champaign, The University of Hong Kong, University of Illinois Urbana-Champaign, University of Illinois at Urbana-Champaign; University of Illinois at Urbana-Champaign
It is extremely hard for an underwater agent to know where it is. Satellite signals disappear within centimeters of the surface; acoustic baselines require heavy infrastructure to instrument small regions. The polarization of the sky, visible underwater, reveals the elevation of the sun. The pattern of elevation over the day reveals location to an agent with a clock. However, recovering elevation from polarization images is very difficult. SOTA geolocalization methods can localize well for lo...
VFMSelf-Supervised
Authors: Henry Herzog, Favyen Bastani, Yawen Zhang, Gabriel Tseng, Joseph Redmon, Hadrien Sablon, Ryan Park, Jacob Morrison, Alexandra Buraczynski, Karen Farley, Josh Hansen, Andrew Howe, Patrick Johnson, Mark Otterlee, Ted Schmitt, Hunter Pitelka, Stephen Daspit, Rachel Ratner, Christopher Wilhelm, Sebastian Wood, Mike Jacobi, Hannah Kerner, Evan Shelhamer, Ali Farhadi, Ranjay Krishna, Patrick Beukema
Institutions: Allen Institute for Artificial Intelligence, Allen Institute for AI, Allen Institute for Artificial Intelligence; McGill University, Stanford University, University of Washington; Allen Institute for Artificial Intelligence, The Allen Institute for Artificial Intelligence, Arizona State University, UBC / Vector, University of Washington
Earth observation data presents a unique challenge: it is spatial like images, sequential like video or text, and highly multimodal. We present Helios: a multimodal, spatio-temporal foundation model that employs a novel self-supervised learning formulation, masking strategy, and loss all designed for the Earth observation domain. Helios achieves state-of-the-art performance compared to 12 other foundation models across a variety of research benchmarks and real-world tasks from external partners....
Self-Supervised
Authors: YANG CHU, Xiaomeng Yang, Keli Deng, Yuntao Qian
Institutions: Zhejiang University
Hierarchical classification (HC) on degraded images presents challenges due to feature corruption, unreliable confidence estimation, and fine-grained misclassification. Existing methods often struggle to balance semantic consistency and adaptive decision paths under low-quality visual conditions. To address this, we propose HierUQ, a unified framework that integrates uncertainty quantification with adaptive granularity reconciliation. A Vision Transformer backbone extracts global features, which...
SegmentationUAV
Authors: Linkang Xu, Gang Li, Yue Song, Xiangxin Ji
Institutions: TongJi University, Tongji University
Drone-based building defect segmentation remains challenging due to complex surface textures and illumination variations. We propose TPSegformer, a topology-preserving segmentation framework that mitigates mis-segmentation in such scenarios. Its decoder incorporates a Hilbert curve–based topology-preserving mechanism to maintain spatial continuity and boundary precision during category layer computation. A lightweight multi-scale fusion module enhances semantic representation, while global conte...
SegmentationMLLMGeneration
Authors: Jie Qiu, XIN LI, Fan Yang, Yan Wang, Dong Yu, Changying Wang, Linwei Dai, Yongxiang Chen, Youqin Chen, Jianzhang Chen
Institutions: Fujian Agriculture and Forestry University, G42, AIQ, Beijing Jiaotong University, IFLYTEK CO.LTD., Fujian University of Technology
High-resolution remote sensing imagery exhibits complex spatial regularities where topology, continuity, and region adjacency govern semantic organization. However, existing remote sensing image semantic segmentation (RSISS) networks, being predominantly discriminative, estimate strong posteriors from data while lacking generative priors that encode such structural dependencies. This imbalance leads to fragmented boundaries, texture overfitting, and poor cross-domain generalization. We address ...
MLLMGeo-Localization
Authors: Jieren Deng, Zhizhang Hu, Ziyan He, Aleksandar Cvetkovic, Pak Chung, Dragomir Yankov, Chiqun Zhang
Institutions: Capital One, Amazon, Microsoft
Most mapping tools remain point-and-click, making it hard to ask spatial questions or relate what a camera sees to its surrounding geography in a view-aware way. We present **IMAIA** — the *Interactive Maps AI Assistant* — which enables natural-language interaction with both vector (street) maps and satellite imagery, while enriching camera inputs with geospatial intelligence to help users interpret the world around them.IMAIA consists of two complementary modules:* **Maps Plus**, which treats t...
UAVSuper-ResolutionTracking
Authors: Haowei Sun, Kai Zhou, Hao Gao, Shiteng Zhang, Jinwu Hu, Xutao Wen, Qixiang Ye, Mingkui Tan
Institutions: South China University of Technology, University of Chinese Academy of Sciences
Visual Active Tracking (VAT) aims to control cameras to follow a target in 3D space, which is critical for applications like drone navigation and security surveillance. However, it faces two key bottlenecks in real-world deployment: confusion from visually similar distractors caused by insufficient instance-level discrimination and severe failure under occlusions due to the absence of active planning. To address these, we propose OA-VAT, a unified pipeline with three complementary modules. First...
VFMMLLMUAV
Authors: Lingfeng Zhang, Yuchen Zhang, Hongsheng Li, Haoxiang Fu, Yingbo Tang, Hangjun Ye, Long Chen, Xiaojun Liang, Xiaoshuai Hao, Wenbo Ding
Institutions: Tsinghua University; Xiaomi Corporation, Georgia Institute of Technology, Shenzhen International Graduate School, Tsinghua University, Institute of Automation, Chinese Academy of Sciences, Xiaomi Corporation, Wayve, Pengcheng Laboratory, Beijing Academy of Artificial Intelligence(BAAl), Tsinghua Univeresity
Vision-Language Models (VLMs), leveraging their powerful visual perception and reasoning capabilities, have been widely applied in Unmanned Aerial Vehicle (UAV) tasks.However, the spatial intelligence capabilities of existing VLMs in UAV scenarios remain largely unexplored, raising concerns about their effectiveness in navigating and interpreting dynamic environments.To bridge this gap, we introduce SpatialSky-Bench, a comprehensive benchmark specifically designed to evaluate the spatial intelli...
SegmentationHyperspectralSelf-Supervised
Authors: Zijun He, Ping Wang, Xiaodong Wang, ChangChen ChangChen, Xin Yuan
Institutions: Westlake University, Zhejiang University
Coded Aperture Snapshot Spectral Imaging (CASSI) is an emerging hyperspectral image (HSI) acquisition technique for downstream semantic segmentation. Due to the ill-posedness nature of CASSI systems, typical solutions are compelled to conduct a two-stage reconstruction-then-segmentation pipeline, namely viewing them as two separate tasks. However, we observe that such two tasks are interrelated and mutually reinforcing for representation learning, and thus separating them limits the overall accu...
Geo-Localization3D Reconstruction
Authors: SUWAN LEE, Jo Ryeong Yim, Kibaek Park, Dong Kim, Eunhyeuk Kim, Minsup Jeong, Chae Sim, Seokju Lee
Institutions: KENTECH, Korea Aerospace Research Institute, Korean Institute of Energy Technology, Korea Astronomy and Space Science Institute, Korea Astronomy and SpaceScience Institute, Korea Institute of Energy Technology (KENTECH)
High-resolution and high-precision digital elevation models (DEMs) of the lunar surface are essential for landing site selection and lunar geological research. However, traditional stereo matching techniques provide limited representation of 3D scene, struggling with non-textured regions and extreme light variations. Furthermore, recent lunar neural rendering methods are ill-suited for 3D reconstruction due to their reliance on simple pinhole approximations for pushbroom sensors. These challenge...
VFMSegmentationHyperspectral
Authors: Xi Chen, Maojun Zhang, Yu Liu, Shen Yan
Institutions: National University of Defense Technology
Domain Generalization Semantic Segmentation (DGSS) in spectral remote sensing is severely challenged by spectral shifts across diverse acquisition conditions, which cause significant performance degradation for models deployed in unseen domains. While Parameter-Efficient Fine-Tuning (PEFT) on foundation models is a promising direction, existing methods employ global, homogeneous adjustments. This "one-size-fits-all" tuning struggles with the spatial heterogeneity of land cover, causing semantic ...
VFMSegmentationUAV
Authors: Shuaibang Peng, Juelin Zhu, Xia Li, Kun Yang, Yu Liu, Maojun Zhang, Shen Yan
Institutions: National University of Defense Technology, Northwest Polytechnical University Xi'an
We present LoD-Loc v3, a novel method for generalized aerial visual localization in dense urban environments. While prior work LoD-Loc v2 [89] achieves localization through semantic building silhouette alignment with low-detail city models, it suffers from two key limitations: poor cross-scene generalization and frequent failure in dense building scenes. Our method addresses these challenges through two key innovations. First, we develop a new synthetic data generation pipeline that produces $\t...
MLLMUAV
Authors: Yuwei Ning, Ganlong Zhao, Yipeng Qin, Si Liu, Yang Liu, Liang Lin, Guanbin Li
Institutions: Sun Yat-sen University, University of Hong Kong, Cardiff University, Beihang University, SUN YAT-SEN UNIVERSITY
Aerial Vision-and-Language Navigation (Aerial VLN) enables unmanned aerial vehicles (UAVs) to follow natural language instructions and navigate complex urban environments.While recent advances have achieved progress through large-scale memory graphs and lookahead path planning, they remain limited by shallow instruction understanding and high computational cost. In particular, existing methods rely primarily on landmark descriptions, overlooking directional cues—a key source of spatial context i...
Hyperspectral
Authors: Dhruv Verma, Andrew Qiu, Roberto Rangel, Ayandev Barman, Hao Yang, Chenjia Hu, Fengqi Zhang, Roman Genov, David B. Lindell, Kiriakos Kutulakos, Alex Mariakakis
Institutions: University of Toronto, Pinterest, Inc., University of Toronto; Universidade de São Paulo; Universidade do Estado do Rio de Janeiro, University of Toronto, University of Toronto
We present Lumosaic, a compact active hyperspectral video system designed for real-time capture of dynamic scenes. Our approach combines a narrowband LED array with a coded-exposure-pixel (CEP) camera capable of high-speed, per-pixel exposure control, enabling joint encoding of scene information across space, time, and wavelength within each video frame. Unlike passive snapshot systems that divide light across multiple spectral channels simultaneously and assume no motion during a frame’s exposu...
VFMSegmentationSAR
Authors: YIMIN WEI, Aoran Xiao, Hongruixuan Chen, Junshi Xia, Naoto Yokoya
Institutions: The University of Tokyo, Harbin Institute of Technology, RIKEN
Open-vocabulary segmentation enables pixel-level recognition from an open set of textual categories, allowing generalization beyond fixed classes. Despite great potential in remote sensing, progress in this area remains largely limited to clear-sky optical data and struggles under cloudy or haze-contaminated conditions. We present MM-OVSeg, a multimodal Optical–SAR fusion framework for resilient open-vocabulary segmentation under adverse weather conditions. MM-OVSeg leverages the complementary s...
UAVGeo-LocalizationGeneration
Authors: Oskar Kristoffersen, Alba Reinders, Morten Hannemose, Anders Dahl, Dim Papadopoulos
Institutions: Technical University of Denmark (DTU), DTU, Technical University of Denmark, DTU Compute
Geo-spatial analysis of our world benefit from a multimodal approach, as every single geographic location can be described in numerous ways (images from various viewpoints, textual descriptions, and geographic coordinates). Current geo-spatial benchmarks have limited coverage across modalities, considerably restricting progress in the field, as current approaches cannot integrate all relevant modalities within a unified framework. We introduce the Multi-Modal Landmark dataset (MMLandmarks), a be...
Geo-Localization
Authors: Lv Bo, Qingwang Zhang, Le Wu, Yuanyuan Li, YINGYING ZHU
Institutions: Shenzhen University, Fudan University
Cross-View Object Geo-Localization (CVOGL) aims to locate an object of interest in a query image within a corresponding satellite image. Existing methods typically assume that the query image contains only a single object, which does not align with the complex, multi-object geo-localization requirements in real-world applications, making them unsuitable for practical scenarios. To bridge the gap between the realistc setting and existing task, we propose a new task, called Cross-View Multi-Object...
VFMSegmentation
Authors: Mirali Purohit, Bimal Gajera, Irish Mehta, Bhanu Tokas, Jacob Adler, Steven Lu, Scott Dickenshied, Serina Diniega, Brian Bue, Umaa Rebbapragada, Hannah Kerner
Institutions: Arizona State University (ASU), Arizona State University, Jet Propulsion Laboratory
We introduce MOMO, the first multi-sensor foundation model for Mars remote sensing. MOMO uses model merge to integrate representations learned independently from three key Martian sensors (HiRISE, CTX, and THEMIS), spanning resolutions from 0.25 m/pixel to 100 m/pixel. Central to our method is our novel Equal Validation Loss (EVL) strategy, which aligns checkpoints across sensors based on validation loss similarity before fusion via task arithmetic. This ensures models are merged at compatible c...
SARImage FusionSuper-Resolution
Authors: Yujian Zhao, Hankun Liu, Guanglin Niu
Institutions: Beihang University, Beijing University of Aeronautics and Astronautics
Cross-modal ship re-identification (ReID) between optical and synthetic aperture radar (SAR) imagery has recently emerged as a critical yet underexplored task in maritime intelligence and surveillance. However, the substantial modality gap between optical and SAR images poses a major challenge for robust identification. To address this issue, we propose MOS, a novel framework designed to Mitigate the Optical–SAR modality gap and achieve modality-consistent feature learning for optical-SAR cross-...
UAVHyperspectral
Authors: Yuxuan Zhao, Zhongao Zhou, Bin Yang, He Li, Jian Liang, Jun Chen, Bo Du, Mang Ye
Institutions: Wuhan University
Recent person re-identification (ReID) leverages heterogeneous sensing with multiple modalities and viewpoints to improve robustness across diverse conditions. However, most approaches target predefined scenario pairs (e.g., visible-infrared or aerial-ground) and train separate task-specific models. In contrast, real-world applications require retrieving identities from galleries that cover all scenarios, making such designs inefficient and complex to deploy. To bridge this gap, we introduce Any...
VFMChange DetectionSelf-Supervised
Authors: Liang Zeng, Valerio Marsocci, Wufan Zhao, Andrea Nascetti, Maarten Vergauwen
Institutions: KU Leuven, European Space Agency Φ-lab, The Hong Kong University of Science and Technology (Guangzhou), KTH Royal Institute of Technology
Masked Image Modeling has been one of the most popular self-supervised learning paradigms to learn representations from large-scale, unlabeled Earth Observation images. While incorporating multi-modal and multi-temporal Earth Observation data into Masked Image Modeling has been widely explored, the spatial dependencies between images captured from neighboring areas remains largely overlooked. Since the Earth's surface is continuous, neighboring images are highly related and offer rich contextual...
VFMUAVGeneration
Authors: Shuang Song, Debao Huang, Deyan Deng, Haolin Xiong, Yang Tang, Yajie Zhao, Rongjun Qin
Institutions: The Ohio State University, Ohio State University, Columbus, University of Southern California
Intrinsic image decomposition (IID) of outdoor scenes is crucial for relighting, editing, and understanding large-scale environments, but progress has been limited by the lack of real-world datasets with reliable albedo and shading supervision. We introduce \textit{Olbedo}, a large-scale aerial dataset for outdoor albedo--shading decomposition in the wild. \textit{Olbedo} contains 5,664 UAV images captured across four landscape types, multiple years, and diverse illumination conditions. Each vie...
VFMMLLMChange Detection
Authors: Qi Guo, Jue Wang, Yinhe Liu, Yanfei Zhong
Institutions: Wuhan University
Open-vocabulary change detection (OVCD) seeks to recognize arbitrary changes of interest by enabling generalization beyond a fixed set of predefined classes. We reformulate OVCD as a two-stage pipeline: first generate class-agnostic change proposals using visual foundation models (VFMs) such as SAM and DINOv2, and then perform category identification with vision-language models (VLMs) such as CLIP. We reveal that category identification errors are the primary bottleneck of OVCD, mainly due to th...
VFMSegmentationDetection
Authors: Canyu Mo, Yongxiang Liu, Jiehua Zhang, Zilong Yu, Zhen Liu, Tianpeng Liu, Li Liu
Institutions: National University of Defense Technology
Recent advances in Remote Sensing Foundation Models (RSFMs) have demonstrated considerable potential for Earth Observation (EO) tasks. While adopting natural image foundation models (e.g., DINO) provides a data-efficient strategy for building RSFMs, their strong generalization capability does not fully transfer to complex remote sensing (RS) scenarios due to severe background interference, notably in perceiving challenging targets like low-contrast objects. To this end, we propose ORSATR-X, a no...
Self-Supervised
Authors: Yongshan Zhang, Xiaohuan Lin, Lefei Zhang, Zhihua Cai
Institutions: China University of Geosciences Wuhan, Wuhan University
Multi-view clustering for remote sensing data has received increasing attention by leveraging diverse data representations to enhance Earth observation. Existing methods are primarily developed under the assumption that each pixel is fully observed across all views. No prior work has investigated the more practical yet challenging scenario where some views suffer from partially missing data. To bridge this gap, this paper presents the first study on clustering incomplete remote sensing data, ter...
Detection
Authors: Mingxin Liu, Peiyuan Zhang, Yuan Liu, Wei Zhang, Yue Zhou, Ning Liao, Ziyang Gong, Junwei Luo, Zhirui Wang, Yi Yu, Xue Yang
Institutions: Shanghai Jiaotong University, Wuhan University, Nanyang Technological University, Shanghai Artificial Intelligence Laboratory; SUN YAT-SEN UNIVERSITY, Aerospace Information Research Institute, Chinese Academy of Sciences, The Ohio State University, Shanghai AI Laboratory
The growing demand for oriented object detection (OOD) across various domains has driven significant research in this area. However, the high cost of dataset annotation remains a major concern. Current mainstream OOD algorithms can be mainly categorized into three types: (1) fully supervised methods using complete oriented bounding box (OBB) annotations, (2) semi-supervised methods using partial OBB annotations, and (3) weakly supervised methods using weak annotations such as horizontal boxes or...
UAVGeo-Localization
Authors: Zheng Li, Xueyi Zhang, Yanming Guo, Yuxiang Xie, Ding Zhaoyun, Siqi Cai, Haizhou Li, Mingrui Lao
Institutions: National University of Defense Technology, nudt, Harbin Institute of Technology, The Chinese University of Hong Kong (Shenzhen); National University of Singapore
Cross-view geo-localization is a critical task for UAV navigation, event detection, and aerial surveying, which establish correspondence between drone-captured and satellite imagery. Most existing approaches embed cross-view data into a joint feature space to maximize similarity between paired images. However, these methods typically assume perfect alignment of image pairs in training data, an assumption that rarely holds in practical scenarios. In real-world conditions, factors such as urban ca...
VFMUAVGeo-Localization
Authors: Xiaoya Cheng, Long Wang, Yan Liu, Xinyi Liu, Hanlin Tan, Yu Liu, Maojun Zhang, Shen Yan
Institutions: National University of Defense Technology, Sensetime, Hangzhou Dianzi University
We present PiLoT, a unified framework that tackles UAV-based ego and target geo-localization. Conventional approaches rely on decoupled pipelines that fuse GNSS and Visual-Inertial Odometry (VIO) for ego-pose estimation, and active sensors like laser rangefinders for target localization. However, these methods are susceptible to failure in GNSS-denied environments and incur substantial hardware costs and complexity.PiLoT breaks this paradigm by directly registering live video stream against a ge...
SegmentationMLLMUAV
Authors: shuyan ke, Yifan Mei, Changli Wu, yonghan zheng, Jiayi Ji, Liujuan Cao, Rongrong Ji
Institutions: Xiamen University
Reasoning segmentation has recently expanded from ground-level scenes to remote-sensing imagery, yet UAV data introduces fundamentally different challenges, including oblique viewpoints, ultra-high resolutions, and extreme scale variations. To address these UAV-specific conditions, we formally define the UAV Reasoning Segmentation task and organize its semantic demands into three dimensions: Spatial, Attribute, and Scene-level reasoning. Based on this formulation, we construct DRSeg, the first l...
UAVSuper-ResolutionTracking
Authors: Chenhui Zhang, Guoqing Dong, WeijiePeng WeijiePeng
Institutions: Xidian University, Northwestern Polytechinical University, Xi'an University of Electronic Science and Technology
Multi-object tracking (MOT) based on unmanned aerial vehicle (UAV) aims to identify and continuously track the positions of multiple ground targets during UAV flight. Current mainstream methods utilize appearance matching and motion matching to match targets in consecutive frames. However, these methods often fail in the following scenarios: First, scenarios with multi-scale targets, where small targets have weak appearance features and small bounding boxes; second, scenarios with complex backgr...
VFMDetection
Authors: Qihong Tang, Changhan Liu, Shaofeng Zhang, Wenbin Li, Qi Fan, Yang Gao
Institutions: Nanjing University, nankai university, University of Science and Technology of China, The Hong Kong University of Science and Technology
Identifying potential objects is critical for object recognition and analysis across various computer vision applications.Existing methods typically localize potential objects by relying on exemplar images, predefined categories, or textual descriptions.However, their reliance on image and text prompts often limits flexibility, restricting adaptability in real-world scenarios.In this paper, we introduce a novel Prompt-Free Universal Region Proposal Network (\ourmodel), which identifies potential...
VFMDetectionGeneration
Authors: Abdullah Azeem, Ruisheng Wang, Qingquan Li, Abubakar Siddique
Institutions: Shenzhen University, University of Calgary
Autonomous object detection in remote sensing requires systems that can discover new categories and assign them usable labels during deployment. Existing Open-World Object Detectors identify unknown objects but leave them unnamed until manual annotation. In contrast, Open-Vocabulary Detectors recognize unseen categories only with provided prompts at test time, lacking autonomous discovery or naming. This work presents HSGDet, a detector that achieves both discovery and semantic assignment at tes...
VFMSegmentationGeo-Localization
Authors: Gedeon Muhawenayo, Caleb Robinson, Subash Khanal, Zhanpei Fang, Isaac Corley, Alexander Wollam, Tianyi Gao, Leonard Strnad, Ryan Avery, Lyndon Estes, Ana Tárano, Nathan Jacobs, Hannah Kerner
Institutions: Arizona State University (ASU), Microsoft, Washington University in St Louis, Oregon State University, Taylor Geospatial, Washington University in St. Louis, Wherobots, Clark University, Arizona State University
Large-scale maps of field boundaries are essential for agricultural monitoring tasks. Existing deep learning approaches for satellite-based field mapping have undesirable properties for large-scale inference, including sensitivity to illumination, spatial scale, and geographic location changes. We conduct the first systematic evaluation of segmentation and geospatial foundation models (GFM) for global field boundary delineation using the Fields of The World (FTW) benchmark. We evaluate 18 models...
Geo-Localization
Authors: Komal Komal, Mukul Gupta, Saumya Singh, SANTOSH VIPPARTHI, Chakradhar Reddy Chandupatla, Subrahmanyam Murala
Institutions: IIT Ropar, INDIAN INSTITUTE OF TECHNOLOGY ROPAR, Indian Institute Of Technology–Ropar (IIT–Ropar), Trinity College Dublin, Ireland
We present QuCNet, a hybrid quantum classical network for efficient remote sensing image classification. QuCNet integrates a lightweight convolutional encoder with sixteen parallel four-qubit trainable quantum circuits (TQCs) trained under a Hybrid Cyclic Weight-Sharing (HCWS)} strategy. This design enhances expressibility while keeping the parameter count extremely low ~87K, 85× smaller than prior hybrid models). Guided by expressibility analysis, the proposed quantum configuration maintains st...
VFMHyperspectral
Authors: Nicolas Houdré, Diego Marcos, Hugo Turckheim, Dino Ienco, Laurent Wendling, Camille Kurtz, Sylvain Lobry
Institutions: Université Paris Cité - LIPADE, INRIA, National Institute for Agriculture, Environment and Food; INRAE, National Research Institute in Agriculture and Environment, Université Paris Cité (LIPADE), Université Paris Cité
Earth observation (EO) data spans a wide range of spatial, spectral, and temporal resolutions, from high-resolution optical imagery to low resolution multispectral products or radar time series. While recent foundation models have improved multimodal integration for learning meaningful representations, they often expect fixed input resolutions or are based on sensor-specific encoders limiting generalization across heterogeneous EO modalities. To overcome these limitations we introduce RAMEN, a r...
SegmentationChange DetectionHyperspectral
Authors: Zian Cao, Wei Wei, QINGSHAN GAO, Yuanyuanfu Yuanyuanfu
Institutions: Huazhong University of Science and Technology, Pingan Technology
Change detection and semantic segmentation are key techniques for satellite image analysis in remote sensing. However, acquiring high-quality labeled data is costly and time-consuming. Although recent studies have explored generative models to ease data scarcity, a unified framework supporting both tasks is still lacking, and most methods overlook noise accumulation and cannot generate multispectral images. To address this, we propose the robust diffusion framework for masked image generation (R...
VFM
Authors: Inha Kang, Eunki Kim, Wonjeong Ryu, Jaeyo Shin, Seungjun Yu, Yoon-Hee Kang, Seongeun Jeong, Eunhye Kim, Soontae Kim, Hyunjung Shim
Institutions: Korea Advanced Institute of Science & Technology, Ajou University, Kunsan National University, KAIST
Accurate long horizon forecasting of particulate matter (PM) concentration fields is essential for operational public health decisions. However, achieving reliable forecasts remains challenging in regions with complex terrain and strong atmospheric dynamics such as East Asia. While foundation models such as Aurora offer global generality, they often miss region-specific dynamics and rely on non–real-time inputs, limiting their practical utility for localized warning systems. To address this gap,...
VFMSegmentationMLLM
Authors: Xin Niu, Manqi Zhao, Dongsheng Jiang, Yingying Wu, Bing Su
Institutions: Renmin University of China, Huawei Technologies Ltd., BeiJing China-Power Information Technology Co., Ltd,
Remote sensing image segmentation is critical for a range of applications, including natural disaster monitoring and precision agriculture. Open-vocabulary segmentation enhances flexibility by removing fixed category constraints, enabling more fine-grained and adaptive scene understanding. Unlike CLIP’s original pretraining objective, which emphasizes global image-text alignment, segmentation tasks require accurate and discriminative patch-level representations to support precise pixel-wise pred...
Hyperspectral
Authors: Zhuwei Wen, Zimin Xia, He Chen, Linwei Yue, Xianwei Zheng
Institutions: The State Key Lab. LIESMARS, Wuhan University, Wuhan, PR China, Wuhan University, EPFL - EPF Lausanne, Wuhan University, China University of Geosciences Wuhan
In remote sensing pansharpening, spectrally mixed regions, where the spectral interactions among adjacent land covers lead to highly inconsistent reconstruction patterns, remain the most challenging areas. Due to the complex spatial distribution and heterogeneous spectral characteristics of ground objects, existing methods relying on rigid architectures and physical constraints struggle to learn generalized reconstruction patterns from limited spectral mixing samples, resulting in unstable gener...
VFMSuper-ResolutionGeneration
Authors: Enzhuo Zhang, Sijie Zhao, Dilxat Muhtar, Zhenshi Li, Xueliang Zhang, Pengfeng Xiao
Institutions: Nanjing University, nanjing university
Generative diffusion priors have recently achieved state-of-the-art performance in Natural Image Super-Resolution, demonstrating a powerful capability to synthesize photorealistic details. However, their direct application to Remote Sensing Image Super-Resolution (RSISR) reveals significant shortcomings. Remote sensing images present a unique challenge: ground objects often exhibit globally stochastic yet locally clustered patterns. This characteristic leads to highly imbalanced texture distribu...
VFMSegmentation
Authors: Muhammad Naseer Subhani
Institutions: Independent Researcher
Interactive segmentation models such as the Segment Anything Model (SAM) have demonstrated remarkable generalization on natural images, but perform suboptimally on remote sensing imagery (RSI) due to severe domain shift and the scarcity of dense annotations. To address this, we propose a self-prompting point-supervised framework that adapts SAM to RSIs using only sparse point annotation. Our method employs a Refine–Requery–Reinforce loop, where coarse pseudo-masks are generated from initial poin...
UAVTracking
Authors: Jian Zhang, Xincheng Yu, Yi Lin
Institutions: Sichuan University
Occlusion remains one of the major challenges in UAV tracking, where dynamic viewpoints and complex environments often cause partial or complete visibility loss.Existing transformer-based trackers typically regard occlusion as random information dropout, overlooking its structured and spatially correlated nature in real-world scenes.We rethink occlusion modeling in UAV tracking as a structured process governed by spatial dependencies.Based on this insight, we introduce Clustered Occlusion Modeli...
Segmentation
Authors: Junda Xu, Yanmeng Liu, Xiangqiang Zeng, Jinrong Wu, Ying Qu, Libao Zhang
Institutions: Beijing Normal University
Google Earth imagery, combined with building footprint databases, offers an efficient way to construct localized building datasets. However, the lack of orthorectification in these images leads to spatial misalignments between annotations and their corresponding roof locations. Adopting such misaligned data directly for model training can severely degrade segmentation performance. To address the challenge, we propose an Object-based Multi-stage Alignment Framework (OMAF) that generates high-qual...
Geo-Localization
Authors: Junwei Zheng, Ruize Dai, Ruiping Liu, Zichao Zeng, Yufan Chen, Fangjinhua Wang, Kunyu Peng, Kailun Yang, Jiaming Zhang, Rainer Stiefelhagen
Institutions: Karlsruhe Institute of Technology, KIT, Karlsruher Institut für Technologie, University College London, University of London; Karlsruher Institut für Technologie, Karlsruhe Institute of Technology (KIT), ETHZ - ETH Zurich, Hunan University
Metric Cross-View Geo-Localization (MCVGL) aims to estimate the 3-DoF camera pose (position and heading) by matching ground and satellite images. In this work, instead of pinhole and satellite images, we study robust MCVGL using holistic panoramas and OpenStreetMap (OSM). To this end, we establish a large-scale MCVGL benchmark dataset, CV-RHO, with over 2.7M images under different weather and lighting conditions, as well as sensor noise. Furthermore, we propose a model termed RHO with a two-bran...
VFMSegmentationUAV
Authors: Chenxu Peng, Chenxu Wang, Yimian Dai, Yongxiang Liu, Ming-Ming Cheng, Xiang Li
Institutions: Nankai University; Zhejiang University of Technology, NanKai University, Nankai University, National University of Defense Technology, Nankai University, Tsinghua University
Accurate road segmentation from aerial imagery is fundamental to many geospatial applications. However, existing datasets often suffer from limited scene diversity, low semantic granularity, and poor structural continuity, restricting their generalization across environments. To address these challenges, we introduce $WorldRoadSeg-360K$, the largest and most diverse road segmentation dataset to date, comprising 366,947 high-resolution images collected from 38 countries and 223 cities across vari...
MLLM
Authors: qiya song, Yiqiang Xie, Yuan Sun, Renwei Dian, Xudong Kang
Institutions: Hunan Normal University, Sichuan University, Hunan University
As a pivotal task that bridges remote visual and linguistic understanding, Remote Sensing Image-Text Retrieval (RSITR) has attracted considerable research interest in recent years. However, almost all RSITR methods implicitly assume that image-text pairs are matched perfectly. In practice, acquiring a large set of well-aligned data pairs is often prohibitively expensive or even infeasible. Although several studies have acknowledged the presence of noisy pairs, little work has explored how to end...
DetectionUAVSuper-Resolution
Authors: Jialei Zhan, Li Liu, Jiehua Zhang, Yuhang Xie, Yongxiang Liu, Jiangming Chen, Ming-Ming Cheng
Institutions: National University of Defense Technology, Xi'an University of Electronic Science and Technology, Nankai University, Tsinghua University
Recent advancements in remote sensing object detection have predominantly focused on oriented bounding box design and small object feature enhancement, while often overlooking the intrinsic geometric properties of remote sensing images, such as rotation invariance and structural symmetry. Many aerial objects appear in multiple orientations and exhibit clear symmetrical patterns, which, if not explicitly modeled, can lead to detection failures and inaccurate localization under geometric variation...
VFMSegmentationSAR
Authors: Danxu Liu, Di Wang, Hebaixu Wang, Haoyang Chen, Wentao Jiang, Yilin Cheng, Haonan Guo, Wei Cui, Jing Zhang
Institutions: Beijing Institute of Technology, Wuhan University, Zhongguancun Academy, Beijing, China, Fudan University, The University of Sydney
Synthetic Aperture Radar (SAR) imagery plays a critical role in all-weather, day-and-night remote sensing applications. However, existing SAR-oriented deep learning is constrained by data scarcity, while the physically grounded speckle noise in SAR imagery further hampers fine-grained semantic representation learning. To address these challenges, we propose SARMAE, a Noise-Aware Masked Autoencoder for self-supervised SAR representation learning. Specifically, we construct SAR-1M, the first milli...
UAV3D Reconstruction
Authors: Yi Zhu, Hao Xiong, Lin Xiao, Ranfeng Shi, Qinying Gu, Leilei Gu
Institutions: Shanghai Jiao Tong University, Shanghai Jiaotong University, Shanghai artificial intelligence laboratory
Accurate depth estimation in wide field is highly desired in applications of autonomous driving, robot vision and drone controls. Biological compound eyes inspire wide Field of View (FOV) depth estimation, yet their artificial implementations face the challenge of modality misalignment. Specifically, the spherical imaging data doesn’t align with the planar neural network, diminishing the learning efficiency. Herein, we propose SCE-Depth, a bio-inspired framework for spherical compound eye depth ...
Geo-Localization3D Reconstruction
Authors: Tamir Cohen, Leo Segre, Shay Shomer-Chai, Shai Avidan, Hadar Averbuch-Elor
Institutions: Tel Aviv University, Tel-Aviv University, Department of Computer Science, Cornell University
Reconstructing accurate 3D models of large-scale real-world scenes from unstructured, in-the-wild imagery remains a core challenge in computer vision, especially when the input views have little or no overlap. In such cases, existing reconstruction pipelines often produce multiple disconnected partial reconstructions or erroneously merge non-overlapping regions into overlapping geometry.In this work, we propose a framework that grounds each partial reconstruction to a complete reference model of...
MLLMGeo-Localization
Authors: Jiayi Wang, Zhihong Tan, Hongchen Wei, Daiqin Yang, Zhenzhong Chen
Institutions: Wuhan University
Object counting in remote sensing imagery becomes challenging when visual cues are obscured by clouds, fog, shadows, or low-light conditions. Yet earth observation inherently provides complementary geo-modalities, including land use and map, which offer stable structural and contextual priors that remain available when appearance cues fail. In this paper, we introduce \textbf{GROC}, the first large-scale dataset \textbf{G}eo-guided \textbf{R}easoning in \textbf{O}bject \textbf{C}ounting under ...
Change DetectionSuper-Resolution
Authors: Xiantao Ma, Siwei Dong, Lin Zhu, Lizhi Wang, Hua Huang
Institutions: Beijing Institute of Technology; Beijing Institute of Technology, Peking University, Beijing Institute of Technology, Beijing Normal University
Spike cameras are a novel class of neuromorphic vision sensors that capture scene dynamics with ultra-high temporal resolution via spike planes. While recent methods have addressed motion blur and noise in spike-based reconstruction, defocus blur caused by shallow depth of field or lens adjustment delays remains a critical yet underexplored issue in real-world applications such as autonomous driving. In this work, we present DeSpike, the first end-to-end defocus removal framework specifically de...
SegmentationMLLMGeo-Localization
Authors: Zepeng Xin, Kaiyu Li, Luodi Chen, Wanchen Li, Xiao Yuchen, Hui Qiao, Weizhan Zhang, Deyu Meng, Xiangyong Cao
Institutions: Xi'an Jiaotong University, China Telecom
Effectively grounding complex language to pixels in remote sensing (RS) images is a critical challenge for applications like disaster response and environmental monitoring. Current models can parse simple, single-target commands but fail when presented with complex geospatial scenarios, e.g., segmenting objects at various granularities, executing multi-target instructions, and interpreting implicit user intent. To drive progress against these failures, we present LaSeRS, the first large-scale da...
Geo-LocalizationImage Fusion
Authors: Jinsong Zhang, Ying Qu, Yuan Liao, Hairong Qi, Zhenzhou Shao
Institutions: Capital Normal University, Beijing Normal University, University of Tennessee, Knoxville; University of Tennessee, Knoxville
Frequent and precise land surface monitoring is critical for numerous applications, yet existing satellites struggle to achieve both simultaneously. Spatiotemporal fusion (STF) tackles this challenge by integrating multiple satellite images to generate data with improved temporal and spatial resolution, enabling more frequent and precise land surface observations. However, current methods often fail to recover dynamic landscape changes due to significant scale discrepancies among multi-source im...
HyperspectralSelf-Supervised
Authors: Yuqiao He, Xiaoyan LIU, Jianxu Mao, Yaonan Wang, Hui Zhang, Lizhu Liu, Yurong Chen, Wenbin He
Institutions: Hunan University
Coded Aperture Snapshot Spectral Imaging (CASSI) has emerged as a prominent technique for efficient hyperspectral imaging. However, the strong coupling between physical encoding and computational decoding makes CASSI highly sensitive to minor hardware misalignments, which can significantly degrade reconstruction quality. Existing methods either assume ideal imaging conditions, or rely on offline calibration, making them vulnerable to dynamic perturbations, such as thermal expansion and mechanica...
Geo-Localization
Authors: CHEN Yang, Xieyuanli Chen, Junxiang Li, Jie Tang, Tao Wu
Institutions: National University of Defense Technology, Changsha, China., National University of Defense Technology, 国防科技大学(National University of Defense Technology)
Robust cross-view geo-localization (CVGL) remains challenging despite the surge in recent progress. Existing methods still rely on field-of-view (FoV)-specific training paradigms, where models are optimized under a fixed FoV but collapse when tested on unseen FoVs and unknown orientations. This limitation necessitates deploying multiple models to cover diverse variations. Although studies have explored dynamic FoV training by simply randomizing FoVs, they failed to achieve robustness across dive...
VFMUAVGeo-Localization
Authors: Zengyan Wang, Sirshapan Mitra, rajat modi, Hui Xian Grace Lim, Yogesh Rawat
Institutions: University of Central Florida, Self
In this work, we propose the problem of localizing cameras and producing renders of a scene, given multiple images captured from ground/aerial/satellite viewpoints. We introduce a dataset called Sky2Ground, which contains synthetic/real images across all 3 viewpoints, along with camera parameters, and dense depth-maps/surface-normals. Recent works have shown that transformer-based nets like VGGT are capable of inferring scene-parameters in a single-forward pass. However, we formally reveal that ...
VFMSegmentationSAR
Authors: Kang Wu, Lei Yu, Junwei Luo, Bo Dang, Junjian Zhang, Xiangyuan Cai, Hongwei Hu, Jingdong Chen, Yansheng Li
Institutions: Wuhan University, antgroup, Ant Group, Alibaba Group
While recent foundation models for remote sensing (RS) segmentation have shown notable progress, they still face significant challenges, struggling to process diverse multi-modal inputs, synergize complementary prompt types, and leverage semantic hierarchies. To address these limitations, we introduce SkySense-VITA, a unified in-context segmentation model, which synergistically processes both optical and SAR imagery using visual, textual, or fused prompts. Based on a novel prompt-and-prediction ...
VFMChange DetectionSuper-Resolution
Authors: Shilong Li, Xiurui Xie, Qiugang Zhan, Luochao Wang, Yong Deng, Guisong Liu
Institutions: University of Electronic Science and Technology of China, Southwestern University of Finance and Economics, Southwest University of Finance and Economics, Southwestern University of Finance and Economics; University of Electronic Science and Technology of China
The temporal evolution patterns of surface spatial structures constitute a central concern within the field of intelligent remote sensing interpretation.However, constrained by the availability of only two temporal phases, modeling sparse spatio-temporal change processes to effectively interpret surface alterations remains a core challenge in intelligent remote sensing analysis. To address this, this paper proposes SpikeAdapter, a lightweight enhancement framework. This framework comprises Geo-S...
HyperspectralSuper-ResolutionGeneration
Authors: jiahan huang, Ran Ran, Junming Hou, Zihao Chen, Xiaofeng Cong, Junling Li, Liang-Jian Deng
Institutions: Southeast University, University of Electronic Science and Technology of China
Pan-sharpening, a fundamental image preprocessing technique in remote sensing, aims to generate spatially and spectrally enriched multispectral imagery by integrating complementary information from texture-rich panchromatic (PAN) images and paired low-resolution multispectral (LRMS) counterparts. Although recent generative diffusion models have achieved impressive fusion quality, these performance gains often come with substantial computational costs, rendering them impractical for resource-cons...
HyperspectralGeo-LocalizationSuper-Resolution
Authors: Si-Sheng Yang, Chia-Hsiang Lin
Institutions: Institute of Computer and Communication Engineering, National Cheng Kung University
The European Space Agency's Sentinel-2 satellite provides global multispectral coverage for remote sensing (RS) applications. However, limited spectral resolution (12 bands) and non-unified spatial resolution (60/20/10 m) restrict their practicality. In contrast, the high spectral-spatial resolution sensor (e.g., NASA's AVIRIS-NG) covers only the American region due to practical considerations. This raises a fundamental question: ``Can a global hyperspectral coverage be achieved by reconstructin...
VFMMLLMGeo-Localization
Authors: Minh Kha Do, Wei Xiang, Kang Han, Di Wu, Khoa T. Phan, Yi-Ping Phoebe Chen, Gaowen Liu, Ramana Kompella
Institutions: La Trobe University, Cisco Systems, Cisco
Vision-language foundation models (VLFMs) promise zero-shot and retrieval understanding for Earth observation. While operational satellite systems often lack full multi-spectral coverage, making RGB-only inference highly desirable for scalable deployment, the adoption of VLFMs for satellite imagery remains hindered by two factors: (1) multi-spectral inputs are informative but difficult to exploit consistently due to band redundancy and misalignment; and (2) CLIP-style text encoders limit semanti...
VFMSegmentationGeneration
Authors: Yunkai Yang, Yudong Zhang, Kunquan Zhang, Jinxiao Zhang, Xinying Chen, Haohuan Fu, Runmin Dong
Institutions: Sun Yat-sen University, Tencent; Tsinghua University, SUN YAT-SEN UNIVERSITY, Tsinghua University, Beijing Institute of Technology, Tsinghua University, Tsinghua University, Sun Yat-sen University,
With the rapid progress of controllable generation, training data synthesis has become a promising way to expand labeled datasets and alleviate manual annotation in remote sensing (RS). However, the complexity of semantic mask control and the uncertainty of sampling quality often limit the utility of synthetic data in downstream semantic segmentation tasks. To address these challenges, we propose a task-oriented data synthesis framework (TODSynth), including a Multimodal Diffusion Transformer (M...
VFMMLLMChange Detection
Authors: Yan Shu, Bin Ren, Zhitong Xiong, Xiao Xiang Zhu, Begüm Demir, Nicu Sebe, Paolo Rota
Institutions: Beijing Academy of Artificial Intelligence, Mohamed bin Zayed University of Artificial Intelligence, Technical University of Munich, Technical University Munich, Technische Universität Berlin, University of Trento
Vision-language models (VLMs) have shown promise in earth observation (EO), yet they struggle with tasks that require grounding complex spatial reasoning in precise pixel-level visual representations. To address this problem, we introduce TerraScope, a unified VLM that delivers pixel-grounded geospatial reasoning with two key capabilities: (1) modality-flexible reasoning: it handles single-modality inputs (optical or SAR) and adaptively fuses different modalities into the reasoning process when ...
VFMSegmentationGeo-Localization
Authors: Zhengpeng Feng, Clement Atzberger, Sadiq Jaffer, Jovana Knezevic, Silja Sormunen, Robin Young, Madeline Lisaius, Markus Immitzer, Toby Jackson, James Ball, David Coomes, Anil Madhavapeddy, Andrew Blake, Srinivasan Keshav
Institutions: University of Cambridge, dClimate, Aalto University, Cyclops MRV Inc; Universität für Bodenkultur Wien, University of Bristol, University of Cambridge , Clare Hall, U. Cambridge.
Satellite Earth-observation (EO) time series in the optical and microwave ranges are often irregular due to orbital patterns and cloud obstruction, and while compositing addresses these issues, it loses critical phenological information. To overcome this, we present TESSERA, a pixel-wise foundation model for multi-modal (Sentinel-1/2) EO time series that learns robust, label-efficient embeddings. During training, TESSERA uses Barlow Twins and sparse random temporal sampling to enforce invariance...
VFMSegmentationMLLM
Authors: Ting Yang, Qilong Wang, Qibin Hou, Qinghua Hu
Institutions: Tianjin University, university of tianjin of china, Nankai University
The rise of vision-language models (VLMs) has driven the initial exploration of open-vocabulary remote sensing image semantic segmentation (OVRSIS), enabling recognition of unseen categories in complex Earth observation scenes. However, existing methods primarily focus on enhancing visual representations of domain-specific remote sensing images, while overlooking the effect of textual information. In this paper, we argue that there exists a crucial issue of textual ambiguity in OVRSIS task, limi...
UAVTracking
Authors: chaocan xue, Qihua Liang, Bineng Zhong, Yanting Zu, Yuanliang Xue, Haiying Xia, Shuxiang Song
Institutions: Guangxi Normal University, Xi’an Research Institute of High Technology
The utilization of temporal information has always been an open topic in the tracking community. However, existing trackers tend to employ more and more inputs or parameters for temporal learning, hindering their deployment in resource-constrained unmanned aerial vehicles (UAVs). More importantly, this raises ambiguity whether the performance gains come from the temporal learning itself, or come from the increased inputs and parameters. In this study, we advocate designing temporal learning comp...
SegmentationDetectionUAV
Authors: Shiman He, Nuo Chen, Xinyi Ying, Yihang Luo, Yangsi Shi, Zaiping Lin, Miao Li
Institutions: College of Electronic Science and Technology, National University of Defense Technology, National University of Defense Technology, National University of Defence Technology
Small object detection (SOD) plays a vital role in applications such as anti-UAV tasks, yet conventional image-based methods struggle in high-speed scenarios due to the limited frame rate. Event cameras offer a promising alternative by capturing spatiotemporal event streams with microsecond-level temporal resolution. To address the inherent sparsity of small objects in event data, existing methods typically formulate the detection task as semantic segmentation on spatiotemporal point clouds to ...
DetectionUAV
Authors: Craig Iaboni, Pramod Abichandani
Institutions: New Jersey Institute of Technology
Reliable UAV object detection requires robustness to illumination changes, motion blur, and scene dynamics that suppress RGB cues. Thermal long-wave infrared (LWIR) sensing preserves contrast in low light, and event cameras retain microsecond-level temporal edges, but integrating all three modalities in a unified detector has not been systematically studied. We present a tri-modal framework that processes RGB, thermal, and event data with a dual-stream hierarchical vision transformer. At selecte...
MLLMImage FusionSelf-Supervised
Authors: Chengyu Zheng, Hanzhang Lu, Jie Nie, Shan Du
Institutions: The University of British Columbia, University of British Columbia, Ocean University of China
In remote sensing (RS) cross-modal retrieval, most existing methods employ contrastive learning as their primary optimization objective, aligning anchors with positive counterparts and distinguishing them from negative samples. To improve negative sampling, these approaches typically set thresholds on cross-modal similarity scores, designating negatives that exceed the threshold as false negative samples (FNS). However, dependence on a single cross-modal similarity threshold is fragile because i...
UAVTracking
Authors: Liang Qin, Min Wang, Xingyu Lu, Aowen Qiu, Wengang Zhou, Houqiang Li
Institutions: University of Science and Technology of China, Institute of Artificial Intelligence, Hefei Comprehensive National Science Center
Active search and tracking of arbitrary targets by Unmanned Aerial Vehicles (UAVs) in cluttered environments remains a highly challenging problem. Existing methods either construct complex modular pipelines, leading to substantial computational costs, or adopt end-to-end controllers that often fail to generalize across different targets and scenes. Moreover, search and tracking are typically treated separately despite their strong interdependence.In this paper, we present UAST, a simple yet effe...
UAVImage Fusion
Authors: Shenghui Huang, Menghao Hu, Longkun Zou, Hongyu Chi, Zekai Li, Feng Gao, Fan Yang, Qingyao Wu, Ke Chen
Institutions: Pengcheng Laboratory, Xinjiang University; Pengcheng Lab, Harbin Institute of Technology; Pengcheng Lab, South China University of Technology, Peking University, Peng Cheng Laboratory
Detecting Unmanned Aerial Vehicles (UAVs) in low-altitude environments is essential for perception and defense systems but remains highly challenging due to complex backgrounds, camouflage, and multimodal interference. In real-world scenarios, UAVs are frequently visually blended with surrounding structures such as buildings, vegetation, and power lines, resulting in low contrast, weak boundaries, and strong confusion with cluttered background textures. Existing UAV detection datasets, though di...
UAV3D Reconstruction
Authors: Kang DU, 雪 廖, Junpeng Xia, Chaozheng Guo, Yi Gu, Yirui Guan, Duotun Wang, ShengHuang ShengHuang, Zeyu Wang
Institutions: Hong Kong University of Science and Technology (Guangzhou), The Hong Kong University of Science and Technology, Meituan, Tencent, The Hong Kong University of Science and Technology (Guangzhou)
Illumination inconsistency is a fundamental challenge in multi-view 3D reconstruction. Variations in sunlight direction, cloud cover, and shadows break the constant-lighting assumption underlying both classical multi-view stereo (MVS) and structure from motion (SfM) pipelines and recent neural rendering methods, leading to geometry drift, color inconsistency, and shadow imprinting. This issue is especially critical in UAV-based reconstruction, where long flight durations and outdoor environments...
VFMMLLMChange Detection
Authors: Xu Zhang, Danyang Li, Xiaohang Dong, Tianhao Wu, Hualong Yu, Jianye Wang, Qicheng Li, Xiang Li
Institutions: Computer Science, Nankai University, Nankai University, Sichuan Agricultural University
Change detection (CD) is a fundamental task for monitoring and analysing land cover dynamics. While recent high performance models and high quality datasets have significantly advanced the field, a critical limitation persists. Current models typically acquire limited knowledge from single-type annotated data and cannot concurrently leverage diverse binary change detection (BCD) and semantic change detection (SCD) datasets. This constraint leads to poor generalisation and limited versatility. Th...
UAVGeo-LocalizationSuper-Resolution
Authors: Xiao Liang, Huaizhi Tang, Feiyang Zhang, Shiji Yuan, Chun Hu, Dezhi Zheng, Kang Ma
Institutions: Beijing Institute of Technology, Beijing University of Aeronautics and Astronautics, Beihang University
Cross-view geo-localization (CVGL) aims to estimate an image’s geographic location by matching it with geo-referenced images from different viewpoints, supporting applications such as autonomous driving, UAV navigation, and visual surveillance. However, due to the high cost of image collection, current CVGL datasets often suffer from limited diversity in both drone and ground imagery, which constrains model generalization. Furthermore, existing methods primarily focus on either ground-to-satelli...
VFMSegmentationMLLM
Authors: Shuo Ni, Di Wang, He Chen, Haonan Guo, Ning Zhang, Jing Zhang
Institutions: Beijing Institude of Technology; Zhongguancun Academy, Wuhan University, Beijing Institute of Technology, Hong Kong Polytechnic University; Beijing Institute of Technology, The University of Sydney
Instruction-driven segmentation in remote sensing generates masks from guidance, offering great potential for accessible and generalizable applications. However, existing methods suffer from fragmented task formulations and limited instruction data, hindering effective understanding and generalization. To address these issues, we introduce GeoSeg-1M, the first million-scale dataset for remote sensing instruction-driven segmentation, constructed via an automatic mask filtering and instruction gen...
Generation3D Reconstruction
Authors: Meng Wang, Changqun Xia, Yuze Wang, Junyi Wang, Wantong Duan, Xinxiong Xie, Yue Qi
Institutions: Beihang University, Peng Cheng Laboratory, Shandong University, PengCheng Labratory
Recently, 3D Gaussian Splatting (3DGS) has revolutionized radiance field reconstruction, enabling efficient and high-fidelity novel view synthesis. However, seamless integration of both aerial and street view images to model urban scenes remains a significant challenge for 3DGS. This joint setting suffers from extreme view coverage disparity, complex multi-scale details, and imbalanced viewpoint distributions.In this work, we present Urban-GS, a novel framework built upon Gaussian Splatting for...
DetectionUAVGeo-Localization
Authors: Weijia Li, Haoen Xiang, Tianxu Wang, Shuaibing Wu, Qiming Xia, Cheng Wang, Chenglu Wen
Institutions: Xiamen University, Lanzhou University of Technology, Nanchang University, XMU
Modern autonomous vehicle perception systems are often constrained by occlusions, blind spots, and limited sensing range, hindering progress toward Level 5 autonomy. While existing cooperative perception paradigms such as Vehicle-to-Vehicle (V2V) and Vehicle-to-Infrastructure (V2I) have demonstrated their effectiveness in mitigating these challenges, they remain limited to ground-level collaboration and cannot fully address large-scale occlusions or long-range perception in complex 3D environmen...
UAV
Authors: Tao Jun Lin, Yujiao Shi, Hongdong Li
Institutions: Australian National university, ShanghaiTech University, Australian National University
Aerial-ground visual localization is a challenging task due to the significant differences in scene scale and view point captured between two views. In this work, we explore the practical benefit of jointly learning camera calibration and bird’s-eye-view (BEV) projection for estimating full 6 Degrees-of-freedom relative camera pose between uncalibrated aerial and ground views. We present Visual Geometry Alignment (VGA), a unified framework that jointly learns a global gravity-alignment prior inf...
Change DetectionImage FusionSuper-Resolution
Authors: Linfeng Tang, Yeda Wang, Meiqi Gong, Zizhuo Li, Yuxin Deng, Xunpeng Yi, Chunyu Li, Han Xu, HAO ZHANG, Jiayi Ma
Institutions: Wuhan University, 武汉大学, Southeast University
Compared to images, videos better align with real-world acquisition scenarios and possess valuable temporal cues. However, existing multi-sensor fusion research predominantly integrates complementary context from multiple images rather than videos. This primarily stems from two factors: 1) the scarcity of large-scale multi-sensor video datasets, limiting research in video fusion, and 2) the inherent difficulty of jointly modeling spatial and temporal dependencies in a unified framework. This pap...
MLLMUAVGeo-Localization
Authors: Quan Zhang, Zeqiang Cai, Peiming Zhao, Jingze Wu, Cailun Wu, Hongbo Chen, Jianhuang Lai
Institutions: SUN YAT-SEN UNIVERSITY
Aerial-Ground Person Re-Identification (AGPReID) remains highly challenging due to drastic viewpoint variations between drones and fixed cameras. Existing methods typically follow a view-invariant paradigm, aligning shared features across views to achieve robustness. However, view-invariant inherently enforces part-level alignment, which ignores view-specific cues and discriminative identity information. To this end, this work proposes ViSA (View-aware Semantic Alignment), a view-aware framework...
Geo-Localization
Authors: Juhye Park, Wooju Lee, Dasol Hong, Changki Sung, Youngwoo Seo, DongWan Kang, Hyun Myung
Institutions: Korea Advanced Institute of Science & Technology, Electronics and Telecommunications Research Institute, Hanhwa Aerospace, Hanwha Aerospace, KAIST
Accurate global localization is crucial for autonomous driving and robotics, especially in dense urban environments where GNSS is often unreliable due to occlusion and multipath effects. As an emerging alternative, cross-view pose estimation predicts the 3-DoF camera pose corresponding to a ground-view image with respect to a geo-referenced satellite image. However, existing methods struggle to bridge the significant viewpoint gap between the ground and satellite views mainly due to limited spat...
DetectionUAVGeneration
Authors: Wenhao Li, Zimeng Wu, Yu Wu, Zehua Fu, Jiaxin Chen
Institutions: Beihang University, Beijing University of Aeronautics and Astronautics, Hangzhou Innovation Institute, Beihang University
Unmanned aerial vehicle (UAV) based object detection is a critical but challenging task, when applied in dynamically changing scenarios with limited annotated training data. Layout-to-image generation approaches have proved effective in promoting detection accuracy by synthesizing labeled images based on diffusion models. However, they suffer from frequently producing artifacts, especially near layout boundaries of tiny objects, thus substantially limiting their performance. To address these iss...
VFMDetectionMLLM
Authors: Shuohao Shi, Qiang Fang, Xin Xu
Institutions: National University of Defense Technology
Closed-set object detection in remote sensing imagery has made significant progress, but achieving high detection accuracy remains challenging. Vision-Language Models (VLMs), which possess rich prior knowledge, offer a promising solution to this challenge. However, most existing VLMs are designed for open-vocabulary tasks and exhibit inherent limitations when directly applied to closed-set scenarios, such as notable accuracy degradation and high deployment costs. To address these issues, we prop...
VFMGeo-Localization3D Reconstruction
Authors: Chandrakanth Gudavalli, Tajuddin Manhar Mohammed, Abhay Yadav, Ananth Bhaskar, Hardik Prajapati, Cheng Peng, Rama Chellappa, Shivkumar Chandrasekaran, B.S. Manjunath
Institutions: University of California, Santa Barbara, Mayachitra, Inc., Johns Hopkins University, Mayachitra Inc, School of Data Science, University of Virginia; Mathematical Institute for Data Science (MINDS) at JHU, Mayachitra, Inc.; University of California, Santa Barbara
Aligning ground-level imagery with geo-registered satellite maps is crucial for mapping, navigation, and situational awareness, yet remains challenging under large viewpoint gaps or when GPS is unreliable. We introduce Wrivinder, a zero-shot, geometry-driven framework that aggregates multiple ground photographs to reconstruct a consistent 3D scene and align it with overhead satellite imagery. Wrivinder combines SfM reconstruction, 3D Gaussian Splatting, semantic grounding, and monocular depth–ba...
HyperspectralGeo-LocalizationImage Fusion
Authors: Miro Miranda, Deepak Pathak, Patrick Helber, Benjamin Bischke, Hiba Najjar, Francisco Mena, Cristhian Sanchez, Akshay Pai, Diego Arenas, Matias Valdenegro, Marcela Charfuelan, Marlon Nuske, Andreas Dengel
Institutions: German Research Center for Artificial Intelligence, German Research Center for AI, Vision Impulse GmbH; German Research Center for AI, Universität Kaiserslautern, GFZ Helmholtz Centre for Geosciences; Rheinland-Pfälzische Technische Universität, University of Groningen, DFKI & RPTU
Crop yield prediction requires substantial data to train data-driven models. However, creating yield prediction datasets is constrained by high acquisition costs, heterogeneous data quality, and data privacy regulations. Consequently, existing datasets are scarce, low in quality, or limited to regional levels or single crop types. In this work, we release YieldSAT, a large, high-quality, and multimodal dataset for high-resolution crop yield prediction. YieldSAT spans various climate zones across...
VFMSegmentationMLLM
Authors: Ruixun Liu, Bowen Fu, Jiayi Song, Kaiyu Li, Wanchen Li, Lanxuan Xue, Hui Qiao, Weizhan Zhang, Deyu Meng, Xiangyong Cao
Institutions: Xi'an Jiaotong University, China Telecom
Ultra-high-resolution (UHR) remote sensing (RS) images offer rich fine-grained information but also present challenges in effective processing. Existing dynamic resolution and token pruning methods are constrained by a passive perception paradigm, suffering from increased redundancy when obtaining finer visual inputs. In this work, we explore a new active perception paradigm that enables models to revisit information-rich regions. First, we present LRS-GRO, a large-scale benchmark dataset tailor...