ICLR-2024 卫星遥感图像相关论文7篇
Remote Sensing Vision-Language Foundation Models without Annotations via Ground Remote Alignment
文章解读: http://www.studyai.com/xueshu/paper/detail/0dbc1e9737
文章链接: (https://openreview.net/forum?id=w9tc699w3Z)
摘要
We introduce a method to train vision-language models for remote-sensing images without using any textual annotations.
Our key insight is to use co-located internet imagery taken on the ground as an intermediary for connecting remote-sensing images and language.
Specifically, we train an image encoder for remote sensing images to align with the image encoder of CLIP using a large amount of paired internet and satellite images.
Our unsupervised approach enables the training of a first-of-its-kind large scale VLM for remote sensing images at two different resolutions.
We show that these VLMs enable zero-shot, open-vocabulary image classification, retrieval, segmentation and visual question answering for satellite images.
On each of these tasks, our VLM trained without textual annotations outperforms existing VLMs trained with supervision, with gains of up to 20% for classification and 80% for segmentation…
GeoLLM: Extracting Geospatial Knowledge from Large Language Models
文章解读: http://www.studyai.com/xueshu/paper/detail/36e5422c48
文章链接: (https://openreview.net/forum?id=TqL2xBwXP3)
摘要
The application of machine learning (ML) in a range of geospatial tasks is increasingly common but often relies on globally available covariates such as satellite imagery that can either be expensive or lack predictive power.
Here we explore the question of whether the vast amounts of knowledge found in Internet language corpora, now compressed within large language models (LLMs), can be leveraged for geospatial prediction tasks.
We first demonstrate that LLMs embed remarkable spatial information about locations, but
naively querying LLMs using geographic coordinates alone is ineffective in predicting key indicators like population density.
We then present GeoLLM, a novel method that can effectively extract geospatial knowledge from LLMs with auxiliary map data from OpenStreetMap.
We demonstrate the utility of our approach across multiple tasks of central interest to the international community, including the measurement of population density and economic livelihoods.
Across these tasks, our method demonstrates a 70% improvement in performance (measured using Pearson’s
r
2
r^2
r2) relative to baselines that use nearest neighbors or use information directly from the prompt, and performance equal to or exceeding satellite-based benchmarks in the literature.
With GeoLLM, we observe that GPT-3.5 outperforms Llama 2 and RoBERTa by 19% and 51% respectively, suggesting that the performance of our method scales well with the size of the model and its pretraining dataset.
Our experiments reveal that LLMs are remarkably sample-efficient, rich in geospatial information, and robust across the globe.
Crucially, GeoLLM shows promise in mitigating the limitations of existing geospatial covariates and complementing them well.
Code is available on the project website: https://rohinmanvi.github.io/GeoLLM.
Convolution Meets LoRA: Parameter Efficient Finetuning for Segment Anything Model
文章解读: http://www.studyai.com/xueshu/paper/detail/3f888c377b
文章链接: (https://openreview.net/forum?id=ezscMer8L0)
摘要
The Segment-Anything Model (SAM) stands as a foundational framework for image segmentation.
While it exhibits remarkable zero-shot generalization in typical scenarios, its advantage diminishes when applied to specialized domains like medical imagery and remote sensing.
To address this limitation, this paper introduces Conv-LoRA, a simple yet effective parameter-efficient fine-tuning approach.
By integrating ultra-lightweight convolutional parameters into Low-Rank Adaptation (LoRA), Conv-LoRA can inject image-related inductive biases into the plain ViT encoder, further reinforcing SAM’s local prior assumption.
Notably, Conv-LoRA not only preserves SAM’s extensive segmentation knowledge but also revives its capacity of learning high-level image semantics, which is constrained by SAM’s foreground-background segmentation pretraining.
Comprehensive experimentation across diverse benchmarks spanning multiple domains underscores Conv-LoRA’s superiority in adapting SAM to real-world semantic segmentation tasks…
DiffusionSat: A Generative Foundation Model for Satellite Imagery
文章解读: http://www.studyai.com/xueshu/paper/detail/3fa371d45a
文章链接: (https://openreview.net/forum?id=I5webNFDgQ)
摘要
Diffusion models have achieved state-of-the-art results on many modalities including images, speech, and video.
However, existing models are not tailored to support remote sensing data, which is widely used in important applications including environmental monitoring and crop-yield prediction.
Satellite images are significantly different from natural images -they can be multi-spectral, irregularly sampled across time -and existing diffusion models trained on images from the Web do not support them.
Furthermore, remote sensing data is inherently spatio-temporal, requiring conditional generation tasks not supported by traditional methods based on captions or images.
In this paper, we present DiffusionSat, to date the largest generative foundation model trained on a collection of publicly available large, high-resolution remote sensing datasets .
As text-based captions are sparsely available for satellite images, we incorporate the associated metadata such as geolocation as conditioning information.
Our method produces realistic samples and can be used to solve multiple generative tasks including temporal generation, multi-spectral superrresolution and in-painting.
Our method outperforms previous state-of-the-art methods for satellite image generation and is the first large-scale generative foundation model for satellite imagery.
The project website can be found here: https://samar-khanna.github.io/DiffusionSat/.
DEEP NEURAL NETWORK INITIALIZATION WITH SPARSITY INDUCING ACTIVATIONS
文章解读: http://www.studyai.com/xueshu/paper/detail/89ceaf0765
文章链接: (https://openreview.net/forum?id=uvXK8Xk9Jk)
摘要
Inducing and leveraging sparse activations during training and inference is a promising avenue for improving the computational efficiency of deep networks, which is increasingly important as network sizes continue to grow and their application becomes more widespread.
Here we use the large width Gaussian process limit to analyze the behaviour, at random initialization, of nonlinear activations that induce sparsity in the hidden outputs.
A previously unreported form of training instability is proven for arguably two of the most natural candidates for hidden layer sparsification; those being a shifted ReLU (
ϕ
(
x
)
=
max
(
0
,
x
−
τ
)
\phi(x)=\max(0, x-\tau)
ϕ(x)=max(0,x−τ) for
τ
≥
0
\tau\ge 0
τ≥0) and soft thresholding (
ϕ
(
x
)
=
0
\phi(x)=0
ϕ(x)=0 for
∣
x
∣
≤
τ
|x|\le\tau
∣x∣≤τ and
x
−
sign
(
x
)
τ
x-\text{sign}(x)\tau
x−sign(x)τ for
∣
x
∣
>
τ
|x|>\tau
∣x∣>τ).
We show that this instability is overcome by clipping the nonlinear activation magnitude, at a level prescribed by the shape of the associated Gaussian process variance map.
Numerical experiments verify the theory and show that the proposed magnitude clipped sparsifying activations can be trained with training and test fractional sparsity as high as 85% while retaining close to full accuracy…
Experimental Design for Multi-Channel Imaging via Task-Driven Feature Selection
文章解读: http://www.studyai.com/xueshu/paper/detail/a8685fa282
文章链接: (https://openreview.net/forum?id=MloaGA6WwX)
摘要
This paper presents a data-driven, task-specific paradigm for experimental design, to shorten acquisition time, reduce costs, and accelerate the deployment of imaging devices.
Current approaches in experimental design focus on model-parameter estimation and require specification of a particular model, whereas in imaging, other tasks may drive the design.
Furthermore, such approaches often lead to intractable optimization problems in real-world imaging applications.
Here we present a new paradigm for experimental design that simultaneously optimizes the design (set of image channels) and trains a machine-learning model to execute a user-specified image-analysis task.
The approach obtains data densely-sampled over the measurement space (many image channels) for a small number of acquisitions, then identifies a subset of channels of prespecified size that best supports the task.
We propose a method: TADRED for TAsk-DRiven Experimental Design in imaging, to identify the most informative channel-subset whilst simultaneously training a network to execute the task given the subset.
Experiments demonstrate the potential of TADRED in diverse imaging applications: several clinically-relevant tasks in magnetic resonance imaging; and remote sensing and physiological applications of hyperspectral imaging.
Results show substantial improvement over classical experimental design, two recent application-specific methods within the new paradigm, and state-of-the-art approaches in supervised feature selection.
We anticipate further applications of our approach.
Code is available: https://github.com/sbb-gh/experimental-design-multichannel.
Geographic Location Encoding with Spherical Harmonics and Sinusoidal Representation Networks
文章解读: http://www.studyai.com/xueshu/paper/detail/be65b783b7
文章链接: (https://openreview.net/forum?id=PudduufFLa)
摘要
Learning representations of geographical space is vital for any machine learning model that integrates geolocated data, spanning application domains such as remote sensing, ecology, or epidemiology.
Recent work embeds coordinates using sine and cosine projections based on Double Fourier Sphere (DFS) features.
These embeddings assume a rectangular data domain even on global data, which can lead to artifacts, especially at the poles.
At the same time, little attention has been paid to the exact design of the neural network architectures with which these functional embeddings are combined.
This work proposes a novel location encoder for globally distributed geographic data that combines spherical harmonic basis functions, natively defined on spherical surfaces, with sinusoidal representation networks (SirenNets) that can be interpreted as learned Double Fourier Sphere embedding.
We systematically evaluate positional embeddings and neural network architectures across various benchmarks and synthetic evaluation datasets.
In contrast to previous approaches that require the combination of both positional encoding and neural networks to learn meaningful representations, we show that both spherical harmonics and sinusoidal representation networks are competitive on their own but set state-of-the-art performances across tasks when combined.
The model code and experiments are available at https://github.com/marccoru/locationencoder…