Spaces:

ibm-esa-geospatial
/

challenge

Running

App Files Files Community

Model-agnostic dataset uncertainty assessment

#20

by Jakob12345 - opened Jan 31

Discussion

Jakob12345

Jan 31

•

edited Jan 31

Model-agnostic dataset uncertainty assessment

Data-driven approaches are increasingly used in real-world applications such as Earth Observation, where reliable performance is critical. In practice, model performance often depends as much on dataset quality and coverage as on the choice of architecture.

We propose a model-agnostic tool to estimate dataset-related uncertainty before actually training a model.

We leverage the S2L1C embedding space of TerraMind Tiny to provide two complementary diagnostics:

1. Distribution coverage (epistemic risk proxy)
If a dataset does not cover relevant regions of the input distribution, models trained on it are more likely to fail under distribution shift.

2. Task clarity (aleatoric ambiguity proxy)
If class definitions are ambiguous or labels are noisy, performance is bounded. This typically appears as strong class overlap in embedding space.

Overall, the goal is to help practitioners assess task feasibility and identify dataset shortcomings (e.g., missing conditions, ambiguous labels) before investing in training and evaluation.

Data is Key: TerraMind for Data Quality Assesment

While both aspects above are highly relevant for Earth observation tasks, dataset evaluation has traditionally been performed during training and evaluation of a specific downstream model.

This is mainly because it is difficult to formally describe the data distribution of a dataset or an even broader reference distribution that covers the diversity of real-world inputs.

Foundation models offer a practical alternative: they provide a structured embedding space that encodes fundamental properties of the imagery in an accessible representation.

TerraMind is trained on a large and diverse data distribution (TerraMesh [1]) and is a good option for a reference representation for Sentinel-2 imagery. The model is trained in a self-supervised manner to capture image structure in a compact latent representation.

In this work, we demonstrate how TerraMind embeddings can be used as a proxy to estimate both aleatoric (task ambiguity) and epistemic (data distribution coverage) uncertainty, before training a model.

Methodology

Aleatoric Uncertainty: How well is the problem defined?

We first evaluate the uncertainty in the task definition.
For this, we apply a simple logistic regression in the embedding space (one-vs-rest, class-wise).

Epistemic Uncertainty: How well is the global data distribution covered?

How to derive a clustering?
To establish a global reference distribution, we use the Sentinel-2 component of the TerraMesh dataset embedded in the TerraMind latent space. We employ the hierarchical clustering method introduced by [2], which partitions the embedding space into semantically homogeneous groups. Unlike classical clustering methods, which often preserve source data imbalances, this strategy mitigates the density skew inherent in geographical sampling and is robust to long-tail distributions. This property allows our method to detect out-of-distribution data and, crucially, to contextualize it semantically. The clustering can be seen as a discretization of the continuous embedding space into a set of semantic concepts.

How to measure coverage of a dataset?
Given a new dataset embedded with TerraMind, each sample is assigned to one of these semantic clusters. The number of occupied clusters defines a metric which we call global coverage. Higher coverage indicates greater diversity, while empty clusters indicate out-of-distribution regions. Experiments show that model performance declines on TerraMesh samples belonging to these clusters, although they are unambiguous.

In summary, the method (1) decouples semantic importance from sampling frequency in remote sensing data, (2) audits dataset diversity and semantic gaps, and (3) provides a proxy for out-of-distribution risk by highlighting clusters with high epistemic uncertainty and guiding the curation of targeted supplementary data.

Evaluation on SEN12MS and EuroSAT

We apply the proposed aleatoric and epistemic evaluation on two widely used datasets, EuroSAT [3] and SEN12MS [4].

How well is the problem defined? (aleatoric)

We estimate task clarity via a one-vs-rest logistic regression in the embedding space.

For EuroSAT, the embeddings already suggest that the task is relatively well defined. Pasture, Permanent Crop, and Highway show lower separability, while the task overall is strong. The overall pattern aligns with reported results.

For SEN12MS, the results look rather different. While classes such as water, forest, and urban perform better, savanna, grassland, and wetland are more mixed/overlapping in the latent representation. Again, the overall pattern aligns with reported results and indicates higher aleatoric uncertainty.

How well does the dataset cover the global distribution? (epistemic)

For the SEN12MS dataset we obtain a global coverage of 1335/2000.
In contrast, for the EuroSAT dataset we obtain a global coverage of 170/2000.

Detailed Evaluation of SEN12MS

This already highlights what is known from the dataset descriptions: SEN12MS covers a significantly larger share of the earth’s optical surface patterns.

Now we use the clusters to derive conclusions about applicability in real-world scenarios. For this, we look at SEN12MS and consider two aspects of uncovered clusters: informative clusters, where land cover should be detectable, and non-informative clusters that contain images which do not allow a clear classification. For this evaluation we trained a simple ResNet50 on the dataset.

Interpreting OOD Clusters 1: Non-informative clusters

We find samples with high cloud coverage or erroneous data, which do not provide clear information about land cover.
However, such cases can occur at inference time, and practitioners should be aware of them.

Cluster 294 contains erroneous samples with artifacts. Nevertheless, all samples are classified as urban with confidence values of 82%, 80%, and 65%.

Cluster 1278 contains high cloud coverage. We can see that the trained ResNet predicts the majority as cropland. Based on the examples, this cannot be confirmed and is unlikely.

Interpreting OOD Clusters 2: Not covered clusters

There are also clusters that are not covered by SEN12MS, but do not contain large cloud coverage or other noise. Interestingly, in the 2D PCA, this is not directly obvious in all cases.

Cluster 1578 represents snowy regions. The images are predicted mainly as Barren (which could be correct) and Grassland (which does not really fit), both with high confidence values. None of the images is predicted as snow/ice, which is not surprising, since (due to technical reasons) there are no samples for this class in the classification version of SEN12MS.

Cluster 418's common property is a combination of forests and rivers. Multiple images in this cluster clearly show forests, but are classified as urban instead. Since there is no indication of urban regions in these images and since this cluster is not covered by SEN12MS, this may indicate a potential gap in forest–river patterns or other biases.

Conclustion

This work shows how TerraMind can be utilized to analyze datasets and corresponding tasks before actual training.

The procedure highlights the difficulty of a task and indicates the global coverage of the dataset in embedding space, both of which are connected to uncertainty in predictions. For the latter, we showed its usefulness by detecting biases and uncovered clusters in SEN12MS that seem to confuse a simple ResNet-50 baseline.

Contact

Team: Stefan Dan & Jakob Gawlikowski
Contact: firstname.lastname@dlr.de

[1] Blumenstiel, Benedikt, et al. "Terramesh: A planetary mosaic of multimodal earth observation data." Proceedings of the Computer Vision and Pattern Recognition Conference. 2025.
[2] Vo, Huy V., et al. "Automatic Data Curation for Self-Supervised Learning: A Clustering-Based Approach." Transactions on Machine Learning Research.
[3] Helber, Patrick, et al. "Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification." IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 12.7 (2019): 2217-2226.
[4] Schmitt, M., and Y-L. Wu. "REMOTE SENSING IMAGE CLASSIFICATION WITH THE SEN12MS DATASET." ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences 2 (2021): 101-106.

Jakob12345 changed discussion title from Model-agnostic dataset related uncertainty assessment to Model-agnostic dataset uncertainty assessment Jan 31

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment