On the Faithfulness of
Post-Hoc Concept Bottleneck Models

1 Institute of Data Science, German Aerospace Center, Jena, Germany
2 Computer Vision Group Jena, Friedrich Schiller University Jena, Germany
3 GEOMAR Helmholtz Centre for Ocean Research Kiel, Germany

ECCV 2026

Abstract

Human decision-making interprets the world through high-level concepts, such as recognizing a bird by its belly color. To bridge the gap between opaque deep learning representations and human understanding, Post-Hoc Concept Bottleneck Models (post-hoc CBMs) project latent features onto interpretable concept spaces using auxiliary datasets or vision-language models.

However, relying on target task accuracy as the primary measure of post-hoc CBM success obscures whether the learned concepts are semantically meaningful or merely predictive artifacts. For example, random concept projections can achieve competitive accuracy despite being semantically meaningless.

In this work, we analyze the learned projections directly and identify two failure cases: First, for concept projections learned from auxiliary data, covariate shifts can lead to unfaithful concept representations for the target task. In particular, we provide an upper bound on the error introduced by this shift. Second, systematic label noise in surrogate concept labels generated by vision-language models leads to unfaithful projections. After formalizing these failure modes, we introduce novel metrics that decouple concept faithfulness from predictive accuracy. Our empirical results across real-world and synthetic benchmarks confirm that these metrics identify unfaithful behaviors that standard accuracy-based evaluation fails to detect.

Two Sources of Unfaithfulness in PCBMs

Sources of Unfaithfulness in PCBMs: Covariate Shift and Systematic Surrogate Errors.

We analyze two primary post-hoc CBM training mechanisms and identify two reasons for unfaithfulness:
  (1) Covariate shifts in auxiliary concept sets  |  (2) Systematic surrogate label errors.

The Illusion of Classifier Accuracy

Downstream accuracy is a highly deceptive metric. Under the manifold hypothesis, even random, semantically meaningless projections retain enough geometric information to accurately reconstruct the original backbone activations due to the Johnson-Lindenstrauss lemma for smooth manifolds.

As the number of random concepts increases, a learned downstream classifier matches or even exceeds the performance of models trained on real concepts. Thus, high accuracy does not imply a faithful, interpretable concept bottleneck.

Random projections achieve high accuracy

High downstream accuracy can be achieved using random concept projections.

Failure Mode 1: Covariate Shift

Covariate shift in auxiliary datasets

Stronger covariate shift (HΔH-divergence) leads to increased unfaithfulness.

Since target-domain concept labels are rarely available, standard practice trains concept projections on rich auxiliary datasets (e.g., Broden) and transfers them to the target task.

We identify covariate shift as a primary source of unfaithfulness in this setting. Even if a concept's semantic definition is identical across domains, a geometric shift in the feature space can invalidate the learned projection.

We formalize this following the domain generalization literature linking it to an empirical HΔH-divergence metric. This allows practitioners to measure unfaithfulness introduced by domain shift without needing ground-truth concept labels in the target domain.

Failure Mode 2: Systematic Surrogate Label Errors

To avoid domain shift, recent methods use Vision-Language Models (VLMs) like CLIP or Grounding DINO to generate surrogate concept labels directly on the target data. Our theoretical analysis formulates the concept bottleneck as a Generalized Linear Model (GLM). Using this, we study the gradient of the target concept objective at the point of the optimum of the surrogate objective. We show that surrogate labels can only yield faithful projections if their errors are orthogonal to the backbone activations. This means, random mistakes cancel out, but systematic errors (e.g., the VLM predicting sky together with clouds) lead to wrong concept projections.

Geometric intuition of orthogonality

2D intuition: Faithful models require surrogate errors (Δk) to be orthogonal (independent) to activations (αj).

Systematic vs Random Errors

VLM surrogates (left) exhibit systematic, correlated errors, causing unfaithful projections. Random noise (right) produces very low correlation.

We evaluate popular VLMs and demonstrate that their surrogate labels frequently violate this orthogonality condition. VLMs introduce systematic label noise that is highly correlated with backbone features. To detect this, we propose measuring the absolute Pearson correlation between surrogate label errors and backbone activations, enabling practitioners to rank and evaluate VLM surrogate quality.

Data Examples

Quantitative Results

We use four different datasets based on Elements of increasing covariate shift to evaluate the first failure mode. Here we visualize one example each for the concept stripes.

Quantitative Results

We directly measure concept faithfulness by training post-hoc Concept Bottleneck Models (PCBM) on multiple auxiliary data distributions and VLM-derived surrogate labels (LF-CBM and VLG-CBM). Rather than relying on downstream task accuracy, we evaluate geometric concept alignment and compare a standard learned classifier h against an oracle h* that uses ground-truth concept mappings.

Quantitative Results

Faithfulness vs. Performance. Comparison of post-hoc CBM variants with respect to the ground truth concept labels and direction on the Elements dataset (see below). OOD data degrades faithfulness significantly more than accuracy suggests. Each metric is averaged over the K concepts. To demonstrate faithfulness under error orthogonality, we also train πθ on ground-truth concept labels while adding 25% random noise.

Across all setups, high downstream accuracy consistently masks severe representation failures. Direct evaluation of πθ and the oracle classifier exposes these flaws. These results demonstrate that task accuracy alone is a deceptive proxy for concept faithfulness, reinforcing the need for direct evaluations of the concept projections in post-hoc CBMs.

BibTeX

If you find our work useful, please consider citing our paper:

@inproceedings{schmalwasser2026faithfulness,
  author    = {Schmalwasser, Laines and Blunk, Jan and Penzel, Niklas and Niebling, Julia and Denzler, Joachim},
  title     = {On the Faithfulness of Post-Hoc Concept Bottleneck Models},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2026},
  arxiv     = {2606.30498},
  url       = {https://posthoc-cbm-faithfulness.github.io/}
}