Home | Research | Groups | Zeynep Akata

Research Group Zeynep Akata

Link to website at TUM

Zeynep Akata

Prof. Dr.

Principal Investigator

Interpretable and Reliable Machine Learning

Zeynep Akata

is a Liesel Beckmann Distinguished Professor of Computer Science at TUM and the director of the Institute for Explainable Machine Learning at Helmholtz Munich.

Zeynep Akata's field of research is explainable machine learning. Her goal is to build transparent computer algorithms that can make comprehensible decisions. Her approach combines different methods of machine vision, machine learning and natural language processing. Her scientific vision is to create a self-explanatory artificial intelligence that can learn through minimal feedback and interact reliably with humans.

Team members @MCML

Link to website

Stephan Alaniz

Dr.

Interpretable and Reliable Machine Learning

Link to website

Jessica Bader

Interpretable and Reliable Machine Learning

Link to website

Massimo Bini

Interpretable and Reliable Machine Learning

Link to website

Quentin Bouniot

Dr.

Interpretable and Reliable Machine Learning

Link to website

Luca Eyring

Interpretable and Reliable Machine Learning

Link to website

Iuliana Georgescu

Dr.

Interpretable and Reliable Machine Learning

Link to website

Leander Girrbach

Interpretable and Reliable Machine Learning

Link to website

Yiran Huang

Interpretable and Reliable Machine Learning

Link to website

Shyamgopal Karthik

Interpretable and Reliable Machine Learning

Link to website

Jae Myung Kim

Interpretable and Reliable Machine Learning

Link to website

Sanghwan Kim

Interpretable and Reliable Machine Learning

Link to website

Mateusz Pach

Interpretable and Reliable Machine Learning

Link to website

Karsten Roth

Interpretable and Reliable Machine Learning

Publications @MCML

2024


[18]
L. Eyring, S. Karthik, K. Roth, A. Dosovitskiy and Z. Akata.
ReNO: Enhancing One-step Text-to-Image Models through Reward-based Noise Optimization.
NeurIPS 2024 - 38th Conference on Neural Information Processing Systems. Vancouver, Canada, Dec 10-15, 2024. To be published. Preprint available. arXiv GitHub
Abstract

Text-to-Image (T2I) models have made significant advancements in recent years, but they still struggle to accurately capture intricate details specified in complex compositional prompts. While fine-tuning T2I models with reward objectives has shown promise, it suffers from ‘reward hacking’ and may not generalize well to unseen prompt distributions. In this work, we propose Reward-based Noise Optimization (ReNO), a novel approach that enhances T2I models at inference by optimizing the initial noise based on the signal from one or multiple human preference reward models. Remarkably, solving this optimization problem with gradient ascent for 50 iterations yields impressive results on four different one-step models across two competitive benchmarks, T2I-CompBench and GenEval. Within a computational budget of 20-50 seconds, ReNO-enhanced one-step models consistently surpass the performance of all current open-source Text-to-Image models. Extensive user studies demonstrate that our model is preferred nearly twice as often compared to the popular SDXL model and is on par with the proprietary Stable Diffusion 3 with 8B parameters. Moreover, given the same computational resources, a ReNO-optimized one-step model outperforms widely-used open-source models such as SDXL and PixArt-α, highlighting the efficiency and effectiveness of ReNO in enhancing T2I model performance at inference time.

MCML Authors
Link to website

Luca Eyring

Interpretable and Reliable Machine Learning

Link to website

Shyamgopal Karthik

Interpretable and Reliable Machine Learning

Link to website

Karsten Roth

Interpretable and Reliable Machine Learning

Link to Profile Zeynep Akata

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning


[17]
K. Roth, V. Udandarao, S. Dziadzio, A. Prabhu, M. Cherti, O. Vinyals, O. Hénaff, S. Albanie, M. Bethge and Z. Akata.
A Practitioner's Guide to Continual Multimodal Pretraining.
NeurIPS 2024 - 38th Conference on Neural Information Processing Systems. Vancouver, Canada, Dec 10-15, 2024. To be published. Preprint available. arXiv GitHub
Abstract

Multimodal foundation models serve numerous applications at the intersection of vision and language. Still, despite being pretrained on extensive data, they become outdated over time. To keep models updated, research into continual pretraining mainly explores scenarios with either (1) infrequent, indiscriminate updates on large-scale new data, or (2) frequent, sample-level updates. However, practical model deployment often operates in the gap between these two limit cases, as real-world applications often demand adaptation to specific subdomains, tasks or concepts – spread over the entire, varying life cycle of a model. In this work, we complement current perspectives on continual pretraining through a research test bed as well as provide comprehensive guidance for effective continual model updates in such scenarios. We first introduce FoMo-in-Flux, a continual multimodal pretraining benchmark with realistic compute constraints and practical deployment requirements, constructed over 63 datasets with diverse visual and semantic coverage. Using FoMo-in-Flux, we explore the complex landscape of practical continual pretraining through multiple perspectives: (1) A data-centric investigation of data mixtures and stream orderings that emulate real-world deployment situations, (2) a method-centric investigation ranging from simple fine-tuning and traditional continual learning strategies to parameter-efficient updates and model merging, (3) meta learning rate schedules and mechanistic design choices, and (4) the influence of model and compute scaling. Together, our insights provide a practitioner’s guide to continual multimodal pretraining for real-world deployment.

MCML Authors
Link to website

Karsten Roth

Interpretable and Reliable Machine Learning

Link to Profile Zeynep Akata

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning


[16]
A. Baumann, R. Li, M. Klasson, S. Mentu, S. Karthik, Z. Akata, A. Solin and M. Trapp.
Post-hoc Probabilistic Vision-Language Models.
Preprint (Dec. 2024). arXiv
Abstract

Vision-language models (VLMs), such as CLIP and SigLIP, have found remarkable success in classification, retrieval, and generative tasks. For this, VLMs deterministically map images and text descriptions to a joint latent space in which their similarity is assessed using the cosine similarity. However, a deterministic mapping of inputs fails to capture uncertainties over concepts arising from domain shifts when used in downstream tasks. In this work, we propose post-hoc uncertainty estimation in VLMs that does not require additional training. Our method leverages a Bayesian posterior approximation over the last layers in VLMs and analytically quantifies uncertainties over cosine similarities. We demonstrate its effectiveness for uncertainty quantification and support set selection in active learning. Compared to baselines, we obtain improved and well-calibrated predictive uncertainties, interpretable uncertainty estimates, and sample-efficient active learning. Our results show promise for safety-critical applications of large-scale models.

MCML Authors
Link to website

Shyamgopal Karthik

Interpretable and Reliable Machine Learning

Link to Profile Zeynep Akata

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning


[15]
S. Dziadzio, V. Udandarao, K. Roth, A. Prabhu, Z. Akata, S. Albanie and M. Bethge.
How to Merge Your Multimodal Models Over Time?.
Preprint (Dec. 2024). arXiv
Abstract

Model merging combines multiple expert models - finetuned from a base foundation model on diverse tasks and domains - into a single, more capable model. However, most existing model merging approaches assume that all experts are available simultaneously. In reality, new tasks and domains emerge progressively over time, requiring strategies to integrate the knowledge of expert models as they become available: a process we call temporal model merging. The temporal dimension introduces unique challenges not addressed in prior work, raising new questions such as: when training for a new task, should the expert model start from the merged past experts or from the original base model? Should we merge all models at each time step? Which merging techniques are best suited for temporal merging? Should different strategies be used to initialize the training and deploy the model? To answer these questions, we propose a unified framework called TIME - Temporal Integration of Model Expertise - which defines temporal model merging across three axes: (1) Initialization Phase, (2) Deployment Phase, and (3) Merging Technique. Using TIME, we study temporal model merging across model sizes, compute budgets, and learning horizons on the FoMo-in-Flux benchmark. Our comprehensive suite of experiments across TIME allows us to uncover key insights for temporal model merging, offering a better understanding of current challenges and best practices for effective temporal model merging.

MCML Authors
Link to website

Karsten Roth

Interpretable and Reliable Machine Learning

Link to Profile Zeynep Akata

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning


[14]
S. Kim, R. Xiao, M.-I. Georgescu, S. Alaniz and Z. Akata.
COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training.
Preprint (Dec. 2024). arXiv
Abstract

Vision-Language Models (VLMs) trained with contrastive loss have achieved significant advancements in various vision and language tasks. However, the global nature of contrastive loss makes VLMs focus predominantly on foreground objects, neglecting other crucial information in the image, which limits their effectiveness in downstream tasks. To address these challenges, we propose COSMOS: CrOSs-MOdality Self-distillation for vision-language pre-training that integrates a novel text-cropping strategy and cross-attention module into a self-supervised learning framework. We create global and local views of images and texts (i.e., multi-modal augmentations), which are essential for self-distillation in VLMs. We further introduce a cross-attention module, enabling COSMOS to learn comprehensive cross-modal representations optimized via a cross-modality self-distillation loss. COSMOS consistently outperforms previous strong baselines on various zero-shot downstream tasks, including retrieval, classification, and semantic segmentation. Additionally, it surpasses CLIP-based models trained on larger datasets in visual perception and contextual understanding tasks.

MCML Authors
Link to website

Sanghwan Kim

Interpretable and Reliable Machine Learning

Link to website

Iuliana Georgescu

Dr.

Interpretable and Reliable Machine Learning

Link to website

Stephan Alaniz

Dr.

Interpretable and Reliable Machine Learning

Link to Profile Zeynep Akata

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning


[13]
R. Xiao, S. Kim, M.-I. Georgescu, Z. Akata and S. Alaniz.
FLAIR: VLM with Fine-grained Language-informed Image Representations.
Preprint (Dec. 2024). arXiv GitHub
Abstract

CLIP has shown impressive results in aligning images and texts at scale. However, its ability to capture detailed visual features remains limited because CLIP matches images and texts at a global level. To address this issue, we propose FLAIR, Fine-grained Language-informed Image Representations, an approach that utilizes long and detailed image descriptions to learn localized image embeddings. By sampling diverse sub-captions that describe fine-grained details about an image, we train our vision-language model to produce not only global embeddings but also text-specific image representations. Our model introduces text-conditioned attention pooling on top of local image tokens to produce fine-grained image representations that excel at retrieving detailed image content. We achieve state-of-the-art performance on both, existing multimodal retrieval benchmarks, as well as, our newly introduced fine-grained retrieval task which evaluates vision-language models’ ability to retrieve partial image content. Furthermore, our experiments demonstrate the effectiveness of FLAIR trained on 30M image-text pairs in capturing fine-grained visual information, including zero-shot semantic segmentation, outperforming models trained on billions of pairs.

MCML Authors
Link to website

Sanghwan Kim

Interpretable and Reliable Machine Learning

Link to website

Iuliana Georgescu

Dr.

Interpretable and Reliable Machine Learning

Link to Profile Zeynep Akata

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning

Link to website

Stephan Alaniz

Dr.

Interpretable and Reliable Machine Learning


[12]
K. Roth, Z. Akata, D. Damen, I. Balažević and O. J. Hénaff.
Context-Aware Multimodal Pretraining.
Preprint (Nov. 2024). arXiv
Abstract

Large-scale multimodal representation learning successfully optimizes for zero-shot transfer at test time. Yet the standard pretraining paradigm (contrastive learning on large amounts of image-text data) does not explicitly encourage representations to support few-shot adaptation. In this work, we propose a simple, but carefully designed extension to multimodal pretraining which enables representations to accommodate additional context. Using this objective, we show that vision-language models can be trained to exhibit significantly increased few-shot adaptation: across 21 downstream tasks, we find up to four-fold improvements in test-time sample efficiency, and average few-shot adaptation gains of over 5%, while retaining zero-shot generalization performance across model scales and training durations. In particular, equipped with simple, training-free, metric-based adaptation mechanisms, our representations easily surpass more complex and expensive optimization-based schemes, vastly simplifying generalization to new domains.

MCML Authors
Link to website

Karsten Roth

Interpretable and Reliable Machine Learning

Link to Profile Zeynep Akata

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning


[11]
L. Girrbach, Y. Huang, S. Alaniz, T. Darrell and Z. Akata.
Revealing and Reducing Gender Biases in Vision and Language Assistants (VLAs).
Preprint (Oct. 2024). arXiv
Abstract

Pre-trained large language models (LLMs) have been reliably integrated with visual input for multimodal tasks. The widespread adoption of instruction-tuned image-to-text vision-language assistants (VLAs) like LLaVA and InternVL necessitates evaluating gender biases. We study gender bias in 22 popular open-source VLAs with respect to personality traits, skills, and occupations. Our results show that VLAs replicate human biases likely present in the data, such as real-world occupational imbalances. Similarly, they tend to attribute more skills and positive personality traits to women than to men, and we see a consistent tendency to associate negative personality traits with men. To eliminate the gender bias in these models, we find that finetuning-based debiasing methods achieve the best tradeoff between debiasing and retaining performance on downstream tasks. We argue for pre-deploying gender bias assessment in VLAs and motivate further development of debiasing strategies to ensure equitable societal outcomes.

MCML Authors
Link to website

Leander Girrbach

Interpretable and Reliable Machine Learning

Link to website

Yiran Huang

Interpretable and Reliable Machine Learning

Link to website

Stephan Alaniz

Dr.

Interpretable and Reliable Machine Learning

Link to Profile Zeynep Akata

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning


[10]
S. Karthik, H. Coskun, Z. Akata, S. Tulyakov, J. Ren and A. Kag.
Scalable Ranked Preference Optimization for Text-to-Image Generation.
Preprint (Oct. 2024). arXiv
Abstract

Direct Preference Optimization (DPO) has emerged as a powerful approach to align text-to-image (T2I) models with human feedback. Unfortunately, successful application of DPO to T2I models requires a huge amount of resources to collect and label large-scale datasets, e.g., millions of generated paired images annotated with human preferences. In addition, these human preference datasets can get outdated quickly as the rapid improvements of T2I models lead to higher quality images. In this work, we investigate a scalable approach for collecting large-scale and fully synthetic datasets for DPO training. Specifically, the preferences for paired images are generated using a pre-trained reward function, eliminating the need for involving humans in the annotation process, greatly improving the dataset collection efficiency. Moreover, we demonstrate that such datasets allow averaging predictions across multiple models and collecting ranked preferences as opposed to pairwise preferences. Furthermore, we introduce RankDPO to enhance DPO-based methods using the ranking feedback. Applying RankDPO on SDXL and SD3-Medium models with our synthetically generated preference dataset ‘Syn-Pic’ improves both prompt-following (on benchmarks like T2I-Compbench, GenEval, and DPG-Bench) and visual quality (through user studies). This pipeline presents a practical and scalable solution to develop better preference datasets to enhance the performance of text-to-image models.

MCML Authors
Link to website

Shyamgopal Karthik

Interpretable and Reliable Machine Learning

Link to Profile Zeynep Akata

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning


[9]
T. Uscidda, L. Eyring, K. Roth, F. J. Theis, Z. Akata and M. Cuturi.
Disentangled Representation Learning with the Gromov-Monge Gap.
Preprint (Oct. 2024). arXiv
Abstract

Learning disentangled representations from unlabelled data is a fundamental challenge in machine learning. Solving it may unlock other problems, such as generalization, interpretability, or fairness. Although remarkably challenging to solve in theory, disentanglement is often achieved in practice through prior matching. Furthermore, recent works have shown that prior matching approaches can be enhanced by leveraging geometrical considerations, e.g., by learning representations that preserve geometric features of the data, such as distances or angles between points. However, matching the prior while preserving geometric features is challenging, as a mapping that fully preserves these features while aligning the data distribution with the prior does not exist in general. To address these challenges, we introduce a novel approach to disentangled representation learning based on quadratic optimal transport. We formulate the problem using Gromov-Monge maps that transport one distribution onto another with minimal distortion of predefined geometric features, preserving them as much as can be achieved. To compute such maps, we propose the Gromov-Monge-Gap (GMG), a regularizer quantifying whether a map moves a reference distribution with minimal geometry distortion. We demonstrate the effectiveness of our approach for disentanglement across four standard benchmarks, outperforming other methods leveraging geometric considerations.

MCML Authors
Link to website

Luca Eyring

Interpretable and Reliable Machine Learning

Link to website

Karsten Roth

Interpretable and Reliable Machine Learning

Link to Profile Fabian Theis

Fabian Theis

Prof. Dr.

Mathematical Modelling of Biological Systems

Link to Profile Zeynep Akata

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning


[8]
A. Christensen, N. Mojab, K. Patel, K. Ahuja, Z. Akata, O. Winther, O. Gonzalez-Franco and A. Colaco.
Geometry Fidelity for Spherical Images.
ECCV 2024 - 18th European Conference on Computer Vision. Milano, Italy, Sep 29-Oct 04, 2024. DOI
Abstract

Spherical or omni-directional images offer an immersive visual format appealing to a wide range of computer vision applications. However, geometric properties of spherical images pose a major challenge for models and metrics designed for ordinary 2D images. Here, we show that direct application of Fréchet Inception Distance (FID) is insufficient for quantifying geometric fidelity in spherical images. We introduce two quantitative metrics accounting for geometric constraints, namely Omnidirectional FID (OmniFID) and Discontinuity Score (DS). OmniFID is an extension of FID tailored to additionally capture field-of-view requirements of the spherical format by leveraging cubemap projections. DS is a kernel-based seam alignment score of continuity across borders of 2D representations of spherical images. In experiments, OmniFID and DS quantify geometry fidelity issues that are undetected by FID.

MCML Authors
Link to Profile Zeynep Akata

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning


[7]
T. Hummel, S. Karthik, M.-I. Georgescu and Z. Akata.
EgoCVR: An Egocentric Benchmark for Fine-Grained Composed Video Retrieval.
ECCV 2024 - 18th European Conference on Computer Vision. Milano, Italy, Sep 29-Oct 04, 2024. DOI GitHub
Abstract

In Composed Video Retrieval, a video and a textual description which modifies the video content are provided as inputs to the model. The aim is to retrieve the relevant video with the modified content from a database of videos. In this challenging task, the first step is to acquire large-scale training datasets and collect high-quality benchmarks for evaluation. In this work, we introduce EgoCVR, a new evaluation benchmark for fine-grained Composed Video Retrieval using large-scale egocentric video datasets. EgoCVR consists of 2,295 queries that specifically focus on high-quality temporal video understanding. We find that existing Composed Video Retrieval frameworks do not achieve the necessary high-quality temporal video understanding for this task. To address this shortcoming, we adapt a simple training-free method, propose a generic re-ranking framework for Composed Video Retrieval, and demonstrate that this achieves strong results on EgoCVR.

MCML Authors
Link to website

Shyamgopal Karthik

Interpretable and Reliable Machine Learning

Link to website

Iuliana Georgescu

Dr.

Interpretable and Reliable Machine Learning

Link to Profile Zeynep Akata

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning


[6]
J. M. Kim, J. Bader, S. Alaniz, C. Schmid and Z. Akata.
DataDream: Few-shot Guided Dataset Generation.
ECCV 2024 - 18th European Conference on Computer Vision. Milano, Italy, Sep 29-Oct 04, 2024. DOI GitHub
Abstract

While text-to-image diffusion models have been shown to achieve state-of-the-art results in image synthesis, they have yet to prove their effectiveness in downstream applications. Previous work has proposed to generate data for image classifier training given limited real data access. However, these methods struggle to generate in-distribution images or depict fine-grained features, thereby hindering the generalization of classification models trained on synthetic datasets. We propose DataDream, a framework for synthesizing classification datasets that more faithfully represents the real data distribution when guided by few-shot examples of the target classes. DataDream fine-tunes LoRA weights for the image generation model on the few real images before generating the training data using the adapted model. We then fine-tune LoRA weights for CLIP using the synthetic data to improve downstream image classification over previous approaches on a large variety of datasets. We demonstrate the efficacy of DataDream through extensive experiments, surpassing state-of-the-art classification accuracy with few-shot data across 7 out of 10 datasets, while being competitive on the other 3. Additionally, we provide insights into the impact of various factors, such as the number of real-shot and generated images as well as the fine-tuning compute on model performance.

MCML Authors
Link to website

Jae Myung Kim

Interpretable and Reliable Machine Learning

Link to website

Jessica Bader

Interpretable and Reliable Machine Learning

Link to website

Stephan Alaniz

Dr.

Interpretable and Reliable Machine Learning

Link to Profile Zeynep Akata

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning


[5]
M. Bini, K. Roth, Z. Akata and A. Khoreva.
ETHER: Efficient Finetuning of Large-Scale Models with Hyperplane Reflections.
ICML 2024 - 41st International Conference on Machine Learning. Vienna, Austria, Jul 21-27, 2024. URL GitHub
Abstract

Parameter-efficient finetuning (PEFT) has become ubiquitous to adapt foundation models to downstream task requirements while retaining their generalization ability. However, the amount of additionally introduced parameters and compute for successful adaptation and hyperparameter searches can explode quickly, especially when deployed at scale to serve numerous individual requests. To ensure effective, parameter-efficient, and hyperparameter-robust adaptation, we propose the ETHER transformation family, which performs Efficient fineTuning via HypErplane Reflections. By design, ETHER transformations require a minimal number of parameters, are less likely to deteriorate model performance, and exhibit robustness to hyperparameter and learning rate choices. In particular, we introduce ETHER and its relaxation ETHER+, which match or outperform existing PEFT methods with significantly fewer parameters (∼10-100 times lower than LoRA or OFT) across multiple image synthesis and natural language tasks without exhaustive hyperparameter tuning. Finally, we investigate the recent emphasis on Hyperspherical Energy retention for adaptation and raise questions on its practical utility.

MCML Authors
Link to website

Massimo Bini

Interpretable and Reliable Machine Learning

Link to website

Karsten Roth

Interpretable and Reliable Machine Learning

Link to Profile Zeynep Akata

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning


[4]
M. Dani, M. J. Prakash, Z. Akata and S. Liebe.
SemioLLM: Assessing Large Language Models for Semiological Analysis in Epilepsy Research.
Preprint (Jul. 2024). arXiv
Abstract

Large Language Models have shown promising results in their ability to encode general medical knowledge in standard medical question-answering datasets. However, their potential application in clinical practice requires evaluation in domain-specific tasks, where benchmarks are largely missing. In this study semioLLM, we test the ability of state-of-the-art LLMs (GPT-3.5, GPT-4, Mixtral 8x7B, and Qwen-72chat) to leverage their internal knowledge and reasoning for epilepsy diagnosis. Specifically, we obtain likelihood estimates linking unstructured text descriptions of seizures to seizure-generating brain regions, using an annotated clinical database containing 1269 entries. We evaluate the LLM’s performance, confidence, reasoning, and citation abilities in comparison to clinical evaluation. Models achieve above-chance classification performance with prompt engineering significantly improving their outcome, with some models achieving close-to-clinical performance and reasoning. However, our analyses also reveal significant pitfalls with several models being overly confident while showing poor performance, as well as exhibiting citation errors and hallucinations. In summary, our work provides the first extensive benchmark comparing current SOTA LLMs in the medical domain of epilepsy and highlights their ability to leverage unstructured texts from patients’ medical history to aid diagnostic processes in health care.

MCML Authors
Link to Profile Zeynep Akata

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning


[3]
L. Thede, K. Roth, O. J. Hénaff, M. Bethge and Z. Akata.
Reflecting on the State of Rehearsal-free Continual Learning with Pretrained Models.
Preprint (Jun. 2024). arXiv
Abstract

With the advent and recent ubiquity of foundation models, continual learning (CL) has recently shifted from continual training from scratch to the continual adaptation of pretrained models, seeing particular success on rehearsal-free CL benchmarks (RFCL). To achieve this, most proposed methods adapt and restructure parameter-efficient finetuning techniques (PEFT) to suit the continual nature of the problem. Based most often on input-conditional query-mechanisms or regularizations on top of prompt- or adapter-based PEFT, these PEFT-style RFCL (P-RFCL) approaches report peak performances; often convincingly outperforming existing CL techniques. However, on the other end, critical studies have recently highlighted competitive results by training on just the first task or via simple non-parametric baselines. Consequently, questions arise about the relationship between methodological choices in P-RFCL and their reported high benchmark scores. In this work, we tackle these questions to better understand the true drivers behind strong P-RFCL performances, their placement w.r.t. recent first-task adaptation studies, and their relation to preceding CL standards such as EWC or SI. In particular, we show: (1) P-RFCL techniques relying on input-conditional query mechanisms work not because, but rather despite them by collapsing towards standard PEFT shortcut solutions. (2) Indeed, we show how most often, P-RFCL techniques can be matched by a simple and lightweight PEFT baseline. (3) Using this baseline, we identify the implicit bound on tunable parameters when deriving RFCL approaches from PEFT methods as a potential denominator behind P-RFCL efficacy. Finally, we (4) better disentangle continual versus first-task adaptation, and (5) motivate standard RFCL techniques s.a. EWC or SI in light of recent P-RFCL methods.

MCML Authors
Link to website

Karsten Roth

Interpretable and Reliable Machine Learning

Link to Profile Zeynep Akata

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning


[2]
L. Eyring, D. Klein, T. Palla, N. Kilbertus, Z. Akata and F. J. Theis.
Unbalancedness in Neural Monge Maps Improves Unpaired Domain Translation.
ICLR 2024 - 12th International Conference on Learning Representations. Vienna, Austria, May 07-11, 2024. URL
Abstract

In optimal transport (OT), a Monge map is known as a mapping that transports a source distribution to a target distribution in the most cost-efficient way. Recently, multiple neural estimators for Monge maps have been developed and applied in diverse unpaired domain translation tasks, e.g. in single-cell biology and computer vision. However, the classic OT framework enforces mass conservation, which makes it prone to outliers and limits its applicability in real-world scenarios. The latter can be particularly harmful in OT domain translation tasks, where the relative position of a sample within a distribution is explicitly taken into account. While unbalanced OT tackles this challenge in the discrete setting, its integration into neural Monge map estimators has received limited attention. We propose a theoretically grounded method to incorporate unbalancedness into any Monge map estimator. We improve existing estimators to model cell trajectories over time and to predict cellular responses to perturbations. Moreover, our approach seamlessly integrates with the OT flow matching (OT-FM) framework. While we show that OT-FM performs competitively in image translation, we further improve performance by incorporating unbalancedness (UOT-FM), which better preserves relevant features. We hence establish UOT-FM as a principled method for unpaired image translation.

MCML Authors
Link to website

Luca Eyring

Interpretable and Reliable Machine Learning

Link to Profile Niki Kilbertus

Niki Kilbertus

Prof. Dr.

Ethics in Systems Design and Machine Learning

Link to Profile Zeynep Akata

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning

Link to Profile Fabian Theis

Fabian Theis

Prof. Dr.

Mathematical Modelling of Biological Systems


[1]
A. Höhl, I. Obadic, M. Á. F. Torres, H. Najjar, D. Oliveira, Z. Akata, A. Dengel and X. Zhu.
Opening the Black-Box: A Systematic Review on Explainable AI in Remote Sensing.
Preprint (Feb. 2024). arXiv
Abstract

In recent years, black-box machine learning approaches have become a dominant modeling paradigm for knowledge extraction in Remote Sensing. Despite the potential benefits of uncovering the inner workings of these models with explainable AI, a comprehensive overview summarizing the used explainable AI methods and their objectives, findings, and challenges in Remote Sensing applications is still missing. In this paper, we address this issue by performing a systematic review to identify the key trends of how explainable AI is used in Remote Sensing and shed light on novel explainable AI approaches and emerging directions that tackle specific Remote Sensing challenges. We also reveal the common patterns of explanation interpretation, discuss the extracted scientific insights in Remote Sensing, and reflect on the approaches used for explainable AI methods evaluation. Our review provides a complete summary of the state-of-the-art in the field. Further, we give a detailed outlook on the challenges and promising research directions, representing a basis for novel methodological development and a useful starting point for new researchers in the field of explainable AI in Remote Sensing.

MCML Authors
Link to website

Ivica Obadic

Data Science in Earth Observation

Link to Profile Zeynep Akata

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning

Link to Profile Xiaoxiang Zhu

Xiaoxiang Zhu

Prof. Dr.

Data Science in Earth Observation