Home | Research | Area B

B | Perception, Vision, and Natural Language Processing

forms a dynamic research domain at the intersection of computer science and cognitive sciences. This field explores the synergies between diverse sensory inputs, visual information processing, and language understanding.

B1 | Computer Vision

In the thriving era of Computer Vision, MCML researchers tackle key challenges by innovating beyond convolutional neural networks. They focus on novel models capturing both pixel relationships and high-level interactions, explore unsupervised learning techniques, and extend analysis beyond 2D to understand the 3D world observed through cameras.

Link to Profile Zeynep Akata

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence

Link to Profile Angela Dai

Angela Dai

Prof. Dr.

3D Artificial Intelligence

Link to Profile Matthias Nießner

Matthias Nießner

Prof. Dr.

Visual Computing

Link to Profile Björn Ommer

Björn Ommer

Prof. Dr.

Machine Vision & Learning

Link to Profile Nils Thuerey

Nils Thuerey

Prof. Dr.

Physics-based Simulation

Link to Profile Rüdiger Westermann

Rüdiger Westermann

Prof. Dr.

Computer Graphics & Visualization

Link to Profile Almut Sophia Koepke

Almut Sophia Koepke

Dr.

JRG Leader Multi-Modal Learning

Computer Vision & Artificial Intelligence

Publication in Research Area B1
[124]
Y.-J. Li, M. Gladkova, Y. Xia, R. Wang and D. Cremers.
VXP: Voxel-Cross-Pixel Large-scale Image-LiDAR Place Recognition.
3DV 2025 - 12th International Conference on 3D Vision. Singapore, Mar 25-28, 2025. To be published. Preprint available. arXiv
Abstract

Recent works on the global place recognition treat the task as a retrieval problem, where an off-the-shelf global descriptor is commonly designed in image-based and LiDAR-based modalities. However, it is non-trivial to perform accurate image-LiDAR global place recognition since extracting consistent and robust global descriptors from different domains (2D images and 3D point clouds) is challenging. To address this issue, we propose a novel Voxel-Cross-Pixel (VXP) approach, which establishes voxel and pixel correspondences in a self-supervised manner and brings them into a shared feature space. Specifically, VXP is trained in a two-stage manner that first explicitly exploits local feature correspondences and enforces similarity of global descriptors. Extensive experiments on the three benchmarks (Oxford RobotCar, ViViD++ and KITTI) demonstrate our method surpasses the state-of-the-art cross-modal retrieval by a large margin.

MCML Authors
Link to website

Mariia Gladkova

Computer Vision & Artificial Intelligence

Link to website

Yan Xia

Dr.

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[123]
B. Cong, N. Daheim, Y. Shen, D. Cremers, R. Yokota, M. Khan and T. Möllenhoff.
Variational Low-Rank Adaptation Using IVON.
FITML @NeurIPS 2024 - Workshop Fine-Tuning in Modern Machine Learning: Principles and Scalability at the 38th Conference on Neural Information Processing Systems (NeurIPS 2024). Vancouver, Canada, Dec 10-15, 2024. To be published. Preprint available. arXiv GitHub
Abstract

We show that variational learning can significantly improve the accuracy and calibration of Low-Rank Adaptation (LoRA) without a substantial increase in the cost. We replace AdamW by the Improved Variational Online Newton (IVON) algorithm to finetune large language models. For Llama-2 with 7 billion parameters, IVON improves the accuracy over AdamW by 2.8% and expected calibration error by 4.6%. The accuracy is also better than the other Bayesian alternatives, yet the cost is lower and the implementation is easier. Our work provides additional evidence for the effectiveness of IVON for large language models.

MCML Authors
Yuesong Shen

Yuesong Shen

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[122]
L. Eyring, S. Karthik, K. Roth, A. Dosovitskiy and Z. Akata.
ReNO: Enhancing One-step Text-to-Image Models through Reward-based Noise Optimization.
NeurIPS 2024 - 38th Conference on Neural Information Processing Systems. Vancouver, Canada, Dec 10-15, 2024. To be published. Preprint available. arXiv GitHub
Abstract

Text-to-Image (T2I) models have made significant advancements in recent years, but they still struggle to accurately capture intricate details specified in complex compositional prompts. While fine-tuning T2I models with reward objectives has shown promise, it suffers from ‘reward hacking’ and may not generalize well to unseen prompt distributions. In this work, we propose Reward-based Noise Optimization (ReNO), a novel approach that enhances T2I models at inference by optimizing the initial noise based on the signal from one or multiple human preference reward models. Remarkably, solving this optimization problem with gradient ascent for 50 iterations yields impressive results on four different one-step models across two competitive benchmarks, T2I-CompBench and GenEval. Within a computational budget of 20-50 seconds, ReNO-enhanced one-step models consistently surpass the performance of all current open-source Text-to-Image models. Extensive user studies demonstrate that our model is preferred nearly twice as often compared to the popular SDXL model and is on par with the proprietary Stable Diffusion 3 with 8B parameters. Moreover, given the same computational resources, a ReNO-optimized one-step model outperforms widely-used open-source models such as SDXL and PixArt-α, highlighting the efficiency and effectiveness of ReNO in enhancing T2I model performance at inference time.

MCML Authors
Link to website

Luca Eyring

Interpretable and Reliable Machine Learning

Link to website

Shyamgopal Karthik

Interpretable and Reliable Machine Learning

Link to website

Karsten Roth

Interpretable and Reliable Machine Learning

Link to Profile Zeynep Akata

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning


[121]
F. Koehler, S. Niedermayr, R. Westermann and N. Thuerey.
APEBench: A Benchmark for Autoregressive Neural Emulators of PDEs.
NeurIPS 2024 - 38th Conference on Neural Information Processing Systems. Vancouver, Canada, Dec 10-15, 2024. To be published. Preprint available. arXiv
Abstract

We introduce the Autoregressive PDE Emulator Benchmark (APEBench), a comprehensive benchmark suite to evaluate autoregressive neural emulators for solving partial differential equations. APEBench is based on JAX and provides a seamlessly integrated differentiable simulation framework employing efficient pseudo-spectral methods, enabling 46 distinct PDEs across 1D, 2D, and 3D. Facilitating systematic analysis and comparison of learned emulators, we propose a novel taxonomy for unrolled training and introduce a unique identifier for PDE dynamics that directly relates to the stability criteria of classical numerical methods. APEBench enables the evaluation of diverse neural architectures, and unlike existing benchmarks, its tight integration of the solver enables support for differentiable physics training and neural-hybrid emulators. Moreover, APEBench emphasizes rollout metrics to understand temporal generalization, providing insights into the long-term behavior of emulating PDE dynamics. In several experiments, we highlight the similarities between neural emulators and numerical simulators.

MCML Authors
Link to Profile Rüdiger Westermann

Rüdiger Westermann

Prof. Dr.

Computer Graphics & Visualization

Link to Profile Nils Thuerey

Nils Thuerey

Prof. Dr.

Physics-based Simulation


[120]
K. Roth, V. Udandarao, S. Dziadzio, A. Prabhu, M. Cherti, O. Vinyals, O. Hénaff, S. Albanie, M. Bethge and Z. Akata.
A Practitioner's Guide to Continual Multimodal Pretraining.
NeurIPS 2024 - 38th Conference on Neural Information Processing Systems. Vancouver, Canada, Dec 10-15, 2024. To be published. Preprint available. arXiv GitHub
Abstract

Multimodal foundation models serve numerous applications at the intersection of vision and language. Still, despite being pretrained on extensive data, they become outdated over time. To keep models updated, research into continual pretraining mainly explores scenarios with either (1) infrequent, indiscriminate updates on large-scale new data, or (2) frequent, sample-level updates. However, practical model deployment often operates in the gap between these two limit cases, as real-world applications often demand adaptation to specific subdomains, tasks or concepts – spread over the entire, varying life cycle of a model. In this work, we complement current perspectives on continual pretraining through a research test bed as well as provide comprehensive guidance for effective continual model updates in such scenarios. We first introduce FoMo-in-Flux, a continual multimodal pretraining benchmark with realistic compute constraints and practical deployment requirements, constructed over 63 datasets with diverse visual and semantic coverage. Using FoMo-in-Flux, we explore the complex landscape of practical continual pretraining through multiple perspectives: (1) A data-centric investigation of data mixtures and stream orderings that emulate real-world deployment situations, (2) a method-centric investigation ranging from simple fine-tuning and traditional continual learning strategies to parameter-efficient updates and model merging, (3) meta learning rate schedules and mechanistic design choices, and (4) the influence of model and compute scaling. Together, our insights provide a practitioner’s guide to continual multimodal pretraining for real-world deployment.

MCML Authors
Link to website

Karsten Roth

Interpretable and Reliable Machine Learning

Link to Profile Zeynep Akata

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning


[119]
J. Wang, M. Ghahremani, Y. Li, B. Ommer and C. Wachinger.
Stable-Pose: Leveraging Transformers for Pose-Guided Text-to-Image Generation.
NeurIPS 2024 - 38th Conference on Neural Information Processing Systems. Vancouver, Canada, Dec 10-15, 2024. To be published. Preprint available. arXiv GitHub
Abstract

Controllable text-to-image (T2I) diffusion models have shown impressive performance in generating high-quality visual content through the incorporation of various conditions. Current methods, however, exhibit limited performance when guided by skeleton human poses, especially in complex pose conditions such as side or rear perspectives of human figures. To address this issue, we present Stable-Pose, a novel adapter model that introduces a coarse-to-fine attention masking strategy into a vision Transformer (ViT) to gain accurate pose guidance for T2I models. Stable-Pose is designed to adeptly handle pose conditions within pre-trained Stable Diffusion, providing a refined and efficient way of aligning pose representation during image synthesis. We leverage the query-key self-attention mechanism of ViTs to explore the interconnections among different anatomical parts in human pose skeletons. Masked pose images are used to smoothly refine the attention maps based on target pose-related features in a hierarchical manner, transitioning from coarse to fine levels. Additionally, our loss function is formulated to allocate increased emphasis to the pose region, thereby augmenting the model’s precision in capturing intricate pose details. We assessed the performance of Stable-Pose across five public datasets under a wide range of indoor and outdoor human pose scenarios. Stable-Pose achieved an AP score of 57.1 in the LAION-Human dataset, marking around 13% improvement over the established technique ControlNet.

MCML Authors
Link to website

Morteza Ghahremani

Dr.

Artificial Intelligence in Radiology

Link to website

Yitong Li

Artificial Intelligence in Radiology

Link to Profile Björn Ommer

Björn Ommer

Prof. Dr.

Machine Vision & Learning

Link to Profile Christian Wachinger

Christian Wachinger

Prof. Dr.

Artificial Intelligence in Radiology


[118]
A. Baumann, R. Li, M. Klasson, S. Mentu, S. Karthik, Z. Akata, A. Solin and M. Trapp.
Post-hoc Probabilistic Vision-Language Models.
Preprint (Dec. 2024). arXiv
Abstract

Vision-language models (VLMs), such as CLIP and SigLIP, have found remarkable success in classification, retrieval, and generative tasks. For this, VLMs deterministically map images and text descriptions to a joint latent space in which their similarity is assessed using the cosine similarity. However, a deterministic mapping of inputs fails to capture uncertainties over concepts arising from domain shifts when used in downstream tasks. In this work, we propose post-hoc uncertainty estimation in VLMs that does not require additional training. Our method leverages a Bayesian posterior approximation over the last layers in VLMs and analytically quantifies uncertainties over cosine similarities. We demonstrate its effectiveness for uncertainty quantification and support set selection in active learning. Compared to baselines, we obtain improved and well-calibrated predictive uncertainties, interpretable uncertainty estimates, and sample-efficient active learning. Our results show promise for safety-critical applications of large-scale models.

MCML Authors
Link to website

Shyamgopal Karthik

Interpretable and Reliable Machine Learning

Link to Profile Zeynep Akata

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning


[117]
S. Dziadzio, V. Udandarao, K. Roth, A. Prabhu, Z. Akata, S. Albanie and M. Bethge.
How to Merge Your Multimodal Models Over Time?.
Preprint (Dec. 2024). arXiv
Abstract

Model merging combines multiple expert models - finetuned from a base foundation model on diverse tasks and domains - into a single, more capable model. However, most existing model merging approaches assume that all experts are available simultaneously. In reality, new tasks and domains emerge progressively over time, requiring strategies to integrate the knowledge of expert models as they become available: a process we call temporal model merging. The temporal dimension introduces unique challenges not addressed in prior work, raising new questions such as: when training for a new task, should the expert model start from the merged past experts or from the original base model? Should we merge all models at each time step? Which merging techniques are best suited for temporal merging? Should different strategies be used to initialize the training and deploy the model? To answer these questions, we propose a unified framework called TIME - Temporal Integration of Model Expertise - which defines temporal model merging across three axes: (1) Initialization Phase, (2) Deployment Phase, and (3) Merging Technique. Using TIME, we study temporal model merging across model sizes, compute budgets, and learning horizons on the FoMo-in-Flux benchmark. Our comprehensive suite of experiments across TIME allows us to uncover key insights for temporal model merging, offering a better understanding of current challenges and best practices for effective temporal model merging.

MCML Authors
Link to website

Karsten Roth

Interpretable and Reliable Machine Learning

Link to Profile Zeynep Akata

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning


[116]
F. Fundel, J. Schusterbauer, V. T. Hu and B. Ommer.
Distillation of Diffusion Features for Semantic Correspondence.
Preprint (Dec. 2024). arXiv
Abstract

Semantic correspondence, the task of determining relationships between different parts of images, underpins various applications including 3D reconstruction, image-to-image translation, object tracking, and visual place recognition. Recent studies have begun to explore representations learned in large generative image models for semantic correspondence, demonstrating promising results. Building on this progress, current state-of-the-art methods rely on combining multiple large models, resulting in high computational demands and reduced efficiency. In this work, we address this challenge by proposing a more computationally efficient approach. We propose a novel knowledge distillation technique to overcome the problem of reduced efficiency. We show how to use two large vision foundation models and distill the capabilities of these complementary models into one smaller model that maintains high accuracy at reduced computational cost. Furthermore, we demonstrate that by incorporating 3D data, we are able to further improve performance, without the need for human-annotated correspondences. Overall, our empirical results demonstrate that our distilled model with 3D data augmentation achieves performance superior to current state-of-the-art methods while significantly reducing computational load and enhancing practicality for real-world applications, such as semantic video correspondence. Our code and weights are publicly available on our project page.

MCML Authors
Link to Profile Björn Ommer

Björn Ommer

Prof. Dr.

Machine Vision & Learning


[115]
V. T. Hu and B. Ommer.
[MASK] is All You Need.
Preprint (Dec. 2024). arXiv
Abstract

In generative models, two paradigms have gained attraction in various applications: next-set prediction-based Masked Generative Models and next-noise prediction-based Non-Autoregressive Models, e.g., Diffusion Models. In this work, we propose using discrete-state models to connect them and explore their scalability in the vision domain. First, we conduct a step-by-step analysis in a unified design space across two types of models including timestep-independence, noise schedule, temperature, guidance strength, etc in a scalable manner. Second, we re-cast typical discriminative tasks, e.g., image segmentation, as an unmasking process from [MASK] tokens on a discrete-state model. This enables us to perform various sampling processes, including flexible conditional sampling by only training once to model the joint distribution. All aforementioned explorations lead to our framework named Discrete Interpolants, which enables us to achieve state-of-the-art or competitive performance compared to previous discrete-state based methods in various benchmarks, like ImageNet256, MS COCO, and video dataset FaceForensics. In summary, by leveraging [MASK] in discrete-state models, we can bridge Masked Generative and Non-autoregressive Diffusion models, as well as generative and discriminative tasks.

MCML Authors
Link to Profile Björn Ommer

Björn Ommer

Prof. Dr.

Machine Vision & Learning


[114]
S. Kim, R. Xiao, M.-I. Georgescu, S. Alaniz and Z. Akata.
COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training.
Preprint (Dec. 2024). arXiv
Abstract

Vision-Language Models (VLMs) trained with contrastive loss have achieved significant advancements in various vision and language tasks. However, the global nature of contrastive loss makes VLMs focus predominantly on foreground objects, neglecting other crucial information in the image, which limits their effectiveness in downstream tasks. To address these challenges, we propose COSMOS: CrOSs-MOdality Self-distillation for vision-language pre-training that integrates a novel text-cropping strategy and cross-attention module into a self-supervised learning framework. We create global and local views of images and texts (i.e., multi-modal augmentations), which are essential for self-distillation in VLMs. We further introduce a cross-attention module, enabling COSMOS to learn comprehensive cross-modal representations optimized via a cross-modality self-distillation loss. COSMOS consistently outperforms previous strong baselines on various zero-shot downstream tasks, including retrieval, classification, and semantic segmentation. Additionally, it surpasses CLIP-based models trained on larger datasets in visual perception and contextual understanding tasks.

MCML Authors
Link to website

Sanghwan Kim

Interpretable and Reliable Machine Learning

Link to website

Stephan Alaniz

Dr.

Interpretable and Reliable Machine Learning

Link to Profile Zeynep Akata

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning


[113]
W. Li, W. Chen, S. Qian, J. Chen, D. Cremers and H. Li.
DynSUP: Dynamic Gaussian Splatting from An Unposed Image Pair.
Preprint (Dec. 2024). arXiv GitHub
Abstract

The problem of symbolic regression (SR) arises in many different applications, such as identifying physical laws or deriving mathematical equations describing the behavior of financial markets from given data. Various methods exist to address the problem of SR, often based on genetic programming. However, these methods are usually complicated and involve various hyperparameters. In this paper, we present our new approach ParFam that utilizes parametric families of suitable symbolic functions to translate the discrete symbolic regression problem into a continuous one, resulting in a more straightforward setup compared to current state-of-the-art methods. In combination with a global optimizer, this approach results in a highly effective method to tackle the problem of SR. We theoretically analyze the expressivity of ParFam and demonstrate its performance with extensive numerical experiments based on the common SR benchmark suit SRBench, showing that we achieve state-of-the-art results. Moreover, we present an extension incorporating a pre-trained transformer network DL-ParFam to guide ParFam, accelerating the optimization process by up to two magnitudes.

MCML Authors
Link to website

Weirong Chen

Computer Vision & Artificial Intelligence

Link to website

Shenhan Qian

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence

Link to website

Haoang Li

Dr.

* Former member


[112]
P. Ma, L. Rietdorf, D. Kotovenko, V. T. Hu and B. Ommer.
Does VLM Classification Benefit from LLM Description Semantics?.
Preprint (Dec. 2024). arXiv
Abstract

Accurately describing images with text is a foundation of explainable AI. Vision-Language Models (VLMs) like CLIP have recently addressed this by aligning images and texts in a shared embedding space, expressing semantic similarities between vision and language embeddings. VLM classification can be improved with descriptions generated by Large Language Models (LLMs). However, it is difficult to determine the contribution of actual description semantics, as the performance gain may also stem from a semantic-agnostic ensembling effect, where multiple modified text prompts act as a noisy test-time augmentation for the original one. We propose an alternative evaluation scenario to decide if a performance boost of LLM-generated descriptions is caused by such a noise augmentation effect or rather by genuine description semantics. The proposed scenario avoids noisy test-time augmentation and ensures that genuine, distinctive descriptions cause the performance boost. Furthermore, we propose a training-free method for selecting discriminative descriptions that work independently of classname-ensembling effects. Our approach identifies descriptions that effectively differentiate classes within a local CLIP label neighborhood, improving classification accuracy across seven datasets. Additionally, we provide insights into the explainability of description-based image classification with VLMs.

MCML Authors
Link to website

Pingchuan Ma

Machine Vision & Learning

Link to Profile Björn Ommer

Björn Ommer

Prof. Dr.

Machine Vision & Learning


[111]
N. Stracke, S. A. Baumann, K. Bauer, F. Fundel and B. Ommer.
CleanDIFT: Diffusion Features without Noise.
Preprint (Dec. 2024). arXiv
Abstract

Internal features from large-scale pre-trained diffusion models have recently been established as powerful semantic descriptors for a wide range of downstream tasks. Works that use these features generally need to add noise to images before passing them through the model to obtain the semantic features, as the models do not offer the most useful features when given images with little to no noise. We show that this noise has a critical impact on the usefulness of these features that cannot be remedied by ensembling with different random noises. We address this issue by introducing a lightweight, unsupervised fine-tuning method that enables diffusion backbones to provide high-quality, noise-free semantic features. We show that these features readily outperform previous diffusion features by a wide margin in a wide variety of extraction setups and downstream tasks, offering better performance than even ensemble-based methods at a fraction of the cost.

MCML Authors
Link to Profile Björn Ommer

Björn Ommer

Prof. Dr.

Machine Vision & Learning


[110]
J. Wang, Z. Qin, Y. Zhang, V. Hu, B. Ommer, R. Briq and S. Kesselheim.
Scaling Image Tokenizers with Grouped Spherical Quantization.
Preprint (Dec. 2024). arXiv
Abstract

Vision tokenizers have gained a lot of attraction due to their scalability and compactness; previous works depend on old-school GAN-based hyperparameters, biased comparisons, and a lack of comprehensive analysis of the scaling behaviours. To tackle those issues, we introduce Grouped Spherical Quantization (GSQ), featuring spherical codebook initialization and lookup regularization to constrain codebook latent to a spherical surface. Our empirical analysis of image tokenizer training strategies demonstrates that GSQ-GAN achieves superior reconstruction quality over state-of-the-art methods with fewer training iterations, providing a solid foundation for scaling studies. Building on this, we systematically examine the scaling behaviours of GSQ, specifically in latent dimensionality, codebook size, and compression ratios, and their impact on model performance. Our findings reveal distinct behaviours at high and low spatial compression levels, underscoring challenges in representing high-dimensional latent spaces. We show that GSQ can restructure high-dimensional latent into compact, low-dimensional spaces, thus enabling efficient scaling with improved quality. As a result, GSQ-GAN achieves a 16x down-sampling with a reconstruction FID (rFID) of 0.50.

MCML Authors
Link to Profile Björn Ommer

Björn Ommer

Prof. Dr.

Machine Vision & Learning


[109]
Y. Xia, Z. Li, Y.-J. Li, L. Shi, H. Cao, J. F. H. João F. Henriques and D. Cremers.
UniLoc: Towards Universal Place Recognition Using Any Single Modality.
Preprint (Dec. 2024). arXiv GitHub
Abstract

To date, most place recognition methods focus on single-modality retrieval. While they perform well in specific environments, cross-modal methods offer greater flexibility by allowing seamless switching between map and query sources. It also promises to reduce computation requirements by having a unified model, and achieving greater sample efficiency by sharing parameters. In this work, we develop a universal solution to place recognition, UniLoc, that works with any single query modality (natural language, image, or point cloud). UniLoc leverages recent advances in large-scale contrastive learning, and learns by matching hierarchically at two levels: instance-level matching and scene-level matching. Specifically, we propose a novel Self-Attention based Pooling (SAP) module to evaluate the importance of instance descriptors when aggregated into a place-level descriptor. Experiments on the KITTI-360 dataset demonstrate the benefits of cross-modality for place recognition, achieving superior performance in cross-modal settings and competitive results also for uni-modal scenarios.

MCML Authors
Link to website

Yan Xia

Dr.

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[108]
Y. Xia, Y. Lu, R. Song, O. Dhaouadi, J. F. Henriques and D. Cremers.
TrafficLoc: Localizing Traffic Surveillance Cameras in 3D Scenes.
Preprint (Dec. 2024). arXiv GitHub
Abstract

We tackle the problem of localizing the traffic surveillance cameras in cooperative perception. To overcome the lack of large-scale real-world intersection datasets, we introduce Carla Intersection, a new simulated dataset with 75 urban and rural intersections in Carla. Moreover, we introduce a novel neural network, TrafficLoc, localizing traffic cameras within a 3D reference map. TrafficLoc employs a coarse-to-fine matching pipeline. For image-point cloud feature fusion, we propose a novel Geometry-guided Attention Loss to address cross-modal viewpoint inconsistencies. During coarse matching, we propose an Inter-Intra Contrastive Learning to achieve precise alignment while preserving distinctiveness among local intra-features within image patch-point group pairs. Besides, we introduce Dense Training Alignment with a soft-argmax operator to consider additional features when regressing the final position. Extensive experiments show that our TrafficLoc improves the localization accuracy over the state-of-the-art Image-to-point cloud registration methods by a large margin (up to 86%) on Carla Intersection and generalizes well to real-world data. TrafficLoc also achieves new SOTA performance on KITTI and NuScenes datasets, demonstrating strong localization ability across both in-vehicle and traffic cameras.

MCML Authors
Link to website

Yan Xia

Dr.

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[107]
R. Xiao, S. Kim, M.-I. Georgescu, Z. Akata and S. Alaniz.
FLAIR: VLM with Fine-grained Language-informed Image Representations.
Preprint (Dec. 2024). arXiv GitHub
Abstract

CLIP has shown impressive results in aligning images and texts at scale. However, its ability to capture detailed visual features remains limited because CLIP matches images and texts at a global level. To address this issue, we propose FLAIR, Fine-grained Language-informed Image Representations, an approach that utilizes long and detailed image descriptions to learn localized image embeddings. By sampling diverse sub-captions that describe fine-grained details about an image, we train our vision-language model to produce not only global embeddings but also text-specific image representations. Our model introduces text-conditioned attention pooling on top of local image tokens to produce fine-grained image representations that excel at retrieving detailed image content. We achieve state-of-the-art performance on both, existing multimodal retrieval benchmarks, as well as, our newly introduced fine-grained retrieval task which evaluates vision-language models’ ability to retrieve partial image content. Furthermore, our experiments demonstrate the effectiveness of FLAIR trained on 30M image-text pairs in capturing fine-grained visual information, including zero-shot semantic segmentation, outperforming models trained on billions of pairs.

MCML Authors
Link to website

Sanghwan Kim

Interpretable and Reliable Machine Learning

Link to Profile Zeynep Akata

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning

Link to website

Stephan Alaniz

Dr.

Interpretable and Reliable Machine Learning


[106]
H. Zeng, M. Gao and D. Cremers.
CoE: Deep Coupled Embedding for Non-Rigid Point Cloud Correspondences.
Preprint (Dec. 2024). arXiv
Abstract

The interest in matching non-rigidly deformed shapes represented as raw point clouds is rising due to the proliferation of low-cost 3D sensors. Yet, the task is challenging since point clouds are irregular and there is a lack of intrinsic shape information. We propose to tackle these challenges by learning a new shape representation – a per-point high dimensional embedding, in an embedding space where semantically similar points share similar embeddings. The learned embedding has multiple beneficial properties: it is aware of the underlying shape geometry and is robust to shape deformations and various shape artefacts, such as noise and partiality. Consequently, this embedding can be directly employed to retrieve high-quality dense correspondences through a simple nearest neighbor search in the embedding space. Extensive experiments demonstrate new state-of-the-art results and robustness in numerous challenging non-rigid shape matching benchmarks and show its great potential in other shape analysis tasks, such as segmentation.

MCML Authors
Link to website

Maolin Gao

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[105]
V. Ehm, N. El Amrani, Y. Xie, L. Bastian, M. Gao, W. Wang, L. Sang, D. Cao, Z. Lähner, D. Cremers and F. Bernard.
Beyond Complete Shapes: A Quantitative Evaluation of 3D Shape Matching Algorithms.
Preprint (Nov. 2024). arXiv
Abstract

Finding correspondences between 3D shapes is an important and long-standing problem in computer vision, graphics and beyond. While approaches based on machine learning dominate modern 3D shape matching, almost all existing (learning-based) methods require that at least one of the involved shapes is complete. In contrast, the most challenging and arguably most practically relevant setting of matching partially observed shapes, is currently underexplored. One important factor is that existing datasets contain only a small number of shapes (typically below 100), which are unable to serve data-hungry machine learning approaches, particularly in the unsupervised regime. In addition, the type of partiality present in existing datasets is often artificial and far from realistic. To address these limitations and to encourage research on these relevant settings, we provide a generic and flexible framework for the procedural generation of challenging partial shape matching scenarios. Our framework allows for a virtually infinite generation of partial shape matching instances from a finite set of shapes with complete geometry. Further, we manually create cross-dataset correspondences between seven existing (complete geometry) shape matching datasets, leading to a total of 2543 shapes. Based on this, we propose several challenging partial benchmark settings, for which we evaluate respective state-of-the-art methods as baselines.

MCML Authors
Link to website

Viktoria Ehm

Computer Vision & Artificial Intelligence

Link to website

Lennart Bastian

Computer Aided Medical Procedures & Augmented Reality

Link to website

Maolin Gao

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[104]
Y.-J. Li, M. Gladkova, Y. Xia and D. Cremers.
SADG: Segment Any Dynamic Gaussian Without Object Trackers.
Preprint (Nov. 2024). arXiv
Abstract

Understanding dynamic 3D scenes is fundamental for various applications, including extended reality (XR) and autonomous driving. Effectively integrating semantic information into 3D reconstruction enables holistic representation that opens opportunities for immersive and interactive applications. We introduce SADG, Segment Any Dynamic Gaussian Without Object Trackers, a novel approach that combines dynamic Gaussian Splatting representation and semantic information without reliance on object IDs. In contrast to existing works, we do not rely on supervision based on object identities to enable consistent segmentation of dynamic 3D objects. To this end, we propose to learn semantically-aware features by leveraging masks generated from the Segment Anything Model (SAM) and utilizing our novel contrastive learning objective based on hard pixel mining. The learned Gaussian features can be effectively clustered without further post-processing. This enables fast computation for further object-level editing, such as object removal, composition, and style transfer by manipulating the Gaussians in the scene. We further extend several dynamic novel-view datasets with segmentation benchmarks to enable testing of learned feature fields from unseen viewpoints. We evaluate SADG on proposed benchmarks and demonstrate the superior performance of our approach in segmenting objects within dynamic scenes along with its effectiveness for further downstream editing tasks.

MCML Authors
Link to website

Mariia Gladkova

Computer Vision & Artificial Intelligence

Link to website

Yan Xia

Dr.

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[103]
Y. Ma, Q. Khan and D. Cremers.
MA-DV2F: A Multi-Agent Navigation Framework using Dynamic Velocity Vector Field.
Preprint (Nov. 2024). arXiv GitHub
Abstract

In this paper we propose MA-DV2F: Multi-Agent Dynamic Velocity Vector Field. It is a framework for simultaneously controlling a group of vehicles in challenging environments. DV2F is generated for each vehicle independently and provides a map of reference orientation and speed that a vehicle must attain at any point on the navigation grid such that it safely reaches its target. The field is dynamically updated depending on the speed and proximity of the ego-vehicle to other agents. This dynamic adaptation of the velocity vector field allows prevention of imminent collisions. Experimental results show that MA-DV2F outperforms concurrent methods in terms of safety, computational efficiency and accuracy in reaching the target when scaling to a large number of vehicles.

MCML Authors
Link to website

Qadeer Khan

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[102]
K. Roth, Z. Akata, D. Damen, I. Balažević and O. J. Hénaff.
Context-Aware Multimodal Pretraining.
Preprint (Nov. 2024). arXiv
Abstract

Large-scale multimodal representation learning successfully optimizes for zero-shot transfer at test time. Yet the standard pretraining paradigm (contrastive learning on large amounts of image-text data) does not explicitly encourage representations to support few-shot adaptation. In this work, we propose a simple, but carefully designed extension to multimodal pretraining which enables representations to accommodate additional context. Using this objective, we show that vision-language models can be trained to exhibit significantly increased few-shot adaptation: across 21 downstream tasks, we find up to four-fold improvements in test-time sample efficiency, and average few-shot adaptation gains of over 5%, while retaining zero-shot generalization performance across model scales and training durations. In particular, equipped with simple, training-free, metric-based adaptation mechanisms, our representations easily surpass more complex and expensive optimization-based schemes, vastly simplifying generalization to new domains.

MCML Authors
Link to website

Karsten Roth

Interpretable and Reliable Machine Learning

Link to Profile Zeynep Akata

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning


[101]
O. Wysocki, Y. Tan, T. Froech, Y. Xia, M. Wysocki, L. Hoegner, D. Cremers and C. Holst.
ZAHA: Introducing the Level of Facade Generalization and the Large-Scale Point Cloud Facade Semantic Segmentation Benchmark Dataset.
Preprint (Nov. 2024). arXiv
Abstract

Facade semantic segmentation is a long-standing challenge in photogrammetry and computer vision. Although the last decades have witnessed the influx of facade segmentation methods, there is a lack of comprehensive facade classes and data covering the architectural variability. In ZAHA, we introduce Level of Facade Generalization (LoFG), novel hierarchical facade classes designed based on international urban modeling standards, ensuring compatibility with real-world challenging classes and uniform methods’ comparison. Realizing the LoFG, we present to date the largest semantic 3D facade segmentation dataset, providing 601 million annotated points at five and 15 classes of LoFG2 and LoFG3, respectively. Moreover, we analyze the performance of baseline semantic segmentation methods on our introduced LoFG classes and data, complementing it with a discussion on the unresolved challenges for facade segmentation. We firmly believe that ZAHA shall facilitate further development of 3D facade semantic segmentation methods, enabling robust segmentation indispensable in creating urban digital twins.

MCML Authors
Link to website

Yan Xia

Dr.

Computer Vision & Artificial Intelligence

Link to website

Magdalena Wysocki

Computer Aided Medical Procedures & Augmented Reality

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[100]
W. Zhang, Q. Cheng, D. Skuddis, N. Zeller, D. Cremers and N. Haala.
HI-SLAM2: Geometry-Aware Gaussian SLAM for Fast Monocular Scene Reconstruction.
Preprint (Nov. 2024). arXiv GitHub
Abstract

We present HI-SLAM2, a geometry-aware Gaussian SLAM system that achieves fast and accurate monocular scene reconstruction using only RGB input. Existing Neural SLAM or 3DGS-based SLAM methods often trade off between rendering quality and geometry accuracy, our research demonstrates that both can be achieved simultaneously with RGB input alone. The key idea of our approach is to enhance the ability for geometry estimation by combining easy-to-obtain monocular priors with learning-based dense SLAM, and then using 3D Gaussian splatting as our core map representation to efficiently model the scene. Upon loop closure, our method ensures on-the-fly global consistency through efficient pose graph bundle adjustment and instant map updates by explicitly deforming the 3D Gaussian units based on anchored keyframe updates. Furthermore, we introduce a grid-based scale alignment strategy to maintain improved scale consistency in prior depths for finer depth details. Through extensive experiments on Replica, ScanNet, and ScanNet++, we demonstrate significant improvements over existing Neural SLAM methods and even surpass RGB-D-based methods in both reconstruction and rendering quality.

MCML Authors
Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[99]
J. Meier, L. Scalerandi, O. Dhaouadi, J. Kaiser, N. Araslanov and D. Cremers.
CARLA Drone: Monocular 3D Object Detection from a Different Perspective.
DAGM-GCPR 2024 - German Conference on Pattern Recognition. Munich, Germany, Oct 10-13, 2024. To be published. Preprint available. arXiv
Abstract

Existing techniques for monocular 3D detection have a serious restriction. They tend to perform well only on a limited set of benchmarks, faring well either on ego-centric car views or on traffic camera views, but rarely on both. To encourage progress, this work advocates for an extended evaluation of 3D detection frameworks across different camera perspectives. We make two key contributions. First, we introduce the CARLA Drone dataset, CDrone. Simulating drone views, it substantially expands the diversity of camera perspectives in existing benchmarks. Despite its synthetic nature, CDrone represents a real-world challenge. To show this, we confirm that previous techniques struggle to perform well both on CDrone and a real-world 3D drone dataset. Second, we develop an effective data augmentation pipeline called GroundMix. Its distinguishing element is the use of the ground for creating 3D-consistent augmentation of a training image. GroundMix significantly boosts the detection accuracy of a lightweight one-stage detector. In our expanded evaluation, we achieve the average precision on par with or substantially higher than the previous state of the art across all tested datasets.

MCML Authors
Link to website

Johannes Meier

Computer Vision & Artificial Intelligence

Link to website

Nikita Araslanov

Dr.

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[98]
A. Saroha, M. Gladkova, C. Curreli, D. Muhle, T. Yenamandra and D. Cremers.
Gaussian Splatting in Style.
DAGM-GCPR 2024 - German Conference on Pattern Recognition. Munich, Germany, Oct 10-13, 2024. To be published. Preprint available. arXiv
Abstract

3D scene stylization extends the work of neural style transfer to 3D. A vital challenge in this problem is to maintain the uniformity of the stylized appearance across multiple views. A vast majority of the previous works achieve this by training a 3D model for every stylized image and a set of multi-view images. In contrast, we propose a novel architecture trained on a collection of style images that, at test time, produces real time high-quality stylized novel views. We choose the underlying 3D scene representation for our model as 3D Gaussian splatting. We take the 3D Gaussians and process them using a multi-resolution hash grid and a tiny MLP to obtain stylized views. The MLP is conditioned on different style codes for generalization to different styles during test time. The explicit nature of 3D Gaussians gives us inherent advantages over NeRF-based methods, including geometric consistency and a fast training and rendering regime. This enables our method to be useful for various practical use cases, such as augmented or virtual reality. We demonstrate that our method achieves state-of-the-art performance with superior visual quality on various indoor and outdoor real-world data.

MCML Authors
Link to website

Abhishek Saroha

Computer Vision & Artificial Intelligence

Link to website

Mariia Gladkova

Computer Vision & Artificial Intelligence

Link to website

Cecilia Curreli

Computer Vision & Artificial Intelligence

Link to website

Dominik Muhle

Computer Vision & Artificial Intelligence

Link to website

Tarun Yenamandra

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[97]
L. Girrbach, Y. Huang, S. Alaniz, T. Darrell and Z. Akata.
Revealing and Reducing Gender Biases in Vision and Language Assistants (VLAs).
Preprint (Oct. 2024). arXiv
Abstract

Pre-trained large language models (LLMs) have been reliably integrated with visual input for multimodal tasks. The widespread adoption of instruction-tuned image-to-text vision-language assistants (VLAs) like LLaVA and InternVL necessitates evaluating gender biases. We study gender bias in 22 popular open-source VLAs with respect to personality traits, skills, and occupations. Our results show that VLAs replicate human biases likely present in the data, such as real-world occupational imbalances. Similarly, they tend to attribute more skills and positive personality traits to women than to men, and we see a consistent tendency to associate negative personality traits with men. To eliminate the gender bias in these models, we find that finetuning-based debiasing methods achieve the best tradeoff between debiasing and retaining performance on downstream tasks. We argue for pre-deploying gender bias assessment in VLAs and motivate further development of debiasing strategies to ensure equitable societal outcomes.

MCML Authors
Link to website

Leander Girrbach

Interpretable and Reliable Machine Learning

Link to website

Yiran Huang

Interpretable and Reliable Machine Learning

Link to website

Stephan Alaniz

Dr.

Interpretable and Reliable Machine Learning

Link to Profile Zeynep Akata

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning


[96]
S. Karthik, H. Coskun, Z. Akata, S. Tulyakov, J. Ren and A. Kag.
Scalable Ranked Preference Optimization for Text-to-Image Generation.
Preprint (Oct. 2024). arXiv
Abstract

Direct Preference Optimization (DPO) has emerged as a powerful approach to align text-to-image (T2I) models with human feedback. Unfortunately, successful application of DPO to T2I models requires a huge amount of resources to collect and label large-scale datasets, e.g., millions of generated paired images annotated with human preferences. In addition, these human preference datasets can get outdated quickly as the rapid improvements of T2I models lead to higher quality images. In this work, we investigate a scalable approach for collecting large-scale and fully synthetic datasets for DPO training. Specifically, the preferences for paired images are generated using a pre-trained reward function, eliminating the need for involving humans in the annotation process, greatly improving the dataset collection efficiency. Moreover, we demonstrate that such datasets allow averaging predictions across multiple models and collecting ranked preferences as opposed to pairwise preferences. Furthermore, we introduce RankDPO to enhance DPO-based methods using the ranking feedback. Applying RankDPO on SDXL and SD3-Medium models with our synthetically generated preference dataset ‘Syn-Pic’ improves both prompt-following (on benchmarks like T2I-Compbench, GenEval, and DPG-Bench) and visual quality (through user studies). This pipeline presents a practical and scalable solution to develop better preference datasets to enhance the performance of text-to-image models.

MCML Authors
Link to website

Shyamgopal Karthik

Interpretable and Reliable Machine Learning

Link to Profile Zeynep Akata

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning


[95]
T. Uscidda, L. Eyring, K. Roth, F. J. Theis, Z. Akata and M. Cuturi.
Disentangled Representation Learning with the Gromov-Monge Gap.
Preprint (Oct. 2024). arXiv
Abstract

Learning disentangled representations from unlabelled data is a fundamental challenge in machine learning. Solving it may unlock other problems, such as generalization, interpretability, or fairness. Although remarkably challenging to solve in theory, disentanglement is often achieved in practice through prior matching. Furthermore, recent works have shown that prior matching approaches can be enhanced by leveraging geometrical considerations, e.g., by learning representations that preserve geometric features of the data, such as distances or angles between points. However, matching the prior while preserving geometric features is challenging, as a mapping that fully preserves these features while aligning the data distribution with the prior does not exist in general. To address these challenges, we introduce a novel approach to disentangled representation learning based on quadratic optimal transport. We formulate the problem using Gromov-Monge maps that transport one distribution onto another with minimal distortion of predefined geometric features, preserving them as much as can be achieved. To compute such maps, we propose the Gromov-Monge-Gap (GMG), a regularizer quantifying whether a map moves a reference distribution with minimal geometry distortion. We demonstrate the effectiveness of our approach for disentanglement across four standard benchmarks, outperforming other methods leveraging geometric considerations.

MCML Authors
Link to website

Luca Eyring

Interpretable and Reliable Machine Learning

Link to website

Karsten Roth

Interpretable and Reliable Machine Learning

Link to Profile Fabian Theis

Fabian Theis

Prof. Dr.

Mathematical Modelling of Biological Systems

Link to Profile Zeynep Akata

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning


[94]
A. Christensen, N. Mojab, K. Patel, K. Ahuja, Z. Akata, O. Winther, O. Gonzalez-Franco and A. Colaco.
Geometry Fidelity for Spherical Images.
ECCV 2024 - 18th European Conference on Computer Vision. Milano, Italy, Sep 29-Oct 04, 2024. DOI
Abstract

Spherical or omni-directional images offer an immersive visual format appealing to a wide range of computer vision applications. However, geometric properties of spherical images pose a major challenge for models and metrics designed for ordinary 2D images. Here, we show that direct application of Fréchet Inception Distance (FID) is insufficient for quantifying geometric fidelity in spherical images. We introduce two quantitative metrics accounting for geometric constraints, namely Omnidirectional FID (OmniFID) and Discontinuity Score (DS). OmniFID is an extension of FID tailored to additionally capture field-of-view requirements of the spherical format by leveraging cubemap projections. DS is a kernel-based seam alignment score of continuity across borders of 2D representations of spherical images. In experiments, OmniFID and DS quantify geometry fidelity issues that are undetected by FID.

MCML Authors
Link to Profile Zeynep Akata

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning


[93]
J. S. Fischer, M. Gui, P. Ma, N. Stracke, S. A. Baumann and B. Ommer.
FMBoost: Boosting Latent Diffusion with Flow Matching.
ECCV 2024 - 18th European Conference on Computer Vision. Milano, Italy, Sep 29-Oct 04, 2024. DOI
Abstract

Visual synthesis has recently seen significant leaps in performance, largely due to breakthroughs in generative models. Diffusion models have been a key enabler, as they excel in image diversity. However, this comes at the cost of slow training and synthesis, which is only partially alleviated by latent diffusion. To this end, flow matching is an appealing approach due to its complementary characteristics of faster training and inference but less diverse synthesis. We demonstrate our FMBoost approach, which introduces flow matching between a frozen diffusion model and a convolutional decoder that enables high-resolution image synthesis at reduced computational cost and model size. A small diffusion model can then effectively provide the necessary visual diversity, while flow matching efficiently enhances resolution and detail by mapping the small to a high-dimensional latent space, producing high-resolution images. Combining the diversity of diffusion models, the efficiency of flow matching, and the effectiveness of convolutional decoders, state-of-the-art high-resolution image synthesis is achieved at 10242 pixels with minimal computational cost. Cascading FMBoost optionally boosts this further to 20482 pixels. Importantly, this approach is orthogonal to recent approximation and speed-up strategies for the underlying model, making it easily integrable into the various diffusion model frameworks.

MCML Authors
Link to website

Pingchuan Ma

Machine Vision & Learning

Link to Profile Björn Ommer

Björn Ommer

Prof. Dr.

Machine Vision & Learning


[92]
L. Härenstam-Nielsen, L. Sang, A. Saroha, N. Araslanov and D. Cremers.
DiffCD: A Symmetric Differentiable Chamfer Distance for Neural Implicit Surface Fitting.
ECCV 2024 - 18th European Conference on Computer Vision. Milano, Italy, Sep 29-Oct 04, 2024. DOI GitHub
Abstract

Neural implicit surfaces can be used to recover accurate 3D geometry from imperfect point clouds. In this work, we show that state-of-the-art techniques work by minimizing an approximation of a one-sided Chamfer distance. This shape metric is not symmetric, as it only ensures that the point cloud is near the surface but not vice versa. As a consequence, existing methods can produce inaccurate reconstructions with spurious surfaces. Although one approach against spurious surfaces has been widely used in the literature, we theoretically and experimentally show that it is equivalent to regularizing the surface area, resulting in over-smoothing. As a more appealing alternative, we propose DiffCD, a novel loss function corresponding to the symmetric Chamfer distance. In contrast to previous work, DiffCD also assures that the surface is near the point cloud, which eliminates spurious surfaces without the need for additional regularization. We experimentally show that DiffCD reliably recovers a high degree of shape detail, substantially outperforming existing work across varying surface complexity and noise levels.

MCML Authors
Link to website

Abhishek Saroha

Computer Vision & Artificial Intelligence

Link to website

Nikita Araslanov

Dr.

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[91]
V. T. Hu, S. A. Baumann, M. Gui, O. Grebenkova, P. Ma, J. Fischer and B. Ommer.
ZigMa: A DiT-style Zigzag Mamba Diffusion Model.
ECCV 2024 - 18th European Conference on Computer Vision. Milano, Italy, Sep 29-Oct 04, 2024. DOI
Abstract

The diffusion model has long been plagued by scalability and quadratic complexity issues, especially within transformer-based structures. In this study, we aim to leverage the long sequence modeling capability of a State-Space Model called Mamba to extend its applicability to visual data generation. Firstly, we identify a critical oversight in most current Mamba-based vision methods, namely the lack of consideration for spatial continuity in the scan scheme of Mamba. Secondly, building upon this insight, we introduce Zigzag Mamba, a simple, plug-and-play, minimal-parameter burden, DiT style solution, which outperforms Mamba-based baselines and demonstrates improved speed and memory utilization compared to transformer-based baselines, also this heterogeneous layerwise scan enables zero memory and speed burden when we consider more scan paths. Lastly, we integrate Zigzag Mamba with the Stochastic Interpolant framework to investigate the scalability of the model on large-resolution visual datasets, such as FacesHQ and UCF101, MultiModal-CelebA-HQ, and MS COCO .

MCML Authors
Link to website

Olga Grebenkova

Machine Vision & Learning

Link to website

Pingchuan Ma

Machine Vision & Learning

Link to Profile Björn Ommer

Björn Ommer

Prof. Dr.

Machine Vision & Learning


[90]
T. Hummel, S. Karthik, M.-I. Georgescu and Z. Akata.
EgoCVR: An Egocentric Benchmark for Fine-Grained Composed Video Retrieval.
ECCV 2024 - 18th European Conference on Computer Vision. Milano, Italy, Sep 29-Oct 04, 2024. DOI GitHub
Abstract

In Composed Video Retrieval, a video and a textual description which modifies the video content are provided as inputs to the model. The aim is to retrieve the relevant video with the modified content from a database of videos. In this challenging task, the first step is to acquire large-scale training datasets and collect high-quality benchmarks for evaluation. In this work, we introduce EgoCVR, a new evaluation benchmark for fine-grained Composed Video Retrieval using large-scale egocentric video datasets. EgoCVR consists of 2,295 queries that specifically focus on high-quality temporal video understanding. We find that existing Composed Video Retrieval frameworks do not achieve the necessary high-quality temporal video understanding for this task. To address this shortcoming, we adapt a simple training-free method, propose a generic re-ranking framework for Composed Video Retrieval, and demonstrate that this achieves strong results on EgoCVR.

MCML Authors
Link to website

Shyamgopal Karthik

Interpretable and Reliable Machine Learning

Link to Profile Zeynep Akata

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning


[89]
J. M. Kim, J. Bader, S. Alaniz, C. Schmid and Z. Akata.
DataDream: Few-shot Guided Dataset Generation.
ECCV 2024 - 18th European Conference on Computer Vision. Milano, Italy, Sep 29-Oct 04, 2024. DOI GitHub
Abstract

While text-to-image diffusion models have been shown to achieve state-of-the-art results in image synthesis, they have yet to prove their effectiveness in downstream applications. Previous work has proposed to generate data for image classifier training given limited real data access. However, these methods struggle to generate in-distribution images or depict fine-grained features, thereby hindering the generalization of classification models trained on synthetic datasets. We propose DataDream, a framework for synthesizing classification datasets that more faithfully represents the real data distribution when guided by few-shot examples of the target classes. DataDream fine-tunes LoRA weights for the image generation model on the few real images before generating the training data using the adapted model. We then fine-tune LoRA weights for CLIP using the synthetic data to improve downstream image classification over previous approaches on a large variety of datasets. We demonstrate the efficacy of DataDream through extensive experiments, surpassing state-of-the-art classification accuracy with few-shot data across 7 out of 10 datasets, while being competitive on the other 3. Additionally, we provide insights into the impact of various factors, such as the number of real-shot and generated images as well as the fine-tuning compute on model performance.

MCML Authors
Link to website

Jae Myung Kim

Interpretable and Reliable Machine Learning

Link to website

Jessica Bader

Interpretable and Reliable Machine Learning

Link to website

Stephan Alaniz

Dr.

Interpretable and Reliable Machine Learning

Link to Profile Zeynep Akata

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning


[88]
D. Kotovenko, O. Grebenkova, N. Sarafianos, A. Paliwal, P. Ma, O. Poursaeed, S. Mohan, Y. Fan, Y. Li, R. Ranjan and B. Ommer.
WaSt-3D: Wasserstein-2 Distance for Scene-to-Scene Stylization on 3D Gaussians.
ECCV 2024 - 18th European Conference on Computer Vision. Milano, Italy, Sep 29-Oct 04, 2024. DOI GitHub
Abstract

While style transfer techniques have been well-developed for 2D image stylization, the extension of these methods to 3D scenes remains relatively unexplored. Existing approaches demonstrate proficiency in transferring colors and textures but often struggle with replicating the geometry of the scenes. In our work, we leverage an explicit Gaussian Scale (GS) representation and directly match the distributions of Gaussians between style and content scenes using the Earth Mover’s Distance (EMD). By employing the entropy-regularized Wasserstein-2 distance, we ensure that the transformation maintains spatial smoothness. Additionally, we decompose the scene stylization problem into smaller chunks to enhance efficiency. This paradigm shift reframes stylization from a pure generative process driven by latent space losses to an explicit matching of distributions between two Gaussian representations. Our method achieves high-resolution 3D stylization by faithfully transferring details from 3D style scenes onto the content scene. Furthermore, WaSt-3D consistently delivers results across diverse content and style scenes without necessitating any training, as it relies solely on optimization-based techniques.

MCML Authors
Link to website

Olga Grebenkova

Machine Vision & Learning

Link to website

Pingchuan Ma

Machine Vision & Learning

Link to Profile Björn Ommer

Björn Ommer

Prof. Dr.

Machine Vision & Learning


[87]
B. Liao, Z. Zhao, L. Chen, H. Li, D. Cremers and P. Liu.
GlobalPointer: Large-Scale Plane Adjustment with Bi-Convex Relaxation.
ECCV 2024 - 18th European Conference on Computer Vision. Milano, Italy, Sep 29-Oct 04, 2024. DOI GitHub
Abstract

Plane adjustment (PA) is crucial for many 3D applications, involving simultaneous pose estimation and plane recovery. Despite recent advancements, it remains a challenging problem in the realm of multi-view point cloud registration. Current state-of-the-art methods can achieve globally optimal convergence only with good initialization. Furthermore, their high time complexity renders them impractical for large-scale problems. To address these challenges, we first exploit a novel optimization strategy termed Bi-Convex Relaxation, which decouples the original problem into two simpler sub-problems, reformulates each sub-problem using a convex relaxation technique, and alternately solves each one until the original problem converges. Building on this strategy, we propose two algorithmic variants for solving the plane adjustment problem, namely GlobalPointer and GlobalPointer++, based on point-to-plane and plane-to-plane errors, respectively. Extensive experiments on both synthetic and real datasets demonstrate that our method can perform large-scale plane adjustment with linear time complexity, larger convergence region, and robustness to poor initialization, while achieving similar accuracy as prior methods.

MCML Authors
Link to website

Haoang Li

Dr.

* Former member

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[86]
M. Mahajan, F. Hofherr and D. Cremers.
MeshFeat: Multi-Resolution Features for Neural Fields on Meshes.
ECCV 2024 - 18th European Conference on Computer Vision. Milano, Italy, Sep 29-Oct 04, 2024. DOI
Abstract

Parametric feature grid encodings have gained significant attention as an encoding approach for neural fields since they allow for much smaller MLPs, which significantly decreases the inference time of the models. In this work, we propose MeshFeat, a parametric feature encoding tailored to meshes, for which we adapt the idea of multi-resolution feature grids from Euclidean space. We start from the structure provided by the given vertex topology and use a mesh simplification algorithm to construct a multi-resolution feature representation directly on the mesh. The approach allows the usage of small MLPs for neural fields on meshes, and we show a significant speed-up compared to previous representations while maintaining comparable reconstruction quality for texture reconstruction and BRDF representation. Given its intrinsic coupling to the vertices, the method is particularly well-suited for representations on deforming meshes, making it a good fit for object animation.

MCML Authors
Link to website

Florian Hofherr

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[85]
N. Stracke, S. A. Baumann, J. M. Susskind, M. A. Bautista and B. Ommer.
CTRLorALTer: Conditional LoRAdapter for Efficient 0-Shot Control and Altering of T2I Models.
ECCV 2024 - 18th European Conference on Computer Vision. Milano, Italy, Sep 29-Oct 04, 2024. DOI GitHub
Abstract

Text-to-image generative models have become a prominent and powerful tool that excels at generating high-resolution realistic images. However, guiding the generative process of these models to take into account detailed forms of conditioning reflecting style and/or structure information remains an open problem. In this paper, we present. LoRAdapter, an approach that unifies both style and structure conditioning under the same formulation using a novel conditional LoRA block that enables zero-shot control. LoRAdapter is an efficient and powerful approach to condition text-to-image diffusion models, which enables fine-grained control conditioning during generation and outperforms recent state-of-the-art approaches.

MCML Authors
Link to Profile Björn Ommer

Björn Ommer

Prof. Dr.

Machine Vision & Learning


[84]
S. Weber, J. H. Hong and D. Cremers.
Power Variable Projection for Initialization-Free Large-Scale Bundle Adjustment.
ECCV 2024 - 18th European Conference on Computer Vision. Milano, Italy, Sep 29-Oct 04, 2024. DOI
Abstract

Most Bundle Adjustment (BA) solvers like the Levenberg-Marquardt algorithm require a good initialization. Instead, initialization-free BA remains a largely uncharted territory. The under-explored Variable Projection algorithm (VarPro) exhibits a wide convergence basin even without initialization. Coupled with object space error formulation, recent works have shown its ability to solve small-scale initialization-free bundle adjustment problem. To make such initialization-free BA approaches scalable, we introduce Power Variable Projection (PoVar), extending a recent inverse expansion method based on power series. Importantly, we link the power series expansion to Riemannian manifold optimization. This projective framework is crucial to solve large-scale bundle adjustment problems without initialization. Using the real-world BAL dataset, we experimentally demonstrate that our solver achieves state-of-the-art results in terms of speed and accuracy. To our knowledge, this work is the first to address the scalability of BA without initialization opening new venues for initialization-free structure-from-motion.

MCML Authors
Link to website

Simon Weber

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[83]
L. Yang, L. Hoyer, M. Weber, T. Fischer, D. Dai, L. Leal-Taixé, D. Cremers, M. Pollefeys and L. Van Gool.
MICDrop: Masking Image and Depth Features via Complementary Dropout for Domain-Adaptive Semantic Segmentation.
ECCV 2024 - 18th European Conference on Computer Vision. Milano, Italy, Sep 29-Oct 04, 2024. DOI GitHub
Abstract

Unsupervised Domain Adaptation (UDA) is the task of bridging the domain gap between a labeled source domain, e.g., synthetic data, and an unlabeled target domain. We observe that current UDA methods show inferior results on fine structures and tend to oversegment objects with ambiguous appearance. To address these shortcomings, we propose to leverage geometric information, i.e., depth predictions, as depth discontinuities often coincide with segmentation boundaries. We show that naively incorporating depth into current UDA methods does not fully exploit the potential of this complementary information. To this end, we present MICDrop, which learns a joint feature representation by masking image encoder features while inversely masking depth encoder features. With this simple yet effective complementary masking strategy, we enforce the use of both modalities when learning the joint feature representation. To aid this process, we propose a feature fusion module to improve both global as well as local information sharing while being robust to errors in the depth predictions. We show that our method can be plugged into various recent UDA methods and consistently improve results across standard UDA benchmarks, obtaining new state-of-the-art performances.

MCML Authors
Laura Leal-Taixé

Laura Leal-Taixé

Prof. Dr.

* Former member

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[82]
L. Sang, M. Gao, A. Saroha and D. Cremers.
Enhancing Surface Neural Implicits with Curvature-Guided Sampling and Uncertainty-Augmented Representations.
Wild3D 2024 - Workshop 3D Modeling, Reconstruction, and Generation in the Wild at the 18th European Conference on Computer Vision (ECCV 2024). Milano, Italy, Sep 29-Oct 04, 2024. URL
Abstract

Neural implicits are a widely used surface presentation because they offer an adaptive resolution and support arbitrary topology changes. While previous works rely on ground truth point clouds or meshes, they often do not discuss the data acquisition and ignore the effect of input quality and sampling methods during reconstruction. In this paper, we introduce a sampling method with an uncertainty-augmented surface implicit representation that employs a sampling technique that considers the geometric characteristics of inputs. To this end, we introduce a strategy that efficiently computes differentiable geometric features, namely, mean curvatures, to guide the sampling phase during the training period. The uncertainty augmentation offers insights into the occupancy and reliability of the output signed distance value, thereby expanding representation capabilities into open surfaces. Finally, we demonstrate that our method improves the reconstruction of both synthetic and real-world data.

MCML Authors
Link to website

Maolin Gao

Computer Vision & Artificial Intelligence

Link to website

Abhishek Saroha

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[81]
L. Cheng, J. Hu, H. Yan, M. Gladkova, T. Huang, Y.-H. Liu, D. Cremers and H. Li.
Physically-Based Photometric Bundle Adjustment in Non-Lambertian Environments.
Preprint (Sep. 2024). arXiv
Abstract

Photometric bundle adjustment (PBA) is widely used in estimating the camera pose and 3D geometry by assuming a Lambertian world. However, the assumption of photometric consistency is often violated since the non-diffuse reflection is common in real-world environments. The photometric inconsistency significantly affects the reliability of existing PBA methods. To solve this problem, we propose a novel physically-based PBA method. Specifically, we introduce the physically-based weights regarding material, illumination, and light path. These weights distinguish the pixel pairs with different levels of photometric inconsistency. We also design corresponding models for material estimation based on sequential images and illumination estimation based on point clouds. In addition, we establish the first SLAM-related dataset of non-Lambertian scenes with complete ground truth of illumination and material. Extensive experiments demonstrated that our PBA method outperforms existing approaches in accuracy.

MCML Authors
Link to website

Mariia Gladkova

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence

Link to website

Haoang Li

Dr.

* Former member


[80]
Y. Ma, A. Li, Q. Khan and D. Cremers.
Enhancing the Performance of Multi-Vehicle Navigation in Unstructured Environments using Hard Sample Mining.
Preprint (Sep. 2024). arXiv GitHub
Abstract

Contemporary research in autonomous driving has demonstrated tremendous potential in emulating the traits of human driving. However, they primarily cater to areas with well built road infrastructure and appropriate traffic management systems. Therefore, in the absence of traffic signals or in unstructured environments, these self-driving algorithms are expected to fail. This paper proposes a strategy for autonomously navigating multiple vehicles in close proximity to their desired destinations without traffic rules in unstructured environments. Graphical Neural Networks (GNNs) have demonstrated good utility for this task of multi-vehicle control. Among the different alternatives of training GNNs, supervised methods have proven to be most data-efficient, albeit require ground truth labels. However, these labels may not always be available, particularly in unstructured environments without traffic regulations. Therefore, a tedious optimization process may be required to determine them while ensuring that the vehicles reach their desired destination and do not collide with each other or any obstacles. Therefore, in order to expedite the training process, it is essential to reduce the optimization time and select only those samples for labeling that add most value to the training. In this paper, we propose a warm start method that first uses a pre-trained model trained on a simpler subset of data. Inference is then done on more complicated scenarios, to determine the hard samples wherein the model faces the greatest predicament. This is measured by the difficulty vehicles encounter in reaching their desired destination without collision. Experimental results demonstrate that mining for hard samples in this manner reduces the requirement for supervised training data by 10 fold.

MCML Authors
Link to website

Qadeer Khan

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[79]
M. Weber, L. Yu, Q. Yu, X. Deng, X. Shen, D. Cremers and L.-C. Chen.
MaskBit: Embedding-free Image Generation via Bit Tokens.
Preprint (Sep. 2024). arXiv
Abstract

Masked transformer models for class-conditional image generation have become a compelling alternative to diffusion models. Typically comprising two stages - an initial VQGAN model for transitioning between latent space and image space, and a subsequent Transformer model for image generation within latent space - these frameworks offer promising avenues for image synthesis. In this study, we present two primary contributions: Firstly, an empirical and systematic examination of VQGANs, leading to a modernized VQGAN. Secondly, a novel embedding-free generation network operating directly on bit tokens - a binary quantized representation of tokens with rich semantics. The first contribution furnishes a transparent, reproducible, and high-performing VQGAN model, enhancing accessibility and matching the performance of current state-of-the-art methods while revealing previously undisclosed details. The second contribution demonstrates that embedding-free image generation using bit tokens achieves a new state-of-the-art FID of 1.52 on the ImageNet 256x256 benchmark, with a compact generator model of mere 305M parameters.

MCML Authors
Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[78]
C. Tomani, D. Vilar, M. Freitag, C. Cherry, S. Naskar, M. Finkelstein, X. Garcia and D. Cremers.
Quality-Aware Translation Models: Efficient Generation and Quality Estimation in a Single Model.
ACL 2024 - 62nd Annual Meeting of the Association for Computational Linguistics. Bangkok, Thailand, Aug 11-16, 2024. DOI
Abstract

Maximum-a-posteriori (MAP) decoding is the most widely used decoding strategy for neural machine translation (NMT) models. The underlying assumption is that model probability correlates well with human judgment, with better translations getting assigned a higher score by the model. However, research has shown that this assumption does not always hold, and generation quality can be improved by decoding to optimize a utility function backed by a metric or quality-estimation signal, as is done by Minimum Bayes Risk (MBR) or Quality-Aware decoding. The main disadvantage of these approaches is that they require an additional model to calculate the utility function during decoding, significantly increasing the computational cost. In this paper, we propose to make the NMT models themselves quality-aware by training them to estimate the quality of their own output. Using this approach for MBR decoding we can drastically reduce the size of the candidate list, resulting in a speed-up of two-orders of magnitude. When applying our method to MAP decoding we obtain quality gains similar or even superior to quality reranking approaches, but with the efficiency of single pass decoding.

MCML Authors
Link to website

Christian Tomani

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[77]
M. Bini, K. Roth, Z. Akata and A. Khoreva.
ETHER: Efficient Finetuning of Large-Scale Models with Hyperplane Reflections.
ICML 2024 - 41st International Conference on Machine Learning. Vienna, Austria, Jul 21-27, 2024. URL GitHub
Abstract

Parameter-efficient finetuning (PEFT) has become ubiquitous to adapt foundation models to downstream task requirements while retaining their generalization ability. However, the amount of additionally introduced parameters and compute for successful adaptation and hyperparameter searches can explode quickly, especially when deployed at scale to serve numerous individual requests. To ensure effective, parameter-efficient, and hyperparameter-robust adaptation, we propose the ETHER transformation family, which performs Efficient fineTuning via HypErplane Reflections. By design, ETHER transformations require a minimal number of parameters, are less likely to deteriorate model performance, and exhibit robustness to hyperparameter and learning rate choices. In particular, we introduce ETHER and its relaxation ETHER+, which match or outperform existing PEFT methods with significantly fewer parameters (∼10-100 times lower than LoRA or OFT) across multiple image synthesis and natural language tasks without exhaustive hyperparameter tuning. Finally, we investigate the recent emphasis on Hyperspherical Energy retention for adaptation and raise questions on its practical utility.

MCML Authors
Link to website

Massimo Bini

Interpretable and Reliable Machine Learning

Link to website

Karsten Roth

Interpretable and Reliable Machine Learning

Link to Profile Zeynep Akata

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning


[76]
Y. Shen, N. Daheim, B. Cong, P. Nickl, G. M. Marconi, C. Bazan, R. Yokota, I. Gurevych, D. Cremers, M. E. Khan and T. Möllenhoff.
Variational Learning is Effective for Large Deep Networks.
ICML 2024 - 41st International Conference on Machine Learning. Vienna, Austria, Jul 21-27, 2024. URL GitHub
Abstract

We give extensive empirical evidence against the common belief that variational learning is ineffective for large neural networks. We show that an optimizer called Improved Variational Online Newton (IVON) consistently matches or outperforms Adam for training large networks such as GPT-2 and ResNets from scratch. IVON’s computational costs are nearly identical to Adam but its predictive uncertainty is better. We show several new use cases of IVON where we improve finetuning and model merging in Large Language Models, accurately predict generalization error, and faithfully estimate sensitivity to data. We find overwhelming evidence that variational learning is effective.

MCML Authors
Yuesong Shen

Yuesong Shen

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[75]
F. Bongratz, V. Golkov, L. Mautner, L. Della Libera, F. Heetmeyer, F. Czaja, J. Rodemann and D. Cremers.
How to Choose a Reinforcement-Learning Algorithm.
Preprint (Jul. 2024). arXiv GitHub
Abstract

The field of reinforcement learning offers a large variety of concepts and methods to tackle sequential decision-making problems. This variety has become so large that choosing an algorithm for a task at hand can be challenging. In this work, we streamline the process of choosing reinforcement-learning algorithms and action-distribution families. We provide a structured overview of existing methods and their properties, as well as guidelines for when to choose which methods.

MCML Authors
Link to website

Fabian Bongratz

Artificial Intelligence in Radiology

Link to website

Vladimir Golkov

Dr.

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[74]
M. Dani, M. J. Prakash, Z. Akata and S. Liebe.
SemioLLM: Assessing Large Language Models for Semiological Analysis in Epilepsy Research.
Preprint (Jul. 2024). arXiv
Abstract

Large Language Models have shown promising results in their ability to encode general medical knowledge in standard medical question-answering datasets. However, their potential application in clinical practice requires evaluation in domain-specific tasks, where benchmarks are largely missing. In this study semioLLM, we test the ability of state-of-the-art LLMs (GPT-3.5, GPT-4, Mixtral 8x7B, and Qwen-72chat) to leverage their internal knowledge and reasoning for epilepsy diagnosis. Specifically, we obtain likelihood estimates linking unstructured text descriptions of seizures to seizure-generating brain regions, using an annotated clinical database containing 1269 entries. We evaluate the LLM’s performance, confidence, reasoning, and citation abilities in comparison to clinical evaluation. Models achieve above-chance classification performance with prompt engineering significantly improving their outcome, with some models achieving close-to-clinical performance and reasoning. However, our analyses also reveal significant pitfalls with several models being overly confident while showing poor performance, as well as exhibiting citation errors and hallucinations. In summary, our work provides the first extensive benchmark comparing current SOTA LLMs in the medical domain of epilepsy and highlights their ability to leverage unstructured texts from patients’ medical history to aid diagnostic processes in health care.

MCML Authors
Link to Profile Zeynep Akata

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning


[73]
Y. Xia, R. Ding, Z. Qin, G. Zhan, K. Zhou, L. Yang, H. Dong and D. Cremers.
TARGO: Benchmarking Target-driven Object Grasping under Occlusions.
Preprint (Jul. 2024). arXiv GitHub
Abstract

Recent advances in predicting 6D grasp poses from a single depth image have led to promising performance in robotic grasping. However, previous grasping models face challenges in cluttered environments where nearby objects impact the target object’s grasp. In this paper, we first establish a new benchmark dataset for TARget-driven Grasping under Occlusions, named TARGO. We make the following contributions: 1) We are the first to study the occlusion level of grasping. 2) We set up an evaluation benchmark consisting of large-scale synthetic data and part of real-world data, and we evaluated five grasp models and found that even the current SOTA model suffers when the occlusion level increases, leaving grasping under occlusion still a challenge. 3) We also generate a large-scale training dataset via a scalable pipeline, which can be used to boost the performance of grasping under occlusion and generalized to the real world. 4) We further propose a transformer-based grasping model involving a shape completion module, termed TARGO-Net, which performs most robustly as occlusion increases.

MCML Authors
Link to website

Yan Xia

Dr.

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[72]
M. Brahimi, B. Haefner, Z. Ye, B. Goldluecke and D. Cremers.
Sparse Views, Near Light: A Practical Paradigm for Uncalibrated Point-light Photometric Stereo.
CVPR 2024 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, WA, USA, Jun 17-21, 2024. DOI
Abstract

Neural approaches have shown a significant progress on camera-based reconstruction. But they require either a fairly dense sampling of the viewing sphere, or pre-training on an existing dataset, thereby limiting their generalizability. In contrast, photometric stereo (PS) approaches have shown great potential for achieving high-quality reconstruction under sparse viewpoints. Yet, they are impractical because they typically require tedious laboratory conditions, are restricted to dark rooms, and often multi-staged, making them subject to accumulated errors. To address these shortcomings, we propose an end-to-end uncalibrated multi-view PS frameworkfor reconstructing high-resolution shapes acquiredfrom sparse viewpoints in a real-world environment. We relax the dark room assumption, and allow a combination of static ambient lighting and dynamic near LED lighting, thereby enabling easy data capture outside the lab. Experimental validation confirms that it outperforms existing baseline approaches in the regime of sparse viewpoints by a large margin. This allows to bring high-accuracy 3D reconstruction from the dark room to the real world, while maintaining a reasonable data capture complexity.

MCML Authors
Link to website

Zhenzhang Ye

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[71]
V. Ehm, M. Gao, P. Roetzer, M. Eisenberger, D. Cremers and F. Bernard.
Partial-to-Partial Shape Matching with Geometric Consistency.
CVPR 2024 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, WA, USA, Jun 17-21, 2024. DOI GitHub
Abstract

Finding correspondences between 3D shapes is an important and long-standing problem in computer vision, graphics and beyond. A prominent challenge are partial-to-partial shape matching settings, which occur when the shapes to match are only observed incompletely (e.g. from 3D scanning). Although partial-to-partial matching is a highly relevant setting in practice, it is rarely explored. Our work bridges the gap between existing (rather artificial) 3D full shape matching and partial-to-partial real-world set-tings by exploiting geometric consistency as a strong constraint. We demonstrate that it is indeed possible to solve this challenging problem in a variety of settings. For the first time, we achieve geometric consistency for partial-to-partial matching, which is realized by a novel integer non-linear program formalism building on triangle prod-uct spaces, along with a new pruning algorithm based on linear integer programming. Further, we generate a new inter-class dataset for partial-to-partial shape-matching. We show that our method outperforms current SOTA meth-ods on both an established intra-class dataset and our novel inter-class dataset.

MCML Authors
Link to website

Viktoria Ehm

Computer Vision & Artificial Intelligence

Link to website

Maolin Gao

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[70]
K. Han, D. Muhle, F. Wimbauer and D. Cremers.
Boosting Self-Supervision for Single-View Scene Completion via Knowledge Distillation.
CVPR 2024 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, WA, USA, Jun 17-21, 2024. DOI
Abstract

Inferring scene geometry from images via Structure from Motion is a long-standing and fundamental problem in computer vision. While classical approaches and, more re-cently, depth map predictions only focus on the visible parts of a scene, the task of scene completion aims to reason about geometry even in occluded regions. With the popularity of neural radiance fields (NeRFs), implicit represen-tations also became popular for scene completion by pre-dicting so-called density fields. Unlike explicit approaches e.g. voxel-based methods, density fields also allow for ac-curate depth prediction and novel-view synthesis via image-based rendering. In this work, we propose to fuse the scene reconstruction from multiple images and distill this knowl-edge into a more accurate single-view scene reconstruction. To this end, we propose Multi-View Behind the Scenes (MVBTS) to fuse density fields from multiple posed images, trained fully self-supervised only from image data. Using knowledge distillation, we use MVBTS to train a single-view scene completion network via direct supervision called KDBTS. It achieves state-of-the-art performance on occu-pancy prediction, especially in occluded regions.

MCML Authors
Link to website

Dominik Muhle

Computer Vision & Artificial Intelligence

Link to website

Felix Wimbauer

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[69]
S. Weber, T. Dagès, M. Gao and D. Cremers.
Finsler-Laplace-Beltrami Operators with Application to Shape Analysis.
CVPR 2024 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, WA, USA, Jun 17-21, 2024. DOI
Abstract

The Laplace-Beltrami operator (LBO) emerges from studying manifolds equipped with a Riemannian metric. It is often called the swiss army knife of geometry processing as it allows to capture intrinsic shape information and gives rise to heat diffusion, geodesic distances, and a mul-titude of shape descriptors. It also plays a central role in geometric deep learning. In this work, we explore Finsler manifolds as a generalization of Riemannian manifolds. We revisit the Finsler heat equation and derive a Finsler heat kernel and a Finsler-Laplace-Beltrami Operator (FLBO): a novel theoretically justified anisotropic Laplace-Beltrami operator (ALBO). In experimental evaluations we demon-strate that the proposed FLBO is a valuable alternative to the traditional Riemannian-based LBO and ALBOs for spa-tialfiltering and shape correspondence estimation. We hope that the proposed Finsler heat kernel and the FLBO will inspire further exploration of Finsler geometry in the Computer vision community.

MCML Authors
Link to website

Simon Weber

Computer Vision & Artificial Intelligence

Link to website

Maolin Gao

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[68]
S. Weber, B. Zöngür, N. Araslanov and D. Cremers.
Flattening the Parent Bias: Hierarchical Semantic Segmentation in the Poincaré Ball.
CVPR 2024 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, WA, USA, Jun 17-21, 2024. DOI
Abstract

Hierarchy is a natural representation of semantic taxonomies, including the ones routinely used in image segmentation. Indeed, recent work on semantic segmentation reports improved accuracy from supervised training leveraging hierarchical label structures. Encouraged by these results, we revisit the fundamental assumptions behind that work. We postulate and then empirically verify that the reasons for the observed improvement in segmentation accuracy may be entirely unrelated to the use of the semantic hierarchy. To demonstrate this, we design a range of crossdomain experiments with a representative hierarchical approach. We find that on the new testing domains, a flat (non-hierarchical) segmentation network, in which the parents are inferred from the children, has superior segmentation accuracy to the hierarchical approach across the board. Complementing these findings and inspired by the intrinsic properties of hyperbolic spaces, we study a more principled approach to hierarchical segmentation using the Poincare ball model. The hyperbolic representation largely outperforms the previous (Euclidean) hierarchical approach as well and is on par with our flat Euclidean baseline in terms of segmentation accuracy. However, it additionally exhibits surprisingly strong calibration quality of the parent nodes in the semantic hierarchy, especially on the more challenging domains. Our combined analysis suggests that the established practice of hierarchical segmentation may be limited to in-domain settings, whereas flat classifiers generalize substantially better, especially if they are modeled in the hyperbolic space.

MCML Authors
Link to website

Simon Weber

Computer Vision & Artificial Intelligence

Link to website

Nikita Araslanov

Dr.

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[67]
F. Wimbauer, B. Wu, E. Schoenfeld, X. Dai, J. Hou, Z. He, A. Sanakoyeu, P. Zhang, S. Tsai, J. Kohler, C. Rupprecht, D. Cremers, P. Vajda and J. Wang.
Cache Me if You Can: Accelerating Diffusion Models through Block Caching.
CVPR 2024 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, WA, USA, Jun 17-21, 2024. DOI GitHub
Abstract

Diffusion models have recently revolutionized the field of image synthesis due to their ability to generate photorealistic images. However, one of the major drawbacks of diffusion models is that the image generation process is costly. A large image-to-image network has to be applied many times to iteratively refine an image from random noise. While many recent works propose techniques to reduce the number of required steps, they generally treat the underlying denoising network as a black box. In this work, we investigate the behavior of the layers within the network and find that 1) the layers’ output changes smoothly over time, 2) the layers show distinct patterns of change, and 3) the change from step to step is often very small. We hypothesize that many layer computations in the denoising network are redundant. Leveraging this, we introduce block caching, in which we reuse outputs from layer blocks of previous steps to speed up inference. Furthermore, we propose a technique to automatically determine caching schedules based on each block’s changes over timesteps. In our experiments, we show through FID, human evaluation and qualitative analysis that Block Caching allows to generate images with higher visual quality at the same computational cost. We demonstrate this for different state-of-the-art models (LDM and EMU) and solvers (DDIM and DPM).

MCML Authors
Link to website

Felix Wimbauer

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[66]
Y. Xia, L. Shi, Z. Ding, J. F. Henriques and D. Cremers.
Text2Loc: 3D Point Cloud Localization from Natural Language.
CVPR 2024 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, WA, USA, Jun 17-21, 2024. DOI GitHub
Abstract

We tackle the problem of 3D point cloud localization based on a few natural linguistic descriptions and introduce a novel neural network, Text2Loc, that fully interprets the semantic relationship between points and text. Text2Loc follows a coarse-to-fine localization pipeline: text-submap global place recognition, followed by fine localization. In global place recognition, relational dynamics among each textual hint are captured in a hierarchical transformer with max-pooling (HTM), whereas a balance between positive and negative pairs is maintained using text-submap contrastive learning. Moreover, we propose a novel matching-free fine localization method to further refine the location predictions, which completely removes the need for complicated text-instance matching and is lighter, faster, and more accurate than previous methods. Extensive experiments show that Text2Loc improves the localization accuracy by up to 2 × over the state-of-the-art on the KITTI360Pose dataset.

MCML Authors
Link to website

Yan Xia

Dr.

Computer Vision & Artificial Intelligence

Link to website

Zifeng Ding

Database Systems & Data Mining

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[65]
C. Reich, B. Debnath, D. Patel, T. Prangemeier, D. Cremers and S. Chakradhar.
Deep Video Codec Control for Vision Models.
CVPR 2024 - Workshop at the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, WA, USA, Jun 17-21, 2024. DOI
Abstract

Standardized lossy video coding is at the core of almost all real-world video processing pipelines. Rate control is used to enable standard codecs to adapt to different network bandwidth conditions or storage constraints. However, standard video codecs (e.g., H.264) and their rate control modules aim to minimize video distortion w.r.t. human quality assessment. We demonstrate empirically that standard-coded videos vastly deteriorate the performance of deep vision models. To overcome the deterioration of vision performance, this paper presents the first end-to-end learnable deep video codec control that considers both bandwidth constraints and downstream deep vision performance, while adhering to existing standardization. We demonstrate that our approach better preserves downstream deep vision performance than traditional standard video coding.

MCML Authors
Link to website

Christoph Reich

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[64]
C. Reich, O. Hahn, D. Cremers, S. Roth and B. Debnath.
A Perspective on Deep Vision Performance with Standard Image and Video Codecs.
CVPR 2024 - Workshop at the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, WA, USA, Jun 17-21, 2024. DOI
Abstract

Resource-constrained hardware, such as edge devices or cell phones, often rely on cloud servers to provide the required computational resources for inference in deep vision models. However, transferring image and video data from an edge or mobile device to a cloud server requires coding to deal with network constraints. The use of standardized codecs, such as JPEG or H.264, is prevalent and required to ensure interoperability. This paper aims to examine the implications of employing standardized codecs within deep vision pipelines. We find that using JPEG and H.264 coding significantly deteriorates the accuracy across a broad range of vision tasks and models. For instance, strong compression rates reduce semantic segmentation accuracy by more than 80% in mIoU. In contrast to previous findings, our analysis extends beyond image and action classification to localization and dense prediction tasks, thus providing a more comprehensive perspective.

MCML Authors
Link to website

Christoph Reich

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[63]
L. Thede, K. Roth, O. J. Hénaff, M. Bethge and Z. Akata.
Reflecting on the State of Rehearsal-free Continual Learning with Pretrained Models.
Preprint (Jun. 2024). arXiv
Abstract

With the advent and recent ubiquity of foundation models, continual learning (CL) has recently shifted from continual training from scratch to the continual adaptation of pretrained models, seeing particular success on rehearsal-free CL benchmarks (RFCL). To achieve this, most proposed methods adapt and restructure parameter-efficient finetuning techniques (PEFT) to suit the continual nature of the problem. Based most often on input-conditional query-mechanisms or regularizations on top of prompt- or adapter-based PEFT, these PEFT-style RFCL (P-RFCL) approaches report peak performances; often convincingly outperforming existing CL techniques. However, on the other end, critical studies have recently highlighted competitive results by training on just the first task or via simple non-parametric baselines. Consequently, questions arise about the relationship between methodological choices in P-RFCL and their reported high benchmark scores. In this work, we tackle these questions to better understand the true drivers behind strong P-RFCL performances, their placement w.r.t. recent first-task adaptation studies, and their relation to preceding CL standards such as EWC or SI. In particular, we show: (1) P-RFCL techniques relying on input-conditional query mechanisms work not because, but rather despite them by collapsing towards standard PEFT shortcut solutions. (2) Indeed, we show how most often, P-RFCL techniques can be matched by a simple and lightweight PEFT baseline. (3) Using this baseline, we identify the implicit bound on tunable parameters when deriving RFCL approaches from PEFT methods as a potential denominator behind P-RFCL efficacy. Finally, we (4) better disentangle continual versus first-task adaptation, and (5) motivate standard RFCL techniques s.a. EWC or SI in light of recent P-RFCL methods.

MCML Authors
Link to website

Karsten Roth

Interpretable and Reliable Machine Learning

Link to Profile Zeynep Akata

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning


[62]
G. Zhang, M. L. A. Fok, Y. Xia, Y. Tang, D. Cremers, P. Torr, V. Tresp and J. Gu.
Localizing Events in Videos with Multimodal Queries.
Preprint (Jun. 2024). arXiv
Abstract

Video understanding is a pivotal task in the digital era, yet the dynamic and multievent nature of videos makes them labor-intensive and computationally demanding to process. Thus, localizing a specific event given a semantic query has gained importance in both user-oriented applications like video search and academic research into video foundation models. A significant limitation in current research is that semantic queries are typically in natural language that depicts the semantics of the target event. This setting overlooks the potential for multimodal semantic queries composed of images and texts. To address this gap, we introduce a new benchmark, ICQ, for localizing events in videos with multimodal queries, along with a new evaluation dataset ICQ-Highlight. Our new benchmark aims to evaluate how well models can localize an event given a multimodal semantic query that consists of a reference image, which depicts the event, and a refinement text to adjust the images’ semantics. To systematically benchmark model performance, we include 4 styles of reference images and 5 types of refinement texts, allowing us to explore model performance across different domains. We propose 3 adaptation methods that tailor existing models to our new setting and evaluate 10 SOTA models, ranging from specialized to large-scale foundation models. We believe this benchmark is an initial step toward investigating multimodal queries in video event localization.

MCML Authors
Link to website

Gengyuan Zhang

Database Systems & Data Mining

Link to website

Yan Xia

Dr.

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence

Link to Profile Volker Tresp

Volker Tresp

Prof. Dr.

Database Systems & Data Mining


[61]
L. Eyring, D. Klein, T. Palla, N. Kilbertus, Z. Akata and F. J. Theis.
Unbalancedness in Neural Monge Maps Improves Unpaired Domain Translation.
ICLR 2024 - 12th International Conference on Learning Representations. Vienna, Austria, May 07-11, 2024. URL
Abstract

In optimal transport (OT), a Monge map is known as a mapping that transports a source distribution to a target distribution in the most cost-efficient way. Recently, multiple neural estimators for Monge maps have been developed and applied in diverse unpaired domain translation tasks, e.g. in single-cell biology and computer vision. However, the classic OT framework enforces mass conservation, which makes it prone to outliers and limits its applicability in real-world scenarios. The latter can be particularly harmful in OT domain translation tasks, where the relative position of a sample within a distribution is explicitly taken into account. While unbalanced OT tackles this challenge in the discrete setting, its integration into neural Monge map estimators has received limited attention. We propose a theoretically grounded method to incorporate unbalancedness into any Monge map estimator. We improve existing estimators to model cell trajectories over time and to predict cellular responses to perturbations. Moreover, our approach seamlessly integrates with the OT flow matching (OT-FM) framework. While we show that OT-FM performs competitively in image translation, we further improve performance by incorporating unbalancedness (UOT-FM), which better preserves relevant features. We hence establish UOT-FM as a principled method for unpaired image translation.

MCML Authors
Link to website

Luca Eyring

Interpretable and Reliable Machine Learning

Link to Profile Niki Kilbertus

Niki Kilbertus

Prof. Dr.

Ethics in Systems Design and Machine Learning

Link to Profile Zeynep Akata

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning

Link to Profile Fabian Theis

Fabian Theis

Prof. Dr.

Mathematical Modelling of Biological Systems


[60]
C. Koke and D. Cremers.
HoloNets: Spectral Convolutions do extend to Directed Graphs.
ICLR 2024 - 12th International Conference on Learning Representations. Vienna, Austria, May 07-11, 2024. URL
Abstract

Within the graph learning community, conventional wisdom dictates that spectral convolutional networks may only be deployed on undirected graphs: Only there could the existence of a well-defined graph Fourier transform be guaranteed, so that information may be translated between spatial- and spectral domains. Here we show this traditional reliance on the graph Fourier transform to be superfluous and – making use of certain advanced tools from complex analysis and spectral theory – extend spectral convolutions to directed graphs. We provide a frequency-response interpretation of newly developed filters, investigate the influence of the basis used to express filters and discuss the interplay with characteristic operators on which networks are based. In order to thoroughly test the developed theory, we conduct experiments in real world settings, showcasing that directed spectral convolutional networks provide new state of the art results for heterophilic node classification on many datasets and – as opposed to baselines – may be rendered stable to resolution-scale varying topological perturbations.

MCML Authors
Link to website

Christian Koke

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[59]
S. Solonets, D. Sinitsyn, L. Von Stumberg, N. Araslanov and D. Cremers.
An Analytical Solution to Gauss-Newton Loss for Direct Image Alignment.
ICLR 2024 - 12th International Conference on Learning Representations. Vienna, Austria, May 07-11, 2024. URL
Abstract

Direct image alignment is a widely used technique for relative 6DoF pose estimation between two images, but its accuracy strongly depends on pose initialization. Therefore, recent end-to-end frameworks increase the convergence basin of the learned feature descriptors with special training objectives, such as the Gauss-Newton loss. However, the training data may exhibit bias toward a specific type of motion and pose initialization, thus limiting the generalization of these methods. In this work, we derive a closed-form solution to the expected optimum of the Gauss-Newton loss. The solution is agnostic to the underlying feature representation and allows us to dynamically adjust the basin of convergence according to our assumptions about the uncertainty in the current estimates. These properties allow for effective control over the convergence in the alignment process. Despite using self-supervised feature embeddings, our solution achieves compelling accuracy w.r.t. the state-of-the-art direct image alignment methods trained end-to-end with pose supervision, and demonstrates improved robustness to pose initialization. Our analytical solution exposes some inherent limitations of end-to-end learning with the Gauss-Newton loss, and establishes an intriguing connection between direct image alignment and feature-matching approaches.

MCML Authors
Link to website

Sergei Solonets

Computer Vision & Artificial Intelligence

Link to website

Daniil Sinitsyn

Computer Vision & Artificial Intelligence

Link to website

Nikita Araslanov

Dr.

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[58]
H. N. Dang, V. Golkov, J. Endres, S. Weinmüller, F. Glang, T. Wimmer, D. Cremers, A. Dörfler, A. Maier and M. Zaiss.
Joint sequence optimization beats pure neural network approaches for super-resolution TSE.
ISMRM 2024 - International Society for Magnetic Resonance in Medicine Annual Meeting. Singapore, May 04-09, 2024. URL
Abstract

Current MRI super-resolution (SR) methods only use existing contrasts acquired from typical clinical sequences as input for the neural network (NN). In turbo spin echo sequences (TSE) the sequence parameters can have a strong influence on the actual resolution of the acquired image and have consequently a considera-ble impact on the performance of the NN. We propose a known-operator learning approach to perform an end-to-end optimization of MR sequence and neural net-work parameters for SR-TSE. This MR-physics-informed training procedure jointly optimizes the radiofrequency pulse train of a proton density- (PD-) and T2-weighted TSE and a subsequently applied convolutional neural network to predict the corresponding PDw and T2w super-resolution TSE images. The found radiofrequency pulse train designs generate an optimal signal for the NN to perform the SR task. Our method generalizes from the simulation-based optimi-zation to in vivo measurements and the acquired physics-informed SR images show higher correlation with a time-consuming segmented high-resolution TSE sequence compared to a pure network training approach.

MCML Authors
Link to website

Vladimir Golkov

Dr.

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[57]
Z. Ye, G. Peyré, D. Cremers and P. Ablin.
Enhancing Hypergradients Estimation: A Study of Preconditioning and Reparameterization.
AISTATS 2024 - 27th International Conference on Artificial Intelligence and Statistics. Valencia, Spain, May 02-04, 2024. URL GitHub
Abstract

Bilevel optimization aims to optimize an outer objective function that depends on the solution to an inner optimization problem. It is routinely used in Machine Learning, notably for hyperparameter tuning. The conventional method to compute the so-called hypergradient of the outer problem is to use the Implicit Function Theorem (IFT). As a function of the error of the inner problem resolution, we study the error of the IFT method. We analyze two strategies to reduce this error: preconditioning the IFT formula and reparameterizing the inner problem. We give a detailed account of the impact of these two modifications on the error, highlighting the role played by higher-order derivatives of the functionals at stake. Our theoretical findings explain when super efficiency, namely reaching an error on the hypergradient that depends quadratically on the error on the inner problem, is achievable and compare the two approaches when this is impossible. Numerical evaluations on hyperparameter tuning for regression problems substantiate our theoretical findings.

MCML Authors
Link to website

Zhenzhang Ye

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[56]
V. Ehm, P. Roetzer, M. Eisenberger, M. Gao, F. Bernard and D. Cremers.
Geometrically Consistent Partial Shape Matching.
3DV 2024 - 11th International Conference on 3D Vision. Davos, Switzerland, Mar 18-21, 2024. DOI GitHub
Abstract

Finding correspondences between 3D shapes is a crucial problem in computer vision and graphics, which is for example relevant for tasks like shape interpolation, pose transfer, or texture transfer. An often neglected but essential property of matchings is geometric consistency, which means that neighboring triangles in one shape are consistently matched to neighboring triangles in the other shape. Moreover, while in practice one often has only access to partial observations of a 3D shape (e.g. due to occlusion, or scanning artifacts), there do not exist any methods that directly address geometrically consistent partial shape matching. In this work we fill this gap by proposing to integrate state-of-the-art deep shape features into a novel integer linear programming partial shape matching formulation. Our optimization yields a globally optimal solution on low resolution shapes, which we then refine using a coarse-to-fine scheme. We show that our method can find more reliable results on partial shapes in comparison to existing geometrically consistent algorithms (for which one first has to fill missing parts with a dummy geometry). Moreover, our matchings are substantially smoother than learning-based state-of-the-art shape matching methods.

MCML Authors
Link to website

Viktoria Ehm

Computer Vision & Artificial Intelligence

Link to website

Maolin Gao

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[55]
A. Hayler, F. Wimbauer, D. Muhle, C. Rupprecht and D. Cremers.
S4C: Self-Supervised Semantic Scene Completion with Neural Fields.
3DV 2024 - 11th International Conference on 3D Vision. Davos, Switzerland, Mar 18-21, 2024. DOI
Abstract

3D semantic scene understanding is a fundamental challenge in computer vision. It enables mobile agents to autonomously plan and navigate arbitrary environments. SSC formalizes this challenge as jointly estimating dense geometry and semantic information from sparse observations of a scene. Current methods for SSC are generally trained on 3D ground truth based on aggregated LiDAR scans. This process relies on special sensors and annotation by hand which are costly and do not scale well. To overcome this issue, our work presents the first self-supervised approach to SSC called S4C that does not rely on 3D ground truth data. Our proposed method can reconstruct a scene from a single image and only relies on videos and pseudo segmentation ground truth generated from off-the-shelf image segmentation network during training. Unlike existing methods, which use discrete voxel grids, we represent scenes as implicit semantic fields. This formulation allows querying any point within the camera frustum for occupancy and semantic class. Our architecture is trained through rendering-based self-supervised losses. Nonetheless, our method achieves performance close to fully supervised state-of-the-art methods. Additionally, our method demonstrates strong generalization capabilities and can synthesize accurate segmentation maps for far away viewpoints.

MCML Authors
Link to website

Felix Wimbauer

Computer Vision & Artificial Intelligence

Link to website

Dominik Muhle

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[54]
S. Klenk, M. Motzet, L. Koestler and D. Cremers.
Deep Event Visual Odometry.
3DV 2024 - 11th International Conference on 3D Vision. Davos, Switzerland, Mar 18-21, 2024. DOI
Abstract

Event cameras offer the exciting possibility of tracking the camera’s pose during high-speed motion and in adverse lighting conditions. Despite this promise, existing event-based monocular visual odometry (VO) approaches demonstrate limited performance on recent benchmarks. To address this limitation, some methods resort to additional sensors such as IMUs, stereo event cameras, or frame-based cameras. Nonetheless, these additional sensors limit the application of event cameras in real-world devices since they increase cost and complicate system requirements. Moreover, relying on a frame-based camera makes the system susceptible to motion blur and HDR. To remove the dependency on additional sensors and to push the limits of using only a single event camera, we present Deep Event VO (DEVO), the first monocular event-only system with strong performance on a large number of real-world benchmarks. DEVO sparsely tracks selected event patches over time. A key component of DEVO is a novel deep patch selection mechanism tailored to event data. We significantly decrease the state-of-the-art pose tracking error on seven real-world benchmarks by up to 97% compared to event-only methods and often surpass or are close to stereo or inertial methods.

MCML Authors
Link to website

Simon Klenk

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[53]
M. Zaiss, J. R. Rajput, H. N. Dang, V. Golkov, D. Cremers, F. Knoll and A. Maier.
GPT4MR: Exploring GPT-4 as an MR Sequence and Reconstruction Programming Assistant.
BVM 2024 - German Conference on Medical Image Computing -Bildverarbeitung für die Medizin. Erlangen, Germany, Mar 10-02, 2024. DOI
Abstract

In this study, we explore the potential of generative pre-trained transformer (GPT), as a coding assistant for MRI sequence programming using the Pulseq framework. The programming of MRI sequences is traditionally a complex and time-consuming task, and the Pulseq standard has recently simplified this process. It allows researchers to define and generate complex pulse sequences used in MRI experiments. Leveraging GPT-4’s capabilities in natural language generation, we adapted it for MRI sequence programming, creating a specialized assistant named GPT4MR. Our tests involved generating various MRI sequences, revealing that GPT-4, guided by a tailored prompt, outperformed GPT-3.5, producing fewer errors and demonstrating improved reasoning. Despite limitations in handling complex sequences, GPT4MR corrected its own errors and successfully generated code with step-by-step instructions. The study showcases GPT4MR’s ability to accelerate MRI sequence development, even for novel ideas absent in its training set. While further research and improvement are needed to address complexity limitations, a well-designed prompt enhances performance. The findings propose GPT4MR as a valuable MRI sequence programming assistant, streamlining prototyping and development. The future prospect involves integrating a PyPulseq plugin into lightweight, open-source LLMs, potentially revolutionizing MRI sequence development and prototyping.

MCML Authors
Link to website

Vladimir Golkov

Dr.

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[52]
S. A. Baumann, F. Krause, M. Neumayr, N. Stracke, V. Hu and B. Ommer.
Continuous, Subject-Specific Attribute Control in T2I Models by Identifying Semantic Directions.
Preprint (Mar. 2024). arXiv GitHub
Abstract

In recent years, advances in text-to-image (T2I) diffusion models have substantially elevated the quality of their generated images. However, achieving fine-grained control over attributes remains a challenge due to the limitations of natural language prompts (such as no continuous set of intermediate descriptions existing between person'' and old person’’). Even though many methods were introduced that augment the model or generation process to enable such control, methods that do not require a fixed reference image are limited to either enabling global fine-grained attribute expression control or coarse attribute expression control localized to specific subjects, not both simultaneously. We show that there exist directions in the commonly used token-level CLIP text embeddings that enable fine-grained subject-specific control of high-level attributes in text-to-image models. Based on this observation, we introduce one efficient optimization-free and one robust optimization-based method to identify these directions for specific attributes from contrastive text prompts. We demonstrate that these directions can be used to augment the prompt text input with fine-grained control over attributes of specific subjects in a compositional manner (control over multiple attributes of a single subject) without having to adapt the diffusion model.

MCML Authors
Link to Profile Björn Ommer

Björn Ommer

Prof. Dr.

Machine Vision & Learning


[51]
A. Davtyan, S. Sameni, B. Ommer and P. Favaro.
Enabling Visual Composition and Animation in Unsupervised Video Generation.
Preprint (Mar. 2024). arXiv GitHub
Abstract

In this work we propose a novel method for unsupervised controllable video generation. Once trained on a dataset of unannotated videos, at inference our model is capable of both composing scenes of predefined object parts and animating them in a plausible and controlled way. This is achieved by conditioning video generation on a randomly selected subset of local pre-trained self-supervised features during training. We call our model CAGE for visual Composition and Animation for video GEneration. We conduct a series of experiments to demonstrate capabilities of CAGE in various settings.

MCML Authors
Link to Profile Björn Ommer

Björn Ommer

Prof. Dr.

Machine Vision & Learning


[50]
M. Gui, J. S. Fischer, U. Prestel, P. Ma, D. Kotovenko, O. Grebenkova, S. Baumann, V. T. Hu and B. Ommer.
DepthFM: Fast Monocular Depth Estimation with Flow Matching.
Preprint (Mar. 2024). arXiv
Abstract

Monocular depth estimation is crucial for numerous downstream vision tasks and applications. Current discriminative approaches to this problem are limited due to blurry artifacts, while state-of-the-art generative methods suffer from slow sampling due to their SDE nature. Rather than starting from noise, we seek a direct mapping from input image to depth map. We observe that this can be effectively framed using flow matching, since its straight trajectories through solution space offer efficiency and high quality. Our study demonstrates that a pre-trained image diffusion model can serve as an adequate prior for a flow matching depth model, allowing efficient training on only synthetic data to generalize to real images. We find that an auxiliary surface normals loss further improves the depth estimates. Due to the generative nature of our approach, our model reliably predicts the confidence of its depth estimates. On standard benchmarks of complex natural scenes, our lightweight approach exhibits state-of-the-art performance at favorable low computational cost despite only being trained on little synthetic data.

MCML Authors
Link to website

Pingchuan Ma

Machine Vision & Learning

Link to website

Olga Grebenkova

Machine Vision & Learning

Link to Profile Björn Ommer

Björn Ommer

Prof. Dr.

Machine Vision & Learning


[49]
A. Höhl, I. Obadic, M. Á. F. Torres, H. Najjar, D. Oliveira, Z. Akata, A. Dengel and X. Zhu.
Opening the Black-Box: A Systematic Review on Explainable AI in Remote Sensing.
Preprint (Feb. 2024). arXiv
Abstract

In recent years, black-box machine learning approaches have become a dominant modeling paradigm for knowledge extraction in Remote Sensing. Despite the potential benefits of uncovering the inner workings of these models with explainable AI, a comprehensive overview summarizing the used explainable AI methods and their objectives, findings, and challenges in Remote Sensing applications is still missing. In this paper, we address this issue by performing a systematic review to identify the key trends of how explainable AI is used in Remote Sensing and shed light on novel explainable AI approaches and emerging directions that tackle specific Remote Sensing challenges. We also reveal the common patterns of explanation interpretation, discuss the extracted scientific insights in Remote Sensing, and reflect on the approaches used for explainable AI methods evaluation. Our review provides a complete summary of the state-of-the-art in the field. Further, we give a detailed outlook on the challenges and promising research directions, representing a basis for novel methodological development and a useful starting point for new researchers in the field of explainable AI in Remote Sensing.

MCML Authors
Link to website

Ivica Obadic

Data Science in Earth Observation

Link to Profile Zeynep Akata

Zeynep Akata

Prof. Dr.

Interpretable and Reliable Machine Learning

Link to Profile Xiaoxiang Zhu

Xiaoxiang Zhu

Prof. Dr.

Data Science in Earth Observation


[48]
M. Brahimi, B. Haefner, T. Yenamandra, B. Goldluecke and D. Cremers.
SupeRVol: Super-Resolution Shape and Reflectance Estimation in Inverse Volume Rendering.
WACV 2024 - IEEE/CVF Winter Conference on Applications of Computer Vision. Waikoloa, Hawaii, Jan 04-08, 2024. DOI
Abstract

We propose an end-to-end inverse rendering pipeline called SupeRVol that allows us to recover 3D shape and material parameters from a set of color images in a superresolution manner. To this end, we represent both the bidirectional reflectance distribution function’s (BRDF) parameters and the signed distance function (SDF) by multi-layer perceptrons (MLPs). In order to obtain both the surface shape and its reflectance properties, we revert to a differentiable volume renderer with a physically based illumination model that allows us to decouple reflectance and lighting. This physical model takes into account the effect of the camera’s point spread function thereby enabling a reconstruction of shape and material in a super-resolution quality. Experimental validation confirms that SupeRVol achieves state of the art performance in terms of inverse rendering quality. It generates reconstructions that are sharper than the individual input images, making this method ideally suited for 3D modeling from low-resolution imagery.

MCML Authors
Link to website

Tarun Yenamandra

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[47]
S. Klenk, D. Bonello, L. Koestler, N. Araslanov and D. Cremers.
Masked Event Modeling: Self-Supervised Pretraining for Event Cameras.
WACV 2024 - IEEE/CVF Winter Conference on Applications of Computer Vision. Waikoloa, Hawaii, Jan 04-08, 2024. DOI
Abstract

Event cameras asynchronously capture brightness changes with low latency, high temporal resolution, and high dynamic range. However, annotation of event data is a costly and laborious process, which limits the use of deep learning methods for classification and other semantic tasks with the event modality. To reduce the dependency on labeled event data, we introduce Masked Event Modeling (MEM), a self-supervised framework for events. Our method pretrains a neural network on unlabeled events, which can originate from any event camera recording. Subsequently, the pretrained model is finetuned on a downstream task, leading to a consistent improvement of the task accuracy. For example, our method reaches state-of-the-art classification accuracy across three datasets, N-ImageNet, N-Cars, and N-Caltech101, increasing the top-1 accuracy of previous work by significant margins. When tested on real-world event data, MEM is even superior to supervised RGB-based pretraining. The models pretrained with MEM are also label-efficient and generalize well to the dense task of semantic image segmentation.

MCML Authors
Link to website

Simon Klenk

Computer Vision & Artificial Intelligence

Link to website

Nikita Araslanov

Dr.

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[46]
U. Sahin, H. Li, Q. Khan, D. Cremers and V. Tresp.
Enhancing Multimodal Compositional Reasoning of Visual Language Models With Generative Negative Mining.
WACV 2024 - IEEE/CVF Winter Conference on Applications of Computer Vision. Waikoloa, Hawaii, Jan 04-08, 2024. DOI GitHub
Abstract

Contemporary large-scale visual language models (VLMs) exhibit strong representation capacities, making them ubiquitous for enhancing image and text understanding tasks. They are often trained in a contrastive manner on a large and diverse corpus of images and corresponding text captions scraped from the internet. Despite this, VLMs often struggle with compositional reasoning tasks which require a fine-grained understanding of the complex interactions of objects and their attributes. This failure can be attributed to two main factors: 1) Contrastive approaches have traditionally focused on mining negative examples from existing datasets. However, the mined negative examples might not be difficult for the model to discriminate from the positive. An alternative to mining would be negative sample generation 2) But existing generative approaches primarily focus on generating hard negative texts associated with a given image. Mining in the other direction, i.e., generating negative image samples associated with a given text has been ignored. To overcome both these limitations, we propose a framework that not only mines in both directions but also generates challenging negative samples in both modalities, i.e., images and texts. Leveraging these generative hard negative samples, we significantly enhance VLMs’ performance in tasks involving multimodal compositional reasoning.

MCML Authors
Link to website

Hang Li

Database Systems & Data Mining

Link to website

Qadeer Khan

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence

Link to Profile Volker Tresp

Volker Tresp

Prof. Dr.

Database Systems & Data Mining


[45]
T. Tewari, N. Yang, F. Bernard, C. Theobalt and D. Cremers.
FIRe: Fast Inverse Rendering Using Directional and Signed Distance Functions.
WACV 2024 - IEEE/CVF Winter Conference on Applications of Computer Vision. Waikoloa, Hawaii, Jan 04-08, 2024. DOI
Abstract

Neural 3D implicit representations learn priors that are useful for diverse applications, such as single- or multiple-view 3D reconstruction. A major downside of existing approaches while rendering an image is that they require evaluating the network multiple times per camera ray so that the high computational time forms a bottleneck for downstream applications. We address this problem by introducing a novel neural scene representation that we call the directional distance function (DDF). To this end, we learn a signed distance function (SDF) along with our DDF model to represent a class of shapes. Specifically, our DDF is defined on the unit sphere and predicts the distance to the surface along any given direction. Therefore, our DDF allows rendering images with just a single network evaluation per camera ray. Based on our DDF, we present a novel fast algorithm (FIRe) to reconstruct 3D shapes given a posed depth map. We evaluate our proposed method on 3D reconstruction from single-view depth images, where we empirically show that our algorithm reconstructs 3D shapes more accurately and it is more than 15 times faster (per iteration) than competing methods.

MCML Authors
Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[44]
D. Zhu, Q. Khan and D. Cremers.
Multi-vehicle trajectory prediction and control at intersections using state and intention information.
Neurocomputing 574 (Jan. 2024). DOI GitHub
Abstract

Traditional deep learning approaches for prediction of future trajectory of multiple road agents rely on knowing information about their past trajectory. In contrast, this work utilizes information of only the current state and intended direction to predict the future trajectory of multiple vehicles at intersections. Incorporating intention information has two distinct advantages: (1) It allows to not just predict the future trajectory but also control the multiple vehicles. (2) By manipulating the intention, the interaction among the vehicles is adapted accordingly to achieve desired behavior. Both these advantages would otherwise not be possible using only past trajectory information Our model utilizes message passing of information between the vehicle nodes for a more holistic overview of the environment, resulting in better trajectory prediction and control of the vehicles. This work also provides a thorough investigation and discussion into the disparity between offline and online metrics for the task of multi-agent control. We particularly show why conducting only offline evaluation would not suffice, thereby necessitating online evaluation. We demonstrate the superiority of utilizing intention information rather than past trajectory in online scenarios. Lastly, we show the capability of our method in adapting to different domains through experiments conducted on two distinct simulation platforms i.e. SUMO and CARLA.

MCML Authors
Link to website

Dekai Zhu

Computer Aided Medical Procedures & Augmented Reality

Link to website

Qadeer Khan

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[43]
M. Zaiss, H. N. Dang, V. Golkov, J. R. Rajput, D. Cremers, F. Knoll and A. Maier.
GPT4MR: Exploring GPT-4 as an MR Sequence and Reconstruction Programming Assistant.
ESMRMB 2023 - 39th Annual Meeting of the European Society for Magnetic Resonance in Medicine and Biology. Basel, Switzerland, Oct 04-07, 2023. URL
Abstract

In this study, we explore the potential of generative pre-trained transformer (GPT), as a coding assistant for MRI sequence programming using the Pulseq framework. The programming of MRI sequences is traditionally a complex and time-consuming task, and the Pulseq standard has recently simplified this process. It allows researchers to define and generate complex pulse sequences used in MRI experiments. Leveraging GPT-4’s capabilities in natural language generation, we adapted it for MRI sequence programming, creating a specialized assistant named GPT4MR. Our tests involved generating various MRI sequences, revealing that GPT-4, guided by a tailored prompt, outperformed GPT-3.5, producing fewer errors and demonstrating improved reasoning. Despite limitations in handling complex sequences, GPT4MR corrected its own errors and successfully generated code with step-by-step instructions. The study showcases GPT4MR’s ability to accelerate MRI sequence development, even for novel ideas absent in its training set. While further research and improvement are needed to address complexity limitations, a well-designed prompt enhances performance. The findings propose GPT4MR as a valuable MRI sequence programming assistant, streamlining prototyping and development. The future prospect involves integrating a PyPulseq plugin into lightweight, open-source LLMs, potentially revolutionizing MRI sequence development and prototyping.

MCML Authors
Link to website

Vladimir Golkov

Dr.

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[42]
M. B. Colomer, P. L. Dovesi, T. Panagiotakopoulos, J. F. Carvalho, L. Härenstam-Nielsen, H. Azizpour, H. Kjellström, D. Cremers and M. Poggi.
To adapt or not to adapt? Real-time adaptation for semantic segmentation.
ICCV 2023 - IEEE/CVF International Conference on Computer Vision. Paris, France, Oct 02-06, 2023. DOI
Abstract

The goal of Online Domain Adaptation for semantic segmentation is to handle unforeseeable domain changes that occur during deployment, like sudden weather events. However, the high computational costs associated with brute-force adaptation make this paradigm unfeasible for real-world applications. In this paper we propose HAMLET, a Hardware-Aware Modular Least Expensive Training framework for real-time domain adaptation. Our approach includes a hardware-aware back-propagation orchestration agent (HAMT) and a dedicated domain-shift detector that enables active control over when and how the model is adapted (LT). Thanks to these advancements, our approach is capable of performing semantic segmentation while simultaneously adapting at more than 29FPS on a single consumer-grade GPU. Our framework’s encouraging accuracy and speed trade-off is demonstrated on OnDA and SHIFT benchmarks through experimental results.

MCML Authors
Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[41]
M. Gao, P. Roetzer, M. Eisenberger, Z. Lähner, M. Moeller, D. Cremers and F. Bernard.
ΣIGMA: Scale-Invariant Global Sparse Shape Matching.
ICCV 2023 - IEEE/CVF International Conference on Computer Vision. Paris, France, Oct 02-06, 2023. DOI
Abstract

We propose a novel mixed-integer programming (MIP) formulation for generating precise sparse correspondences for highly non-rigid shapes. To this end, we introduce a projected Laplace-Beltrami operator (PLBO) which combines intrinsic and extrinsic geometric information to measure the deformation quality induced by predicted correspondences. We integrate the PLBO, together with an orientation-aware regulariser, into a novel MIP formulation that can be solved to global optimality for many practical problems. In contrast to previous methods, our approach is provably invariant to rigid transformations and global scaling, initialisation-free, has optimality guarantees, and scales to high resolution meshes with (empirically observed) linear time. We show state-of-the-art results for sparse non-rigid matching on several challenging 3D datasets, including data with inconsistent meshing, as well as applications in mesh-to-point-cloud matching.

MCML Authors
Link to website

Maolin Gao

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[40]
H. Li, J. Dong, B. Wen, M. Gao, T. Huang, Y.-H. Liu and D. Cremers.
DDIT: Semantic Scene Completion via Deformable Deep Implicit Templates.
ICCV 2023 - IEEE/CVF International Conference on Computer Vision. Paris, France, Oct 02-06, 2023. DOI
Abstract

Scene reconstructions are often incomplete due to occlusions and limited viewpoints. There have been efforts to use semantic information for scene completion. However, the completed shapes may be rough and imprecise since respective methods rely on 3D convolution and/or lack effective shape constraints. To overcome these limitations, we propose a semantic scene completion method based on deformable deep implicit templates (DDIT). Specifically, we complete each segmented instance in a scene by deforming a template with a latent code. Such a template is expressed by a deep implicit function in the canonical frame. It abstracts the shape prior of a category, and thus can provide constraints on the overall shape of an instance. Latent code controls the deformation of template to guarantee fine details of an instance. For code prediction, we design a neural network that leverages both intra-and inter-instance information. We also introduce an algorithm to transform instances between the world and canonical frames based on geometric constraints and a hierarchical tree. To further improve accuracy, we jointly optimize the latent code and transformation by enforcing the zero-valued isosurface constraint. In addition, we establish a new dataset to solve different problems of existing datasets. Experiments showed that our DDIT outperforms state-of-the-art approaches.

MCML Authors
Link to website

Haoang Li

Dr.

* Former member

Link to website

Maolin Gao

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[39]
Y. Xia, M. Gladkova, R. Wang, Q. Li, U. Stilla, J. F. Henriques and D. Cremers.
CASSPR: Cross Attention Single Scan Place Recognition.
ICCV 2023 - IEEE/CVF International Conference on Computer Vision. Paris, France, Oct 02-06, 2023. DOI
Abstract

Place recognition based on point clouds (LiDAR) is an important component for autonomous robots or self-driving vehicles. Current SOTA performance is achieved on accumulated LiDAR submaps using either point-based or voxel-based structures. While voxel-based approaches nicely integrate spatial context across multiple scales, they do not exhibit the local precision of point-based methods. As a result, existing methods struggle with fine-grained matching of subtle geometric features in sparse single-shot Li-DAR scans. To overcome these limitations, we propose CASSPR as a method to fuse point-based and voxel-based approaches using cross attention transformers. CASSPR leverages a sparse voxel branch for extracting and aggregating information at lower resolution and a point-wise branch for obtaining fine-grained local information. CASSPR uses queries from one branch to try to match structures in the other branch, ensuring that both extract self-contained descriptors of the point cloud (rather than one branch dominating), but using both to inform the out-put global descriptor of the point cloud. Extensive experiments show that CASSPR surpasses the state-of-the-art by a large margin on several datasets (Oxford RobotCar, TUM, USyd). For instance, it achieves AR@1 of 85.6% on the TUM dataset, surpassing the strongest prior model by ~15%. Our code is publicly available.

MCML Authors
Link to website

Yan Xia

Dr.

Computer Vision & Artificial Intelligence

Link to website

Mariia Gladkova

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[38]
A. Farshad, Y. Yeganeh, Y. Chi, C. Shen, B. Ommer and N. Navab.
Scenegenie: Scene graph guided diffusion models for image synthesis.
ICCV 2023 - Workshop at the IEEE/CVF International Conference on Computer Vision. Paris, France, Oct 02-06, 2023. DOI
Abstract

Text-conditioned image generation has made significant progress in recent years with generative adversarial networks and more recently, diffusion models. While diffusion models conditioned on text prompts have produced impressive and high-quality images, accurately representing complex text prompts such as the number of instances of a specific object remains challenging.To address this limitation, we propose a novel guidance approach for the sampling process in the diffusion model that leverages bounding box and segmentation map information at inference time without additional training data. Through a novel loss in the sampling process, our approach guides the model with semantic features from CLIP embeddings and enforces geometric constraints, leading to high-resolution images that accurately represent the scene. To obtain bounding box and segmentation map information, we structure the text prompt as a scene graph and enrich the nodes with CLIP embeddings. Our proposed model achieves state-of-the-art performance on two public benchmarks for image generation from scene graphs, surpassing both scene graph to image and text-based diffusion models in various metrics. Our results demonstrate the effectiveness of incorporating bounding box and segmentation map guidance in the diffusion model sampling process for more accurate text-to-image generation.

MCML Authors
Link to website

Azade Farshad

Dr.

Computer Aided Medical Procedures & Augmented Reality

Link to website

Yousef Yeganeh

Computer Aided Medical Procedures & Augmented Reality

Link to Profile Björn Ommer

Björn Ommer

Prof. Dr.

Machine Vision & Learning

Link to Profile Nassir Navab

Nassir Navab

Prof. Dr.

Computer Aided Medical Procedures & Augmented Reality


[37]
J. Pan, C. Zhou, M. Gladkova, Q. Khan and D. Cremers.
Robust Autonomous Vehicle Pursuit without Expert Steering Labels.
IEEE Robotics and Automation Letters 8.10 (Oct. 2023). DOI
Abstract

In this work, we present a learning method for both lateral and longitudinal motion control of an ego-vehicle for the task of vehicle pursuit. The car being controlled does not have a pre-defined route, rather it reactively adapts to follow a target vehicle while maintaining a safety distance. To train our model, we do not rely on steering labels recorded from an expert driver, but effectively leverage a classical controller as an offline label generation tool. In addition, we account for the errors in the predicted control values, which can lead to a loss of tracking and catastrophic crashes of the controlled vehicle. To this end, we propose an effective data augmentation approach, which allows to train a network that is capable of handling different views of the target vehicle. During the pursuit, the target vehicle is firstly localized using a Convolutional Neural Network. The network takes a single RGB image along with cars’ velocities and estimates target vehicle’s pose with respect to the ego-vehicle. This information is then fed to a Multi-Layer Perceptron, which regresses the control commands for the ego-vehicle, namely throttle and steering angle. We extensively validate our approach using the CARLA simulator on a wide range of terrains. Our method demonstrates real-time performance, robustness to different scenarios including unseen trajectories and high route completion.

MCML Authors
Link to website

Mariia Gladkova

Computer Vision & Artificial Intelligence

Link to website

Qadeer Khan

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[36]
Y. Ma, Q. Khan and D. Cremers.
Multi Agent Navigation in Unconstrained Environments Using a Centralized Attention Based Graphical Neural Network Controller.
ITSC 2023 - 26th IEEE International Conference on Intelligent Transportation . Bilbao, Spain, Sep 24-28, 2023. DOI GitHub
Abstract

In this work, we propose a learning based neural model that provides both the longitudinal and lateral control commands to simultaneously navigate multiple vehicles. The goal is to ensure that each vehicle reaches a desired target state without colliding with any other vehicle or obstacle in an unconstrained environment. The model utilizes an attention based Graphical Neural Network paradigm that takes into consideration the state of all the surrounding vehicles to make an informed decision. This allows each vehicle to smoothly reach its destination while also evading collision with the other agents. The data and corresponding labels for training such a network is obtained using an optimization based procedure. Experimental results demonstrate that our model is powerful enough to generalize even to situations with more vehicles than in the training data. Our method also outperforms comparable graphical neural network architectures.

MCML Authors
Link to website

Qadeer Khan

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[35]
J. Schmidt, Q. Khan and D. Cremers.
LiDAR View Synthesis for Robust Vehicle Navigation Without Expert Labels.
ITSC 2023 - 26th IEEE International Conference on Intelligent Transportation . Bilbao, Spain, Sep 24-28, 2023. DOI GitHub
Abstract

Deep learning models for self-driving cars require a diverse training dataset to manage critical driving scenarios on public roads safely. This includes having data from divergent trajectories, such as the oncoming traffic lane or sidewalks. Such data would be too dangerous to collect in the real world. Data augmentation approaches have been proposed to tackle this issue using RGB images. However, solutions based on LiDAR sensors are scarce. Therefore, we propose synthesizing additional LiDAR point clouds from novel viewpoints without physically driving at dangerous positions. The LiDAR view synthesis is done using mesh reconstruction and ray casting. We train a deep learning model, which takes a LiDAR scan as input and predicts the future trajectory as output. A waypoint controller is then applied to this predicted trajectory to determine the throttle and steering labels of the ego-vehicle. Our method neither requires expert driving labels for the original nor the synthesized LiDAR sequence. Instead, we infer labels from LiDAR odometry. We demonstrate the effectiveness of our approach in a comprehensive online evaluation and with a comparison to concurrent work. Our results show the importance of synthesizing additional LiDAR point clouds, particularly in terms of model robustness.

MCML Authors
Link to website

Qadeer Khan

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[34]
Y. Shan, Y. Xia, Y. Chen and D. Cremers.
SCP: Scene Completion Pre-training for 3D Object Detection.
Preprint (Sep. 2023). arXiv
Abstract

3D object detection using LiDAR point clouds is a fundamental task in the fields of computer vision, robotics, and autonomous driving. However, existing 3D detectors heavily rely on annotated datasets, which are both time-consuming and prone to errors during the process of labeling 3D bounding boxes. In this paper, we propose a Scene Completion Pre-training (SCP) method to enhance the performance of 3D object detectors with less labeled data. SCP offers three key advantages: (1) Improved initialization of the point cloud model. By completing the scene point clouds, SCP effectively captures the spatial and semantic relationships among objects within urban environments. (2) Elimination of the need for additional datasets. SCP serves as a valuable auxiliary network that does not impose any additional efforts or data requirements on the 3D detectors. (3) Reduction of the amount of labeled data for detection. With the help of SCP, the existing state-of-the-art 3D detectors can achieve comparable performance while only relying on 20% labeled data.

MCML Authors
Link to website

Yan Xia

Dr.

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[33]
C. Tomani, F. K. Waseda, Y. Shen and D. Cremers.
Beyond In-Domain Scenarios: Robust Density-Aware Calibration.
ICML 2023 - 40th International Conference on Machine Learning. Honolulu, Hawaii, Jul 23-29, 2023. URL
Abstract

Calibrating deep learning models to yield uncertainty-aware predictions is crucial as deep neural networks get increasingly deployed in safety-critical applications. While existing post-hoc calibration methods achieve impressive results on in-domain test datasets, they are limited by their inability to yield reliable uncertainty estimates in domain-shift and out-of-domain (OOD) scenarios. We aim to bridge this gap by proposing DAC, an accuracy-preserving as well as Density-Aware Calibration method based on k-nearest-neighbors (KNN). In contrast to existing post-hoc methods, we utilize hidden layers of classifiers as a source for uncertainty-related information and study their importance. We show that DAC is a generic method that can readily be combined with state-of-the-art post-hoc methods. DAC boosts the robustness of calibration performance in domain-shift and OOD, while maintaining excellent in-domain predictive uncertainty estimates. We demonstrate that DAC leads to consistently better calibration across a large number of model architectures, datasets, and metrics. Additionally, we show that DAC improves calibration substantially on recent large-scale neural networks pre-trained on vast amounts of data.

MCML Authors
Link to website

Christian Tomani

Computer Vision & Artificial Intelligence

Yuesong Shen

Yuesong Shen

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[32]
M. Eisenberger, A. Toker, L. Leal-Taixé and D. Cremers.
G-MSM: Unsupervised Multi-Shape Matching with Graph-based Affinity Priors.
CVPR 2023 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada, Jun 18-23, 2023. DOI GitHub
Abstract

We present G-MSM (Graph-based Multi-Shape Matching), a novel unsupervised learning approach for non-rigid shape correspondence. Rather than treating a collection of input poses as an unordered set of samples, we explicitly model the underlying shape data manifold. To this end, we propose an adaptive multi-shape matching architecture that constructs an affinity graph on a given set of training shapes in a self-supervised manner. The key idea is to combine putative, pairwise correspondences by propagating maps along shortest paths in the underlying shape graph. During training, we enforce cycle-consistency between such optimal paths and the pairwise matches which enables our model to learn topology-aware shape priors. We explore different classes of shape graphs and recover specific settings, like template-based matching (star graph) or learnable ranking/sorting (TSP graph), as special cases in our framework. Finally, we demonstrate state-of-the-art performance on several recent shape correspondence benchmarks, including realworld 3D scan meshes with topological noise and challenging inter-class pairs.

MCML Authors
Laura Leal-Taixé

Laura Leal-Taixé

Prof. Dr.

* Former member

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[31]
L. Härenstam-Nielsen, N. Zeller and D. Cremers.
Semidefinite Relaxations for Robust Multiview Triangulation.
CVPR 2023 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada, Jun 18-23, 2023. DOI
Abstract

We propose an approach based on convex relaxations for certifiably optimal robust multiview triangulation. To this end, we extend existing relaxation approaches to non-robust multiview triangulation by incorporating a least squares cost function. We propose two formulations, one based on epipolar constraints and one based on fractional reprojection constraints. The first is lower dimensional and remains tight under moderate noise and outlier levels, while the second is higher dimensional and therefore slower but remains tight even under extreme noise and outlier levels. We demonstrate through extensive experiments that the proposed approaches allow us to compute provably optimal re-constructions even under significant noise and a large percentage of outliers.

MCML Authors
Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[30]
D. Kotovenko, P. Ma, T. Milbich and B. Ommer.
Cross-Image-Attention for Conditional Embeddings in Deep Metric Learning.
CVPR 2023 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada, Jun 18-23, 2023. DOI
Abstract

Learning compact image embeddings that yield seman-tic similarities between images and that generalize to un-seen test classes, is at the core of deep metric learning (DML). Finding a mapping from a rich, localized image feature map onto a compact embedding vector is challenging: Although similarity emerges between tuples of images, DML approaches marginalize out information in an individ-ual image before considering another image to which simi-larity is to be computed. Instead, we propose during training to condition the em-bedding of an image on the image we want to compare it to. Rather than embedding by a simple pooling as in standard DML, we use cross-attention so that one image can iden-tify relevant features in the other image. Consequently, the attention mechanism establishes a hierarchy of conditional embeddings that gradually incorporates information about the tuple to steer the representation of an individual image. The cross-attention layers bridge the gap between the origi-nal unconditional embedding and the final similarity and al-low backpropagtion to update encodings more directly than through a lossy pooling layer. At test time we use the re-sulting improved unconditional embeddings, thus requiring no additional parameters or computational overhead. Ex-periments on established DML benchmarks show that our cross-attention conditional embedding during training im-proves the underlying standard DML pipeline significantly so that it outperforms the state-of-the-art.

MCML Authors
Link to website

Pingchuan Ma

Machine Vision & Learning

Link to Profile Björn Ommer

Björn Ommer

Prof. Dr.

Machine Vision & Learning


[29]
D. Muhle, L. Koestler, K. M. Jatavallabhula and D. Cremers.
Learning Correspondence Uncertainty via Differentiable Nonlinear Least Squares.
CVPR 2023 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada, Jun 18-23, 2023. DOI
Abstract

We propose a differentiable nonlinear least squares framework to account for uncertainty in relative pose estimation from feature correspondences. Specifically, we introduce a symmetric version of the probabilistic normal epipolar constraint, and an approach to estimate the co-variance of feature positions by differentiating through the camera pose estimation procedure. We evaluate our approach on synthetic, as well as the KITTI and EuRoC real-world datasets. On the synthetic dataset, we confirm that our learned covariances accurately approximate the true noise distribution. In real world experiments, we find that our approach consistently outperforms state-of-the-art non-probabilistic and probabilistic approaches, regardless of the feature extraction algorithm of choice.

MCML Authors
Link to website

Dominik Muhle

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[28]
S. Weber, N. Demmel, T. Chon Chan and D. Cremers.
Power Bundle Adjustment for Large-Scale 3D Reconstruction.
CVPR 2023 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada, Jun 18-23, 2023. DOI
Abstract

We introduce Power Bundle Adjustment as an expansion type algorithm for solving large-scale bundle adjustment problems. It is based on the power series expansion of the inverse Schur complement and constitutes a new family of solvers that we call inverse expansion methods. We theoretically justify the use of power series and we prove the convergence of our approach. Using the real-world BAL dataset we show that the proposed solver challenges the state-of-the-art iterative methods and significantly accelerates the solution of the normal equation, even for reaching a very high accuracy. This easy-to-implement solver can also complement a recently presented distributed bundle adjustment framework. We demonstrate that employing the proposed Power Bundle Adjustment as a subproblem solver significantly improves speed and accuracy of the distributed optimization.

MCML Authors
Link to website

Simon Weber

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[27]
F. Wimbauer, N. Yang, C. Rupprecht and D. Cremers.
Behind the Scenes: Density Fields for Single View Reconstruction.
CVPR 2023 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada, Jun 18-23, 2023. DOI
Abstract

Inferring a meaningful geometric scene representation from a single image is a fundamental problem in computer vision. Approaches based on traditional depth map prediction can only reason about areas that are visible in the image. Currently, neural radiance fields (NeRFs) can capture true 3D including color, but are too complex to be generated from a single image. As an alternative, we propose to predict an implicit density field from a single image. It maps every location in the frustum of the image to volumetric density. By directly sampling color from the available views instead of storing color in the density field, our scene representation becomes significantly less complex compared to NeRFs, and a neural network can predict it in a single forward pass. The network is trained through self-supervision from only video data. Our formulation allows volume rendering to perform both depth prediction and novel view synthesis. Through experiments, we show that our method is able to predict meaningful geometry for regions that are occluded in the input image. Additionally, we demonstrate the potential of our approach on three datasets for depth prediction and novel-view synthesis.

MCML Authors
Link to website

Felix Wimbauer

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[26]
V. Ehm, D. Cremers and F. Bernard.
Non-Separable Multi-Dimensional Network Flows for Visual Computing.
EG 2023 - Poster at the 44th Annual Conference of the European Association for Computer Graphics. Saarbrücken, Germany, May 08-12, 2023. DOI
Abstract

Flows in networks (or graphs) play a significant role in numerous computer vision tasks. The scalar-valued edges in these graphs often lead to a loss of information and thereby to limitations in terms of expressiveness. For example, oftentimes highdimensional data (e.g. feature descriptors) are mapped to a single scalar value (e.g. the similarity between two feature descriptors). To overcome this limitation, we propose a novel formalism for non-separable multi-dimensional network flows. By doing so, we enable an automatic and adaptive feature selection strategy - since the flow is defined on a per-dimension basis, the maximizing flow automatically chooses the best matching feature dimensions. As a proof of concept, we apply our formalism to the multi-object tracking problem and demonstrate that our approach outperforms scalar formulations on the MOT16 benchmark in terms of robustness to noise.

MCML Authors
Link to website

Viktoria Ehm

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[25]
H. N. Dang, V. Golkov, T. Wimmer, D. Cremers, A. Maier and M. Zaiss.
Joint MR sequence optimization beats pure neural network approaches for spin-echo MRI super-resolution.
Preprint (May. 2023). arXiv
Abstract

Current MRI super-resolution (SR) methods only use existing contrasts acquired from typical clinical sequences as input for the neural network (NN). In turbo spin echo sequences (TSE) the sequence parameters can have a strong influence on the actual resolution of the acquired image and have consequently a considera-ble impact on the performance of the NN. We propose a known-operator learning approach to perform an end-to-end optimization of MR sequence and neural net-work parameters for SR-TSE. This MR-physics-informed training procedure jointly optimizes the radiofrequency pulse train of a proton density- (PD-) and T2-weighted TSE and a subsequently applied convolutional neural network to predict the corresponding PDw and T2w super-resolution TSE images. The found radiofrequency pulse train designs generate an optimal signal for the NN to perform the SR task. Our method generalizes from the simulation-based optimi-zation to in vivo measurements and the acquired physics-informed SR images show higher correlation with a time-consuming segmented high-resolution TSE sequence compared to a pure network training approach.

MCML Authors
Link to website

Vladimir Golkov

Dr.

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[24]
T. Wimmer, V. Golkov, H. Dang, M. Zaiss, A. Maier and D. Cremers.
Scale-Equivariant Deep Learning for 3D Data.
Preprint (Apr. 2023). arXiv GitHub
Abstract

The ability of convolutional neural networks (CNNs) to recognize objects regardless of their position in the image is due to the translation-equivariance of the convolutional operation. Group-equivariant CNNs transfer this equivariance to other transformations of the input. Dealing appropriately with objects and object parts of different scale is challenging, and scale can vary for multiple reasons such as the underlying object size or the resolution of the imaging modality. In this paper, we propose a scale-equivariant convolutional network layer for three-dimensional data that guarantees scale-equivariance in 3D CNNs. Scale-equivariance lifts the burden of having to learn each possible scale separately, allowing the neural network to focus on higher-level learning goals, which leads to better results and better data-efficiency. We provide an overview of the theoretical foundations and scientific work on scale-equivariant neural networks in the two-dimensional domain. We then transfer the concepts from 2D to the three-dimensional space and create a scale-equivariant convolutional layer for 3D data. Using the proposed scale-equivariant layer, we create a scale-equivariant U-Net for medical image segmentation and compare it with a non-scale-equivariant baseline method. Our experiments demonstrate the effectiveness of the proposed method in achieving scale-equivariance for 3D medical image analysis.

MCML Authors
Link to website

Vladimir Golkov

Dr.

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[23]
Q. Khan, I. Sülö, M. Öcal and D. Cremers.
Learning vision based autonomous lateral vehicle control without supervision.
Applied Intelligence 53 (Mar. 2023). DOI GitHub
Abstract

Supervised deep learning methods using image data as input have shown promising results in the context of vehicle control. However, these supervised methods have two main disadvantages: 1) They require a copious amount of labeled training data, which is difficult and expensive to collect. 2) Such models do not perform well, when situations that are not in the distribution of the training set are encountered. This includes deviations from the designated driving behavior. We therefore provide a framework to mitigate these problems from merely an unlabeled sequence of images. Visual Odometry is first used to determine the vehicle trajectory. Model Predictive Control (MPC) then uses this trajectory to implicitly infer the steering labels. Meanwhile, synthesized images at deviated trajectories are included in the training distribution for enhanced robustness of the neural network model. Experimental results demonstrate that the performance of our network is at par with methods requiring additional data collection or supervision.

MCML Authors
Link to website

Qadeer Khan

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[22]
S. Klenk, L. Koestler, D. Scaramuzza and D. Cremers.
E-NeRF: Neural Radiance Fields from a Moving Event Camera.
IEEE Robotics and Automation Letters 8.3 (Mar. 2023). DOI
Abstract

Estimating neural radiance fields (NeRFs) from “ideal” images has been extensively studied in the computer vision community. Most approaches assume optimal illumination and slow camera motion. These assumptions are often violated in robotic applications, where images may contain motion blur, and the scene may not have suitable illumination. This can cause significant problems for downstream tasks such as navigation, inspection, or visualization of the scene. To alleviate these problems, we present E-NeRF, the first method which estimates a volumetric scene representation in the form of a NeRF from a fast-moving event camera. Our method can recover NeRFs during very fast motion and in high-dynamic-range conditions where frame-based approaches fail. We show that rendering high-quality frames is possible by only providing an event stream as input. Furthermore, by combining events and frames, we can estimate NeRFs of higher quality than state-of-the-art approaches under severe motion blur. We also show that combining events and frames can overcome failure cases of NeRF estimation in scenarios where only a few input views are available without requiring additional regularization.

MCML Authors
Link to website

Simon Klenk

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[21]
L. Sang, B. Häfner, X. Zuo and D. Cremers.
High-Quality RGB-D Reconstruction via Multi-View Uncalibrated Photometric Stereo and Gradient-SDF.
WACV 2023 - IEEE/CVF Winter Conference on Applications of Computer Vision. Waikoloa, Hawaii, Jan 03-07, 2023. DOI
Abstract

Fine-detailed reconstructions are in high demand in many applications. However, most of the existing RGB-D reconstruction methods rely on pre-calculated accurate camera poses to recover the detailed surface geometry, where the representation of a surface needs to be adapted when optimizing different quantities. In this paper, we present a novel multi-view RGB-D based reconstruction method that tackles camera pose, lighting, albedo, and surface normal estimation via the utilization of a gradient signed distance field (gradient-SDF). The proposed method formulates the image rendering process using specific physically-based model(s) and optimizes the surface’s quantities on the actual surface using its volumetric representation, as opposed to other works which estimate surface quantities only near the actual surface. To validate our method, we investigate two physically-based image formation models for natural light and point light source applications. The experimental results on synthetic and real-world datasets demonstrate that the proposed method can recover high-quality geometry of the surface more faithfully than the state-of-the-art and further improves the accuracy of estimated camera poses

MCML Authors
Link to website

Björn Häfner

Computer Vision & Artificial Intelligence

Link to website

Xingxing Zuo

Dr.

Machine Learning for Robotics

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[20]
A. Blattmann, R. Rombach, K. Oktay and B. Ommer.
Retrieval-Augmented Diffusion Models.
NeurIPS 2022 - 36th Conference on Neural Information Processing Systems. New Orleans, LA, USA, Nov 28-Dec 09, 2022. URL
Abstract

Novel architectures have recently improved generative image synthesis leading to excellent visual quality in various tasks. Much of this success is due to the scalability of these architectures and hence caused by a dramatic increase in model complexity and in the computational resources invested in training these models. Our work questions the underlying paradigm of compressing large training data into ever growing parametric representations. We rather present an orthogonal, semi-parametric approach. We complement comparably small diffusion or autoregressive models with a separate image database and a retrieval strategy. During training we retrieve a set of nearest neighbors from this external database for each training instance and condition the generative model on these informative samples. While the retrieval approach is providing the (local) content, the model is focusing on learning the composition of scenes based on this content. As demonstrated by our experiments, simply swapping the database for one with different contents transfers a trained model post-hoc to a novel domain. The evaluation shows competitive performance on tasks which the generative model has not been trained on, such as class-conditional synthesis, zero-shot stylization or text-to-image synthesis without requiring paired text-image data. With negligible memory and computational overhead for the external database and retrieval we can significantly reduce the parameter count of the generative model and still outperform the state-of-the-art.

MCML Authors
Link to Profile Björn Ommer

Björn Ommer

Prof. Dr.

Machine Vision & Learning


[19]
H. H.-H. Hsu, Y. Shen, C. Tomani and D. Cremers.
What Makes Graph Neural Networks Miscalibrated?.
NeurIPS 2022 - 36th Conference on Neural Information Processing Systems. New Orleans, LA, USA, Nov 28-Dec 09, 2022. URL
Abstract

Given the importance of getting calibrated predictions and reliable uncertainty estimations, various post-hoc calibration methods have been developed for neural networks on standard multi-class classification tasks. However, these methods are not well suited for calibrating graph neural networks (GNNs), which presents unique challenges such as accounting for the graph structure and the graph-induced correlations between the nodes. In this work, we conduct a systematic study on the calibration qualities of GNN node predictions. In particular, we identify five factors which influence the calibration of GNNs: general under-confident tendency, diversity of nodewise predictive distributions, distance to training nodes, relative confidence level, and neighborhood similarity. Furthermore, based on the insights from this study, we design a novel calibration method named Graph Attention Temperature Scaling (GATS), which is tailored for calibrating graph neural networks. GATS incorporates designs that address all the identified influential factors and produces nodewise temperature scaling using an attention-based architecture. GATS is accuracy-preserving, data-efficient, and expressive at the same time. Our experiments empirically verify the effectiveness of GATS, demonstrating that it can consistently achieve state-of-the-art calibration results on various graph datasets for different GNN backbones.

MCML Authors
Yuesong Shen

Yuesong Shen

Computer Vision & Artificial Intelligence

Link to website

Christian Tomani

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[18]
Y. Shen and D. Cremers.
Deep Combinatorial Aggregation.
NeurIPS 2022 - 36th Conference on Neural Information Processing Systems. New Orleans, LA, USA, Nov 28-Dec 09, 2022. URL
Abstract

Neural networks are known to produce poor uncertainty estimations, and a variety of approaches have been proposed to remedy this issue. This includes deep ensemble, a simple and effective method that achieves state-of-the-art results for uncertainty-aware learning tasks. In this work, we explore a combinatorial generalization of deep ensemble called deep combinatorial aggregation (DCA). DCA creates multiple instances of network components and aggregates their combinations to produce diversified model proposals and predictions. DCA components can be defined at different levels of granularity. And we discovered that coarse-grain DCAs can outperform deep ensemble for uncertainty-aware learning both in terms of predictive performance and uncertainty estimation. For fine-grain DCAs, we discover that an average parameterization approach named deep combinatorial weight averaging (DCWA) can improve the baseline training. It is on par with stochastic weight averaging (SWA) but does not require any custom training schedule or adaptation of BatchNorm layers. Furthermore, we propose a consistency enforcing loss that helps the training of DCWA and modelwise DCA. We experiment on in-domain, distributional shift, and out-of-distribution image classification tasks, and empirically confirm the effectiveness of DCWA and DCA approaches.

MCML Authors
Yuesong Shen

Yuesong Shen

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[17]
H. H.-H. Hsu, Y. Shen and D. Cremers.
A Graph Is More Than Its Nodes: Towards Structured Uncertainty-Aware Learning on Graphs.
NeurIPS 2022 - Workshop on New Frontiers in Graph Learning at the 36th Conference on Neural Information Processing Systems. New Orleans, LA, USA, Nov 28-Dec 09, 2022. URL
Abstract

Current graph neural networks (GNNs) that tackle node classification on graphs tend to only focus on nodewise scores and are solely evaluated by nodewise metrics. This limits uncertainty estimation on graphs since nodewise marginals do not fully characterize the joint distribution given the graph structure. In this work, we propose novel edgewise metrics, namely the edgewise expected calibration error (ECE) and the agree/disagree ECEs, which provide criteria for uncertainty estimation on graphs beyond the nodewise setting. Our experiments demonstrate that the proposed edgewise metrics can complement the nodewise results and yield additional insights. Moreover, we show that GNN models which consider the structured prediction problem on graphs tend to have better uncertainty estimations, which illustrates the benefit of going beyond the nodewise setting.

MCML Authors
Yuesong Shen

Yuesong Shen

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[16]
C. Tomani, D. Cremers and F. Buettner.
Parameterized Temperature Scaling for Boosting the Expressive Power in Post-Hoc Uncertainty Calibration.
ECCV 2022 - 17th European Conference on Computer Vision. Tel Aviv, Israel, Oct 23-27, 2022. DOI GitHub
Abstract

We address the problem of uncertainty calibration and introduce a novel calibration method, Parametrized Temperature Scaling (PTS). Standard deep neural networks typically yield uncalibrated predictions, which can be transformed into calibrated confidence scores using post-hoc calibration methods. In this contribution, we demonstrate that the performance of accuracy-preserving state-of-the-art post-hoc calibrators is limited by their intrinsic expressive power. We generalize temperature scaling by computing prediction-specific temperatures, parameterized by a neural network. We show with extensive experiments that our novel accuracy-preserving approach consistently outperforms existing algorithms across a large number of model architectures, datasets and metrics.

MCML Authors
Link to website

Christian Tomani

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[15]
H. Li, Q. Khan, V. Tresp and D. Cremers.
Biologically Inspired Neural Path Finding.
BI 2022 - 15th International Conference on Brain Informatics. Padova, Italy, Jul 15-15, 2022. DOI GitHub
Abstract

The human brain can be considered to be a graphical structure comprising of tens of billions of biological neurons connected by synapses. It has the remarkable ability to automatically re-route information flow through alternate paths, in case some neurons are damaged. Moreover, the brain is capable of retaining information and applying it to similar but completely unseen scenarios. In this paper, we take inspiration from these attributes of the brain to develop a computational framework to find the optimal low cost path between a source node and a destination node in a generalized graph. We show that our framework is capable of handling unseen graphs at test time. Moreover, it can find alternate optimal paths, when nodes are arbitrarily added or removed during inference, while maintaining a fixed prediction time.

MCML Authors
Link to website

Hang Li

Database Systems & Data Mining

Link to website

Qadeer Khan

Computer Vision & Artificial Intelligence

Link to Profile Volker Tresp

Volker Tresp

Prof. Dr.

Database Systems & Data Mining

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[14]
D. Muhle, L. Koestler, N. Demmel, F. Bernard and D. Cremers.
The Probabilistic Normal Epipolar Constraint for Frame-To-Frame Rotation Optimization under Uncertain Feature Positions.
CVPR 2022 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, LA, USA, Jun 19-24, 2022. DOI
Abstract

The estimation of the relative pose of two camera views is a fundamental problem in computer vision. Kneip et al. proposed to solve this problem by introducing the normal epipolar constraint (NEC). However, their approach does not take into account uncertainties, so that the accuracy of the estimated relative pose is highly dependent on accurate feature positions in the target frame. In this work, we introduce the probabilistic normal epipolar constraint (PNEC) that overcomes this limitation by accounting for anisotropic and inhomogeneous uncertainties in the feature positions. To this end, we propose a novel objective function, along with an efficient optimization scheme that effectively minimizes our objective while maintaining real-time performance. In experiments on synthetic data, we demonstrate that the novel PNEC yields more accurate rotation estimates than the original NEC and several popular relative rotation estimation algorithms. Furthermore, we integrate the proposed method into a state-of-the-art monocular rotation-only odometry system and achieve consistently improved results for the real-world KITTI dataset.

MCML Authors
Link to website

Dominik Muhle

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[13]
F. Müller, Q. Khan and D. Cremers.
Lateral Ego-Vehicle Control Without Supervision Using Point Clouds.
ICPRAI 2022 - 3rd International Conference on Pattern Recognition and Artificial Intelligence. Paris, France, Jun 01-03, 2022. DOI
Abstract

Existing vision based supervised approaches to lateral vehicle control are capable of directly mapping RGB images to the appropriate steering commands. However, they are prone to suffering from inadequate robustness in real world scenarios due to a lack of failure cases in the training data. In this paper, a framework for training a more robust and scalable model for lateral vehicle control is proposed. The framework only requires an unlabeled sequence of RGB images. The trained model takes a point cloud as input and predicts the lateral offset to a subsequent frame from which the steering angle is inferred. The frame poses are in turn obtained from visual odometry. The point cloud is conceived by projecting dense depth maps into 3D. An arbitrary number of additional trajectories from this point cloud can be generated during training. This is to increase the robustness of the model. Online experiments conducted on a driving simulator show that the performance of our model is superior to that of a supervised model trained on the same initial data set and comparable to the same model but trained on data collected with noise injection.

MCML Authors
Link to website

Qadeer Khan

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[12]
C. Tomani and D. Cremers.
Challenger: Training with Attribution Maps.
Preprint (May. 2022). arXiv
Abstract

We show that utilizing attribution maps for training neural networks can improve regularization of models and thus increase performance. Regularization is key in deep learning, especially when training complex models on relatively small datasets. In order to understand inner workings of neural networks, attribution methods such as Layer-wise Relevance Propagation (LRP) have been extensively studied, particularly for interpreting the relevance of input features. We introduce Challenger, a module that leverages the explainable power of attribution maps in order to manipulate particularly relevant input patterns. Therefore, exposing and subsequently resolving regions of ambiguity towards separating classes on the ground-truth data manifold, an issue that arises particularly when training models on rather small datasets. Our Challenger module increases model performance through building more diverse filters within the network and can be applied to any input data domain. We demonstrate that our approach results in substantially better classification as well as calibration performance on datasets with only a few samples up to datasets with thousands of samples. In particular, we show that our generic domain-independent approach yields state-of-the-art results in vision, natural language processing and on time series tasks.

MCML Authors
Link to website

Christian Tomani

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[11]
C. Brunner, A. Duensing, C. Schröder, M. Mittermair, V. Golkov, M. Pollanka, D. Cremers and R. Kienberger.
Deep Learning in Attosecond Metrology.
Optics Express 30.9 (Apr. 2022). Editor’s Pick. DOI
Abstract

Time-resolved photoelectron spectroscopy provides a versatile tool for investigating electron dynamics in gaseous, liquid, and solid samples on sub-femtosecond time scales. The extraction of information from spectrograms recorded with the attosecond streak camera remains a difficult challenge. Common algorithms are highly specialized and typically computationally heavy. In this work, we apply deep neural networks to map from streaking traces to near-infrared pulses as well as electron wavepackets and extensively benchmark our results on simulated data. Additionally, we illustrate domain-shift to real-world data. We also attempt to quantify the model predictive uncertainty. Our deep neural networks display competitive retrieval quality and superior tolerance against noisy data conditions, while reducing the computational time by orders of magnitude.

MCML Authors
Link to website

Vladimir Golkov

Dr.

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[10]
M. Weber, J. Xie, M. Collins, Y. Zhu, H. Adam, B. Green, A. Geiger, D. Cremers, A. Ošep, L. Leal-Taixé, P. Voigtlaender and B. Chen.
STEP: Segmenting and Tracking Every Pixel.
NeurIPS 2021 - Track on Datasets and Benchmarks at the 35th Conference on Neural Information Processing Systems. Virtual, Dec 06-14, 2021. PDF
Abstract

The task of assigning semantic classes and track identities to every pixel in a video is called video panoptic segmentation. Our work is the first that targets this task in a real-world setting requiring dense interpretation in both spatial and temporal domains. As the ground-truth for this task is difficult and expensive to obtain, existing datasets are either constructed synthetically or only sparsely annotated within short video clips. To overcome this, we introduce a new benchmark encompassing two datasets, KITTI-STEP, and MOTChallenge-STEP. The datasets contain long video sequences, providing challenging examples and a test-bed for studying long-term pixel-precise segmentation and tracking under real-world conditions. We further propose a novel evaluation metric Segmentation and Tracking Quality (STQ) that fairly balances semantic and tracking aspects of this task and is more appropriate for evaluating sequences of arbitrary length. Finally, we provide several baselines to evaluate the status of existing methods on this new challenging dataset. We have made our datasets, metric, benchmark servers, and baselines publicly available, and hope this will inspire future research.

MCML Authors
Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence

Laura Leal-Taixé

Laura Leal-Taixé

Prof. Dr.

* Former member


[9]
Y. Wang, Y. Shen and D. Cremers.
Explicit pairwise factorized graph neural network for semi-supervised node classification.
UAI 2021 - Conference on Uncertainty in Artificial Intelligence. Virtual, Jul 27-29, 2021. URL
Abstract

Node features and structural information of a graph are both crucial for semi-supervised node classification problems. A variety of graph neural network (GNN) based approaches have been proposed to tackle these problems, which typically determine output labels through feature aggregation. This can be problematic, as it implies conditional independence of output nodes given hidden representations, despite their direct connections in the graph. To learn the direct influence among output nodes in a graph, we propose the Explicit Pairwise Factorized Graph Neural Network (EPFGNN), which models the whole graph as a partially observed Markov Random Field. It contains explicit pairwise factors to model output-output relations and uses a GNN backbone to model input-output relations. To balance model complexity and expressivity, the pairwise factors have a shared component and a separate scaling coefficient for each edge. We apply the EM algorithm to train our model, and utilize a star-shaped piecewise likelihood for the tractable surrogate objective. We conduct experiments on various datasets, which shows that our model can effectively improve the performance for semi-supervised node classification on graphs.

MCML Authors
Yuesong Shen

Yuesong Shen

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[8]
T. Frerix, D. Kochkov, J. Smith, D. Cremers, M. Brenner and S. Hoyer.
Variational Data Assimilation with a Learned Inverse Observation Operator.
ICML 2021 - 38th International Conference on Machine Learning. Virtual, Jul 18-24, 2021. URL
Abstract

Variational data assimilation optimizes for an initial state of a dynamical system such that its evolution fits observational data. The physical model can subsequently be evolved into the future to make predictions. This principle is a cornerstone of large scale forecasting applications such as numerical weather prediction. As such, it is implemented in current operational systems of weather forecasting agencies across the globe. However, finding a good initial state poses a difficult optimization problem in part due to the non-invertible relationship between physical states and their corresponding observations. We learn a mapping from observational data to physical states and show how it can be used to improve optimizability. We employ this mapping in two ways: to better initialize the non-convex optimization problem, and to reformulate the objective function in better behaved physics space instead of observation space. Our experimental results for the Lorenz96 model and a two-dimensional turbulent fluid flow demonstrate that this procedure significantly improves forecast quality for chaotic systems.

MCML Authors
Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[7]
M. Eisenberger, D. Novotny, G. Kerchenbaum, P. Labatut, N. Neverova, D. Cremers and A. Vedaldi.
NeuroMorph: Unsupervised Shape Interpolation and Correspondence in One Go.
CVPR 2021 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Virtual, Jun 19-25, 2021. DOI GitHub
Abstract

We present NeuroMorph, a new neural network architecture that takes as input two 3D shapes and produces in one go, i.e. in a single feed forward pass, a smooth interpolation and point-to-point correspondences between them. The interpolation, expressed as a deformation field, changes the pose of the source shape to resemble the target, but leaves the object identity unchanged. NeuroMorph uses an elegant architecture combining graph convolutions with global feature pooling to extract local features. During training, the model is incentivized to create realistic deformations by approximating geodesics on the underlying shape space manifold. This strong geometric prior allows to train our model end-to-end and in a fully unsupervised manner without requiring any manual correspondence annotations. NeuroMorph works well for a large variety of input shapes, including non-isometric pairs from different object categories. It obtains state-of-the-art results for both shape correspondence and interpolation tasks, matching or surpassing the performance of recent unsupervised and supervised methods on multiple benchmarks.

MCML Authors
Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[6]
M. Gao, Z. Lähner, J. Thunberg, D. Cremers and F. Bernard.
Isometric Multi-Shape Matching.
CVPR 2021 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Virtual, Jun 19-25, 2021. DOI GitHub
Abstract

Finding correspondences between shapes is a fundamental problem in computer vision and graphics, which is relevant for many applications, including 3D reconstruction, object tracking, and style transfer. The vast majority of correspondence methods aim to find a solution between pairs of shapes, even if multiple instances of the same class are available. While isometries are often studied in shape correspondence problems, they have not been considered explicitly in the multi-matching setting. This paper closes this gap by proposing a novel optimisation formulation for isometric multi-shape matching. We present a suitable optimisation algorithm for solving our formulation and provide a convergence and complexity analysis. Our algorithm obtains multi-matchings that are by construction provably cycle-consistent. We demonstrate the superior performance of our method on various datasets and set the new state-of-the-art in isometric multi-shape matching.

MCML Authors
Link to website

Maolin Gao

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[5]
C. Tomani, S. Gruber, M. E. Erdem, D. Cremers and F. Buettner.
Post-hoc Uncertainty Calibration for Domain Drift Scenarios.
CVPR 2021 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Virtual, Jun 19-25, 2021. DOI
Abstract

We address the problem of uncertainty calibration. While standard deep neural networks typically yield uncalibrated predictions, calibrated confidence scores that are representative of the true likelihood of a prediction can be achieved using post-hoc calibration methods. However, to date, the focus of these approaches has been on in-domain calibration. Our contribution is two-fold. First, we show that existing post-hoc calibration methods yield highly over-confident predictions under domain shift. Second, we introduce a simple strategy where perturbations are applied to samples in the validation set before performing the post-hoc calibration step. In extensive experiments, we demonstrate that this perturbation step results in substantially better calibration under domain shift on a wide range of architectures and modelling tasks.

MCML Authors
Link to website

Christian Tomani

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[4]
P. Müller, V. Golkov, V. Tomassini and D. Cremers.
Rotation-Equivariant Deep Learning for Diffusion MRI (short version).
ISMRM 2021 - International Society for Magnetic Resonance in Medicine Annual Meeting. Virtual, May 15-20, 2021. Long version in arXiv. arXiv
Abstract

Convolutional networks are successful, but they have recently been outperformed by new neural networks that are equivariant under rotations and translations. These new networks work better because they do not struggle with learning each possible orientation of each image feature separately. So far, they have been proposed for 2D and 3D data. Here we generalize them to 6D diffusion MRI data, ensuring joint equivariance under 3D roto-translations in image space and the matching 3D rotations in q-space, as dictated by the image formation. Such equivariant deep learning is appropriate for diffusion MRI, because microstructural and macrostructural features such as neural fibers can appear at many different orientations, and because even non-rotation-equivariant deep learning has so far been the best method for many diffusion MRI tasks. We validate our equivariant method on multiple-sclerosis lesion segmentation. Our proposed neural networks yield better results and require fewer scans for training compared to non-rotation-equivariant deep learning. They also inherit all the advantages of deep learning over classical diffusion MRI methods. Our implementation is available at this https URL and can be used off the shelf without understanding the mathematical background.

MCML Authors
Link to website

Vladimir Golkov

Dr.

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[3]
G. Fabbro, V. Golkov, T. Kemp and D. Cremers.
Speech Synthesis and Control Using Differentiable DSP.
Preprint (Oct. 2020). arXiv
Abstract

Modern text-to-speech systems are able to produce natural and high-quality speech, but speech contains factors of variation (e.g. pitch, rhythm, loudness, timbre) that text alone cannot contain. In this work we move towards a speech synthesis system that can produce diverse speech renditions of a text by allowing (but not requiring) explicit control over the various factors of variation. We propose a new neural vocoder that offers control of such factors of variation. This is achieved by employing differentiable digital signal processing (DDSP) (previously used only for music rather than speech), which exposes these factors of variation. The results show that the proposed approach can produce natural speech with realistic timbre, and individual factors of variation can be freely controlled.

MCML Authors
Link to website

Vladimir Golkov

Dr.

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[2]
F. Wimbauer, N. Yang, L. von Stumberg, N. Zeller and D. Cremers.
MonoRec: Semi-Supervised Dense Reconstruction in Dynamic Environments from a Single Moving Camera.
CVPR 2020 - IEEE/CVF Conference on Computer Vision and Pattern Recognition. Virtual, Jun 14-19, 2020. DOI GitHub
Abstract

In this paper, we propose MonoRec, a semi-supervised monocular dense reconstruction architecture that predicts depth maps from a single moving camera in dynamic environments. MonoRec is based on a multi-view stereo setting which encodes the information of multiple consecutive images in a cost volume. To deal with dynamic objects in the scene, we introduce a MaskModule that predicts moving object masks by leveraging the photometric inconsistencies encoded in the cost volumes. Unlike other multi-view stereo methods, MonoRec is able to reconstruct both static and moving objects by leveraging the predicted masks. Furthermore, we present a novel multi-stage training scheme with a semi-supervised loss formulation that does not require LiDAR depth values. We carefully evaluate MonoRec on the KITTI dataset and show that it achieves state-of-theart performance compared to both multi-view and singleview methods. With the model trained on KITTI, we furthermore demonstrate that MonoRec is able to generalize well to both the Oxford RobotCar dataset and the more challenging TUM-Mono dataset recorded by a handheld camera.

MCML Authors
Link to website

Felix Wimbauer

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


[1]
L. Della Libera, V. Golkov, Y. Zhu, A. Mielke and D. Cremers.
Deep Learning for 2D and 3D Rotatable Data: An Overview of Methods.
Preprint (Oct. 2019). arXiv
Abstract

Convolutional networks are successful due to their equivariance/invariance under translations. However, rotatable data such as images, volumes, shapes, or point clouds require processing with equivariance/invariance under rotations in cases where the rotational orientation of the coordinate system does not affect the meaning of the data (e.g. object classification). On the other hand, estimation/processing of rotations is necessary in cases where rotations are important (e.g. motion estimation). There has been recent progress in methods and theory in all these regards. Here we provide an overview of existing methods, both for 2D and 3D rotations (and translations), and identify commonalities and links between them.

MCML Authors
Link to website

Vladimir Golkov

Dr.

Computer Vision & Artificial Intelligence

Link to Profile Daniel Cremers

Daniel Cremers

Prof. Dr.

Computer Vision & Artificial Intelligence


B2 | Natural Language Processing

Natural Language Processing (NLP) focuses on understanding and generating natural language text, greatly influenced by recent advances in deep learning. Despite substantial progress, our MCML researchers address key challenges like enhancing deep language understanding through structural biases, developing common sense in models through experimental environments, and improving sample efficiency for more effective learning from large datasets.

Link to Profile Alexander Fraser

Alexander Fraser

Prof. Dr.

Data Analytics & Statistics

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

Artificial Intelligence and Computational Linguistics

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning

Link to Profile Michael Hedderich

Michael Hedderich

Dr.

JRG Leader Human-Centered NLP

Artificial Intelligence and Computational Linguistics

Publication in Research Area B2
[179]
R. Litschko, O. Kraus, V. Blaschke and B. Plank.
Cross-Dialect Information Retrieval: Information Access in Low-Resource and High-Variance Languages.
COLING 2025 - The 31st International Conference on Computational Linguistics. Abu Dhabi, UAE, Jan 19-24, 2025. To be published. Preprint available. arXiv
Abstract

A large amount of local and culture-specific knowledge (e.g., people, traditions, food) can only be found in documents written in dialects. While there has been extensive research conducted on cross-lingual information retrieval (CLIR), the field of cross-dialect retrieval (CDIR) has received limited attention. Dialect retrieval poses unique challenges due to the limited availability of resources to train retrieval models and the high variability in non-standardized languages. We study these challenges on the example of German dialects and introduce the first German dialect retrieval dataset, dubbed WikiDIR, which consists of seven German dialects extracted from Wikipedia. Using WikiDIR, we demonstrate the weakness of lexical methods in dealing with high lexical variation in dialects. We further show that commonly used zero-shot cross-lingual transfer approach with multilingual encoders do not transfer well to extremely low-resource setups, motivating the need for resource-lean and dialect-specific retrieval models. We finally demonstrate that (document) translation is an effective way to reduce the dialect gap in CDIR.

MCML Authors
Link to website

Robert Litschko

Artificial Intelligence and Computational Linguistics

Link to website

Verena Blaschke

Artificial Intelligence and Computational Linguistics

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

Artificial Intelligence and Computational Linguistics


[178]
Y. Liu, C. Ma, H. Ye and H. Schütze.
TransMI: A Framework to Create Strong Baselines from Multilingual Pretrained Language Models for Transliterated Data.
COLING 2025 - The 31st International Conference on Computational Linguistics. Abu Dhabi, UAE, Jan 19-24, 2025. To be published. Preprint available. arXiv GitHub
Abstract

Transliterating related languages that use different scripts into a common script shows effectiveness in improving crosslingual transfer in downstream tasks. However, this methodology often makes pretraining a model from scratch unavoidable, as transliteration brings about new subwords not covered in existing multilingual pretrained language models (mPLMs). This is not desired because it takes a lot of computation budget for pretraining. A more promising way is to make full use of available mPLMs. To this end, this paper proposes a simple but effective framework: Transliterate-Merge-Initialize (TransMI), which can create a strong baseline well-suited for data that is transliterated into a common script by exploiting an mPLM and its accompanied tokenizer. TransMI has three stages: (a) transliterate the vocabulary of an mPLM into a common script; (b) merge the new vocabulary with the original vocabulary; and (c) initialize the embeddings of the new subwords. We applied TransMI to three recent strong mPLMs, and our experiments demonstrate that TransMI not only preserves their ability to handle non-transliterated data, but also enables the models to effectively process transliterated data: the results show a consistent improvement of 3% to 34%, varying across different models and tasks.

MCML Authors
Link to website

Yihong Liu

Statistical NLP and Deep Learning

Link to website

Chunlan Ma

Statistical NLP and Deep Learning

Link to website

Haotian Ye

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[177]
Y. Liu, M. Wang, A. H. Kargaran, A. Imani, O. Xhelili, H. Ye, C. Ma, F. Yvon and H. Schütze.
How Transliterations Improve Crosslingual Alignment.
COLING 2025 - The 31st International Conference on Computational Linguistics. Abu Dhabi, UAE, Jan 19-24, 2025. To be published. Preprint available. arXiv
Abstract

Recent studies have shown that post-aligning multilingual pretrained language models (mPLMs) using alignment objectives on both original and transliterated data can improve crosslingual alignment. This improvement further leads to better crosslingual transfer performance. However, it remains unclear how and why a better crosslingual alignment is achieved, as this technique only involves transliterations, and does not use any parallel data. This paper attempts to explicitly evaluate the crosslingual alignment and identify the key elements in transliteration-based approaches that contribute to better performance. For this, we train multiple models under varying setups for two pairs of related languages: (1) Polish and Ukrainian and (2) Hindi and Urdu. To assess alignment, we define four types of similarities based on sentence representations. Our experiments show that adding transliterations alone improves the overall similarities, even for random sentence pairs. With the help of auxiliary alignment objectives, especially the contrastive objective, the model learns to distinguish matched from random pairs, leading to better alignments. However, we also show that better alignment does not always yield better downstream performance, suggesting that further research is needed to clarify the connection between alignment and performance.

MCML Authors
Link to website

Yihong Liu

Statistical NLP and Deep Learning

Link to website

Mingyang Wang

Statistical NLP and Deep Learning

Link to website

Amir Hossein Kargaran

Statistical NLP and Deep Learning

Link to website

Ayyoob Imani

Statistical NLP and Deep Learning

Link to website

Haotian Ye

Statistical NLP and Deep Learning

Link to website

Chunlan Ma

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[176]
A. Muñoz-Ortiz, V. Blaschke and B. Plank.
Evaluating Pixel Language Models on Non-Standardized Languages.
COLING 2025 - The 31st International Conference on Computational Linguistics. Abu Dhabi, UAE, Jan 19-24, 2025. To be published. Preprint available. arXiv
Abstract

We explore the potential of pixel-based models for transfer learning from standard languages to dialects. These models convert text into images that are divided into patches, enabling a continuous vocabulary representation that proves especially useful for out-of-vocabulary words common in dialectal data. Using German as a case study, we compare the performance of pixel-based models to token-based models across various syntactic and semantic tasks. Our results show that pixel-based models outperform token-based models in part-of-speech tagging, dependency parsing and intent detection for zero-shot dialect evaluation by up to 26 percentage points in some scenarios, though not in Standard German. However, pixel-based models fall short in topic classification. These findings emphasize the potential of pixel-based models for handling dialectal data, though further research should be conducted to assess their effectiveness in various linguistic contexts.

MCML Authors
Link to website

Verena Blaschke

Artificial Intelligence and Computational Linguistics

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

Artificial Intelligence and Computational Linguistics


[175]
A.-M. Lutgen, A. Plum, C. Purschke and B. Plank.
Neural Text Normalization for Luxembourgish using Real-Life Variation Data.
VarDial @COLING 2025 - 12th Workshop on NLP for Similar Languages, Varieties and Dialects at the The 31st International Conference on Computational Linguistics (COLING 2025). Abu Dhabi, UAE, Jan 19-24, 2025. To be published. Preprint available. arXiv
Abstract

Orthographic variation is very common in Luxembourgish texts due to the absence of a fully-fledged standard variety. Additionally, developing NLP tools for Luxembourgish is a difficult task given the lack of annotated and parallel data, which is exacerbated by ongoing standardization. In this paper, we propose the first sequence-to-sequence normalization models using the ByT5 and mT5 architectures with training data obtained from word-level real-life variation data. We perform a fine-grained, linguistically-motivated evaluation to test byte-based, word-based and pipeline-based models for their strengths and weaknesses in text normalization. We show that our sequence model using real-life variation data is an effective approach for tailor-made normalization in Luxembourgish.

MCML Authors
Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

Artificial Intelligence and Computational Linguistics


[174]
A. H. Kargaran, F. Yvon and H. Schütze.
GlotCC: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority Languages.
NeurIPS 2024 - 38th Conference on Neural Information Processing Systems. Vancouver, Canada, Dec 10-15, 2024. To be published. Preprint available. arXiv
Abstract

The need for large text corpora has increased with the advent of pretrained language models and, in particular, the discovery of scaling laws for these models. Most available corpora have sufficient data only for languages with large dominant communities. However, there is no corpus available that (i) covers a wide range of minority languages; (ii) is generated by an open-source reproducible pipeline; and (iii) is rigorously cleaned from noise, making it trustworthy to use. We present GlotCC, a clean, document-level, 2TB general domain corpus derived from CommonCrawl, covering more than 1000 languages. We make GlotCC and the system used to generate it - including the pipeline, language identification model, and filters - available to the research community.

MCML Authors
Link to website

Amir Hossein Kargaran

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[173]
Y. Zhang, Y. Li, X. Wang, Q. Shen, B. Plank, B. Bischl, M. Rezaei and K. Kawaguchi.
FinerCut: Finer-grained Interpretable Layer Pruning for Large Language Models.
NeurIPS 2024 - Workshop on Machine Learning and Compression at the 38th Conference on Neural Information Processing Systems. Vancouver, Canada, Dec 10-15, 2024. To be published. Preprint available. arXiv
Abstract

Overparametrized transformer networks are the state-of-the-art architecture for Large Language Models (LLMs). However, such models contain billions of parameters making large compute a necessity, while raising environmental concerns. To address these issues, we propose FinerCut, a new form of fine-grained layer pruning, which in contrast to prior work at the transformer block level, considers all self-attention and feed-forward network (FFN) layers within blocks as individual pruning candidates. FinerCut prunes layers whose removal causes minimal alternation to the model’s output – contributing to a new, lean, interpretable, and task-agnostic pruning method. Tested across 9 benchmarks, our approach retains 90% performance of Llama3-8B with 25% layers removed, and 95% performance of Llama3-70B with 30% layers removed, all without fine-tuning or post-pruning reconstruction. Strikingly, we observe intriguing results with FinerCut: 42% (34 out of 80) of the self-attention layers in Llama3-70B can be removed while preserving 99% of its performance – without additional fine-tuning after removal. Moreover, FinerCut provides a tool to inspect the types and locations of pruned layers, allowing to observe interesting pruning behaviors. For instance, we observe a preference for pruning self-attention layers, often at deeper consecutive decoder layers. We hope our insights inspire future efficient LLM architecture designs.

MCML Authors
Link to website

Yawei Li

Statistical Learning & Data Science

Link to website

Xinpeng Wang

Artificial Intelligence and Computational Linguistics

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

Artificial Intelligence and Computational Linguistics

Link to Profile Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning & Data Science

Link to website

Mina Rezaei

Dr.

Statistical Learning & Data Science


[172]
B. Chen, S. Peng, A. Korhonen and B. Plank.
A Rose by Any Other Name: LLM-Generated Explanations Are Good Proxies for Human Explanations to Collect Label Distributions on NLI.
Preprint (Dec. 2024). arXiv
Abstract

Disagreement in human labeling is ubiquitous, and can be captured in human judgment distributions (HJDs). Recent research has shown that explanations provide valuable information for understanding human label variation (HLV) and large language models (LLMs) can approximate HJD from a few human-provided label-explanation pairs. However, collecting explanations for every label is still time-consuming. This paper examines whether LLMs can be used to replace humans in generating explanations for approximating HJD. Specifically, we use LLMs as annotators to generate model explanations for a few given human labels. We test ways to obtain and combine these label-explanations with the goal to approximate human judgment distribution. We further compare the resulting human with model-generated explanations, and test automatic and human explanation selection. Our experiments show that LLM explanations are promising for NLI: to estimate HJD, generated explanations yield comparable results to human’s when provided with human labels. Importantly, our results generalize from datasets with human explanations to i) datasets where they are not available and ii) challenging out-of-distribution test sets.

MCML Authors
Link to website

Beiduo Chen

Artificial Intelligence and Computational Linguistics

Link to website

Siyao Peng

Dr.

Artificial Intelligence and Computational Linguistics

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

Artificial Intelligence and Computational Linguistics


[171]
B. Ma, B. Yoztyurk, A.-C. Haensch, X. Wang, M. Herklotz, F. Kreuter, B. Plank and M. Assenmacher.
Algorithmic Fidelity of Large Language Models in Generating Synthetic German Public Opinions: A Case Study.
Preprint (Dec. 2024). arXiv
Abstract

In recent research, large language models (LLMs) have been increasingly used to investigate public opinions. This study investigates the algorithmic fidelity of LLMs, i.e., the ability to replicate the socio-cultural context and nuanced opinions of human participants. Using open-ended survey data from the German Longitudinal Election Studies (GLES), we prompt different LLMs to generate synthetic public opinions reflective of German subpopulations by incorporating demographic features into the persona prompts. Our results show that Llama performs better than other LLMs at representing subpopulations, particularly when there is lower opinion diversity within those groups. Our findings further reveal that the LLM performs better for supporters of left-leaning parties like The Greens and The Left compared to other parties, and matches the least with the right-party AfD. Additionally, the inclusion or exclusion of specific variables in the prompts can significantly impact the models’ predictions. These findings underscore the importance of aligning LLMs to more effectively model diverse public opinions while minimizing political biases and enhancing robustness in representativeness.

MCML Authors
Link to website

Bolei Ma

Social Data Science and AI Lab

Link to website

Xinpeng Wang

Artificial Intelligence and Computational Linguistics

Link to Profile Frauke Kreuter

Frauke Kreuter

Prof. Dr.

Social Data Science and AI Lab

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

Artificial Intelligence and Computational Linguistics


[170]
A. Testoni, B. Plank and R. Fernández.
RACQUET: Unveiling the Dangers of Overlooked Referential Ambiguity in Visual LLMs.
Preprint (Dec. 2024). arXiv
Abstract

Ambiguity resolution is key to effective communication. While humans effortlessly address ambiguity through conversational grounding strategies, the extent to which current language models can emulate these strategies remains unclear. In this work, we examine referential ambiguity in image-based question answering by introducing RACQUET, a carefully curated dataset targeting distinct aspects of ambiguity. Through a series of evaluations, we reveal significant limitations and problems of overconfidence of state-of-the-art large multimodal language models in addressing ambiguity in their responses. The overconfidence issue becomes particularly relevant for RACQUET-BIAS, a subset designed to analyze a critical yet underexplored problem: failing to address ambiguity leads to stereotypical, socially biased responses. Our results underscore the urgency of equipping models with robust strategies to deal with uncertainty without resorting to undesirable stereotypes.

MCML Authors
Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

Artificial Intelligence and Computational Linguistics


[169]
H. Ye, A. Wisiorek, A. Maronikolakis, Ö. Alaçam and H. Schütze.
A Federated Approach to Few-Shot Hate Speech Detection for Marginalized Communities.
Preprint (Dec. 2024). arXiv
Abstract

Hate speech online remains an understudied issue for marginalized communities, and has seen rising relevance, especially in the Global South, which includes developing societies with increasing internet penetration. In this paper, we aim to provide marginalized communities living in societies where the dominant language is low-resource with a privacy-preserving tool to protect themselves from hate speech on the internet by filtering offensive content in their native languages. Our contribution in this paper is twofold: 1) we release REACT (REsponsive hate speech datasets Across ConTexts), a collection of high-quality, culture-specific hate speech detection datasets comprising seven distinct target groups in eight low-resource languages, curated by experienced data collectors; 2) we propose a solution to few-shot hate speech detection utilizing federated learning (FL), a privacy-preserving and collaborative learning approach, to continuously improve a central model that exhibits robustness when tackling different target groups and languages. By keeping the training local to the users’ devices, we ensure the privacy of the users’ data while benefitting from the efficiency of federated learning. Furthermore, we personalize client models to target-specific training data and evaluate their performance. Our results indicate the effectiveness of FL across different target groups, whereas the benefits of personalization on few-shot learning are not clear.

MCML Authors
Link to website

Haotian Ye

Statistical NLP and Deep Learning

Link to website

Axel Wisiorek

Dr.

Statistical NLP and Deep Learning

Link to website

Antonis Maronikolakis

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[168]
M. Di Marco and A. Fraser.
Subword Segmentation in LLMs: Looking at Inflection and Consistency.
EMNLP 2024 - Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI
Abstract

The role of subword segmentation in relation to capturing morphological patterns in LLMs is currently not well explored. Ideally, one would train models like GPT using various segmentations and evaluate how well word meanings are captured. Since this is not computationally feasible, we group words according to their segmentation properties and compare how well a model can solve a linguistic task for these groups. We study two criteria: (i) adherence to morpheme boundaries and (ii) the segmentation consistency of the different inflected forms of a lemma. We select word forms with high and low values for these criteria and carry out experiments on GPT-4o’s ability to capture verbal inflection for 10 languages. Our results indicate that in particular the criterion of segmentation consistency can help to predict the model’s ability to recognize and generate the lemma from an inflected form, providing evidence that subword segmentation is relevant.

MCML Authors
Link to Profile Alexander Fraser

Alexander Fraser

Prof. Dr.

Data Analytics & Statistics


[167]
L. Edman, H. Schmid and A. Fraser.
CUTE: Measuring LLMs’ Understanding of Their Tokens.
EMNLP 2024 - Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI
Abstract

Large Language Models (LLMs) show remarkable performance on a wide variety of tasks. Most LLMs split text into multi-character tokens and process them as atomic units without direct access to individual characters. This raises the question: To what extent can LLMs learn orthographic information? To answer this, we propose a new benchmark, CUTE, which features a collection of tasks designed to test the orthographic knowledge of LLMs. We evaluate popular LLMs on CUTE, finding that most of them seem to know the spelling of their tokens, yet fail to use this information effectively to manipulate text, calling into question how much of this knowledge is generalizable.

MCML Authors
Link to website at LMU

Lukas Edman

Dr.

Data Analytics & Statistics

Link to Profile Alexander Fraser

Alexander Fraser

Prof. Dr.

Data Analytics & Statistics


[166]
W. Lai, V. Hangya and A. Fraser.
Style-Specific Neurons for Steering LLMs in Text Style Transfer.
EMNLP 2024 - Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI
Abstract

Text style transfer (TST) aims to modify the style of a text without altering its original meaning. Large language models (LLMs) demonstrate superior performance across multiple tasks, including TST. However, in zero-shot setups, they tend to directly copy a significant portion of the input text to the output without effectively changing its style. To enhance the stylistic variety and fluency of the text, we present sNeuron-TST, a novel approach for steering LLMs using style-specific neurons in TST. Specifically, we identify neurons associated with the source and target styles and deactivate source-style-only neurons to give target-style words a higher probability, aiming to enhance the stylistic diversity of the generated text. However, we find that this deactivation negatively impacts the fluency of the generated text, which we address by proposing an improved contrastive decoding method that accounts for rapid token probability shifts across layers caused by deactivated source-style neurons. Empirical experiments demonstrate the effectiveness of the proposed method on six benchmarks, encompassing formality, toxicity, politics, politeness, authorship, and sentiment.

MCML Authors
Link to Profile Alexander Fraser

Alexander Fraser

Prof. Dr.

Data Analytics & Statistics


[165]
Y. Liu, Y. Zhang, Q. Li, T. Liu, S. Feng, D. Wang, Y. Zhang and H. Schütze.
HiFT: A Hierarchical Full Parameter Fine-Tuning Strategy.
EMNLP 2024 - Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI
Abstract

Full-parameter fine-tuning has become the go-to choice for adapting language models (LMs) to downstream tasks due to its excellent performance. As LMs grow in size, fine-tuning the full parameters of LMs requires a prohibitively large amount of GPU memory. Existing approaches utilize zeroth-order optimizer to conserve GPU memory, which can potentially compromise the performance of LMs as non-zero order optimizers tend to converge more readily on most downstream tasks. In this paper, we propose a novel optimizer-independent end-to-end hierarchical fine-tuning strategy, HiFT, which only updates a subset of parameters at each training step. HiFT can significantly reduce the amount of gradients and optimizer state parameters residing in GPU memory at the same time, thereby reducing GPU memory usage. Our results demonstrate that: (1) HiFT achieves comparable performance to parameter-efficient fine-tuning and standard full parameter fine-tuning. (2) HiFT supports various optimizers including AdamW, AdaGrad, SGD, etc. (3) HiFT can save more than 60% GPU memory compared with standard full-parameter fine-tuning for 7B model. (4) HiFT enables full-parameter fine-tuning of a 7B model on single 48G A6000 with a precision of 32 using the AdamW optimizer, without using any memory saving techniques.

MCML Authors
Link to website

Yongkang Liu

Statistical NLP and Deep Learning

Link to website

Tong Liu

Database Systems & Data Mining

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[164]
P. Mondorf and B. Plank.
Liar, Liar, Logical Mire: A Benchmark for Suppositional Reasoning in Large Language Models.
EMNLP 2024 - Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI
Abstract

Knights and knaves problems represent a classic genre of logical puzzles where characters either tell the truth or lie. The objective is to logically deduce each character’s identity based on their statements. The challenge arises from the truth-telling or lying behavior, which influences the logical implications of each statement. Solving these puzzles requires not only direct deductions from individual statements, but the ability to assess the truthfulness of statements by reasoning through various hypothetical scenarios. As such, knights and knaves puzzles serve as compelling examples of suppositional reasoning. In this paper, we introduce TruthQuest, a benchmark for suppositional reasoning based on the principles of knights and knaves puzzles. Our benchmark presents problems of varying complexity, considering both the number of characters and the types of logical statements involved. Evaluations on TruthQuest show that large language models like Llama 3 and Mixtral-8x7B exhibit significant difficulties solving these tasks. A detailed error analysis of the models’ output reveals that lower-performing models exhibit a diverse range of reasoning errors, frequently failing to grasp the concept of truth and lies. In comparison, more proficient models primarily struggle with accurately inferring the logical implications of potentially false statements.

MCML Authors
Link to website

Philipp Mondorf

Artificial Intelligence and Computational Linguistics

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

Artificial Intelligence and Computational Linguistics


[163]
B. Chen, X. Wang, S. Peng, R. Litschko, A. Korhonen and B. Plank.
'Seeing the Big through the Small': Can LLMs Approximate Human Judgment Distributions on NLI from a Few Explanations?.
EMNLP 2024 - Findings of the Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI
Abstract

Human label variation (HLV) is a valuable source of information that arises when multiple human annotators provide different labels for valid reasons. In Natural Language Inference (NLI) earlier approaches to capturing HLV involve either collecting annotations from many crowd workers to represent human judgment distribution (HJD) or use expert linguists to provide detailed explanations for their chosen labels. While the former method provides denser HJD information, obtaining it is resource-intensive. In contrast, the latter offers richer textual information but it is challenging to scale up to many human judges. Besides, large language models (LLMs) are increasingly used as evaluators (‘LLM judges’) but with mixed results, and few works aim to study HJDs. This study proposes to exploit LLMs to approximate HJDs using a small number of expert labels and explanations. Our experiments show that a few explanations significantly improve LLMs’ ability to approximate HJDs with and without explicit labels, thereby providing a solution to scale up annotations for HJD. However, fine-tuning smaller soft-label aware models with the LLM-generated model judgment distributions (MJDs) presents partially inconsistent results: while similar in distance, their resulting fine-tuned models and visualized distributions differ substantially. We show the importance of complementing instance-level distance measures with a global-level shape metric and visualization to more effectively evaluate MJDs against human judgment distributions.

MCML Authors
Link to website

Beiduo Chen

Artificial Intelligence and Computational Linguistics

Link to website

Xinpeng Wang

Artificial Intelligence and Computational Linguistics

Link to website

Siyao Peng

Dr.

Artificial Intelligence and Computational Linguistics

Link to website

Robert Litschko

Artificial Intelligence and Computational Linguistics

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

Artificial Intelligence and Computational Linguistics


[162]
A. Köksal, T. Schick, A. Korhonen and H. Schütze.
LongForm: Effective Instruction Tuning with Reverse Instructions.
EMNLP 2024 - Findings of the Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI GitHub
Abstract

Instruction tuning enables language models to more effectively generalize and better follow user intent. However, obtaining instruction data is costly and challenging. Prior work employs methods such as expensive human annotation, crowd-sourced datasets with alignment issues, and generating noisy examples via LLMs. We introduce the LongForm-C dataset, which is created by reverse instructions. We generate instructions via LLMs for human-written corpus examples using reverse instructions. First we select a diverse set of human-written documents from corpora such as C4 and Wikipedia; then we generate instructions for these documents via LLMs. This approach provides a cheaper and cleaner instruction-tuning dataset with natural output and one suitable for long text generation. Our models outperform 10x larger language models without instruction tuning on tasks such as story/recipe generation and long-form question answering. Moreover, LongForm models outperform prior instruction-tuned models such as FLAN-T5 and Alpaca by a large margin, and improve language understanding capabilities further.

MCML Authors
Link to website

Abdullatif Köksal

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[161]
A. Modarressi, A. Köksal and H. Schütze.
Consistent Document-Level Relation Extraction via Counterfactuals.
EMNLP 2024 - Findings of the Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI
Abstract

Many datasets have been developed to train and evaluate document-level relation extraction (RE) models. Most of these are constructed using real-world data. It has been shown that RE models trained on real-world data suffer from factual biases. To evaluate and address this issue, we present CovEReD, a counterfactual data generation approach for document-level relation extraction datasets using entity replacement. We first demonstrate that models trained on factual data exhibit inconsistent behavior: while they accurately extract triples from factual data, they fail to extract the same triples after counterfactual modification. This inconsistency suggests that models trained on factual data rely on spurious signals such as specific entities and external knowledge – rather than on the input context – to extract triples. We show that by generating document-level counterfactual data with CovEReD and training models on them, consistency is maintained with minimal impact on RE performance. We release our CovEReD pipeline as well as Re-DocRED-CF, a dataset of counterfactual RE documents, to assist in evaluating and addressing inconsistency in document-level RE.

MCML Authors
Link to website

Ali Modarressi

Statistical NLP and Deep Learning

Link to website

Abdullatif Köksal

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[160]
A. Sedova, R. Litschko, D. Frassinelli, B. Roth and B. Plank.
To Know or Not To Know? Analyzing Self-Consistency of Large Language Models under Ambiguity.
EMNLP 2024 - Findings of the Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI
Abstract

One of the major aspects contributing to the striking performance of large language models (LLMs) is the vast amount of factual knowledge accumulated during pre-training. Yet, many LLMs suffer from self-inconsistency, which raises doubts about their trustworthiness and reliability. This paper focuses on entity type ambiguity, analyzing the proficiency and consistency of state-of-the-art LLMs in applying factual knowledge when prompted with ambiguous entities. To do so, we propose an evaluation protocol that disentangles knowing from applying knowledge, and test state-of-the-art LLMs on 49 ambiguous entities. Our experiments reveal that LLMs struggle with choosing the correct entity reading, achieving an average accuracy of only 85%, and as low as 75% with underspecified prompts. The results also reveal systematic discrepancies in LLM behavior, showing that while the models may possess knowledge, they struggle to apply it consistently, exhibit biases toward preferred readings, and display self-inconsistencies. This highlights the need to address entity ambiguity in the future for more trustworthy LLMs.

MCML Authors
Link to website

Robert Litschko

Artificial Intelligence and Computational Linguistics

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

Artificial Intelligence and Computational Linguistics


[159]
M. Wang, L. Lange, H. Adel, J. Strötgen and H. Schütze.
Better Call SAUL: Fluent and Consistent Language Model Editing with Generation Regularization.
EMNLP 2024 - Findings of the Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI
Abstract

To ensure large language models contain up-to-date knowledge, they need to be updated regularly. However, model editing is challenging as it might also affect knowledge that is unrelated to the new data. State-of-the-art methods identify parameters associated with specific knowledge and then modify them via direct weight updates. However, these locate-and-edit methods suffer from heavy computational overhead and lack theoretical validation. In contrast, directly fine-tuning the model on requested edits affects the model’s behavior on unrelated knowledge, and significantly damages the model’s generation fluency and consistency. To address these challenges, we propose SAUL, a streamlined model editing method that uses sentence concatenation with augmented random facts for generation regularization. Evaluations on three model editing benchmarks show that SAUL is a practical and reliable solution for model editing outperforming state-of-the-art methods while maintaining generation quality and reducing computational overhead.

MCML Authors
Link to website

Mingyang Wang

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[158]
O. Xhelili, Y. Liu and H. Schütze.
Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment.
EMNLP 2024 - Findings of the Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI GitHub
Abstract

Multilingual pre-trained models (mPLMs) have shown impressive performance on cross-lingual transfer tasks. However, the transfer performance is often hindered when a low-resource target language is written in a different script than the high-resource source language, even though the two languages may be related or share parts of their vocabularies. Inspired by recent work that uses transliteration to address this problem, our paper proposes a transliteration-based post-pretraining alignment (PPA) method aiming to improve the cross-lingual alignment between languages using diverse scripts. We select two areal language groups, Mediterranean-Amharic-Farsi and South+East Asian Languages, wherein the languages are mutually influenced but use different scripts. We apply our method to these language groups and conduct extensive experiments on a spectrum of downstream tasks. The results show that after PPA, models consistently outperform the original model (up to 50% for some tasks) in English-centric transfer. In addition, when we use languages other than English as sources in transfer, our method obtains even larger improvements.

MCML Authors
Link to website

Yihong Liu

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[157]
A. Yüksel, A. Köksal, L. K. Senel, A. Korhonen and H. Schütze.
TurkishMMLU: Measuring Massive Multitask Language Understanding in Turkish.
EMNLP 2024 - Findings of the Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI GitHub
Abstract

Multiple choice question answering tasks evaluate the reasoning, comprehension, and mathematical abilities of Large Language Models (LLMs). While existing benchmarks employ automatic translation for multilingual evaluation, this approach is error-prone and potentially introduces culturally biased questions, especially in social sciences. We introduce the first multitask, multiple-choice Turkish QA benchmark, TurkishMMLU, to evaluate LLMs’ understanding of the Turkish language. TurkishMMLU includes over 10,000 questions, covering 9 different subjects from Turkish high-school education curricula. These questions are written by curriculum experts, suitable for the high-school curricula in Turkey, covering subjects ranging from natural sciences and math questions to more culturally representative topics such as Turkish Literature and the history of the Turkish Republic. We evaluate over 20 LLMs, including multilingual open-source (e.g., Gemma, Llama, MT5), closed-source (GPT 4o, Claude, Gemini), and Turkish-adapted (e.g., Trendyol) models. We provide an extensive evaluation, including zero-shot and few-shot evaluation of LLMs, chain-of-thought reasoning, and question difficulty analysis along with model performance. We provide an in-depth analysis of the Turkish capabilities and limitations of current LLMs to provide insights for future LLMs for the Turkish language.

MCML Authors
Link to website

Abdullatif Köksal

Statistical NLP and Deep Learning

Link to website

Lütfi Kerem Şenel

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[156]
R. Zhao, A. Köksal, Y. Liu, L. Weissweiler, A. Korhonen and H. Schütze.
SynthEval: Hybrid Behavioral Testing of NLP Models with Synthetic Evaluation.
EMNLP 2024 - Findings of the Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. DOI GitHub
Abstract

Traditional benchmarking in NLP typically involves using static held-out test sets. However, this approach often results in an overestimation of performance and lacks the ability to offer comprehensive, interpretable, and dynamic assessments of NLP models. Recently, works like DynaBench (Kiela et al., 2021) and CheckList (Ribeiro et al., 2020) have addressed these limitations through behavioral testing of NLP models with test types generated by a multistep human-annotated pipeline. Unfortunately, manually creating a variety of test types requires much human labor, often at prohibitive cost. In this work, we propose SYNTHEVAL, a hybrid behavioral testing framework that leverages large language models (LLMs) to generate a wide range of test types for a comprehensive evaluation of NLP models. SYNTHEVAL first generates sentences via LLMs using controlled generation, and then identifies challenging examples by comparing the predictions made by LLMs with task-specific NLP models. In the last stage, human experts investigate the challenging examples, manually design templates, and identify the types of failures the taskspecific models consistently exhibit. We apply SYNTHEVAL to two classification tasks, sentiment analysis and toxic language detection, and show that our framework is effective in identifying weaknesses of strong models on these tasks.

MCML Authors
Link to website

Raoyuan Zhao

Artificial Intelligence and Computational Linguistics

Link to website

Abdullatif Köksal

Statistical NLP and Deep Learning

Link to website

Yihong Liu

Statistical NLP and Deep Learning

Leonie Weissweiler

Leonie Weissweiler

Dr.

* Former member

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[155]
B. Ma, X. Wang, T. Hu, A.-C. Haensch, M. A. Hedderich, B. Plank and F. Kreuter.
The Potential and Challenges of Evaluating Attitudes, Opinions, and Values in Large Language Models.
EMNLP 2024 - Findings of the Conference on Empirical Methods in Natural Language Processing. Miami, FL, USA, Nov 12-16, 2024. To be published. Preprint available. DOI
Abstract

Recent advances in Large Language Models (LLMs) have sparked wide interest in validating and comprehending the human-like cognitive-behavioral traits LLMs may capture and convey. These cognitive-behavioral traits include typically Attitudes, Opinions, Values (AOVs). However, measuring AOVs embedded within LLMs remains opaque, and different evaluation methods may yield different results. This has led to a lack of clarity on how different studies are related to each other and how they can be interpreted. This paper aims to bridge this gap by providing a comprehensive overview of recent works on the evaluation of AOVs in LLMs. Moreover, we survey related approaches in different stages of the evaluation pipeline in these works. By doing so, we address the potential and challenges with respect to understanding the model, human-AI alignment, and downstream application in social sciences. Finally, we provide practical insights into evaluation methods, model enhancement, and interdisciplinary collaboration, thereby contributing to the evolving landscape of evaluating AOVs in LLMs.

MCML Authors
Link to website

Bolei Ma

Social Data Science and AI Lab

Link to website

Xinpeng Wang

Artificial Intelligence and Computational Linguistics

Link to Profile Michael Hedderich

Michael Hedderich

Dr.

JRG Leader Human-Centered NLP

Artificial Intelligence and Computational Linguistics

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

Artificial Intelligence and Computational Linguistics

Link to Profile Frauke Kreuter

Frauke Kreuter

Prof. Dr.

Social Data Science and AI Lab


[154]
J. Wang, L. Zuo, S. Peng and B. Plank.
MultiClimate: Multimodal Stance Detection on Climate Change Videos.
NLP4PI @EMNLP 2024 - 3rd Workshop on NLP for Positive Impact at the Conference on Empirical Methods in Natural Language Processing (EMNLP 2024). Miami, FL, USA, Nov 12-16, 2024. DOI GitHub
Abstract

Climate change (CC) has attracted increasing attention in NLP in recent years. However, detecting the stance on CC in multimodal data is understudied and remains challenging due to a lack of reliable datasets. To improve the understanding of public opinions and communication strategies, this paper presents MultiClimate, the first open-source manually-annotated stance detection dataset with 100 CC-related YouTube videos and 4,209 frame-transcript pairs. We deploy state-of-the-art vision and language models, as well as multimodal models for MultiClimate stance detection. Results show that text-only BERT significantly outperforms image-only ResNet50 and ViT. Combining both modalities achieves state-of-the-art, 0.747/0.749 in accuracy/F1. Our 100M-sized fusion models also beat CLIP and BLIP, as well as the much larger 9B-sized multimodal IDEFICS and text-only Llama3 and Gemma2, indicating that multimodal stance detection remains challenging for large language models.

MCML Authors
Link to website

Siyao Peng

Dr.

Artificial Intelligence and Computational Linguistics

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

Artificial Intelligence and Computational Linguistics


[153]
L. He, E. Nie, H. Schmid, H. Schütze, N. Mesgarani and J. Brennan.
Large Language Models as Neurolinguistic Subjects: Identifying Internal Representations for Form and Meaning.
Preprint (Nov. 2024). arXiv
Abstract

This study investigates the linguistic understanding of Large Language Models (LLMs) regarding signifier (form) and signified (meaning) by distinguishing two LLM evaluation paradigms: psycholinguistic and neurolinguistic. Traditional psycholinguistic evaluations often reflect statistical biases that may misrepresent LLMs’ true linguistic capabilities. We introduce a neurolinguistic approach, utilizing a novel method that combines minimal pair and diagnostic probing to analyze activation patterns across model layers. This method allows for a detailed examination of how LLMs represent form and meaning, and whether these representations are consistent across languages. Our contributions are three-fold: (1) We compare neurolinguistic and psycholinguistic methods, revealing distinct patterns in LLM assessment; (2) We demonstrate that LLMs exhibit higher competence in form compared to meaning, with the latter largely correlated to the former; (3) We present new conceptual minimal pair datasets for Chinese (COMPS-ZH) and German (COMPS-DE), complementing existing English datasets.

MCML Authors
Link to website

Ercong Nie

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[152]
V. Hofmann, L. Weissweiler, D. Mortensen, H. Schütze and J. Pierrehumbert.
Derivational Morphology Reveals Analogical Generalization in Large Language Models.
Preprint (Nov. 2024). arXiv
Abstract

What mechanisms underlie linguistic generalization in large language models (LLMs)? This question has attracted considerable attention, with most studies analyzing the extent to which the language skills of LLMs resemble rules. As of yet, it is not known whether linguistic generalization in LLMs could equally well be explained as the result of analogical processes, which can be formalized as similarity operations on stored exemplars. A key shortcoming of prior research is its focus on linguistic phenomena with a high degree of regularity, for which rule-based and analogical approaches make the same predictions. Here, we instead examine derivational morphology, specifically English adjective nominalization, which displays notable variability. We introduce a new method for investigating linguistic generalization in LLMs: focusing on GPT-J, we fit cognitive models that instantiate rule-based and analogical learning to the LLM training data and compare their predictions on a set of nonce adjectives with those of the LLM, allowing us to draw direct conclusions regarding underlying mechanisms. As expected, rule-based and analogical models explain the predictions of GPT-J equally well for adjectives with regular nominalization patterns. However, for adjectives with variable nominalization patterns, the analogical model provides a much better match. Furthermore, GPT-J’s behavior is sensitive to the individual word frequencies, even for regular forms, a behavior that is consistent with an analogical account of regular forms but not a rule-based one. These findings refute the hypothesis that GPT-J’s linguistic generalization on adjective nominalization involves rules, suggesting similarity operations on stored exemplars as the underlying mechanism. Overall, our study suggests that analogical processes play a bigger role in the linguistic generalization of LLMs than previously thought.

MCML Authors
Leonie Weissweiler

Leonie Weissweiler

Dr.

* Former member

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[151]
L. Madaan, D. Esiobu, P. Stenetorp, B. Plank and D. Hupkes.
Lost in Inference: Rediscovering the Role of Natural Language Inference for Large Language Models.
Preprint (Nov. 2024). arXiv
Abstract

In the recent past, a popular way of evaluating natural language understanding (NLU), was to consider a model’s ability to perform natural language inference (NLI) tasks. In this paper, we investigate if NLI tasks, that are rarely used for LLM evaluation, can still be informative for evaluating LLMs. Focusing on five different NLI benchmarks across six models of different scales, we investigate if they are able to discriminate models of different size and quality and how their accuracies develop during training. Furthermore, we investigate the extent to which the softmax distributions of models align with human distributions in cases where statements are ambiguous or vague. Overall, our results paint a positive picture for the NLI tasks: we find that they are able to discriminate well between models at various stages of training, yet are not (all) saturated. Furthermore, we find that while the similarity of model distributions with human label distributions increases with scale, it is still much higher than the similarity between two populations of humans, making it a potentially interesting statistic to consider.

MCML Authors
Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

Artificial Intelligence and Computational Linguistics


[150]
M. Thaler, A. Köksal, A. Leidinger, A. Korhonen and H. Schütze.
How far can bias go? -- Tracing bias from pretraining data to alignment.
Preprint (Nov. 2024). arXiv
Abstract

As LLMs are increasingly integrated into user-facing applications, addressing biases that perpetuate societal inequalities is crucial. While much work has gone into measuring or mitigating biases in these models, fewer studies have investigated their origins. Therefore, this study examines the correlation between gender-occupation bias in pre-training data and their manifestation in LLMs, focusing on the Dolma dataset and the OLMo model. Using zero-shot prompting and token co-occurrence analyses, we explore how biases in training data influence model outputs. Our findings reveal that biases present in pre-training data are amplified in model outputs. The study also examines the effects of prompt types, hyperparameters, and instruction-tuning on bias expression, finding instruction-tuning partially alleviating representational bias while still maintaining overall stereotypical gender associations, whereas hyperparameters and prompting variation have a lesser effect on bias expression. Our research traces bias throughout the LLM development pipeline and underscores the importance of mitigating bias at the pretraining stage.

MCML Authors
Link to website

Abdullatif Köksal

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[149]
Y. Liu, F. Shi, D. Wang, Y. Zhang and H. Schütze.
ChatZero: Zero-Shot Cross-Lingual Dialogue Generation via Pseudo-Target Language.
ECAI 2024 - 27th European Conference on Artificial Intelligence. Santiago de Compostela, Spain, Oct 19-24, 2024. DOI
Abstract

Although large language models(LLMs) show amazing capabilities, among various exciting applications discovered for LLMs fall short in other low-resource languages. Besides, most existing methods depend on large-scale dialogue corpora and thus building systems for dialogue generation in a zero-shot scenario remains a considerable challenge. To address this challenge, we propose a novel end-to-end zero-shot dialogue generation model ChatZero based on cross-lingual code-switching method. First, we construct code-switching language and pseudo-target language with placeholders. Then for cross-lingual semantic transfer, we employ unsupervised contrastive learning to minimize the semantics gap of the source language, code-switching language, and pseudo-target language that are mutually positive examples in the high dimensional semantic space. Experiments on the multilingual DailyDialog and DSTC7-AVSD datasets demonstrate that ChatZero can achieve more than 90% of the original performance under the zero-shot case compared to supervised learning, and achieve state-of-the-art performance compared with other baselines.

MCML Authors
Link to website

Yongkang Liu

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[148]
P. Mondorf and B. Plank.
Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models--A Survey.
COLM 2024 - Conference on Language Modeling. Philadelphia, PA, USA, Oct 07-09, 2024. PDF
Abstract

Large language models (LLMs) have recently shown impressive performance on tasks involving reasoning, leading to a lively debate on whether these models possess reasoning capabilities similar to humans. However, despite these successes, the depth of LLMs’ reasoning abilities remains uncertain. This uncertainty partly stems from the predominant focus on task performance, measured through shallow accuracy metrics, rather than a thorough investigation of the models’ reasoning behavior. This paper seeks to address this gap by providing a comprehensive review of studies that go beyond task accuracy, offering deeper insights into the models’ reasoning processes. Furthermore, we survey prevalent methodologies to evaluate the reasoning behavior of LLMs, emphasizing current trends and efforts towards more nuanced reasoning analyses. Our review suggests that LLMs tend to rely on surface-level patterns and correlations in their training data, rather than on sophisticated reasoning abilities. Additionally, we identify the need for further research that delineates the key differences between human and LLM-based reasoning. Through this survey, we aim to shed light on the complex reasoning processes within LLMs.

MCML Authors
Link to website

Philipp Mondorf

Artificial Intelligence and Computational Linguistics

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

Artificial Intelligence and Computational Linguistics


[147]
X. Wang, C. Hu, B. Ma, P. Rottger and B. Plank.
Look at the Text: Instruction-Tuned Language Models are More Robust Multiple Choice Selectors than You Think.
COLM 2024 - Conference on Language Modeling. Philadelphia, PA, USA, Oct 07-09, 2024. PDF
Abstract

Multiple choice questions (MCQs) are commonly used to evaluate the capabilities of large language models (LLMs). One common way to evaluate the model response is to rank the candidate answers based on the log probability of the first token prediction. An alternative way is to examine the text output. Prior work has shown that first token probabilities lack robustness to changes in MCQ phrasing, and that first token probabilities do not match text answers for instruction-tuned models. Therefore, in this paper, we investigate the robustness of text answers. We show that the text answers are more robust to question perturbations than the first token probabilities, when the first token answers mismatch the text answers. The difference in robustness increases as the mismatch rate becomes greater. As the mismatch reaches over 50%, the text answer is more robust to option order changes than the debiased first token probabilities using state-of-the-art debiasing methods such as PriDe. Our findings provide further evidence for the benefits of text answer evaluation over first token probability evaluation.

MCML Authors
Link to website

Xinpeng Wang

Artificial Intelligence and Computational Linguistics

Link to website

Bolei Ma

Social Data Science and AI Lab

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

Artificial Intelligence and Computational Linguistics


[146]
V. Blaschke, B. Kovačić, S. Peng and B. Plank.
MaiBaam Annotation Guidelines.
Preprint (Oct. 2024). arXiv
Abstract

This document provides the annotation guidelines for MaiBaam, a Bavarian corpus manually annotated with part-of-speech (POS) tags, syntactic dependencies, and German lemmas. MaiBaam belongs to the Universal Dependencies (UD) project, and our annotations elaborate on the general and German UD version 2 guidelines. In this document, we detail how to preprocess and tokenize Bavarian data, provide an overview of the POS tags and dependencies we use, explain annotation decisions that would also apply to closely related languages like German, and lastly we introduce and motivate decisions that are specific to Bavarian grammar.

MCML Authors
Link to website

Verena Blaschke

Artificial Intelligence and Computational Linguistics

Link to website

Siyao Peng

Dr.

Artificial Intelligence and Computational Linguistics

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

Artificial Intelligence and Computational Linguistics


[145]
Q. Chen, X. Wang, P. Mondorf, M. A. Hedderich and B. Plank.
Understanding When Tree of Thoughts Succeeds: Larger Models Excel in Generation, Not Discrimination.
Preprint (Oct. 2024). arXiv
Abstract

Tree of Thoughts (ToT) is a reasoning strategy for Large Language Models (LLMs) that employs a generator to suggest reasoning steps and a discriminator to decide which steps to implement. ToT demonstrates strong performance on reasoning tasks, often surpassing simple methods such as Input-Output (IO) prompting and Chain-of-Thought (CoT) reasoning. However, ToT does not consistently outperform such simpler methods across all models, leaving large knowledge gaps on the conditions under which ToT is most beneficial. In this paper, we analyze the roles of the generator and discriminator separately to better understand the conditions when ToT is beneficial. We find that the generator plays a more critical role than the discriminator in driving the success of ToT. Scaling the generator leads to notable improvements in ToT performance, even when using a smaller model as the discriminator, whereas scaling the discriminator with a fixed generator yields only marginal gains. Our results show that models across different scales exhibit comparable discrimination capabilities, yet differ significantly in their generative performance for ToT.

MCML Authors
Link to website

Xinpeng Wang

Artificial Intelligence and Computational Linguistics

Link to website

Philipp Mondorf

Artificial Intelligence and Computational Linguistics

Link to Profile Michael Hedderich

Michael Hedderich

Dr.

JRG Leader Human-Centered NLP

Artificial Intelligence and Computational Linguistics

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

Artificial Intelligence and Computational Linguistics


[144]
L. Edman, L. Bylinina, F. Ghorbanpour and A. Fraser.
Are BabyLMs Second Language Learners?.
Preprint (Oct. 2024). arXiv
Abstract

This paper describes a linguistically-motivated approach to the 2024 edition of the BabyLM Challenge (Warstadt et al. 2023). Rather than pursuing a first language learning (L1) paradigm, we approach the challenge from a second language (L2) learning perspective. In L2 learning, there is a stronger focus on learning explicit linguistic information, such as grammatical notions, definitions of words or different ways of expressing a meaning. This makes L2 learning potentially more efficient and concise. We approximate this using data from Wiktionary, grammar examples either generated by an LLM or sourced from grammar books, and paraphrase data. We find that explicit information about word meaning (in our case, Wiktionary) does not boost model performance, while grammatical information can give a small improvement. The most impactful data ingredient is sentence paraphrases, with our two best models being trained on 1) a mix of paraphrase data and data from the BabyLM pretraining dataset, and 2) exclusively paraphrase data.

MCML Authors
Link to website at LMU

Lukas Edman

Dr.

Data Analytics & Statistics

Link to website

Faeze Ghorbanpour

Data Analytics & Statistics

Link to Profile Alexander Fraser

Alexander Fraser

Prof. Dr.

Data Analytics & Statistics


[143]
F. Eichin, C. Schuster, G. Groh and M. A. Hedderich.
Semantic Component Analysis: Discovering Patterns in Short Texts Beyond Topics.
Preprint (Oct. 2024). arXiv
Abstract

Topic modeling is a key method in text analysis, but existing approaches are limited by assuming one topic per document or fail to scale efficiently for large, noisy datasets of short texts. We introduce Semantic Component Analysis (SCA), a novel topic modeling technique that overcomes these limitations by discovering multiple, nuanced semantic components beyond a single topic in short texts which we accomplish by introducing a decomposition step to the clustering-based topic modeling framework. Evaluated on multiple Twitter datasets, SCA matches the state-of-the-art method BERTopic in coherence and diversity, while uncovering at least double the semantic components and maintaining a noise rate close to zero while staying scalable and effective across languages, including an underrepresented one.

MCML Authors
Link to website

Florian Eichin

Artificial Intelligence and Computational Linguistics

Link to Profile Michael Hedderich

Michael Hedderich

Dr.

JRG Leader Human-Centered NLP

Artificial Intelligence and Computational Linguistics


[142]
A. H. Kargaran, A. Modarressi, N. Nikeghbal, J. Diesner, F. Yvon and H. Schütze.
MEXA: Multilingual Evaluation of English-Centric LLMs via Cross-Lingual Alignment.
Preprint (Oct. 2024). arXiv
Abstract

English-centric large language models (LLMs) often show strong multilingual capabilities. However, the multilingual performance of these models remains unclear and is not thoroughly evaluated for many languages. Most benchmarks for multilinguality focus on classic NLP tasks, or cover a minimal number of languages. We introduce MEXA, a method for assessing the multilingual capabilities of pre-trained English-centric LLMs using parallel sentences, which are available for more languages than existing downstream tasks. MEXA leverages the fact that English-centric LLMs use English as a kind of pivot language in their intermediate layers. It computes the alignment between English and non-English languages using parallel sentences to evaluate the transfer of language understanding from English to other languages. This alignment can be used to estimate model performance in other languages. We conduct studies using various parallel datasets (FLORES-200 and Bible), models (Llama family, Gemma family, Mistral, and OLMo), and established downstream tasks (Belebele, m-MMLU, and m-ARC). We explore different methods to compute embeddings in decoder-only models. Our results show that MEXA, in its default settings, achieves a statistically significant average Pearson correlation of 0.90 with three established downstream tasks across nine models and two parallel datasets. This suggests that MEXA is a reliable method for estimating the multilingual capabilities of English-centric LLMs, providing a clearer understanding of their multilingual potential and the inner workings of LLMs.

MCML Authors
Link to website

Amir Hossein Kargaran

Statistical NLP and Deep Learning

Link to website

Ali Modarressi

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[141]
J. Lan, D. Frassinelli and B. Plank.
Mind the Uncertainty in Human Disagreement: Evaluating Discrepancies between Model Predictions and Human Responses in VQA.
Preprint (Oct. 2024). arXiv
Abstract

Large vision-language models frequently struggle to accurately predict responses provided by multiple human annotators, particularly when those responses exhibit human uncertainty. In this study, we focus on the Visual Question Answering (VQA) task, and we comprehensively evaluate how well the state-of-the-art vision-language models correlate with the distribution of human responses. To do so, we categorize our samples based on their levels (low, medium, high) of human uncertainty in disagreement (HUD) and employ not only accuracy but also three new human-correlated metrics in VQA, to investigate the impact of HUD. To better align models with humans, we also verify the effect of common calibration and human calibration. Our results show that even BEiT3, currently the best model for this task, struggles to capture the multi-label distribution inherent in diverse human responses. Additionally, we observe that the commonly used accuracy-oriented calibration technique adversely affects BEiT3’s ability to capture HUD, further widening the gap between model predictions and human distributions. In contrast, we show the benefits of calibrating models towards human distributions for VQA, better aligning model confidence with human uncertainty. Our findings highlight that for VQA, the consistent alignment between human responses and model predictions is understudied and should become the next crucial target of future studies.

MCML Authors
Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

Artificial Intelligence and Computational Linguistics


[140]
P. Mondorf, S. Wold and B. Plank.
Circuit Compositions: Exploring Modular Structures in Transformer-Based Language Models.
Preprint (Oct. 2024). arXiv
Abstract

A fundamental question in interpretability research is to what extent neural networks, particularly language models, implement reusable functions via subnetworks that can be composed to perform more complex tasks. Recent developments in mechanistic interpretability have made progress in identifying subnetworks, often referred to as circuits, which represent the minimal computational subgraph responsible for a model’s behavior on specific tasks. However, most studies focus on identifying circuits for individual tasks without investigating how functionally similar circuits relate to each other. To address this gap, we examine the modularity of neural networks by analyzing circuits for highly compositional subtasks within a transformer-based language model. Specifically, given a probabilistic context-free grammar, we identify and compare circuits responsible for ten modular string-edit operations. Our results indicate that functionally similar circuits exhibit both notable node overlap and cross-task faithfulness. Moreover, we demonstrate that the circuits identified can be reused and combined through subnetwork set operations to represent more complex functional capabilities of the model.

MCML Authors
Link to website

Philipp Mondorf

Artificial Intelligence and Computational Linguistics

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

Artificial Intelligence and Computational Linguistics


[139]
R. Shim and B. Plank.
Dialetto, ma Quanto Dialetto? Transcribing and Evaluating Dialects on a Continuum.
Preprint (Oct. 2024). arXiv
Abstract

There is increasing interest in looking at dialects in NLP. However, most work to date still treats dialects as discrete categories. For instance, evaluative work in variation-oriented NLP for English often works with Indian English or African-American Venacular English as homogeneous categories (Faisal et al., 2024; Ziems et al., 2023), yet even within one variety there is substantial variation. We examine within-dialect variation and show that performance critically varies within categories. We measure speech-to-text performance on Italian dialects, and empirically observe a geographical performance disparity. This disparity correlates substantially (-0.5) with linguistic similarity to the highest performing dialect variety. We cross-examine our results against dialectometry methods, and interpret the performance disparity to be due to a bias towards dialects that are more similar to the standard variety in the speech-to-text model examined. We additionally leverage geostatistical methods to predict zero-shot performance at unseen sites, and find the incorporation of geographical information to substantially improve prediction performance, indicating there to be geographical structure in the performance distribution.

MCML Authors
Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

Artificial Intelligence and Computational Linguistics


[138]
X. Wang, C. Hu, P. Röttger and B. Plank.
Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation.
Preprint (Oct. 2024). arXiv
Abstract

Training a language model to be both helpful and harmless requires careful calibration of refusal behaviours: Models should refuse to follow malicious instructions or give harmful advice (e.g. ‘how do I kill someone?’’), but they should not refuse safe requests, even if they superficially resemble unsafe ones (e.g. ‘how do I kill a Python process?’’). Avoiding such false refusal, as prior work has shown, is challenging even for highly-capable language models. In this paper, we propose a simple and surgical method for mitigating false refusal in language models via single vector ablation. For a given model, we extract a false refusal vector and show that ablating this vector reduces false refusal rate without negatively impacting model safety and general model capabilities. We also show that our approach can be used for fine-grained calibration of model safety. Our approach is training-free and model-agnostic, making it useful for mitigating the problem of false refusal in current and future language models.

MCML Authors
Link to website

Xinpeng Wang

Artificial Intelligence and Computational Linguistics

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

Artificial Intelligence and Computational Linguistics


[137]
Y. Liu, E. Nie, S. Feng, Z. Hua, Z. Ding, D. Wang, Y. Zhang and H. Schütze.
A Unified Data Augmentation Framework for Low-Resource Multi-Domain Dialogue Generation.
ECML-PKDD 2024 - European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases. Vilnius, Lithuania, Sep 09-13, 2024. DOI GitHub
Abstract

Current state-of-the-art dialogue systems heavily rely on extensive training datasets. However, challenges arise in domains where domain-specific training datasets are insufficient or entirely absent. To tackle this challenge, we propose a novel data Augmentation framework for Multi-Domain Dialogue Generation, referred to as AMDG. The AMDG framework consists of a data augmentation process and a two-stage training approach: domain-agnostic training and domain adaptation training. We posit that domain corpora are a blend of domain-agnostic and domain-specific features, with certain representation patterns shared among diverse domains. Domain-agnostic training aims to enable models to learn these common expressive patterns. To construct domain-agnostic dialogue corpora, we employ a de-domaining data processing technique used to remove domain-specific features. By mitigating the effects of domain-specific features, the model trained on the de-domained corpora can effectively learn common expression patterns in different domains. Subsequently, we adapt the learned domain-agnostic features to the target domain through domain adaptation training. We conduct experiments on Chinese dialogue datasets from five different domains and show that AMDG achieves superior performance compared to both direct training on the target domain corpus and collective training on all five domain corpora. Our work underscores AMDG as a viable alternative solution for low-resource multi-domain dialogue generation.

MCML Authors
Link to website

Yongkang Liu

Statistical NLP and Deep Learning

Link to website

Ercong Nie

Statistical NLP and Deep Learning

Link to website

Zifeng Ding

Database Systems & Data Mining

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[136]
A. Köksal, M. Thaler, A. Imani, A. Üstün, A. Korhonen and H. Schütze.
MURI: High-Quality Instruction Tuning Datasets for Low-Resource Languages via Reverse Instructions.
Preprint (Sep. 2024). arXiv GitHub
Abstract

Instruction tuning enhances large language models (LLMs) by aligning them with human preferences across diverse tasks. Traditional approaches to create instruction tuning datasets face serious challenges for low-resource languages due to their dependence on data annotation. This work introduces a novel method, Multilingual Reverse Instructions (MURI), which generates high-quality instruction tuning datasets for low-resource languages without requiring human annotators or pre-existing multilingual models. Utilizing reverse instructions and a translation pipeline, MURI produces instruction-output pairs from existing human-written texts in low-resource languages. This method ensures cultural relevance and diversity by sourcing texts from different native domains and applying filters to eliminate inappropriate content. Our dataset, MURI-IT, includes more than 2 million instruction-output pairs across 200 languages. Evaluation by native speakers and fine-tuning experiments with mT5 models demonstrate the approach’s effectiveness for both NLU and open-ended generation.

MCML Authors
Link to website

Abdullatif Köksal

Statistical NLP and Deep Learning

Link to website

Ayyoob Imani

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[135]
Y. Liu, H. Ye, C. Ma, M. Wang and H. Schütze.
LangSAMP: Language-Script Aware Multilingual Pretraining.
Preprint (Sep. 2024). arXiv GitHub
Abstract

Recent multilingual pretrained language models (mPLMs) often avoid using language embeddings – learnable vectors assigned to different languages. These embeddings are discarded for two main reasons: (1) mPLMs are expected to have a single, unified parameter set across all languages, and (2) they need to function seamlessly as universal text encoders without requiring language IDs as input. However, this removal increases the burden on token embeddings to encode all language-specific information, which may hinder the model’s ability to produce more language-neutral representations. To address this challenge, we propose Language-Script Aware Multilingual Pretraining (LangSAMP), a method that incorporates both language and script embeddings to enhance representation learning while maintaining a simple architecture. Specifically, we integrate these embeddings into the output of the transformer blocks before passing the final representations to the language modeling head for prediction. We apply LangSAMP to the continual pretraining of XLM-R on a highly multilingual corpus covering more than 500 languages. The resulting model consistently outperforms the baseline. Extensive analysis further shows that language/script embeddings encode language/script-specific information, which improves the selection of source languages for crosslingual transfer.

MCML Authors
Link to website

Yihong Liu

Statistical NLP and Deep Learning

Link to website

Haotian Ye

Statistical NLP and Deep Learning

Link to website

Chunlan Ma

Statistical NLP and Deep Learning

Link to website

Mingyang Wang

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[134]
I. Ziegler, A. Köksal, D. Elliott and H. Schütze.
CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation.
Preprint (Sep. 2024). arXiv
Abstract

Building high-quality datasets for specialized tasks is a time-consuming and resource-intensive process that often requires specialized domain knowledge. We propose Corpus Retrieval and Augmentation for Fine-Tuning (CRAFT), a method for generating synthetic datasets, given a small number of user-written few-shots that demonstrate the task to be performed. Given the few-shot examples, we use large-scale public web-crawled corpora and similarity-based document retrieval to find other relevant human-written documents. Lastly, instruction-tuned large language models (LLMs) augment the retrieved documents into custom-formatted task samples, which then can be used for fine-tuning. We demonstrate that CRAFT can efficiently generate large-scale task-specific training datasets for four diverse tasks: biology question-answering (QA), medicine QA and commonsense QA as well as summarization. Our experiments show that CRAFT-based models outperform or achieve comparable performance to general LLMs for QA tasks, while CRAFT-based summarization models outperform models trained on human-curated data by 46 preference points.

MCML Authors
Link to website

Abdullatif Köksal

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[133]
V. Blaschke, C. Purschke, H. Schütze and B. Plank.
What Do Dialect Speakers Want? A Survey of Attitudes Towards Language Technology for German Dialects.
ACL 2024 - 62nd Annual Meeting of the Association for Computational Linguistics. Bangkok, Thailand, Aug 11-16, 2024. DOI
Abstract

Natural language processing (NLP) has largely focused on modelling standardized languages. More recently, attention has increasingly shifted to local, non-standardized languages and dialects. However, the relevant speaker populations’ needs and wishes with respect to NLP tools are largely unknown. In this paper, we focus on dialects and regional languages related to German – a group of varieties that is heterogeneous in terms of prestige and standardization. We survey speakers of these varieties (N=327) and present their opinions on hypothetical language technologies for their dialects. Although attitudes vary among subgroups of our respondents, we find that respondents are especially in favour of potential NLP tools that work with dialectal input (especially audio input) such as virtual assistants, and less so for applications that produce dialectal output such as machine translation or spellcheckers.

MCML Authors
Link to website

Verena Blaschke

Artificial Intelligence and Computational Linguistics

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

Artificial Intelligence and Computational Linguistics


[132]
A. H. Kargaran, F. Yvon and H. Schütze.
MaskLID: Code-Switching Language Identification through Iterative Masking.
ACL 2024 - 62nd Annual Meeting of the Association for Computational Linguistics. Bangkok, Thailand, Aug 11-16, 2024. DOI GitHub
Abstract

We present MaskLID, a simple, yet effective, code-switching (CS) language identification (LID) method. MaskLID does not require any training and is designed to complement current high-performance sentence-level LIDs. Sentence-level LIDs are classifiers trained on monolingual texts to provide single labels, typically using a softmax layer to turn scores into probabilities. However, in cases where a sentence is composed in both L1 and L2 languages, the LID classifier often only returns the dominant label L1. To address this limitation, MaskLID employs a strategy to mask text features associated with L1, allowing the LID to classify the text as L2 in the next round. This method uses the LID itself to identify the features that require masking and does not rely on any external resource. In this work, we explore the use of MaskLID for two open-source LIDs (GlotLID and OpenLID), that are both based on the FastText architecture.

MCML Authors
Link to website

Amir Hossein Kargaran

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[131]
Y. Liu, C. Ma, H. Ye and H. Schütze.
TransliCo: A Contrastive Learning Framework to Address the Script Barrier in Multilingual Pretrained Language Models.
ACL 2024 - 62nd Annual Meeting of the Association for Computational Linguistics. Bangkok, Thailand, Aug 11-16, 2024. DOI
Abstract

The world’s more than 7000 languages are written in at least 293 scripts. Due to various reasons, many closely related languages use different scripts, which poses a difficulty for multilingual pretrained language models (mPLMs) in learning crosslingual knowledge through lexical overlap. As a consequence, mPLMs are faced with a script barrier: representations from different scripts are located in different subspaces, which can result in crosslingual transfer involving languages of different scripts performing suboptimally. To address this problem, we propose TransliCo, a framework that optimizes the Transliteration Contrastive Modeling (TCM) objective to fine-tune an mPLM by contrasting sentences in its training data and their transliterations in a unified script (in our case Latin), which enhances uniformity in the representation space for different scripts. Using Glot500-m, an mPLM pretrained on over 500 languages, as our source model, we fine-tune it on a small portion (5%) of its training data, and refer to the resulting model as Furina. We show that Furina not only better aligns representations from distinct scripts but also outperforms the original Glot500-m on various zero-shot crosslingual transfer tasks. Additionally, we achieve consistent improvement in a case study on the Indic group where the languages exhibit areal features but use different scripts. We make our code and models publicly available.

MCML Authors
Link to website

Yihong Liu

Statistical NLP and Deep Learning

Link to website

Chunlan Ma

Statistical NLP and Deep Learning

Link to website

Haotian Ye

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[130]
P. Mondorf and B. Plank.
Comparing Inferential Strategies of Humans and Large Language Models in Deductive Reasoning.
ACL 2024 - 62nd Annual Meeting of the Association for Computational Linguistics. Bangkok, Thailand, Aug 11-16, 2024. DOI
Abstract

Deductive reasoning plays a pivotal role in the formulation of sound and cohesive arguments. It allows individuals to draw conclusions that logically follow, given the truth value of the information provided. Recent progress in the domain of large language models (LLMs) has showcased their capability in executing deductive reasoning tasks. Nonetheless, a significant portion of research primarily assesses the accuracy of LLMs in solving such tasks, often overlooking a deeper analysis of their reasoning behavior. In this study, we draw upon principles from cognitive psychology to examine inferential strategies employed by LLMs, through a detailed evaluation of their responses to propositional logic problems. Our findings indicate that LLMs display reasoning patterns akin to those observed in humans, including strategies like supposition following or chain construction. Moreover, our research demonstrates that the architecture and scale of the model significantly affect its preferred method of reasoning, with more advanced models tending to adopt strategies more frequently than less sophisticated ones. Importantly, we assert that a model’s accuracy, that is the correctness of its final conclusion, does not necessarily reflect the validity of its reasoning process. This distinction underscores the necessity for more nuanced evaluation procedures in the field.

MCML Authors
Link to website

Philipp Mondorf

Artificial Intelligence and Computational Linguistics

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

Artificial Intelligence and Computational Linguistics


[129]
L. Weber-Genzel, S. Peng, M.-C. De Marneffe and B. Plank.
VariErr NLI: Separating Annotation Error from Human Label Variation.
ACL 2024 - 62nd Annual Meeting of the Association for Computational Linguistics. Bangkok, Thailand, Aug 11-16, 2024. DOI
Abstract

Human label variation arises when annotators assign different labels to the same item for valid reasons, while annotation errors occur when labels are assigned for invalid reasons. These two issues are prevalent in NLP benchmarks, yet existing research has studied them in isolation. To the best of our knowledge, there exists no prior work that focuses on teasing apart error from signal, especially in cases where signal is beyond black-and-white.To fill this gap, we introduce a systematic methodology and a new dataset, VariErr (variation versus error), focusing on the NLI task in English. We propose a 2-round annotation procedure with annotators explaining each label and subsequently judging the validity of label-explanation pairs.VariErr contains 7,732 validity judgments on 1,933 explanations for 500 re-annotated MNLI items. We assess the effectiveness of various automatic error detection (AED) methods and GPTs in uncovering errors versus human label variation. We find that state-of-the-art AED methods significantly underperform GPTs and humans. While GPT-4 is the best system, it still falls short of human performance. Our methodology is applicable beyond NLI, offering fertile ground for future research on error versus plausible variation, which in turn can yield better and more trustworthy NLP systems.

MCML Authors
Link to website

Leon Weber-Genzel

Dr.

* Former member

Link to website

Siyao Peng

Dr.

Artificial Intelligence and Computational Linguistics

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

Artificial Intelligence and Computational Linguistics


[128]
S. Xu, S. T.y.s.s, O. Ichim, B. Plank and M. Grabmair.
Through the Lens of Split Vote: Exploring Disagreement, Difficulty and Calibration in Legal Case Outcome Classification.
ACL 2024 - 62nd Annual Meeting of the Association for Computational Linguistics. Bangkok, Thailand, Aug 11-16, 2024. DOI
Abstract

In legal decisions, split votes (SV) occur when judges cannot reach a unanimous decision, posing a difficulty for lawyers who must navigate diverse legal arguments and opinions. In high-stakes domains, %as human-AI interaction systems become increasingly important, understanding the alignment of perceived difficulty between humans and AI systems is crucial to build trust. However, existing NLP calibration methods focus on a classifier’s awareness of predictive performance, measured against the human majority class, overlooking inherent human label variation (HLV). This paper explores split votes as naturally observable human disagreement and value pluralism. We collect judges’ vote distributions from the European Court of Human Rights (ECHR), and present SV-ECHR, a case outcome classification (COC) dataset with SV information. We build a taxonomy of disagreement with SV-specific subcategories. We further assess the alignment of perceived difficulty between models and humans, as well as confidence- and human-calibration of COC models. We observe limited alignment with the judge vote distribution. To our knowledge, this is the first systematic exploration of calibration to human judgements in legal NLP. Our study underscores the necessity for further research on measuring and enhancing model calibration considering HLV in legal decision tasks.

MCML Authors
Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

Artificial Intelligence and Computational Linguistics


[127]
K. Hämmerl, J. Libovický and A. Fraser.
Understanding Cross-Lingual Alignment—A Survey.
ACL 2024 - Findings of the 62nd Annual Meeting of the Association for Computational Linguistics. Bangkok, Thailand, Aug 11-16, 2024. DOI
Abstract

Cross-lingual alignment, the meaningful similarity of representations across languages in multilingual language models, has been an active field of research in recent years. We survey the literature of techniques to improve cross-lingual alignment, providing a taxonomy of methods and summarising insights from throughout the field. We present different understandings of cross-lingual alignment and their limitations. We provide a qualitative summary of results from a number of surveyed papers. Finally, we discuss how these insights may be applied not only to encoder models, where this topic has been heavily studied, but also to encoder-decoder or even decoder-only models, and argue that an effective trade-off between language-neutral and language-specific information is key.

MCML Authors
Link to website at LMU

Katharina Hämmerl

Data Analytics & Statistics

Link to Profile Alexander Fraser

Alexander Fraser

Prof. Dr.

Data Analytics & Statistics


[126]
W. Lai, M. Mesgar and A. Fraser.
LLMs Beyond English: Scaling the Multilingual Capability of LLMs with Cross-Lingual Feedback.
ACL 2024 - Findings of the 62nd Annual Meeting of the Association for Computational Linguistics. Bangkok, Thailand, Aug 11-16, 2024. DOI
Abstract

To democratize large language models (LLMs) to most natural languages, it is imperative to make these models capable of understanding and generating texts in many languages, in particular low-resource ones. While recent multilingual LLMs demonstrate remarkable performance in such capabilities, these LLMs still support a limited number of human languages due to the lack of training data for low resource languages. Moreover, these LLMs are not yet aligned with human preference for downstream tasks, which is crucial for the success of LLMs in English. In this paper, we introduce xLLaMA-100 and xBLOOM-100 (collectively xLLMs-100), which scale the multilingual capabilities of LLaMA and BLOOM to 100 languages. To do so, we construct two datasets: a multilingual instruction dataset including 100 languages, which represents the largest language coverage to date, and a cross-lingual human feedback dataset encompassing 30 languages. We perform multilingual instruction tuning on the constructed instruction data and further align the LLMs with human feedback using the DPO algorithm on our cross-lingual human feedback dataset. We evaluate the multilingual understanding and generating capabilities of xLLMs-100 on five multilingual benchmarks. Experimental results show that xLLMs-100 consistently outperforms its peers across the benchmarks by considerable margins, defining a new state-of-the-art multilingual LLM that supports 100 languages.

MCML Authors
Link to Profile Alexander Fraser

Alexander Fraser

Prof. Dr.

Data Analytics & Statistics


[125]
X. Wang, B. Ma, C. Hu, L. Weber-Genzel, P. Röttger, F. Kreuter, D. Hovy and B. Plank.
My Answer is C: First-Token Probabilities Do Not Match Text Answers in Instruction-Tuned Language Models.
ACL 2024 - Findings of the 62nd Annual Meeting of the Association for Computational Linguistics. Bangkok, Thailand, Aug 11-16, 2024. DOI
Abstract

The open-ended nature of language generation makes the evaluation of autoregressive large language models (LLMs) challenging. One common evaluation approach uses multiple-choice questions to limit the response space. The model is then evaluated by ranking the candidate answers by the log probability of the first token prediction. However, first-tokens may not consistently reflect the final response output, due to model’s diverse response styles such as starting with ‘Sure’ or refusing to answer. Consequently, first-token evaluation is not indicative of model behaviour when interacting with users. But by how much? We evaluate how aligned first-token evaluation is with the text output along several dimensions, namely final option choice, refusal rate, choice distribution and robustness under prompt perturbation. Our results show that the two approaches are severely misaligned on all dimensions, reaching mismatch rates over 60%. Models heavily fine-tuned on conversational or safety data are especially impacted. Crucially, models remain misaligned even when we increasingly constrain prompts, i.e., force them to start with an option letter or example template. Our findings i) underscore the importance of inspecting the text output as well and ii) caution against relying solely on first-token evaluation.

MCML Authors
Link to website

Xinpeng Wang

Artificial Intelligence and Computational Linguistics

Link to website

Bolei Ma

Social Data Science and AI Lab

Link to website

Leon Weber-Genzel

Dr.

* Former member

Link to Profile Frauke Kreuter

Frauke Kreuter

Prof. Dr.

Social Data Science and AI Lab

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

Artificial Intelligence and Computational Linguistics


[124]
S. Yuan, E. Nie, M. Färber, H. Schmid and H. Schütze.
GNNAVI: Navigating the Information Flow in Large Language Models by Graph Neural Network.
ACL 2024 - Findings of the 62nd Annual Meeting of the Association for Computational Linguistics. Bangkok, Thailand, Aug 11-16, 2024. DOI
Abstract

Large Language Models (LLMs) exhibit strong In-Context Learning (ICL) capabilities when prompts with demonstrations are applied to them. However, fine-tuning still remains crucial to further enhance their adaptability. Prompt-based fine-tuning proves to be an effective fine-tuning method in low-data scenarios, but high demands on computing resources limit its practicality. We address this issue by introducing a prompt-based parameter-efficient fine-tuning (PEFT) approach. GNNavi leverages insights into ICL’s information flow dynamics, which indicates that label words act in prompts as anchors for information propagation. GNNavi employs a Graph Neural Network (GNN) layer to precisely guide the aggregation and distribution of information flow during the processing of prompts by hardwiring the desired information flow into the GNN. Our experiments on text classification tasks with GPT-2 and Llama2 shows GNNavi surpasses standard prompt-based fine-tuning methods in few-shot settings by updating just 0.2% to 0.5% of parameters. We compare GNNavi with prevalent PEFT approaches, such as prefix tuning, LoRA and Adapter in terms of performance and efficiency. Our analysis reveals that GNNavi enhances information flow and ensures a clear aggregation process.

MCML Authors
Link to website

Ercong Nie

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[123]
A. Dimmelmeier, H. Doll, M. Schierholz, E. Kormanyos, M. Fehr, B. Ma, J. Beck, A. Fraser and F. Kreuter.
Informing climate risk analysis using textual information - A research agenda.
ClimateNLP @ACL 2024 - 1st Workshop on Natural Language Processing Meets Climate Change at the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024). Bangkok, Thailand, Aug 11-16, 2024. DOI
Abstract

We present a research agenda focused on efficiently extracting, assuring quality, and consolidating textual company sustainability information to address urgent climate change decision-making needs. Starting from the goal to create integrated FAIR (Findable, Accessible, Interoperable, Reusable) climate-related data, we identify research needs pertaining to the technical aspects of information extraction as well as to the design of the integrated sustainability datasets that we seek to compile. Regarding extraction, we leverage technological advancements, particularly in large language models (LLMs) and Retrieval-Augmented Generation (RAG) pipelines, to unlock the underutilized potential of unstructured textual information contained in corporate sustainability reports. In applying these techniques, we review key challenges, which include the retrieval and extraction of CO2 emission values from PDF documents, especially from unstructured tables and graphs therein, and the validation of automatically extracted data through comparisons with human-annotated values. We also review how existing use cases and practices in climate risk analytics relate to choices of what textual information should be extracted and how it could be linked to existing structured data.

MCML Authors
Link to website

Malte Schierholz

Dr.

Social Data Science and AI Lab

Link to website

Bolei Ma

Social Data Science and AI Lab

Link to website

Jacob Beck

Social Data Science and AI Lab

Link to Profile Alexander Fraser

Alexander Fraser

Prof. Dr.

Data Analytics & Statistics

Link to Profile Frauke Kreuter

Frauke Kreuter

Prof. Dr.

Social Data Science and AI Lab


[122]
S. Zhou, S. Peng and B. Plank.
CLIMATELI: Evaluating Entity Linking on Climate Change Data.
ClimateNLP @ACL 2024 - 1st Workshop on Natural Language Processing Meets Climate Change at the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024). Bangkok, Thailand, Aug 11-16, 2024. DOI
Abstract

Climate Change (CC) is a pressing topic of global importance, attracting increasing attention across research fields, from social sciences to Natural Language Processing (NLP). CC is also discussed in various settings and communication platforms, from academic publications to social media forums. Understanding who and what is mentioned in such data is a first critical step to gaining new insights into CC. We present CLIMATELI (CLIMATe Entity LInking), the first manually annotated CC dataset that links 3,087 entity spans to Wikipedia. Using CLIMATELI (CLIMATe Entity LInking), we evaluate existing entity linking (EL) systems on the CC topic across various genres and propose automated filtering methods for CC entities. We find that the performance of EL models notably lags behind humans at both token and entity levels. Testing within the scope of retaining or excluding non-nominal and/or non-CC entities particularly impacts the models’ performances.

MCML Authors
Link to website

Shijia Zhou

Artificial Intelligence and Computational Linguistics

Link to website

Siyao Peng

Dr.

Artificial Intelligence and Computational Linguistics

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

Artificial Intelligence and Computational Linguistics


[121]
A. Yüksel, A. Köksal, L. K. Senel, A. Korhonen and H. Schütze.
TurkishMMLU: Measuring Massive Multitask Language Understanding in Turkish.
SIGTURK @ACL 2024 - 1st Workshop on Natural Language Processing for Turkic Languages at the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024). Bangkok, Thailand, Aug 11-16, 2024. Invited talk. arXiv GitHub
Abstract

Multiple choice question answering tasks evaluate the reasoning, comprehension, and mathematical abilities of Large Language Models (LLMs). While existing benchmarks employ automatic translation for multilingual evaluation, this approach is error-prone and potentially introduces culturally biased questions, especially in social sciences. We introduce the first multitask, multiple-choice Turkish QA benchmark, TurkishMMLU, to evaluate LLMs’ understanding of the Turkish language. TurkishMMLU includes over 10,000 questions, covering 9 different subjects from Turkish high-school education curricula. These questions are written by curriculum experts, suitable for the high-school curricula in Turkey, covering subjects ranging from natural sciences and math questions to more culturally representative topics such as Turkish Literature and the history of the Turkish Republic. We evaluate over 20 LLMs, including multilingual open-source (e.g., Gemma, Llama, MT5), closed-source (GPT 4o, Claude, Gemini), and Turkish-adapted (e.g., Trendyol) models. We provide an extensive evaluation, including zero-shot and few-shot evaluation of LLMs, chain-of-thought reasoning, and question difficulty analysis along with model performance. We provide an in-depth analysis of the Turkish capabilities and limitations of current LLMs to provide insights for future LLMs for the Turkish language.

MCML Authors
Link to website

Abdullatif Köksal

Statistical NLP and Deep Learning

Link to website

Lütfi Kerem Şenel

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[120]
M. Aßenmacher, A. Stephan, L. Weissweiler, E. Çano, I. Ziegler, M. Härttrich, B. Bischl, B. Roth, C. Heumann and H. Schütze.
Collaborative Development of Modular Open Source Educational Resources for Natural Language Processing.
TeachingNLP @ACL 2024 - 6th Workshop on Teaching NLP at the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024). Bangkok, Thailand, Aug 11-16, 2024. URL
Abstract

In this work, we present a collaboratively and continuously developed open-source educational resource (OSER) for teaching natural language processing at two different universities. We shed light on the principles we followed for the initial design of the course and the rationale for ongoing developments, followed by a reflection on the inter-university collaboration for designing and maintaining teaching material. When reflecting on the latter, we explicitly emphasize the considerations that need to be made when facing heterogeneous groups and when having to accommodate multiple examination regulations within one single course framework. Relying on the fundamental principles of OSER developments as defined by Bothmann et al. (2023) proved to be an important guideline during this process. The final part pertains to open-sourcing our teaching material, coping with the increasing speed of developments in the field, and integrating the course digitally, also addressing conflicting priorities and challenges we are currently facing.

MCML Authors
Link to website

Matthias Aßenmacher

Dr.

Statistical Learning & Data Science

Leonie Weissweiler

Leonie Weissweiler

Dr.

* Former member

Link to Profile Bernd Bischl

Bernd Bischl

Prof. Dr.

Statistical Learning & Data Science

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[119]
S. Eckman, B. Plank and F. Kreuter.
Position: Insights from Survey Methodology can Improve Training Data.
ICML 2024 - 41st International Conference on Machine Learning. Vienna, Austria, Jul 21-27, 2024. URL
Abstract

Whether future AI models are fair, trustworthy, and aligned with the public’s interests rests in part on our ability to collect accurate data about what we want the models to do. However, collecting high-quality data is difficult, and few AI/ML researchers are trained in data collection methods. Recent research in data-centric AI has show that higher quality training data leads to better performing models, making this the right moment to introduce AI/ML researchers to the field of survey methodology, the science of data collection. We summarize insights from the survey methodology literature and discuss how they can improve the quality of training and feedback data. We also suggest collaborative research ideas into how biases in data collection can be mitigated, making models more accurate and human-centric.

MCML Authors
Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

Artificial Intelligence and Computational Linguistics

Link to Profile Frauke Kreuter

Frauke Kreuter

Prof. Dr.

Social Data Science and AI Lab


[118]
J. Shin, M. A. Hedderich, B. J. Rey, A. Lucero and A. Oulasvirta.
Understanding Human-AI Workflows for Generating Personas.
DIS 2024 - ACM Conference on Designing Interactive Systems. Copenhagen, Denmark, Jul 01-05, 2024. DOI
Abstract

One barrier to deeper adoption of user-research methods is the amount of labor required to create high-quality representations of collected data. Trained user researchers need to analyze datasets and produce informative summaries pertaining to the original data. While Large Language Models (LLMs) could assist in generating summaries, they are known to hallucinate and produce biased responses. In this paper, we study human–AI workflows that differently delegate subtasks in user research between human experts and LLMs. Studying persona generation as our case, we found that LLMs are not good at capturing key characteristics of user data on their own. Better results are achieved when we leverage human skill in grouping user data by their key characteristics and exploit LLMs for summarizing pre-grouped data into personas. Personas generated via this collaborative approach can be more representative and empathy-evoking than ones generated by human experts or LLMs alone. We also found that LLMs could mimic generated personas and enable interaction with personas, thereby helping user researchers empathize with them. We conclude that LLMs, by facilitating the analysis of user data, may promote widespread application of qualitative methods in user research.

MCML Authors
Link to Profile Michael Hedderich

Michael Hedderich

Dr.

JRG Leader Human-Centered NLP

Artificial Intelligence and Computational Linguistics


[117]
P. Lin, A. F. T. Martins and H. Schütze.
A Recipe of Parallel Corpora Exploitation for Multilingual Large Language Models.
Preprint (Jul. 2024). arXiv
Abstract

Recent studies have highlighted the potential of exploiting parallel corpora to enhance multilingual large language models, improving performance in both bilingual tasks, e.g., machine translation, and general-purpose tasks, e.g., text classification. Building upon these findings, our comprehensive study aims to identify the most effective strategies for leveraging parallel corpora. We investigate the impact of parallel corpora quality and quantity, training objectives, and model size on the performance of multilingual large language models enhanced with parallel corpora across diverse languages and tasks. Our analysis reveals several key insights: (i) filtering noisy translations is essential for effectively exploiting parallel corpora, while language identification and short sentence filtering have little effect; (ii) even a corpus containing just 10K parallel sentences can yield results comparable to those obtained from much larger datasets; (iii) employing only the machine translation objective yields the best results among various training objectives and their combinations; (iv) larger multilingual language models benefit more from parallel corpora than smaller models due to their stronger capacity for cross-task transfer. Our study offers valuable insights into the optimal utilization of parallel corpora to enhance multilingual large language models, extending the generalizability of previous findings from limited languages and tasks to a broader range of scenarios.

MCML Authors
Link to website

Peiqin Lin

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[116]
C. Ma, Y. Liu, H. Ye and H. Schütze.
Exploring the Role of Transliteration in In-Context Learning for Low-resource Languages Written in Non-Latin Scripts.
Preprint (Jul. 2024). arXiv
Abstract

Decoder-only large language models (LLMs) excel in high-resource languages across various tasks through few-shot or even zero-shot in-context learning (ICL). However, their performance often does not transfer well to low-resource languages, especially those written in non-Latin scripts. Inspired by recent work that leverages transliteration in encoder-only models, we investigate whether transliteration is also effective in improving LLMs’ performance for low-resource languages written in non-Latin scripts. To this end, we propose three prompt templates, where the target-language text is represented in (1) its original script, (2) Latin script, or (3) both. We apply these methods to several representative LLMs of different sizes on various tasks including text classification and sequential labeling. Our findings show that the effectiveness of transliteration varies by task type and model size. For instance, all models benefit from transliterations for sequential labeling (with increases of up to 25%).

MCML Authors
Link to website

Chunlan Ma

Statistical NLP and Deep Learning

Link to website

Yihong Liu

Statistical NLP and Deep Learning

Link to website

Haotian Ye

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[115]
P. Piccirilli, A. Fraser and S. Schulte im Walde.
VOLIMET: A Parallel Corpus of Literal and Metaphorical Verb-Object Pairs for English–German and English–French.
*SEM 2024 - 13th Joint Conference on Lexical and Computational Semantics co-located with NAACL 2024. Mexico City, Mexico, Jun 20-21, 2024. DOI
Abstract

The interplay of cultural and linguistic elements that characterizes metaphorical language poses a substantial challenge for both human comprehension and machine processing. This challenge goes beyond monolingual settings and becomes particularly complex in translation, even more so in automatic translation. We present VOLIMET, a corpus of 2,916 parallel sentences containing gold standard alignments of metaphorical verb-object pairs and their literal paraphrases, e.g., tackle/address question, from English to German and French. On the one hand, the parallel nature of our corpus enables us to explore monolingual patterns for metaphorical vs. literal uses in English. On the other hand, we investigate different aspects of cross-lingual translations into German and French and the extent to which metaphoricity and literalness in the source language are transferred to the target languages. Monolingually, our findings reveal clear preferences in using metaphorical or literal uses of verb-object pairs. Cross-lingually, we observe a rich variability in translations as well as different behaviors for our two target languages.

MCML Authors
Link to Profile Alexander Fraser

Alexander Fraser

Prof. Dr.

Data Analytics & Statistics


[114]
H. Ye, Y. Liu, C. Ma and H. Schütze.
MoSECroT: Model Stitching with Static Word Embeddings for Crosslingual Zero-shot Transfer.
NAACL 2024 - 5th Workshop on Insights from Negative Results in NLP at the Annual Conference of the North American Chapter of the Association for Computational Linguistics. Mexico City, Mexico, Jun 16-21, 2024. URL
Abstract

Transformer-based pre-trained language models (PLMs) have achieved remarkable performance in various natural language processing (NLP) tasks. However, pre-training such models can take considerable resources that are almost only available to high-resource languages. On the contrary, static word embeddings are easier to train in terms of computing resources and the amount of data required. In this paper, we introduce MoSECroT Model Stitching with Static Word Embeddings for Crosslingual Zero-shot Transfer), a novel and challenging task that is especially relevant to low-resource languages for which static word embeddings are available. To tackle the task, we present the first framework that leverages relative representations to construct a common space for the embeddings of a source language PLM and the static word embeddings of a target language. In this way, we can train the PLM on source-language training data and perform zero-shot transfer to the target language by simply swapping the embedding layer. However, through extensive experiments on two classification datasets, we show that although our proposed framework is competitive with weak baselines when addressing MoSECroT, it fails to achieve competitive results compared with some strong baselines. In this paper, we attempt to explain this negative result and provide several thoughts on possible improvement.

MCML Authors
Link to website

Haotian Ye

Statistical NLP and Deep Learning

Link to website

Yihong Liu

Statistical NLP and Deep Learning

Link to website

Chunlan Ma

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[113]
M. Wang, H. Adel, L. Lange, J. Strötgen and H. Schütze.
Rehearsal-Free Modular and Compositional Continual Learning for Language Models.
NAACL 2024 - Annual Conference of the North American Chapter of the Association for Computational Linguistics. Mexico City, Mexico, Jun 16-21, 2024. URL
Abstract

Continual learning aims at incrementally acquiring new knowledge while not forgetting existing knowledge. To overcome catastrophic forgetting, methods are either rehearsal-based, i.e., store data examples from previous tasks for data replay, or isolate parameters dedicated to each task. However, rehearsal-based methods raise privacy and memory issues, and parameter-isolation continual learning does not consider interaction between tasks, thus hindering knowledge transfer. In this work, we propose MoCL, a rehearsal-free Modular and Compositional Continual Learning framework which continually adds new modules to language models and composes them with existing modules. Experiments on various benchmarks show that MoCL outperforms state of the art and effectively facilitates knowledge transfer.

MCML Authors
Link to website

Mingyang Wang

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[112]
Y. Liu, P. Lin, M. Wang and H. Schütze.
OFA: A Framework of Initializing Unseen Subword Embeddings for Efficient Large-scale Multilingual Continued Pretraining.
NAACL 2024 - Findings of Annual Conference of the North American Chapter of the Association for Computational Linguistics. Mexico City, Mexico, Jun 16-21, 2024. URL
Abstract

Instead of pretraining multilingual language models from scratch, a more efficient method is to adapt existing pretrained language models (PLMs) to new languages via vocabulary extension and continued pretraining. However, this method usually randomly initializes the embeddings of new subwords and introduces substantially more embedding parameters to the model, thus weakening the efficiency. To address these issues, we propose a novel framework: One For All (OFA), which wisely initializes the embeddings of unseen subwords and thus can adapt a PLM to multiple languages efficiently and effectively. OFA takes advantage of external well-aligned multilingual static word vectors and injects the alignment knowledge into the subword embeddings. In addition, OFA applies matrix factorization and replaces the cumbersome embeddings with two lower-dimensional matrices, which largely reduces the number of parameters. We show OFA accelerates the convergence of continued pretraining, which is environmentally friendly as much fewer carbon footprints are generated. Through extensive experiments, we demonstrate OFA can achieve competitive or better performance than default continued pretraining baselines on a wide range of crosslingual downstream tasks. We make our code and models publicly available.

MCML Authors
Link to website

Yihong Liu

Statistical NLP and Deep Learning

Link to website

Peiqin Lin

Statistical NLP and Deep Learning

Link to website

Mingyang Wang

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[111]
S. Zhou, H. Shan, B. Plank and R. Litschko.
MaiNLP at SemEval-2024 Task 1: Analyzing Source Language Selection in Cross-Lingual Textual Relatedness.
SemEval @NAACL 2024 - 18th International Workshop on Semantic Evaluation at the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2024). Mexico City, Mexico, Jun 16-21, 2024. URL
Abstract

This paper presents our system developed for the SemEval-2024 Task 1: Semantic Textual Relatedness (STR), on Track C: Cross-lingual. The task aims to detect semantic relatedness of two sentences from the same languages. For cross-lingual approach we developed a set of linguistics-inspired models trained with several task-specific strategies. We 1) utilize language vectors for selection of donor languages; 2) investigate the multi-source approach for training; 3) use transliteration of non-latin script to study impact of ‘script gap’; 4) opt machine translation for data augmentation. We additionally compare the performance of XLM-RoBERTa and Furina with the same training strategy. Our submission achieved the first place in the C8 (Kinyarwanda) test.

MCML Authors
Link to website

Shijia Zhou

Artificial Intelligence and Computational Linguistics

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

Artificial Intelligence and Computational Linguistics

Link to website

Robert Litschko

Artificial Intelligence and Computational Linguistics


[110]
Y. Zhang, V. Hangya and A. Fraser.
A Study of the Class Imbalance Problem in Abusive Language Detection.
WOAH @NAACL 2024 - 8th Workshop on Online Abuse and Harms at the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2024). Mexico City, Mexico, Jun 16-21, 2024. DOI
Abstract

Abusive language detection has drawn increasing interest in recent years. However, a less systematically explored obstacle is label imbalance, i.e., the amount of abusive data is much lower than non-abusive data, leading to performance issues. The aim of this work is to conduct a comprehensive comparative study of popular methods for addressing the class imbalance issue. We explore 10 well-known approaches on 8 datasets with distinct characteristics: binary or multi-class, moderately or largely imbalanced, focusing on various types of abuse, etc. Additionally, we pro-pose two novel methods specialized for abuse detection: AbusiveLexiconAug and ExternalDataAug, which enrich the training data using abusive lexicons and external abusive datasets, respectively. We conclude that: 1) our AbusiveLexiconAug approach, random oversampling, and focal loss are the most versatile methods on various datasets; 2) focal loss tends to yield peak model performance; 3) oversampling and focal loss provide promising results for binary datasets and small multi-class sets, while undersampling and weighted cross-entropy are more suitable for large multi-class sets; 4) most methods are sensitive to hyperparameters, yet our suggested choice of hyperparameters provides a good starting point.

MCML Authors
Link to Profile Alexander Fraser

Alexander Fraser

Prof. Dr.

Data Analytics & Statistics


[109]
A. Bavaresco, R. Bernardi, L. Bertolazzi, D. Elliott, R. Fernández, A. Gatt, E. Ghaleb, M. Giulianelli, M. Hanna, A. Koller, A. F. T. Martins, P. Mondorf, V. Neplenbroek, S. Pezzelle, B. Plank, D. Schlangen, A. Suglia, A. K. S. Aditya K. Surikuchi, E. Takmaz and A. Testoni.
LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks.
Preprint (Jun. 2024). arXiv
Abstract

There is an increasing trend towards evaluating NLP models with LLM-generated judgments instead of human judgments. In the absence of a comparison against human data, this raises concerns about the validity of these evaluations; in case they are conducted with proprietary models, this also raises concerns over reproducibility. We provide JUDGE-BENCH, a collection of 20 NLP datasets with human annotations, and comprehensively evaluate 11 current LLMs, covering both open-weight and proprietary models, for their ability to replicate the annotations. Our evaluations show that each LLM exhibits a large variance across datasets in its correlation to human judgments. We conclude that LLMs are not yet ready to systematically replace human judges in NLP.

MCML Authors
Link to website

Philipp Mondorf

Artificial Intelligence and Computational Linguistics

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

Artificial Intelligence and Computational Linguistics


[108]
L. Hirlimann, S. Zhang, H. Schütze and P. Wicke.
Robustness Testing of Multi-Modal Models in Varied Home Environments for Assistive Robots.
Preprint (Jun. 2024). arXiv
Abstract

The development of assistive robotic agents to support household tasks is advancing, yet the underlying models often operate in virtual settings that do not reflect real-world complexity. For assistive care robots to be effective in diverse environments, their models must be robust and integrate multiple modalities. Consider a caretaker needing assistance in a dimly lit room or navigating around a newly installed glass door. Models relying solely on visual input might fail in low light, while those using depth information could avoid the door. This demonstrates the necessity for models that can process various sensory inputs. Our ongoing study evaluates state-of-the-art robotic models in the AI2Thor virtual environment. We introduce disturbances, such as dimmed lighting and mirrored walls, to assess their impact on modalities like movement or vision, and object recognition. Our goal is to gather input from the Geriatronics community to understand and model the challenges faced by practitioners.

MCML Authors
Link to website

Shengqiang Zhang

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning

Link to website

Philipp Wicke

Dr.

Statistical NLP and Deep Learning


[107]
P. Lin, A. F. T. Martins and H. Schütze.
XAMPLER: Learning to Retrieve Cross-Lingual In-Context Examples.
Preprint (Jun. 2024). arXiv GitHub
Abstract

Recent studies indicate that leveraging off-the-shelf or fine-tuned retrievers, capable of retrieving relevant in-context examples tailored to the input query, enhances few-shot in-context learning of English. However, adapting these methods to other languages, especially low-resource ones, poses challenges due to the scarcity of cross-lingual retrievers and annotated data. Thus, we introduce XAMPLER: Cross-Lingual Example Retrieval, a method tailored to tackle the challenge of cross-lingual in-context learning using only annotated English data. XAMPLER first trains a retriever based on Glot500, a multilingual small language model, using positive and negative English examples constructed from the predictions of a multilingual large language model, i.e., MaLA500. Leveraging the cross-lingual capacity of the retriever, it can directly retrieve English examples as few-shot examples for in-context learning of target languages. Experiments on the multilingual text classification benchmark SIB200 with 176 languages show that XAMPLER substantially improves the in-context learning performance across languages.

MCML Authors
Link to website

Peiqin Lin

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[106]
C. Ma, A. ImaniGooghari, H. Ye, R. Pei, E. Asgari and H. Schütze.
Taxi1500: A Multilingual Dataset for Text Classification in 1500 Languages.
Preprint (Jun. 2024). arXiv
Abstract

While natural language processing tools have been developed extensively for some of the world’s languages, a significant portion of the world’s over 7000 languages are still neglected. One reason for this is that evaluation datasets do not yet cover a wide range of languages, including low-resource and endangered ones. We aim to address this issue by creating a text classification dataset encompassing a large number of languages, many of which currently have little to no annotated data available. We leverage parallel translations of the Bible to construct such a dataset by first developing applicable topics and employing a crowdsourcing tool to collect annotated data. By annotating the English side of the data and projecting the labels onto other languages through aligned verses, we generate text classification datasets for more than 1500 languages. We extensively benchmark several existing multilingual language models using our dataset. To facilitate the advancement of research in this area, we will release our dataset and code.

MCML Authors
Link to website

Chunlan Ma

Statistical NLP and Deep Learning

Link to website

Haotian Ye

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[105]
E. Nie, B. Shao, Z. Ding, M. Wang, H. Schmid and H. Schütze.
BMIKE-53: Investigating Cross-Lingual Knowledge Editing with In-Context Learning.
Preprint (Jun. 2024). arXiv GitHub
Abstract

Large language models (LLMs) possess extensive parametric knowledge, but this knowledge is difficult to update with new information because retraining is very expensive and infeasible for closed-source models. Knowledge editing (KE) has emerged as a viable solution for updating the knowledge of LLMs without compromising their overall performance. On-the-fly KE methods, inspired by in-context learning (ICL), have shown great promise and allow LLMs to be treated as black boxes. In the past, KE was primarily employed in English contexts, whereas the potential for cross-lingual KE in current English-centric LLMs has not been fully explored. To foster more research in this direction, we introduce the BMIKE-53 benchmark for evaluating cross-lingual KE on 53 diverse languages across three KE task types. We also propose a gradient-free KE method called Multilingual In-context Knowledge Editing (MIKE) and evaluate it on BMIKE-53. Our evaluation focuses on cross-lingual knowledge transfer in terms of reliability, generality, locality, and portability, offering valuable insights and a framework for future research in cross-lingual KE.

MCML Authors
Link to website

Ercong Nie

Statistical NLP and Deep Learning

Link to website

Zifeng Ding

Database Systems & Data Mining

Link to website

Mingyang Wang

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[104]
M. Wang, H. Adel, L. Lange, J. Strötgen and H. Schütze.
Learn it or Leave it: Module Composition and Pruning for Continual Learning.
Preprint (Jun. 2024). arXiv
Abstract

In real-world environments, continual learning is essential for machine learning models, as they need to acquire new knowledge incrementally without forgetting what they have already learned. While pretrained language models have shown impressive capabilities on various static tasks, applying them to continual learning poses significant challenges, including avoiding catastrophic forgetting, facilitating knowledge transfer, and maintaining parameter efficiency. In this paper, we introduce MoCL-P, a novel lightweight continual learning method that addresses these challenges simultaneously. Unlike traditional approaches that continuously expand parameters for newly arriving tasks, MoCL-P integrates task representation-guided module composition with adaptive pruning, effectively balancing knowledge integration and computational overhead. Our evaluation across three continual learning benchmarks with up to 176 tasks shows that MoCL-P achieves state-of-the-art performance and improves parameter efficiency by up to three times, demonstrating its potential for practical applications where resource requirements are constrained.

MCML Authors
Link to website

Mingyang Wang

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[103]
V. Blaschke, B. Kovačić, S. Peng, H. Schütze and B. Plank.
MaiBaam: A Multi-Dialectal Bavarian Universal Dependency Treebank.
LREC-COLING 2024 - Joint International Conference on Computational Linguistics, Language Resources and Evalutaion. Torino, Italy, May 20-25, 2024. URL
Abstract

Despite the success of the Universal Dependencies (UD) project exemplified by its impressive language breadth, there is still a lack in `within-language breadth’: most treebanks focus on standard languages. Even for German, the language with the most annotations in UD, so far no treebank exists for one of its language varieties spoken by over 10M people: Bavarian. To contribute to closing this gap, we present the first multi-dialect Bavarian treebank (MaiBaam) manually annotated with part-of-speech and syntactic dependency information in UD, covering multiple text genres (wiki, fiction, grammar examples, social, non-fiction). We highlight the morphosyntactic differences between the closely-related Bavarian and German and showcase the rich variability of speakers’ orthographies. Our corpus includes 15k tokens, covering dialects from all Bavarian-speaking areas spanning three countries. We provide baseline parsing and POS tagging results, which are lower than results obtained on German and vary substantially between different graph-based parsers. To support further research on Bavarian syntax, we make our dataset, language-specific guidelines and code publicly available.

MCML Authors
Link to website

Verena Blaschke

Artificial Intelligence and Computational Linguistics

Link to website

Siyao Peng

Dr.

Artificial Intelligence and Computational Linguistics

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

Artificial Intelligence and Computational Linguistics


[102]
V. Hangya and A. Fraser.
How to Solve Few-Shot Abusive Content Detection Using the Data We Actually Have.
LREC-COLING 2024 - Joint International Conference on Computational Linguistics, Language Resources and Evalutaion. Torino, Italy, May 20-25, 2024. URL
Abstract

Due to the broad range of social media platforms, the requirements of abusive language detection systems are varied and ever-changing. Already a large set of annotated corpora with different properties and label sets were created, such as hate or misogyny detection, but the form and targets of abusive speech are constantly evolving. Since, the annotation of new corpora is expensive, in this work we leverage datasets we already have, covering a wide range of tasks related to abusive language detection. Our goal is to build models cheaply for a new target label set and/or language, using only a few training examples of the target domain. We propose a two-step approach: first we train our model in a multitask fashion. We then carry out few-shot adaptation to the target requirements. Our experiments show that using already existing datasets and only a few-shots of the target task the performance of models improve both monolingually and across languages. Our analysis also shows that our models acquire a general understanding of abusive language, since they improve the prediction of labels which are present only in the target dataset and can benefit from knowledge about labels which are not directly used for the target task.

MCML Authors
Link to Profile Alexander Fraser

Alexander Fraser

Prof. Dr.

Data Analytics & Statistics


[101]
A. H. Kargaran, F. Yvon and H. Schütze.
GlotScript: A Resource and Tool for Low Resource Writing System Identification.
LREC-COLING 2024 - Joint International Conference on Computational Linguistics, Language Resources and Evalutaion. Torino, Italy, May 20-25, 2024. URL GitHub
Abstract

We present GlotScript, an open resource and tool for low resource writing system identification. GlotScript-R is a resource that provides the attested writing systems for more than 7,000 languages. It is compiled by aggregating information from existing writing system resources. GlotScript-T is a writing system identification tool that covers all 161 Unicode 15.0 scripts. For an input text, it returns its script distribution where scripts are identified by ISO 15924 codes. We also present two use cases for GlotScript. First, we demonstrate that GlotScript can help cleaning multilingual corpora such as mC4 and OSCAR. Second, we analyze the tokenization of a number of language models such as GPT-4 using GlotScript and provide insights on the coverage of low resource scripts and languages by each language model. We hope that GlotScript will become a useful resource for work on low resource languages in the NLP community.

MCML Authors
Link to website

Amir Hossein Kargaran

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[100]
A. Köksal, S. Severini and H. Schütze.
SilverAlign: MT-Based Silver Data Algorithm for Evaluating Word Alignment.
LREC-COLING 2024 - Joint International Conference on Computational Linguistics, Language Resources and Evalutaion. Torino, Italy, May 20-25, 2024. URL
Abstract

Word alignments are essential for a variety of NLP tasks. Therefore, choosing the best approaches for their creation is crucial. However, the scarce availability of gold evaluation data makes the choice difficult. We propose SilverAlign, a new method to automatically create silver data for the evaluation of word aligners by exploiting machine translation and minimal pairs. We show that performance on our silver data correlates well with gold benchmarks for 9 language pairs, making our approach a valid resource for evaluation of different domains and languages when gold data are not available. This addresses the important scenario of missing gold data alignments for low-resource languages.

MCML Authors
Link to website

Abdullatif Köksal

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[99]
M. Marco and A. Fraser.
Analyzing the Understanding of Morphologically Complex Words in Large Language Models.
LREC-COLING 2024 - Joint International Conference on Computational Linguistics, Language Resources and Evalutaion. Torino, Italy, May 20-25, 2024. URL
Abstract

We empirically study the ability of a Large Language Model (gpt-3.5-turbo-instruct) to understand morphologically complex words. In our experiments, we looked at a variety of tasks to analyse German compounds with regard to compositional word formation and derivation, such as identifying the head noun of existing and novel compounds, identifying the shared verb stem between two words, or recognizing words constructed with inappropriately used derivation morphemes as invalid. Our results show that the language model is generally capable of solving most tasks, except for the task of identifying ill-formed word forms. While the model demonstrated a good overall understanding of complex words and their word-internal structure, the results also suggest that there is no formal knowledge of derivational rules, but rather an interpretation of the observed word parts to derive the meaning of a word.

MCML Authors
Link to Profile Alexander Fraser

Alexander Fraser

Prof. Dr.

Data Analytics & Statistics


[98]
D. R. Mortensen, V. Izrailevitch, Y. Xiao, H. Schütze and L. Weissweiler.
Verbing Weirds Language (Models): Evaluation of English Zero-Derivation in Five LLMs.
LREC-COLING 2024 - Joint International Conference on Computational Linguistics, Language Resources and Evalutaion. Torino, Italy, May 20-25, 2024. URL
Abstract

Lexical-syntactic flexibility, in the form of conversion (or zero-derivation) is a hallmark of English morphology. In conversion, a word with one part of speech is placed in a non-prototypical context, where it is coerced to behave as if it had a different part of speech. However, while this process affects a large part of the English lexicon, little work has been done to establish the degree to which language models capture this type of generalization. This paper reports the first study on the behavior of large language models with reference to conversion. We design a task for testing lexical-syntactic flexibility—the degree to which models can generalize over words in a construction with a non-prototypical part of speech. This task is situated within a natural language inference paradigm. We test the abilities of five language models—two proprietary models (GPT-3.5 and GPT-4), three open source model (Mistral 7B, Falcon 40B, and Llama 2 70B). We find that GPT-4 performs best on the task, followed by GPT-3.5, but that the open source language models are also able to perform it and that the 7-billion parameter Mistral displays as little difference between its baseline performance on the natural language inference task and the non-prototypical syntactic category task, as the massive GPT-4.

MCML Authors
Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning

Leonie Weissweiler

Leonie Weissweiler

Dr.

* Former member


[97]
C. Müller and B. Plank.
IndirectQA: Understanding Indirect Answers to Implicit Polar Questions in French and Spanish.
LREC-COLING 2024 - Joint International Conference on Computational Linguistics, Language Resources and Evalutaion. Torino, Italy, May 20-25, 2024. URL
Abstract

Polar questions are common in dialogue and expect exactly one of two answers (yes/no). It is however not uncommon for speakers to bypass these expected choices and answer, for example, ‘Islands are generally by the sea’ to the question: ‘An island? By the sea?’. While such answers are natural in spoken dialogues, conversational systems still struggle to interpret them. Seminal work to interpret indirect answers were made in recent years—but only for English and with strict question formulations. In this work, we present a new corpus for French and Spanish—IndirectQA —where we mine subtitle data for indirect answers to study the labeling task with six different labels, while broadening polar questions to include also implicit polar questions (statements that trigger a yes/no-answer which are not necessarily formulated as a question). We opted for subtitles since they are a readily available source of conversation in various languages, but also come with peculiarities and challenges which we will discuss. Overall, we provide the first results on French and Spanish. They show that the task is challenging: the baseline accuracy scores drop from 61.43 on English to 44.06 for French and Spanish.

MCML Authors
Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

Artificial Intelligence and Computational Linguistics


[96]
S. Peng, Z. Sun, H. Shan, M. Kolm, V. Blaschke, E. Artemova and B. Plank.
Sebastian, Basti, Wastl?! Recognizing Named Entities in Bavarian Dialectal Data.
LREC-COLING 2024 - Joint International Conference on Computational Linguistics, Language Resources and Evalutaion. Torino, Italy, May 20-25, 2024. URL
Abstract

Named Entity Recognition (NER) is a fundamental task to extract key information from texts, but annotated resources are scarce for dialects. This paper introduces the first dialectal NER dataset for German, BarNER, with 161K tokens annotated on Bavarian Wikipedia articles (bar-wiki) and tweets (bar-tweet), using a schema adapted from German CoNLL 2006 and GermEval. The Bavarian dialect differs from standard German in lexical distribution, syntactic construction, and entity information. We conduct in-domain, cross-domain, sequential, and joint experiments on two Bavarian and three German corpora and present the first comprehensive NER results on Bavarian. Incorporating knowledge from the larger German NER (sub-)datasets notably improves on bar-wiki and moderately on bar-tweet. Inversely, training first on Bavarian contributes slightly to the seminal German CoNLL 2006 corpus. Moreover, with gold dialect labels on Bavarian tweets, we assess multi-task learning between five NER and two Bavarian-German dialect identification tasks and achieve NER SOTA on bar-wiki. We substantiate the necessity of our low-resource BarNER corpus and the importance of diversity in dialects, genres, and topics in enhancing model performance.

MCML Authors
Link to website

Siyao Peng

Dr.

Artificial Intelligence and Computational Linguistics

Link to website

Verena Blaschke

Artificial Intelligence and Computational Linguistics

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

Artificial Intelligence and Computational Linguistics


[95]
L. Weissweiler, N. Böbel, K. Herrera, W. Scivetti, A. Lorenzi, N. Melnik, A. Bhatia, H. Schütze, L. Levin, A. Zeldes, J. Nivre, W. Croft and N. Schneider.
UCxn: Typologically Informed Annotation of Constructions Atop Universal Dependencies.
LREC-COLING 2024 - Joint International Conference on Computational Linguistics, Language Resources and Evalutaion. Torino, Italy, May 20-25, 2024. URL
Abstract

The Universal Dependencies (UD) project has created an invaluable collection of treebanks with contributions in over 140 languages. However, the UD annotations do not tell the full story. Grammatical constructions that convey meaning through a particular combination of several morphosyntactic elements – for example, interrogative sentences with special markers and/or word orders – are not labeled holistically. We argue for (i) augmenting UD annotations with a ‘UCxn’ annotation layer for such meaning-bearing grammatical constructions, and (ii) approaching this in a typologically informed way so that morphosyntactic strategies can be compared across languages. As a case study, we consider five construction families in ten languages, identifying instances of each construction in UD treebanks through the use of morphosyntactic patterns. In addition to findings regarding these particular constructions, our study yields important insights on methodology for describing and identifying constructions in language-general and language-particular ways, and lays the foundation for future constructional enrichment of UD treebanks.

MCML Authors
Leonie Weissweiler

Leonie Weissweiler

Dr.

* Former member

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[94]
M. Winkler, V. Juozapaityte, R. van der Goot and B. Plank.
Slot and Intent Detection Resources for Bavarian and Lithuanian: Assessing Translations vs Natural Queries to Digital Assistants.
LREC-COLING 2024 - Joint International Conference on Computational Linguistics, Language Resources and Evalutaion. Torino, Italy, May 20-25, 2024. URL
Abstract

Digital assistants perform well in high-resource languages like English, where tasks like slot and intent detection (SID) are well-supported. Many recent SID datasets start including multiple language varieties. However, it is unclear how realistic these translated datasets are. Therefore, we extend one such dataset, namely xSID-0.4, to include two underrepresented languages: Bavarian, a German dialect, and Lithuanian, a Baltic language. Both language variants have limited speaker populations and are often not included in multilingual projects. In addition to translations we provide “natural” queries to digital assistants generated by native speakers. We further include utterances from another dataset for Bavarian to build the richest SID dataset available today for a low-resource dialect without standard orthography. We then set out to evaluate models trained on English in a zero-shot scenario on our target language variants. Our evaluation reveals that translated data can produce overly optimistic scores. However, the error patterns in translated and natural datasets are highly similar. Cross-dataset experiments demonstrate that data collection methods influence performance, with scores lower than those achieved with single-dataset translations. This work contributes to enhancing SID datasets for underrepresented languages, yielding NaLiBaSID, a new evaluation dataset for Bavarian and Lithuanian.

MCML Authors
Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

Artificial Intelligence and Computational Linguistics


[93]
S. Zhou, L. Weissweiler, T. He, H. Schütze, D. R. Mortensen and L. Levin.
Constructions Are So Difficult That Even Large Language Models Get Them Right for the Wrong Reasons.
LREC-COLING 2024 - Joint International Conference on Computational Linguistics, Language Resources and Evalutaion. Torino, Italy, May 20-25, 2024. URL
Abstract

In this paper, we make a contribution that can be understood from two perspectives: from an NLP perspective, we introduce a small challenge dataset for NLI with large lexical overlap, which minimises the possibility of models discerning entailment solely based on token distinctions, and show that GPT-4 and Llama 2 fail it with strong bias. We then create further challenging sub-tasks in an effort to explain this failure. From a Computational Linguistics perspective, we identify a group of constructions with three classes of adjectives which cannot be distinguished by surface features. This enables us to probe for LLM’s understanding of these constructions in various ways, and we find that they fail in a variety of ways to distinguish between them, suggesting that they don’t adequately represent their meaning or capture the lexical properties of phrasal heads.

MCML Authors
Link to website

Shijia Zhou

Artificial Intelligence and Computational Linguistics

Leonie Weissweiler

Leonie Weissweiler

Dr.

* Former member

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[92]
P. Lin, S. Ji, J. Tiedemann, A. F. T. Martins and H. Schütze.
MaLA-500: Massive Language Adaptation of Large Language Models.
Preprint (Apr. 2024). arXiv GitHub
Abstract

Large language models (LLMs) have advanced the state of the art in natural language processing. However, their predominant design for English or a limited set of languages creates a substantial gap in their effectiveness for low-resource languages. To bridge this gap, we introduce MaLA-500, a novel large language model designed to cover an extensive range of 534 languages. To train MaLA-500, we employ vocabulary extension and continued pretraining on LLaMA 2 with Glot500-c. Our intrinsic evaluation demonstrates that MaLA-500 is better at predicting the given texts of low-resource languages than existing multilingual LLMs. Moreover, the extrinsic evaluation of in-context learning shows that MaLA-500 outperforms previous LLMs on SIB200 and Taxi1500 by a significant margin, i.e., 11.68% and 4.82% marco-average accuracy across languages.

MCML Authors
Link to website

Peiqin Lin

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[91]
A. Modarressi, A. Köksal, A. Imani, M. Fayyaz and H. Schütze.
MemLLM: Finetuning LLMs to Use An Explicit Read-Write Memory.
Preprint (Apr. 2024). arXiv
Abstract

While current large language models (LLMs) demonstrate some capabilities in knowledge-intensive tasks, they are limited by relying on their parameters as an implicit storage mechanism. As a result, they struggle with infrequent knowledge and temporal degradation. In addition, the uninterpretable nature of parametric memorization makes it challenging to understand and prevent hallucination. Parametric memory pools and model editing are only partial solutions. Retrieval Augmented Generation (RAG) – though non-parametric – has its own limitations: it lacks structure, complicates interpretability and makes it hard to effectively manage stored knowledge. In this paper, we introduce MemLLM, a novel method of enhancing LLMs by integrating a structured and explicit read-and-write memory module. MemLLM tackles the aforementioned challenges by enabling dynamic interaction with the memory and improving the LLM’s capabilities in using stored knowledge. Our experiments indicate that MemLLM enhances the LLM’s performance and interpretability, in language modeling in general and knowledge-intensive tasks in particular. We see MemLLM as an important step towards making LLMs more grounded and factual through memory augmentation.

MCML Authors
Link to website

Ali Modarressi

Statistical NLP and Deep Learning

Link to website

Abdullatif Köksal

Statistical NLP and Deep Learning

Link to website

Ayyoob Imani

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[90]
A. Maronikolakis, A. Köksal and H. Schütze.
Sociocultural knowledge is needed for selection of shots in hate speech detection tasks.
LT-EDI 2024 - 4th Workshop on Language Technology for Equality, Diversity, Inclusion. St. Julian’s, Malta, Mar 21, 2024. URL
Abstract

We introduce HATELEXICON, a lexicon of slurs and targets of hate speech for Brazil, Germany, India and Kenya, to aid model development and interpretability. First, we demonstrate how HATELEXICON can be used to interpret model predictions, showing that models developed to classify extreme speech rely heavily on target group names. Further, we propose a culturally-informed method to aid shot selection for training in low-resource settings. In few-shot learning, shot selection is of paramount importance to model performance and we need to ensure we make the most of available data. We work with HASOC German and Hindi data for training and the Multilingual HateCheck (MHC) benchmark for evaluation. We show that selecting shots based on our lexicon leads to models performing better than models trained on shots sampled randomly. Thus, when given only a few training examples, using HATELEXICON to select shots containing more sociocultural information leads to better few-shot performance. With these two use-cases we show how our HATELEXICON can be used for more effective hate speech detection.

MCML Authors
Link to website

Antonis Maronikolakis

Statistical NLP and Deep Learning

Link to website

Abdullatif Köksal

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[89]
C. Gruber, K. Hechinger, M. Aßenmacher, G. Kauermann and B. Plank.
More Labels or Cases? Assessing Label Variation in Natural Language Inference.
UnImplicit 2024 - 3rd Workshop on Understanding Implicit and Underspecified Language. Malta, Mar 21, 2024. URL
Abstract

In this work, we analyze the uncertainty that is inherently present in the labels used for supervised machine learning in natural language inference (NLI). In cases where multiple annotations per instance are available, neither the majority vote nor the frequency of individual class votes is a trustworthy representation of the labeling uncertainty. We propose modeling the votes via a Bayesian mixture model to recover the data-generating process, i.e., the “true” latent classes, and thus gain insight into the class variations. This will enable a better understanding of the confusion happening during the annotation process. We also assess the stability of the proposed estimation procedure by systematically varying the numbers of i) instances and ii) labels. Thereby, we observe that few instances with many labels can predict the latent class borders reasonably well, while the estimation fails for many instances with only a few labels. This leads us to conclude that multiple labels are a crucial building block for properly analyzing label uncertainty.

MCML Authors
Link to website

Matthias Aßenmacher

Dr.

Statistical Learning & Data Science

Link to Profile Göran Kauermann

Göran Kauermann

Prof. Dr.

Applied Statistics in Social Sciences, Economics and Business

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

Artificial Intelligence and Computational Linguistics


[88]
S. Peng, Z. Sun, S. Loftus and B. Plank.
Different Tastes of Entities: Investigating Human Label Variation in Named Entity Annotations.
UnImplicit 2024 - 3rd Workshop on Understanding Implicit and Underspecified Language. Malta, Mar 21, 2024. URL
Abstract

Named Entity Recognition (NER) is a key information extraction task with a long-standing tradition. While recent studies address and aim to correct annotation errors via re-labeling efforts, little is known about the sources of human label variation, such as text ambiguity, annotation error, or guideline divergence. This is especially the case for high-quality datasets and beyond English CoNLL03. This paper studies disagreements in expert-annotated named entity datasets for three languages: English, Danish, and Bavarian. We show that text ambiguity and artificial guideline changes are dominant factors for diverse annotations among high-quality revisions. We survey student annotations on a subset of difficult entities and substantiate the feasibility and necessity of manifold annotations for understanding named entity ambiguities from a distributional perspective.

MCML Authors
Link to website

Siyao Peng

Dr.

Artificial Intelligence and Computational Linguistics

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

Artificial Intelligence and Computational Linguistics


[87]
E. Artemova, V. Blaschke and B. Plank.
Exploring the Robustness of Task-oriented Dialogue Systems for Colloquial German Varieties.
EACL 2024 - 18th Conference of the European Chapter of the Association for Computational Linguistics. St. Julians, Malta, Mar 17-22, 2024. URL
Abstract

Mainstream cross-lingual task-oriented dialogue (ToD) systems leverage the transfer learning paradigm by training a joint model for intent recognition and slot-filling in English and applying it, zero-shot, to other languages.We address a gap in prior research, which often overlooked the transfer to lower-resource colloquial varieties due to limited test data.Inspired by prior work on English varieties, we craft and manually evaluate perturbation rules that transform German sentences into colloquial forms and use them to synthesize test sets in four ToD datasets.Our perturbation rules cover 18 distinct language phenomena, enabling us to explore the impact of each perturbation on slot and intent performance.Using these new datasets, we conduct an experimental evaluation across six different transformers.Here, we demonstrate that when applied to colloquial varieties, ToD systems maintain their intent recognition performance, losing 6% (4.62 percentage points) in accuracy on average. However, they exhibit a significant drop in slot detection, with a decrease of 31% (21 percentage points) in slot F1 score.Our findings are further supported by a transfer experiment from Standard American English to synthetic Urban African American Vernacular English.

MCML Authors
Link to website

Verena Blaschke

Artificial Intelligence and Computational Linguistics

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

Artificial Intelligence and Computational Linguistics


[86]
J. Baan, R. Fernández, B. Plank and W. Aziz.
Interpreting Predictive Probabilities: Model Confidence or Human Label Variation?.
EACL 2024 - 18th Conference of the European Chapter of the Association for Computational Linguistics. St. Julians, Malta, Mar 17-22, 2024. URL
Abstract

With the rise of increasingly powerful and user-facing NLP systems, there is growing interest in assessing whether they have a good representation of uncertainty by evaluating the quality of their predictive distribution over outcomes. We identify two main perspectives that drive starkly different evaluation protocols. The first treats predictive probability as an indication of model confidence; the second as an indication of human label variation. We discuss their merits and limitations, and take the position that both are crucial for trustworthy and fair NLP systems, but that exploiting a single predictive distribution is limiting. We recommend tools and highlight exciting directions towards models with disentangled representations of uncertainty about predictions and uncertainty about human labels.

MCML Authors
Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

Artificial Intelligence and Computational Linguistics


[85]
B. Ma, E. Nie, S. Yuan, H. Schmid, M. Färber, F. Kreuter and H. Schütze.
ToPro: Token-Level Prompt Decomposition for Cross-Lingual Sequence Labeling Tasks.
EACL 2024 - 18th Conference of the European Chapter of the Association for Computational Linguistics. St. Julians, Malta, Mar 17-22, 2024. URL
Abstract

Prompt-based methods have been successfully applied to multilingual pretrained language models for zero-shot cross-lingual understanding. However, most previous studies primarily focused on sentence-level classification tasks, and only a few considered token-level labeling tasks such as Named Entity Recognition (NER) and Part-of-Speech (POS) tagging. In this paper, we propose Token-Level Prompt Decomposition (ToPro), which facilitates the prompt-based method for token-level sequence labeling tasks. The ToPro method decomposes an input sentence into single tokens and applies one prompt template to each token. Our experiments on multilingual NER and POS tagging datasets demonstrate that ToPro-based fine-tuning outperforms Vanilla fine-tuning and Prompt-Tuning in zero-shot cross-lingual transfer, especially for languages that are typologically different from the source language English. Our method also attains state-of-the-art performance when employed with the mT5 model. Besides, our exploratory study in multilingual large language models shows that ToPro performs much better than the current in-context learning method. Overall, the performance improvements show that ToPro could potentially serve as a novel and simple benchmarking method for sequence labeling tasks.

MCML Authors
Link to website

Bolei Ma

Social Data Science and AI Lab

Link to website

Ercong Nie

Statistical NLP and Deep Learning

Link to Profile Frauke Kreuter

Frauke Kreuter

Prof. Dr.

Social Data Science and AI Lab

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[84]
L. K. Senel, B. Ebing, K. Baghirova, H. Schütze and G. Glavaš.
Kardeş-NLU: Transfer to Low-Resource Languages with Big Brother’s Help – A Benchmark and Evaluation for Turkic Languages.
EACL 2024 - 18th Conference of the European Chapter of the Association for Computational Linguistics. St. Julians, Malta, Mar 17-22, 2024. URL
Abstract

Cross-lingual transfer (XLT) driven by massively multilingual language models (mmLMs) has been shown largely ineffective for low-resource (LR) target languages with little (or no) representation in mmLM’s pretraining, especially if they are linguistically distant from the high-resource (HR) source language. Much of the recent focus in XLT research has been dedicated to LR language families, i.e., families without any HR languages (e.g., families of African languages or indigenous languages of the Americas). In this work, in contrast, we investigate a configuration that is arguably of practical relevance for more of the world’s languages: XLT to LR languages that do have a close HR relative. To explore the extent to which a HR language can facilitate transfer to its LR relatives, we (1) introduce Kardeş-NLU, an evaluation benchmark with language understanding datasets in five LR Turkic languages: Azerbaijani, Kazakh, Kyrgyz, Uzbek, and Uyghur; and (2) investigate (a) intermediate training and (b) fine-tuning strategies that leverage Turkish in XLT to these target languages. Our experimental results show that both - integrating Turkish in intermediate training and in downstream fine-tuning - yield substantial improvements in XLT to LR Turkic languages. Finally, we benchmark cutting-edge instruction-tuned large language models on Kardeş-NLU, showing that their performance is highly task- and language-dependent.

MCML Authors
Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[83]
M. Zhang, R. van der Goot, M.-Y. Kan and B. Plank.
NNOSE: Nearest Neighbor Occupational Skill Extraction.
EACL 2024 - 18th Conference of the European Chapter of the Association for Computational Linguistics. St. Julians, Malta, Mar 17-22, 2024. URL
Abstract

The labor market is changing rapidly, prompting increased interest in the automatic extraction of occupational skills from text. With the advent of English benchmark job description datasets, there is a need for systems that handle their diversity well. We tackle the complexity in occupational skill datasets tasks—combining and leveraging multiple datasets for skill extraction, to identify rarely observed skills within a dataset, and overcoming the scarcity of skills across datasets. In particular, we investigate the retrieval-augmentation of language models, employing an external datastore for retrieving similar skills in a dataset-unifying manner. Our proposed method, Nearest Neighbor Occupational Skill Extraction (NNOSE) effectively leverages multiple datasets by retrieving neighboring skills from other datasets in the datastore. This improves skill extraction without additional fine-tuning. Crucially, we observe a performance gain in predicting infrequent patterns, with substantial gains of up to 30% span-F1 in cross-dataset settings.

MCML Authors
Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

Artificial Intelligence and Computational Linguistics


[82]
P. Lin, C. Hu, Z. Zhang, A. Martins and H. Schütze.
mPLM-Sim: Better Cross-Lingual Similarity and Transfer in Multilingual Pretrained Language Models.
EACL 2024 - Findings of the 18th Conference of the European Chapter of the Association for Computational Linguistics. St. Julians, Malta, Mar 17-22, 2024. URL
Abstract

Recent multilingual pretrained language models (mPLMs) have been shown to encode strong language-specific signals, which are not explicitly provided during pretraining. It remains an open question whether it is feasible to employ mPLMs to measure language similarity, and subsequently use the similarity results to select source languages for boosting cross-lingual transfer. To investigate this, we propose mPLM-Sim, a language similarity measure that induces the similarities across languages from mPLMs using multi-parallel corpora. Our study shows that mPLM-Sim exhibits moderately high correlations with linguistic similarity measures, such as lexicostatistics, genealogical language family, and geographical sprachbund. We also conduct a case study on languages with low correlation and observe that mPLM-Sim yields more accurate similarity results. Additionally, we find that similarity results vary across different mPLMs and different layers within an mPLM. We further investigate whether mPLM-Sim is effective for zero-shot cross-lingual transfer by conducting experiments on both low-level syntactic tasks and high-level semantic tasks. The experimental results demonstrate that mPLM-Sim is capable of selecting better source languages than linguistic measures, resulting in a 1%-2% improvement in zero-shot cross-lingual transfer performance.

MCML Authors
Link to website

Peiqin Lin

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[81]
M. Zhang, R. van der Goot and B. Plank.
Entity Linking in the Job Market Domain.
EACL 2024 - Findings of the 18th Conference of the European Chapter of the Association for Computational Linguistics. St. Julians, Malta, Mar 17-22, 2024. URL
Abstract

In Natural Language Processing, entity linking (EL) has centered around Wikipedia, but yet remains underexplored for the job market domain. Disambiguating skill mentions can help us get insight into the current labor market demands. In this work, we are the first to explore EL in this domain, specifically targeting the linkage of occupational skills to the ESCO taxonomy (le Vrang et al., 2014). Previous efforts linked coarse-grained (full) sentences to a corresponding ESCO skill. In this work, we link more fine-grained span-level mentions of skills. We tune two high-performing neural EL models, a bi-encoder (Wu et al., 2020) and an autoregressive model (Cao et al., 2021), on a synthetically generated mention–skill pair dataset and evaluate them on a human-annotated skill-linking benchmark. Our findings reveal that both models are capable of linking implicit mentions of skills to their correct taxonomy counterparts. Empirically, BLINK outperforms GENRE in strict evaluation, but GENRE performs better in loose evaluation (accuracy@k).

MCML Authors
Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

Artificial Intelligence and Computational Linguistics


[80]
A. Sorensen, S. Peng, B. Plank and R. Goot.
EEVEE: An Easy Annotation Tool for Natural Language Processing.
LAW @EACL 2024 - 18th Linguistic Annotation Workshop at the 18th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2024). St. Julians, Malta, Mar 17-22, 2024. URL
Abstract

Annotation tools are the starting point for creating Natural Language Processing (NLP) datasets. There is a wide variety of tools available; setting up these tools is however a hindrance. We propose EEVEE, an annotation tool focused on simplicity, efficiency, and ease of use. It can run directly in the browser (no setup required) and uses tab-separated files (as opposed to character offsets or task-specific formats) for annotation. It allows for annotation of multiple tasks on a single dataset and supports four task-types: sequence labeling, span labeling, text classification and seq2seq.

MCML Authors
Link to website

Siyao Peng

Dr.

Artificial Intelligence and Computational Linguistics

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

Artificial Intelligence and Computational Linguistics


[79]
L. Weber-Genzel, R. Litschko, E. Artemova and B. Plank.
Donkii: Can Annotation Error Detection Methods Find Errors in Instruction-Tuning Datasets?.
LAW @EACL 2024 - 18th Linguistic Annotation Workshop at the 18th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2024). St. Julians, Malta, Mar 17-22, 2024. URL
Abstract

Instruction tuning has become an integral part of training pipelines for Large Language Models (LLMs) and has been shown to yield strong performance gains. In an orthogonal line of research, Annotation Error Detection (AED) has emerged as a tool for detecting quality problems in gold standard labels. So far, however, the application of AED methods has been limited to classification tasks. It is an open question how well AED methods generalize to language generation settings, which are becoming more widespread via LLMs. In this paper, we present a first and novel benchmark for AED on instruction tuning data: DONKII. It comprises three instruction-tuning datasets enriched with error annotations by experts and semi-automatic methods. We also provide a novel taxonomy of error types for instruction-tuning data. We find that all three datasets contain clear errors, which sometimes propagate directly into instruction-tuned LLMs. We propose four AED baselines for the generative setting and evaluate them extensively on the newly introduced dataset. Our results show that the choice of the right AED method and model size is indeed crucial and derive practical recommendations for how to use AED methods to clean instruction-tuning data.

MCML Authors
Link to website

Leon Weber-Genzel

Dr.

* Former member

Link to website

Robert Litschko

Artificial Intelligence and Computational Linguistics

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

Artificial Intelligence and Computational Linguistics


[78]
L. Weissweiler, A. Köksal and H. Schütze.
Hybrid Human-LLM Corpus Construction and LLM Evaluation for Rare Linguistic Phenomena.
Preprint (Mar. 2024). arXiv
Abstract

Argument Structure Constructions (ASCs) are one of the most well-studied construction groups, providing a unique opportunity to demonstrate the usefulness of Construction Grammar (CxG). For example, the caused-motion construction (CMC, She sneezed the foam off her cappuccino'') demonstrates that constructions must carry meaning, otherwise the fact that sneeze’’ in this context causes movement cannot be explained. We form the hypothesis that this remains challenging even for state-of-the-art Large Language Models (LLMs), for which we devise a test based on substituting the verb with a prototypical motion verb. To be able to perform this test at statistically significant scale, in the absence of adequate CxG corpora, we develop a novel pipeline of NLP-assisted collection of linguistically annotated text. We show how dependency parsing and GPT-3.5 can be used to significantly reduce annotation cost and thus enable the annotation of rare phenomena at scale. We then evaluate GPT, Gemini, Llama2 and Mistral models for their understanding of the CMC using the newly collected corpus. We find that all models struggle with understanding the motion component that the CMC adds to a sentence.

MCML Authors
Leonie Weissweiler

Leonie Weissweiler

Dr.

* Former member

Link to website

Abdullatif Köksal

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[77]
F. Friedrich, K. Hämmerl, P. Schramowski, M. Brack, J. Libovicky, K. Kersting and A. Fraser.
Multilingual Text-to-Image Generation Magnifies Gender Stereotypes and Prompt Engineering May Not Help You.
Preprint (Feb. 2024). arXiv
Abstract

Text-to-image generation models have recently achieved astonishing results in image quality, flexibility, and text alignment, and are consequently employed in a fast-growing number of applications. Through improvements in multilingual abilities, a larger community now has access to this technology. However, our results show that multilingual models suffer from significant gender biases just as monolingual models do. Furthermore, the natural expectation that multilingual models will provide similar results across languages does not hold up. Instead, there are important differences between languages. We propose a novel benchmark, MAGBIG, intended to foster research on gender bias in multilingual models. We use MAGBIG to investigate the effect of multilingualism on gender bias in T2I models. To this end, we construct multilingual prompts requesting portraits of people with a certain occupation or trait. Our results show that not only do models exhibit strong gender biases but they also behave differently across languages. Furthermore, we investigate prompt engineering strategies, such as indirect, neutral formulations, to mitigate these biases. Unfortunately, these approaches have limited success and result in worse text-to-image alignment. Consequently, we call for more research into diverse representations across languages in image generators, as well as into steerability to address biased model behavior.

MCML Authors
Link to website at LMU

Katharina Hämmerl

Data Analytics & Statistics

Link to Profile Alexander Fraser

Alexander Fraser

Prof. Dr.

Data Analytics & Statistics


[76]
E. Nie, S. Yuan, B. Ma, H. Schmid, M. Färber, F. Kreuter and H. Schütze.
Decomposed Prompting: Unveiling Multilingual Linguistic Structure Knowledge in English-Centric Large Language Models.
Preprint (Feb. 2024). arXiv
Abstract

Despite the predominance of English in their training data, English-centric Large Language Models (LLMs) like GPT-3 and LLaMA display a remarkable ability to perform multilingual tasks, raising questions about the depth and nature of their cross-lingual capabilities. This paper introduces the decomposed prompting approach to probe the linguistic structure understanding of these LLMs in sequence labeling tasks. Diverging from the single text-to-text prompt, our method generates for each token of the input sentence an individual prompt which asks for its linguistic label. We assess our method on the Universal Dependencies part-of-speech tagging dataset for 38 languages, utilizing both English-centric and multilingual LLMs. Our findings show that decomposed prompting surpasses the iterative prompting baseline in efficacy and efficiency under zero- and few-shot settings. Further analysis reveals the influence of evaluation methods and the use of instructions in prompts. Our multilingual investigation shows that English-centric language models perform better on average than multilingual models. Our study offers insights into the multilingual transferability of English-centric LLMs, contributing to the understanding of their multilingual linguistic knowledge.

MCML Authors
Link to website

Ercong Nie

Statistical NLP and Deep Learning

Link to website

Bolei Ma

Social Data Science and AI Lab

Link to Profile Frauke Kreuter

Frauke Kreuter

Prof. Dr.

Social Data Science and AI Lab

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[75]
S. Zhang, P. Wicke, L. K. Senel, L. Figueredo, A. Naceri, S. Haddadin, B. Plank and H. Schütze.
LoHoRavens: A Long-Horizon Language-Conditioned Benchmark for Robotic Tabletop Manipulation.
NeurIPS 2023 - 6th Robot Learning Workshop: Pretraining, Fine-Tuning, and Generalization with Large Scale Models at the 37th Conference on Neural Information Processing Systems. New Orleans, LA, USA, Dec 10-16, 2023. URL
Abstract

The convergence of embodied agents and large language models (LLMs) has brought significant advancements to embodied instruction following.Particularly, the strong reasoning capabilities of LLMs make it possible for robots to perform long-horizon tasks without expensive annotated demonstrations.However, public benchmarks for testing the long-horizon reasoning capabilities of language-conditioned robots in various scenarios are still missing. To fill this gap, this work focuses on the tabletopmanipulation task and releases a simulation benchmark,textit{LoHoRavens}, which covers various long-horizonreasoning aspects spanning color, size, space, arithmeticsand reference.Furthermore, there is a key modality bridging problem forlong-horizon manipulation tasks with LLMs: how toincorporate the observation feedback during robot executionfor the LLM’s closed-loop planning, which is however less studied by prior work. We investigate two methods of bridging the modality gap: caption generation and learnable interface for incorporating explicit and implicit observation feedback to the LLM, respectively.These methods serve as the two baselines for our proposed benchmark. Experiments show that both methods struggle to solve most tasks, indicating long-horizon manipulation tasks are still challenging for current popular models.We expect the proposed public benchmark and baselines can help the community develop better models for long-horizon tabletop manipulation tasks.

MCML Authors
Link to website

Shengqiang Zhang

Statistical NLP and Deep Learning

Link to website

Philipp Wicke

Dr.

Statistical NLP and Deep Learning

Link to website

Lütfi Kerem Şenel

Statistical NLP and Deep Learning

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

Artificial Intelligence and Computational Linguistics

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[74]
M. Di Marco, K. Hämmerl and A. Fraser.
A Study on Accessing Linguistic Information in Pre-Trained Language Models by Using Prompts.
EMNLP 2023 - Conference on Empirical Methods in Natural Language Processing. Singapore, Dec 06-10, 2023. DOI
Abstract

We study whether linguistic information in pre-trained multilingual language models can be accessed by human language: So far, there is no easy method to directly obtain linguistic information and gain insights into the linguistic principles encoded in such models. We use the technique of prompting and formulate linguistic tasks to test the LM’s access to explicit grammatical principles and study how effective this method is at providing access to linguistic features. Our experiments on German, Icelandic and Spanish show that some linguistic properties can in fact be accessed through prompting, whereas others are harder to capture.

MCML Authors
Link to website at LMU

Katharina Hämmerl

Data Analytics & Statistics

Link to Profile Alexander Fraser

Alexander Fraser

Prof. Dr.

Data Analytics & Statistics


[73]
M. Giulianelli, J. Baan, W. Aziz, R. Fernández and B. Plank.
What Comes Next? Evaluating Uncertainty in Neural Text Generators Against Human Production Variability.
EMNLP 2023 - Conference on Empirical Methods in Natural Language Processing. Singapore, Dec 06-10, 2023. DOI
Abstract

In Natural Language Generation (NLG) tasks, for any input, multiple communicative goals are plausible, and any goal can be put into words, or produced, in multiple ways. We characterise the extent to which human production varies lexically, syntactically, and semantically across four NLG tasks, connecting human production variability to aleatoric or data uncertainty. We then inspect the space of output strings shaped by a generation system’s predicted probability distribution and decoding algorithm to probe its uncertainty. For each test input, we measure the generator’s calibration to human production variability. Following this instance-level approach, we analyse NLG models and decoding strategies, demonstrating that probing a generator with multiple samples and, when possible, multiple references, provides the level of detail necessary to gain understanding of a model’s representation of uncertainty.

MCML Authors
Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

Artificial Intelligence and Computational Linguistics


[72]
N. Kassner, O. Tafjord, A. Sabharwal, K. Richardson, H. Schütze and P. Clark.
Language Models with Rationality.
EMNLP 2023 - Conference on Empirical Methods in Natural Language Processing. Singapore, Dec 06-10, 2023. DOI
Abstract

While large language models (LLMs) are proficient at question-answering (QA), it is not always clear how (or even if) an answer follows from their latent ‘beliefs’. This lack of interpretability is a growing impediment to widespread use of LLMs. To address this, our goals are to make model beliefs and their inferential relationships explicit, and to resolve inconsistencies that may exist, so that answers are supported by interpretable chains of reasoning drawn from a consistent network of beliefs. Our approach, which we call REFLEX, is to add a rational, self-reflecting layer on top of the LLM. First, given a question, we construct a belief graph using a backward-chaining process to materialize relevant model beliefs (including beliefs about answer candidates) and their inferential relationships. Second, we identify and minimize contradictions in that graph using a formal constraint reasoner. We find that REFLEX significantly improves consistency (by 8%-11% absolute) without harming overall answer accuracy, resulting in answers supported by faithful chains of reasoning drawn from a more consistent belief system. This suggests a new style of system architecture in which an LLM extended with a rational layer can provide an interpretable window into system beliefs, add a systematic reasoning capability, and repair latent inconsistencies present in the LLM.

MCML Authors
Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[71]
R. Litschko, M. Müller-Eberstein, R. van der Goot, L. Weber-Genzel and B. Plank.
Establishing Trustworthiness: Rethinking Tasks and Model Evaluation.
EMNLP 2023 - Conference on Empirical Methods in Natural Language Processing. Singapore, Dec 06-10, 2023. DOI
Abstract

Language understanding is a multi-faceted cognitive capability, which the Natural Language Processing (NLP) community has striven to model computationally for decades. Traditionally, facets of linguistic intelligence have been compartmentalized into tasks with specialized model architectures and corresponding evaluation protocols. With the advent of large language models (LLMs) the community has witnessed a dramatic shift towards general purpose, task-agnostic approaches powered by generative models. As a consequence, the traditional compartmentalized notion of language tasks is breaking down, followed by an increasing challenge for evaluation and analysis. At the same time, LLMs are being deployed in more real-world scenarios, including previously unforeseen zero-shot setups, increasing the need for trustworthy and reliable systems. Therefore, we argue that it is time to rethink what constitutes tasks and model evaluation in NLP, and pursue a more holistic view on language, placing trustworthiness at the center. Towards this goal, we review existing compartmentalized approaches for understanding the origins of a model’s functional capacity, and provide recommendations for more multi-faceted evaluation protocols.

MCML Authors
Link to website

Robert Litschko

Artificial Intelligence and Computational Linguistics

Link to website

Leon Weber-Genzel

Dr.

* Former member

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

Artificial Intelligence and Computational Linguistics


[70]
M. Wang, H. Adel, L. Lange, J. Strötgen and H. Schütze.
GradSim: Gradient-Based Language Grouping for Effective Multilingual Training.
EMNLP 2023 - Conference on Empirical Methods in Natural Language Processing. Singapore, Dec 06-10, 2023. DOI
Abstract

Most languages of the world pose low-resource challenges to natural language processing models. With multilingual training, knowledge can be shared among languages. However, not all languages positively influence each other and it is an open research question how to select the most suitable set of languages for multilingual training and avoid negative interference among languages whose characteristics or data distributions are not compatible. In this paper, we propose GradSim, a language grouping method based on gradient similarity. Our experiments on three diverse multilingual benchmark datasets show that it leads to the largest performance gains compared to other similarity measures and it is better correlated with cross-lingual model performance. As a result, we set the new state of the art on AfriSenti, a benchmark dataset for sentiment analysis on low-resource African languages. In our extensive analysis, we further reveal that besides linguistic features, the topics of the datasets play an important role for language grouping and that lower layers of transformer models encode language-specific features while higher layers capture task-specific information.

MCML Authors
Link to website

Mingyang Wang

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[69]
X. Wang and B. Plank.
ACTOR: Active Learning with Annotator-specific Classification Heads to Embrace Human Label Variation.
EMNLP 2023 - Conference on Empirical Methods in Natural Language Processing. Singapore, Dec 06-10, 2023. DOI
Abstract

Label aggregation such as majority voting is commonly used to resolve annotator disagreement in dataset creation. However, this may disregard minority values and opinions. Recent studies indicate that learning from individual annotations outperforms learning from aggregated labels, though they require a considerable amount of annotation. Active learning, as an annotation cost-saving strategy, has not been fully explored in the context of learning from disagreement. We show that in the active learning setting, a multi-head model performs significantly better than a single-head model in terms of uncertainty estimation. By designing and evaluating acquisition functions with annotator-specific heads on two datasets, we show that group-level entropy works generally well on both datasets. Importantly, it achieves performance in terms of both prediction and uncertainty estimation comparable to full-scale training from disagreement, while saving 70% of the annotation budget.

MCML Authors
Link to website

Xinpeng Wang

Artificial Intelligence and Computational Linguistics

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

Artificial Intelligence and Computational Linguistics


[68]
L. Weissweiler, V. Hofmann, A. Kantharuban, A. Cai, R. Dutt, A. Hengle, A. Kabra, A. Kulkarni, A. Vijayakumar, H. Yu, H. Schütze, K. Oflazer and D. Mortensen.
Counting the Bugs in ChatGPT's Wugs: A Multilingual Investigation into the Morphological Capabilities of a Large Language Model.
EMNLP 2023 - Conference on Empirical Methods in Natural Language Processing. Singapore, Dec 06-10, 2023. DOI
Abstract

Large language models (LLMs) have recently reached an impressive level of linguistic capability, prompting comparisons with human language skills. However, there have been relatively few systematic inquiries into the linguistic capabilities of the latest generation of LLMs, and those studies that do exist (i) ignore the remarkable ability of humans to generalize, (ii) focus only on English, and (iii) investigate syntax or semantics and overlook other capabilities that lie at the heart of human language, like morphology. Here, we close these gaps by conducting the first rigorous analysis of the morphological capabilities of ChatGPT in four typologically varied languages (specifically, English, German, Tamil, and Turkish). We apply a version of Berko’s (1958) wug test to ChatGPT, using novel, uncontaminated datasets for the four examined languages. We find that ChatGPT massively underperforms purpose-built systems, particularly in English. Overall, our results—through the lens of morphology—cast a new light on the linguistic capabilities of ChatGPT, suggesting that claims of human-like language skills are premature and misleading.

MCML Authors
Leonie Weissweiler

Leonie Weissweiler

Dr.

* Former member

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[67]
S. Xu, S. T.y.s.s, O. Ichim, I. Risini, B. Plank and M. Grabmair.
From Dissonance to Insights: Dissecting Disagreements in Rationale Construction for Case Outcome Classification.
EMNLP 2023 - Conference on Empirical Methods in Natural Language Processing. Singapore, Dec 06-10, 2023. DOI
Abstract

In legal NLP, Case Outcome Classification (COC) must not only be accurate but also trustworthy and explainable. Existing work in explainable COC has been limited to annotations by a single expert. However, it is well-known that lawyers may disagree in their assessment of case facts. We hence collect a novel dataset RaVE: Rationale Variation in ECHR, which is obtained from two experts in the domain of international human rights law, for whom we observe weak agreement. We study their disagreements and build a two-level task-independent taxonomy, supplemented with COC-specific subcategories. To our knowledge, this is the first work in the legal NLP that focuses on human label variation. We quantitatively assess different taxonomy categories and find that disagreements mainly stem from underspecification of the legal context, which poses challenges given the typically limited granularity and noise in COC metadata. We further assess the explainablility of state-of-the-art COC models on RaVE and observe limited agreement between models and experts. Overall, our case study reveals hitherto underappreciated complexities in creating benchmark datasets in legal NLP that revolve around identifying aspects of a case’s facts supposedly relevant for its outcome.

MCML Authors
Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

Artificial Intelligence and Computational Linguistics


[66]
A. H. Kargaran, A. Imani, F. Yvon and H. Schütze.
GlotLID: Language Identification for Low-Resource Languages.
EMNLP 2023 - Findings of the Conference on Empirical Methods in Natural Language Processing. Singapore, Dec 06-10, 2023. DOI GitHub
Abstract

Several recent papers have published good solutions for language identification (LID) for about 300 high-resource and medium-resource languages. However, there is no LID available that (i) covers a wide range of low-resource languages, (ii) is rigorously evaluated and reliable and (iii) efficient and easy to use. Here, we publish GlotLID-M, an LID model that satisfies the desiderata of wide coverage, reliability and efficiency. It identifies 1665 languages, a large increase in coverage compared to prior work. In our experiments, GlotLID-M outperforms four baselines (CLD3, FT176, OpenLID and NLLB) when balancing F1 and false positive rate (FPR). We analyze the unique challenges that low-resource LID poses: incorrect corpus metadata, leakage from high-resource languages, difficulty separating closely related languages, handling of macrolanguage vs varieties and in general noisy data. We hope that integrating GlotLID-M into dataset creation pipelines will improve quality and enhance accessibility of NLP technology for low-resource languages and cultures.

MCML Authors
Link to website

Amir Hossein Kargaran

Statistical NLP and Deep Learning

Link to website

Ayyoob Imani

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[65]
A. Köksal, T. Schick and H. Schütze.
MEAL: Stable and Active Learning for Few-Shot Prompting.
EMNLP 2023 - Findings of the Conference on Empirical Methods in Natural Language Processing. Singapore, Dec 06-10, 2023. DOI GitHub
Abstract

Few-shot classification has made great strides due to foundation models that, through priming and prompting, are highly effective few-shot learners. However, this approach has high variance both across different sets of few shots (data selection) and across different finetuning runs (run variability). This is problematic not only because it impedes the fair comparison of different approaches, but especially because it makes few-shot learning too unreliable for many real-world applications. To alleviate these issues, we make two contributions for more stable and effective few-shot learning: First, we propose novel ensembling methods and show that they substantially reduce run variability. Second, we introduce a new active learning (AL) criterion for data selection and present the first AL-based approach specifically tailored towards prompt-based learning. In our experiments, we show that our combined method, MEAL (Multiprompt finetuning and prediction Ensembling with Active Learning), improves overall performance of prompt-based finetuning by 2.3 points on five diverse tasks.

MCML Authors
Link to website

Abdullatif Köksal

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[64]
A. Köksal, O. Yalcin, A. Akbiyik, M. T. Kilavuz, A. Korhonen and H. Schütze.
Language-Agnostic Bias Detection in Language Models with Bias Probing.
EMNLP 2023 - Findings of the Conference on Empirical Methods in Natural Language Processing. Singapore, Dec 06-10, 2023. DOI GitHub
Abstract

Pretrained language models (PLMs) are key components in NLP, but they contain strong social biases. Quantifying these biases is challenging because current methods focusing on fill-the-mask objectives are sensitive to slight changes in input. To address this, we propose a bias probing technique called LABDet, for evaluating social bias in PLMs with a robust and language-agnostic method. For nationality as a case study, we show that LABDet “surfaces” nationality bias by training a classifier on top of a frozen PLM on non-nationality sentiment detection. We find consistent patterns of nationality bias across monolingual PLMs in six languages that align with historical and political context. We also show for English BERT that bias surfaced by LABDet correlates well with bias in the pretraining data; thus, our work is one of the few studies that directly links pretraining data to PLM behavior. Finally, we verify LABDet’s reliability and applicability to different templates and languages through an extensive set of robustness checks.

MCML Authors
Link to website

Abdullatif Köksal

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[63]
W. Lai, A. Chronopoulou and A. Fraser.
Mitigating Data Imbalance and Representation Degeneration in Multilingual Machine Translation.
EMNLP 2023 - Findings of the Conference on Empirical Methods in Natural Language Processing. Singapore, Dec 06-10, 2023. DOI
Abstract

Despite advances in multilingual neural machine translation (MNMT), we argue that there are still two major challenges in this area: data imbalance and representation degeneration. The data imbalance problem refers to the imbalance in the amount of parallel corpora for all language pairs, especially for long-tail languages (i.e., very low-resource languages). The representation degeneration problem refers to the problem of encoded tokens tending to appear only in a small subspace of the full space available to the MNMT model. To solve these two issues, we propose Bi-ACL, a framework which only requires target-side monolingual data and a bilingual dictionary to improve the performance of the MNMT model. We define two modules, named bidirectional autoencoder and bidirectional contrastive learning, which we combine with an online constrained beam search and a curriculum learning sampling strategy. Extensive experiments show that our proposed method is more effective than strong baselines both in long-tail languages and in high-resource languages. We also demonstrate that our approach is capable of transferring knowledge between domains and languages in zero-shot scenarios.

MCML Authors
Link to website

Alexandra Chronopoulou

Dr.

* Former member

Link to Profile Alexander Fraser

Alexander Fraser

Prof. Dr.

Data Analytics & Statistics


[62]
Y. Liu, H. Ye, L. Weissweiler, R. Pei and H. Schütze.
Crosslingual Transfer Learning for Low-Resource Languages Based on Multilingual Colexification Graphs.
EMNLP 2023 - Findings of the Conference on Empirical Methods in Natural Language Processing. Singapore, Dec 06-10, 2023. DOI
Abstract

In comparative linguistics, colexification refers to the phenomenon of a lexical form conveying two or more distinct meanings. Existing work on colexification patterns relies on annotated word lists, limiting scalability and usefulness in NLP. In contrast, we identify colexification patterns of more than 2,000 concepts across 1,335 languages directly from an unannotated parallel corpus. We then propose simple and effective methods to build multilingual graphs from the colexification patterns: ColexNet and ColexNet+. ColexNet’s nodes are concepts and its edges are colexifications. In ColexNet+, concept nodes are additionally linked through intermediate nodes, each representing an ngram in one of 1,334 languages. We use ColexNet+ to train ColexNet+, high-quality multilingual embeddings that are well-suited for transfer learning. In our experiments, we first show that ColexNet achieves high recall on CLICS, a dataset of crosslingual colexifications. We then evaluate ColexNet+ on roundtrip translation, sentence retrieval and sentence classification and show that our embeddings surpass several transfer learning baselines. This demonstrates the benefits of using colexification as a source of information in multilingual NLP.

MCML Authors
Link to website

Yihong Liu

Statistical NLP and Deep Learning

Link to website

Haotian Ye

Statistical NLP and Deep Learning

Leonie Weissweiler

Leonie Weissweiler

Dr.

* Former member

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[61]
M. Müller-Eberstein, R. van der Goot, B. Plank and I. Titov.
Subspace Chronicles: How Linguistic Information Emerges, Shifts and Interacts during Language Model Training.
EMNLP 2023 - Findings of the Conference on Empirical Methods in Natural Language Processing. Singapore, Dec 06-10, 2023. DOI
Abstract

Representational spaces learned via language modeling are fundamental to Natural Language Processing (NLP), however there has been limited understanding regarding how and when during training various types of linguistic information emerge and interact. Leveraging a novel information theoretic probing suite, which enables direct comparisons of not just task performance, but their representational subspaces, we analyze nine tasks covering syntax, semantics and reasoning, across 2M pre-training steps and five seeds. We identify critical learning phases across tasks and time, during which subspaces emerge, share information, and later disentangle to specialize. Across these phases, syntactic knowledge is acquired rapidly after 0.5% of full training. Continued performance improvements primarily stem from the acquisition of open-domain knowledge, while semantics and reasoning tasks benefit from later boosts to long-range contextualization and higher specialization. Measuring cross-task similarity further reveals that linguistically related tasks share information throughout training, and do so more during the critical phase of learning than before or after. Our findings have implications for model interpretability, multi-task learning, and learning from limited data.

MCML Authors
Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

Artificial Intelligence and Computational Linguistics


[60]
E. Nie, H. Schmid and H. Schütze.
Unleashing the Multilingual Encoder Potential: Boosting Zero-Shot Performance via Probability Calibration.
EMNLP 2023 - Findings of the Conference on Empirical Methods in Natural Language Processing. Singapore, Dec 06-10, 2023. DOI
Abstract

Pretrained multilingual encoder models can directly perform zero-shot multilingual tasks or linguistic probing by reformulating the input examples into cloze-style prompts. This is accomplished by predicting the probabilities of the label words at the masked token position, without requiring any updates to the model parameters. However, the performance of this method is limited by the model’s bias toward predicting label words which frequently occurred during the pretraining. These words typically receive high probabilities. To address this issue, we combine the models with calibration techniques which modify the probabilities of label words predicted by the models. We first validate the effectiveness of a proposed simple calibration method together with other existing techniques on monolingual encoders in both zero- and few-shot scenarios. We subsequently employ these calibration techniques on multilingual encoders, resulting in substantial performance improvements across a wide range of tasks.

MCML Authors
Link to website

Ercong Nie

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[59]
V. Hangya, S. Severini, R. Ralev, A. Fraser and H. Schütze.
Multilingual Word Embeddings for Low-Resource Languages using Anchors and a Chain of Related Languages.
MRL @EMNLP 2023 - 3rd Workshop on Multi-lingual Representation Learning at the Conference on Empirical Methods in Natural Language Processing (EMNLP 2023). Singapore, Dec 06-10, 2023. DOI
Abstract

Very low-resource languages, having only a few million tokens worth of data, are not well-supported by multilingual NLP approaches due to poor quality cross-lingual word representations. Recent work showed that good crosslingual performance can be achieved if a source language is related to the low-resource target language. However, not all language pairs are related. In this paper, we propose to build multilingual word embeddings (MWEs) via a novel language chain-based approach, that incorporates intermediate related languages to bridge the gap between the distant source and target. We build MWEs one language at a time by starting from the resource rich source and sequentially adding each language in the chain till we reach the target. We extend a semi-joint bilingual approach to multiple languages in order to eliminate the main weakness of previous works, i.e., independently trained monolingual embeddings, by anchoring the target language around the multilingual space. We evaluate our method on bilingual lexicon induction for 4 language families, involving 4 very low-resource (≤ 5M tokens) and 4 moderately low-resource (≤ 50M) target languages, showing improved performance in both categories. Additionally, our analysis reveals the importance of good quality embeddings for intermediate languages as well as the importance of leveraging anchor points from all languages in the multilingual space.

MCML Authors
Link to Profile Alexander Fraser

Alexander Fraser

Prof. Dr.

Data Analytics & Statistics

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[58]
W. Lai, V. Hangya and A. Fraser.
Extending Multilingual Machine Translation through Imitation Learning.
Preprint (Nov. 2023). arXiv
Abstract

Despite the growing variety of languages supported by existing multilingual neural machine translation (MNMT) models, most of the world’s languages are still being left behind. We aim to extend large-scale MNMT models to a new language, allowing for translation between the newly added and all of the already supported languages in a challenging scenario: using only a parallel corpus between the new language and English. Previous approaches, such as continued training on parallel data including the new language, suffer from catastrophic forgetting (i.e., performance on other languages is reduced). Our novel approach Imit-MNMT treats the task as an imitation learning process, which mimicks the behavior of an expert, a technique widely used in the computer vision area, but not well explored in NLP. More specifically, we construct a pseudo multi-parallel corpus of the new and the original languages by pivoting through English, and imitate the output distribution of the original MNMT model. Extensive experiments show that our approach significantly improves the translation performance between the new and the original languages, without severe catastrophic forgetting. We also demonstrate that our approach is capable of solving copy and off-target problems, which are two common issues existence in current large-scale MNMT models.

MCML Authors
Link to Profile Alexander Fraser

Alexander Fraser

Prof. Dr.

Data Analytics & Statistics


[57]
L. Weissweiler, V. Hofmann, A. Köksal and H. Schütze.
Explaining pretrained language models' understanding of linguistic structures using construction grammar.
Frontiers in Artificial Intelligence 6 (Oct. 2023). DOI
Abstract

Construction Grammar (CxG) is a paradigm from cognitive linguistics emphasizing the connection between syntax and semantics. Rather than rules that operate on lexical items, it posits constructions as the central building blocks of language, i.e., linguistic units of different granularity that combine syntax and semantics. As a first step toward assessing the compatibility of CxG with the syntactic and semantic knowledge demonstrated by state-of-the-art pretrained language models (PLMs), we present an investigation of their capability to classify and understand one of the most commonly studied constructions, the English comparative correlative (CC). We conduct experiments examining the classification accuracy of a syntactic probe on the one hand and the models’ behavior in a semantic application task on the other, with BERT, RoBERTa, and DeBERTa as the example PLMs. Our results show that all three investigated PLMs, as well as OPT, are able to recognize the structure of the CC but fail to use its meaning. While human-like performance of PLMs on many NLP tasks has been alleged, this indicates that PLMs still suffer from substantial shortcomings in central domains of linguistic knowledge.

MCML Authors
Leonie Weissweiler

Leonie Weissweiler

Dr.

* Former member

Link to website

Abdullatif Köksal

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[56]
B. Ma, E. Nie, H. Schmid and H. Schütze.
Is Prompt-Based Finetuning Always Better than Vanilla Finetuning? Insights from Cross-Lingual Language Understanding.
KONVENS 2023 - 19th Conference on Natural Language Processing. Ingolstadt, Germany, Sep 18-22, 2023. URL
Abstract

Multilingual pretrained language models (MPLMs) have demonstrated substantial performance improvements in zero-shot cross-lingual transfer across various natural language understanding tasks by finetuning MPLMs on task-specific labelled data of a source language (e.g. English) and evaluating on a wide range of target languages. Recent studies show that prompt-based finetuning surpasses regular finetuning in few-shot scenarios. However, the exploration of prompt-based learning in multilingual tasks remains limited. In this study, we propose the PROFIT pipeline to investigate the cross-lingual capabilities of Prompt-based Finetuning. We conduct comprehensive experiments on diverse cross-lingual language understanding tasks (sentiment classification, paraphrase identification, and natural language inference) and empirically analyze the variation trends of prompt-based finetuning performance in cross-lingual transfer across different few-shot and full-data settings. Our results reveal the effectiveness and versatility of prompt-based finetuning in cross-lingual language understanding. Our findings indicate that prompt-based finetuning outperforms vanilla finetuning in full-data scenarios and exhibits greater advantages in few-shot scenarios, with different performance patterns dependent on task types. Additionally, we analyze underlying factors such as language similarity and pretraining data size that impact the cross-lingual performance of prompt-based finetuning. Overall, our work provides valuable insights into the cross-lingual prowess of prompt-based finetuning.

MCML Authors
Link to website

Bolei Ma

Social Data Science and AI Lab

Link to website

Ercong Nie

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[55]
A. Maronikolakis, P. O’Grady, H. Schütze and M. Lyra.
Improving Few-Shot Learning with Multilingual Transfer and Monte Carlo Training Set Selection.
LSD 2023 - CLASP Conference on Learning with Small Data. Gothenburg, Sweden, Sep 11-12, 2023. URL
Abstract

In industry settings, machine learning is an attractive tool to automatize processes. Unfortunately, annotated and high-quality data is expensive to source. This problem is exacerbated in settings spanning multiple markets and languages. Thus, developing solutions for multilingual tasks with little available data is challenging. Few-shot learning is a compelling approach when building solutions in multilingual and low-resource settings, since the method not only requires just a few training examples to achieve high performance, but is also a technique agnostic to language. Even though the technique can be applied to multilingual settings, optimizing performance is an open question. In our work we show that leveraging higher-resource, task-specific language data can boost overall performance and we propose a method to select training examples per their average performance in a Monte Carlo simulation, resulting in a training set more conducive to learning. We demonstrate the effectiveness of our methods in fashion text reviews moderation, classifying reviews as related or unrelated to the given product. We show that our methodology boosts performance in multilingual (English, French, German) settings, increasing F1 score and significantly decreasing false positives.

MCML Authors
Link to website

Antonis Maronikolakis

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[54]
E. Nie, H. Schmid and H. Schütze.
Cross-Lingual Constituency Parsing for Middle High German: A Delexicalized Approach.
ALP @RANLP 2023 - 1st Workshop on Ancient Language Processing co-located with the Conference on Recent Advances in Natural Language Processing (RANLP 2023). Varna, Bulgaria, Sep 08, 2023. URL
Abstract

Constituency parsing plays a fundamental role in advancing natural language processing (NLP) tasks. However, training an automatic syntactic analysis system for ancient languages solely relying on annotated parse data is a formidable task due to the inherent challenges in building treebanks for such languages. It demands extensive linguistic expertise, leading to a scarcity of available resources. To overcome this hurdle, cross-lingual transfer techniques which require minimal or even no annotated data for low-resource target languages offer a promising solution. In this study, we focus on building a constituency parser for Middle High German (MHG) under realistic conditions, where no annotated MHG treebank is available for training. In our approach, we leverage the linguistic continuity and structural similarity between MHG and Modern German (MG), along with the abundance of MG treebank resources. Specifically, by employing the delexicalization method, we train a constituency parser on MG parse datasets and perform cross-lingual transfer to MHG parsing. Our delexicalized constituency parser demonstrates remarkable performance on the MHG test set, achieving an F1-score of 67.3%. It outperforms the best zero-shot cross-lingual baseline by a margin of 28.6% points. The encouraging results underscore the practicality and potential for automatic syntactic analysis in other ancient languages that face similar challenges as MHG.

MCML Authors
Link to website

Ercong Nie

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[53]
V. Hangya and A. Fraser.
LMU at HaSpeeDe3: Multi-Dataset Training for Cross-Domain Hate Speech Detection.
EVALITA 2023 - Final Workshop of the 8th evaluation campaign. Parma, Italy, Sep 07-08, 2023. PDF
Abstract

We describe LMU Munich’s hate speech detection system for participating in the cross-domain track of the HaSpeeDe3 shared task at EVALITA 2023. The task focuses on the politics and religion domains, having no in-domain training data for the latter. Our submission combines multiple training sets from various domains in a multitask prompt-training system. We experimented with both Italian and English source datasets as well as monolingual Italian and multilingual pre-trained language models. We found that the Italian out-of-domain datasets are the most influential on the performance in the test domains and that combining both monolingual and multilingual language models using an ensemble gives the best results. Our system ranked second in both domains.

MCML Authors
Link to Profile Alexander Fraser

Alexander Fraser

Prof. Dr.

Data Analytics & Statistics


[52]
A. Imani, P. Lin, A. H. Kargaran, S. Severini, M. J. Sabet, N. Kassner, C. Ma, H. Schmid, A. Martins, F. Yvon and H. Schütze.
Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages.
ACL 2023 - 61th Annual Meeting of the Association for Computational Linguistics. Toronto, Canada, Jul 09-14, 2023. DOI GitHub
Abstract

The NLP community has mainly focused on scaling Large Language Models (LLMs) vertically, i.e., making them better for about 100 languages. We instead scale LLMs horizontally: we create, through continued pretraining, Glot500-m, an LLM that covers 511 predominantly low-resource languages. An important part of this effort is to collect and clean Glot500-c, a corpus that covers these 511 languages and allows us to train Glot500-m. We evaluate Glot500-m on five diverse tasks across these languages. We observe large improvements for both high-resource and low-resource languages compared to an XLM-R baseline. Our analysis shows that no single factor explains the quality of multilingual LLM representations. Rather, a combination of factors determines quality including corpus size, script, ‘help’ from related languages and the total capacity of the model. Our work addresses an important goal of NLP research: we should notlimit NLP to a small fraction of the world’s languages and instead strive to support as many languages as possible to bring the benefits of NLP technology to all languages and cultures.

MCML Authors
Link to website

Ayyoob Imani

Statistical NLP and Deep Learning

Link to website

Peiqin Lin

Statistical NLP and Deep Learning

Link to website

Amir Hossein Kargaran

Statistical NLP and Deep Learning

Link to website

Masoud Jalili Sabet

Dr.

* Former member

Link to website

Chunlan Ma

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[51]
Y. Liu, H. Ye, L. Weissweiler, P. Wicke, R. Pei, R. Zangenfeind and H. Schütze.
A Crosslingual Investigation of Conceptualization in 1335 Languages.
ACL 2023 - 61th Annual Meeting of the Association for Computational Linguistics. Toronto, Canada, Jul 09-14, 2023. DOI
Abstract

Languages differ in how they divide up the world into concepts and words; e.g., in contrast to English, Swahili has a single concept for ‘belly’ and ‘womb’. We investigate these differences in conceptualization across 1,335 languages by aligning concepts in a parallel corpus. To this end, we propose Conceptualizer, a method that creates a bipartite directed alignment graph between source language concepts and sets of target language strings. In a detailed linguistic analysis across all languages for one concept (‘bird’) and an evaluation on gold standard data for 32 Swadesh concepts, we show that Conceptualizer has good alignment accuracy. We demonstrate the potential of research on conceptualization in NLP with two experiments. (1) We define crosslingual stability of a concept as the degree to which it has 1-1 correspondences across languages, and show that concreteness predicts stability. (2) We represent each language by its conceptualization pattern for 83 concepts, and define a similarity measure on these representations. The resulting measure for the conceptual similarity between two languages is complementary to standard genealogical, typological, and surface similarity measures. For four out of six language families, we can assign languages to their correct family based on conceptual similarity with accuracies between 54% and 87%.

MCML Authors
Link to website

Yihong Liu

Statistical NLP and Deep Learning

Link to website

Haotian Ye

Statistical NLP and Deep Learning

Leonie Weissweiler

Leonie Weissweiler

Dr.

* Former member

Link to website

Philipp Wicke

Dr.

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[50]
Y. Liu, S. Feng, D. Wang, Y. Zhang and H. Schütze.
PVGRU: Generating Diverse and Relevant Dialogue Responses via Pseudo-Variational Mechanism.
ACL 2023 - 61th Annual Meeting of the Association for Computational Linguistics. Toronto, Canada, Jul 09-14, 2023. DOI
Abstract

We investigate response generation for multi-turn dialogue in generative chatbots. Existing generative modelsbased on RNNs (Recurrent Neural Networks) usually employ the last hidden state to summarize the history, which makesmodels unable to capture the subtle variability observed in different dialogues and cannot distinguish the differencesbetween dialogues that are similar in composition. In this paper, we propose Pseudo-Variational Gated Recurrent Unit (PVGRU). The key novelty of PVGRU is a recurrent summarizing variable thataggregates the accumulated distribution variations of subsequences. We train PVGRU without relying on posterior knowledge, thus avoiding the training-inference inconsistency problem. PVGRU can perceive subtle semantic variability through summarizing variables that are optimized by two objectives we employ for training: distribution consistency and reconstruction. In addition, we build a Pseudo-Variational Hierarchical Dialogue(PVHD) model based on PVGRU. Experimental results demonstrate that PVGRU can broadly improve the diversity andrelevance of responses on two benchmark datasets.

MCML Authors
Link to website

Yongkang Liu

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[49]
K. Hämmerl, B. Deiseroth, P. Schramowski, J. Libovický, C. Rothkopf, A. Fraser and K. Kersting.
Speaking Multiple Languages Affects the Moral Bias of Language Models.
ACL 2023 - Findings of the 61th Annual Meeting of the Association for Computational Linguistics. Toronto, Canada, Jul 09-14, 2023. DOI
Abstract

Pre-trained multilingual language models (PMLMs) are commonly used when dealing with data from multiple languages and cross-lingual transfer. However, PMLMs are trained on varying amounts of data for each language. In practice this means their performance is often much better on English than many other languages. We explore to what extent this also applies to moral norms. Do the models capture moral norms from English and impose them on other languages? Do the models exhibit random and thus potentially harmful beliefs in certain languages? Both these issues could negatively impact cross-lingual transfer and potentially lead to harmful outcomes. In this paper, we (1) apply the MORALDIRECTION framework to multilingual models, comparing results in German, Czech, Arabic, Chinese, and English, (2) analyse model behaviour on filtered parallel subtitles corpora, and (3) apply the models to a Moral Foundations Questionnaire, comparing with human responses from different countries. Our experiments demonstrate that, indeed, PMLMs encode differing moral biases, but these do not necessarily correspond to cultural differences or commonalities in human opinions. We release our code and models.

MCML Authors
Link to website at LMU

Katharina Hämmerl

Data Analytics & Statistics

Link to Profile Alexander Fraser

Alexander Fraser

Prof. Dr.

Data Analytics & Statistics


[48]
K. Hämmerl, A. Fastowski, J. Libovický and A. Fraser.
Exploring Anisotropy and Outliers in Multilingual Language Models for Cross-Lingual Semantic Sentence Similarity.
ACL 2023 - Findings of the 61th Annual Meeting of the Association for Computational Linguistics. Toronto, Canada, Jul 09-14, 2023. DOI
Abstract

Previous work has shown that the representations output by contextual language models are more anisotropic than static type embeddings, and typically display outlier dimensions. This seems to be true for both monolingual and multilingual models, although much less work has been done on the multilingual context. Why these outliers occur and how they affect the representations is still an active area of research. We investigate outlier dimensions and their relationship to anisotropy in multiple pre-trained multilingual language models. We focus on cross-lingual semantic similarity tasks, as these are natural tasks for evaluating multilingual representations. Specifically, we examine sentence representations. Sentence transformers which are fine-tuned on parallel resources (that are not always available) perform better on this task, and we show that their representations are more isotropic. However, we aim to improve multilingual representations in general. We investigate how much of the performance difference can be made up by only transforming the embedding space without fine-tuning, and visualise the resulting spaces. We test different operations: Removing individual outlier dimensions, cluster-based isotropy enhancement, and ZCA whitening. We publish our code for reproducibility.

MCML Authors
Link to website at LMU

Katharina Hämmerl

Data Analytics & Statistics

Link to Profile Alexander Fraser

Alexander Fraser

Prof. Dr.

Data Analytics & Statistics


[47]
Z. Han, R. Liao, J. Gu, Y. Zhang, Z. Ding, Y. Gu, H. Köppl, H. Schütze and V. Tresp.
ECOLA: Enhancing Temporal Knowledge Embeddings with Contextualized Language Representations.
ACL 2023 - Findings of the 61th Annual Meeting of the Association for Computational Linguistics. Toronto, Canada, Jul 09-14, 2023. DOI
Abstract

Since conventional knowledge embedding models cannot take full advantage of the abundant textual information, there have been extensive research efforts in enhancing knowledge embedding using texts. However, existing enhancement approaches cannot apply to temporal knowledge graphs (tKGs), which contain time-dependent event knowledge with complex temporal dynamics. Specifically, existing enhancement approaches often assume knowledge embedding is time-independent. In contrast, the entity embedding in tKG models usually evolves, which poses the challenge of aligning temporally relevant texts with entities. To this end, we propose to study enhancing temporal knowledge embedding with textual data in this paper. As an approach to this task, we propose Enhanced Temporal Knowledge Embeddings with Contextualized Language Representations (ECOLA), which takes the temporal aspect into account and injects textual information into temporal knowledge embedding. To evaluate ECOLA, we introduce three new datasets for training and evaluating ECOLA. Extensive experiments show that ECOLA significantly enhances temporal KG embedding models with up to 287% relative improvements regarding Hits@1 on the link prediction task.

MCML Authors
Link to website

Ruotong Liao

Database Systems & Data Mining

Link to website

Yao Zhang

Database Systems & Data Mining

Link to website

Zifeng Ding

Database Systems & Data Mining

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning

Link to Profile Volker Tresp

Volker Tresp

Prof. Dr.

Database Systems & Data Mining


[46]
E. Nie, S. Liang, H. Schmid and H. Schütze.
Cross-Lingual Retrieval Augmented Prompt for Low-Resource Languages.
ACL 2023 - Findings of the 61th Annual Meeting of the Association for Computational Linguistics. Toronto, Canada, Jul 09-14, 2023. DOI
Abstract

Multilingual Pretrained Language Models (MPLMs) perform strongly in cross-lingual transfer. We propose Prompts Augmented by Retrieval Crosslingually (PARC) to improve zero-shot performance on low-resource languages (LRLs) by augmenting the context with prompts consisting of semantically similar sentences retrieved from a high-resource language (HRL). PARC improves zero-shot performance on three downstream tasks (sentiment classification, topic categorization, natural language inference) with multilingual parallel test sets across 10 LRLs covering 6 language families in unlabeled (+5.1%) and labeled settings (+16.3%). PARC also outperforms finetuning by 3.7%. We find a significant positive correlation between cross-lingual transfer performance on one side, and the similarity between high- and low-resource languages as well as the amount of low-resource pretraining data on the other side. A robustness analysis suggests that PARC has the potential to achieve even stronger performance with more powerful MPLMs.

MCML Authors
Link to website

Ercong Nie

Statistical NLP and Deep Learning

Link to website

Sheng Liang

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[45]
L. Weber and B. Plank.
ActiveAED: A Human in the Loop Improves Annotation Error Detection.
ACL 2023 - Findings of the 61th Annual Meeting of the Association for Computational Linguistics. Toronto, Canada, Jul 09-14, 2023. DOI
Abstract

Manually annotated datasets are crucial for training and evaluating Natural Language Processing models. However, recent work has discovered that even widely-used benchmark datasets contain a substantial number of erroneous annotations. This problem has been addressed with Annotation Error Detection (AED) models, which can flag such errors for human re-annotation. However, even though many of these AED methods assume a final curation step in which a human annotator decides whether the annotation is erroneous, they have been developed as static models without any human-in-the-loop component. In this work, we propose ActiveAED, an AED method that can detect errors more accurately by repeatedly querying a human for error corrections in its prediction loop. We evaluate ActiveAED on eight datasets spanning five different tasks and find that it leads to improvements over the state of the art on seven of them, with gains of up to six percentage points in average precision.

MCML Authors
Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

Artificial Intelligence and Computational Linguistics


[44]
Y. Liu, A. Chronopoulou, H. Schütze and A. Fraser.
On the Copying Problem of Unsupervised NMT: A Training Schedule with a Language Discriminator Loss.
IWSLT 2023 - 20th International Conference on Spoken Language Translation. Toronto, Canada, Jul 09-14, 2023. DOI
Abstract

Although unsupervised neural machine translation (UNMT) has achieved success in many language pairs, the copying problem, i.e., directly copying some parts of the input sentence as the translation, is common among distant language pairs, especially when low-resource languages are involved. We find this issue is closely related to an unexpected copying behavior during online back-translation (BT). In this work, we propose a simple but effective training schedule that incorporates a language discriminator loss. The loss imposes constraints on the intermediate translation so that the translation is in the desired language. By conducting extensive experiments on different language pairs, including similar and distant, high and low-resource languages, we find that our method alleviates the copying problem, thus improving the translation performance on low-resource languages.

MCML Authors
Link to website

Yihong Liu

Statistical NLP and Deep Learning

Link to website

Alexandra Chronopoulou

Dr.

* Former member

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning

Link to Profile Alexander Fraser

Alexander Fraser

Prof. Dr.

Data Analytics & Statistics


[43]
P. Wicke, L. K. Senel, S. Zhang, L. Figueredo, A. Naceri, S. Haddadin and H. Schütze.
Towards Language-Based Modulation of Assistive Robots through Multimodal Models.
Geriatronics Summit 2023 - 2nd Geriatronics Summit. Garmisch-Partenkirchen, Germany, Jul 02-03, 2023. arXiv
Abstract

In the field of Geriatronics, enabling effective and transparent communication between humans and robots is crucial for enhancing the acceptance and performance of assistive robots. Our early-stage research project investigates the potential of language-based modulation as a means to improve human-robot interaction. We propose to explore real-time modulation during task execution, leveraging language cues, visual references, and multimodal inputs. By developing transparent and interpretable methods, we aim to enable robots to adapt and respond to language commands, enhancing their usability and flexibility. Through the exchange of insights and knowledge at the workshop, we seek to gather valuable feedback to advance our research and contribute to the development of interactive robotic systems for Geriatronics and beyond.

MCML Authors
Link to website

Philipp Wicke

Dr.

Statistical NLP and Deep Learning

Link to website

Shengqiang Zhang

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[42]
J. Baan, N. Daheim, E. Ilia, D. Ulmer, H.-S. Li, R. Fernández, B. Plank, R. Sennrich, C. Zerva and W. Aziz.
Uncertainty in Natural Language Generation: From Theory to Applications.
Preprint (Jul. 2023). arXiv
Abstract

Recent advances of powerful Language Models have allowed Natural Language Generation (NLG) to emerge as an important technology that can not only perform traditional tasks like summarisation or translation, but also serve as a natural language interface to a variety of applications. As such, it is crucial that NLG systems are trustworthy and reliable, for example by indicating when they are likely to be wrong; and supporting multiple views, backgrounds and writing styles – reflecting diverse human sub-populations. In this paper, we argue that a principled treatment of uncertainty can assist in creating systems and evaluation protocols better aligned with these goals. We first present the fundamental theory, frameworks and vocabulary required to represent uncertainty. We then characterise the main sources of uncertainty in NLG from a linguistic perspective, and propose a two-dimensional taxonomy that is more informative and faithful than the popular aleatoric/epistemic dichotomy. Finally, we move from theory to applications and highlight exciting research directions that exploit uncertainty to power decoding, controllable generation, self-assessment, selective answering, active learning and more.

MCML Authors
Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

Artificial Intelligence and Computational Linguistics


[41]
V. Steinborn, A. Maronikolakis and H. Schütze.
Politeness Stereotypes and Attack Vectors: Gender Stereotypes in Japanese and Korean Language Models.
Preprint (Jun. 2023). arXiv
Abstract

In efforts to keep up with the rapid progress and use of large language models, gender bias research is becoming more prevalent in NLP. Non-English bias research, however, is still in its infancy with most work focusing on English. In our work, we study how grammatical gender bias relating to politeness levels manifests in Japanese and Korean language models. Linguistic studies in these languages have identified a connection between gender bias and politeness levels, however it is not yet known if language models reproduce these biases. We analyze relative prediction probabilities of the male and female grammatical genders using templates and find that informal polite speech is most indicative of the female grammatical gender, while rude and formal speech is most indicative of the male grammatical gender. Further, we find politeness levels to be an attack vector for allocational gender bias in cyberbullying detection models. Cyberbullies can evade detection through simple techniques abusing politeness levels. We introduce an attack dataset to (i) identify representational gender bias across politeness levels, (ii) demonstrate how gender biases can be abused to bypass cyberbullying detection models and (iii) show that allocational biases can be mitigated via training on our proposed dataset. Through our findings we highlight the importance of bias research moving beyond its current English-centrism.

MCML Authors
Link to website

Victor Steinborn

Statistical NLP and Deep Learning

Link to website

Antonis Maronikolakis

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[40]
V. Blaschke, H. Schütze and B. Plank.
A Survey of Corpora for Germanic Low-Resource Languages and Dialects.
NoDaLiDa 2023 - 24th Nordic Conference on Computational Linguistics. Tórshavn, Faroe Islands, May 22-24, 2023. URL
Abstract

Despite much progress in recent years, the vast majority of work in natural language processing (NLP) is on standard languages with many speakers. In this work, we instead focus on low-resource languages and in particular non-standardized low-resource languages. Even within branches of major language families, often considered well-researched, little is known about the extent and type of available resources and what the major NLP challenges are for these language varieties. The first step to address this situation is a systematic survey of available corpora (most importantly, annotated corpora, which are particularly valuable for NLP research). Focusing on Germanic low-resource language varieties, we provide such a survey in this paper. Except for geolocation (origin of speaker or document), we find that manually annotated linguistic resources are sparse and, if they exist, mostly cover morphosyntax. Despite this lack of resources, we observe that interest in this area is increasing: there is active development and a growing research community. To facilitate research, we make our overview of over 80 corpora publicly available.

MCML Authors
Link to website

Verena Blaschke

Artificial Intelligence and Computational Linguistics

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

Artificial Intelligence and Computational Linguistics


[39]
X. Wang, L. Weissweiler, H. Schütze and B. Plank.
How to Distill your BERT: An Empirical Study on the Impact of Weight Initialisation and Distillation Objectives.
EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics. Dubrovnik, Croatia, May 02-06, 2023. DOI
Abstract

Recently, various intermediate layer distillation (ILD) objectives have been shown to improve compression of BERT models via Knowledge Distillation (KD). However, a comprehensive evaluation of the objectives in both task-specific and task-agnostic settings is lacking. To the best of our knowledge, this is the first work comprehensively evaluating distillation objectives in both settings. We show that attention transfer gives the best performance overall. We also study the impact of layer choice when initializing the student from the teacher layers, finding a significant impact on the performance in task-specific distillation. For vanilla KD and hidden states transfer, initialisation with lower layers of the teacher gives a considerable improvement over higher layers, especially on the task of QNLI (up to an absolute percentage change of 17.8 in accuracy). Attention transfer behaves consistently under different initialisation settings. We release our code as an efficient transformer-based model distillation framework for further studies.

MCML Authors
Link to website

Xinpeng Wang

Artificial Intelligence and Computational Linguistics

Leonie Weissweiler

Leonie Weissweiler

Dr.

* Former member

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

Artificial Intelligence and Computational Linguistics


[38]
A. Chronopoulou, M. Peters, A. Fraser and J. Dodge.
AdapterSoup: Weight Averaging to Improve Generalization of Pretrained Language Models.
EACL 2023 - Findings of the 17th Conference of the European Chapter of the Association for Computational Linguistics. Dubrovnik, Croatia, May 02-06, 2023. DOI
Abstract

Pretrained language models (PLMs) are trained on massive corpora, but often need to specialize to specific domains. A parameter-efficient adaptation method suggests training an adapter for each domain on the task of language modeling. This leads to good in-domain scores but can be impractical for domain- or resource-restricted settings. A solution is to use a related-domain adapter for the novel domain at test time. In this paper, we introduce AdapterSoup, an approach that performs weight-space averaging of adapters trained on different domains. Our approach is embarrassingly parallel: first, we train a set of domain-specific adapters; then, for each novel domain, we determine which adapters should be averaged at test time. We present extensive experiments showing that AdapterSoup consistently improves performance to new domains without extra training. We also explore weight averaging of adapters trained on the same domain with different hyper-parameters, and show that it preserves the performance of a PLM on new domains while obtaining strong in-domain results. We explore various approaches for choosing which adapters to combine, such as text clustering and semantic similarity. We find that using clustering leads to the most competitive results on novel domains.

MCML Authors
Link to website

Alexandra Chronopoulou

Dr.

* Former member

Link to Profile Alexander Fraser

Alexander Fraser

Prof. Dr.

Data Analytics & Statistics


[37]
A. Chronopoulou, D. Stojanovski and A. Fraser.
Language-Family Adapters for Low-Resource Multilingual Neural Machine Translation.
LoResMT @EACL 2023 - 6th Workshop on Technologies for Machine Translation of Low-Resource Languages at the 17th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2023). Dubrovnik, Croatia, May 02-06, 2023. DOI
Abstract

Large multilingual models trained with self-supervision achieve state-of-the-art results in a wide range of natural language processing tasks. Self-supervised pretrained models are often fine-tuned on parallel data from one or multiple language pairs for machine translation. Multilingual fine-tuning improves performance on low-resource languages but requires modifying the entire model and can be prohibitively expensive. Training a new adapter on each language pair or training a single adapter on all language pairs without updating the pretrained model has been proposed as a parameter-efficient alternative. However, the former does not permit any sharing between languages, while the latter shares parameters for all languages and is susceptible to negative interference. In this paper, we propose training language-family adapters on top of mBART-50 to facilitate cross-lingual transfer. Our approach outperforms related baselines, yielding higher translation scores on average when translating from English to 17 different low-resource languages. We also show that language-family adapters provide an effective method to translate to languages unseen during pretraining.

MCML Authors
Link to website

Alexandra Chronopoulou

Dr.

* Former member

Link to Profile Alexander Fraser

Alexander Fraser

Prof. Dr.

Data Analytics & Statistics


[36]
V. Blaschke, H. Schütze and B. Plank.
Does Manipulating Tokenization Aid Cross-Lingual Transfer? A Study on POS Tagging for Non-Standardized Languages.
VarDial @EACL 2023 - 10th Workshop on NLP for Similar Languages, Varieties and Dialects at the 17th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2023). Dubrovnik, Croatia, May 02-06, 2023. DOI
Abstract

One of the challenges with finetuning pretrained language models (PLMs) is that their tokenizer is optimized for the language(s) it was pretrained on, but brittle when it comes to previously unseen variations in the data. This can for instance be observed when finetuning PLMs on one language and evaluating them on data in a closely related language variety with no standardized orthography. Despite the high linguistic similarity, tokenization no longer corresponds to meaningful representations of the target data, leading to low performance in, e.g., part-of-speech tagging. In this work, we finetune PLMs on seven languages from three different families and analyze their zero-shot performance on closely related, non-standardized varieties. We consider different measures for the divergence in the tokenization of the source and target data, and the way they can be adjusted by manipulating the tokenization during the finetuning step. Overall, we find that the similarity between the percentage of words that get split into subwords in the source and target data (the split word ratio difference) is the strongest predictor for model performance on target data.

MCML Authors
Link to website

Verena Blaschke

Artificial Intelligence and Computational Linguistics

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning

Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

Artificial Intelligence and Computational Linguistics


[35]
Y. Liu, S. Feng, D. Wang, Y. Zhang and H. Schütze.
Evaluate What You Can't Evaluate: Unassessable Quality for Generated Response.
Preprint (May. 2023). arXiv
Abstract

LLMs (large language models) such as ChatGPT have shown remarkable language understanding and generation capabilities. Although reference-free evaluators based on LLMs show better human alignment than traditional reference-based evaluators, there are many challenges in using reference-free evaluators based on LLMs. Reference-free evaluators are more suitable for open-ended examples with different semantics responses. But not all examples are open-ended. For closed-ended examples with unique correct semantic response, reference-free evaluators will still consider it high quality when giving a response that is inconsistent with the facts and the semantic of reference. In order to comprehensively evaluate the reliability of evaluators based on LLMs, we construct two adversarial meta-evaluation dialogue generation datasets KdConv-ADV and DSTC7-ADV based on KdConv and DSTC7-AVSD, respectively. Compared to previous meta-evaluation benchmarks, KdConv-ADV and DSTC7-ADV are much more challenging since they requires evaluators to be able to reasonably evaluate closed-ended examples with the help of external knowledge or even its own knowledge. Empirical results show that the ability of LLMs to identify unreasonable responses is insufficient. There are risks in using eference-free evaluators based on LLMs to evaluate the quality of dialogue responses.

MCML Authors
Link to website

Yongkang Liu

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[34]
A. Modarressi, A. Imani, M. Fayyaz and H. Schütze.
RET-LLM: Towards a General Read-Write Memory for Large Language Models.
Preprint (May. 2023). arXiv
Abstract

Large language models (LLMs) have significantly advanced the field of natural language processing (NLP) through their extensive parameters and comprehensive data utilization. However, existing LLMs lack a dedicated memory unit, limiting their ability to explicitly store and retrieve knowledge for various tasks. In this paper, we propose RET-LLM a novel framework that equips LLMs with a general write-read memory unit, allowing them to extract, store, and recall knowledge from the text as needed for task performance. Inspired by Davidsonian semantics theory, we extract and save knowledge in the form of triplets. The memory unit is designed to be scalable, aggregatable, updatable, and interpretable. Through qualitative evaluations, we demonstrate the superiority of our proposed framework over baseline approaches in question answering tasks. Moreover, our framework exhibits robust performance in handling temporal-based question answering tasks, showcasing its ability to effectively manage time-dependent information.

MCML Authors
Link to website

Ali Modarressi

Statistical NLP and Deep Learning

Link to website

Ayyoob Imani

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[33]
H. Ye, Y. Liu and H. Schütze.
A study of conceptual language similarity: comparison and evaluation.
Preprint (May. 2023). arXiv
Abstract

An interesting line of research in natural language processing (NLP) aims to incorporate linguistic typology to bridge linguistic diversity and assist the research of low-resource languages. While most works construct linguistic similarity measures based on lexical or typological features, such as word order and verbal inflection, recent work has introduced a novel approach to defining language similarity based on how they represent basic concepts, which is complementary to existing similarity measures. In this work, we study the conceptual similarity in detail and evaluate it extensively on a binary classification task.

MCML Authors
Link to website

Haotian Ye

Statistical NLP and Deep Learning

Link to website

Yihong Liu

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[32]
L. He, N. Otani, D. R. Mortensen, L. Levin and H. Schütze.
Construction Grammar Provides Unique Insight into Neural Language Models.
GURT 2023 - Georgetown University Round Table on Linguistics. Washington D.C., USA, Mar 09-12, 2023. URL
Abstract

Construction Grammar (CxG) has recently been used as the basis for probing studies that have investigated the performance of large pre-trained language models (PLMs) with respect to the structure and meaning of constructions. In this position paper, we make suggestions for the continuation and augmentation of this line of research. We look at probing methodology that was not designed with CxG in mind, as well as probing methodology that was designed for specific constructions. We analyse selected previous work in detail, and provide our view of the most important challenges and research questions that this promising new field faces.

MCML Authors
Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[31]
J. Li, M. Zhao, Y. Xie, A. Maronikolakis, P. Pu and H. Schütze.
This joke is [MASK]: Recognizing Humor and Offense with Prompting.
TL4NLP @NeurIPS 2022 - 1st Transfer Learning for Natural Language Processing Workshop at the 36th Conference on Neural Information Processing Systems (NeurIPS 2022). New Orleans, LA, USA, Nov 28-Dec 09, 2022. URL
Abstract

Humor is a magnetic component in everyday human interactions and communications. Computationally modeling humor enables NLP systems to entertain and engage with users. We investigate the effectiveness of prompting, a new transfer learning paradigm for NLP, for humor recognition. We show that prompting performs similarly to finetuning when numerous annotations are available, but gives stellar performance in low-resource humor recognition. The relationship between humor and offense is also inspected by applying influence functions to prompting; we show that models could rely on offense to determine humor during transfer.

MCML Authors
Link to website

Antonis Maronikolakis

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[30]
J. Baan, W. Aziz, B. Plank and R. Fernandez.
Stop Measuring Calibration When Humans Disagree.
EMNLP 2022 - Conference on Empirical Methods in Natural Language Processing. Abu Dhabi, United Arab Emirates, Nov 07-11, 2022. DOI
Abstract

Calibration is a popular framework to evaluate whether a classifier knows when it does not know - i.e., its predictive probabilities are a good indication of how likely a prediction is to be correct. Correctness is commonly estimated against the human majority class. Recently, calibration to human majority has been measured on tasks where humans inherently disagree about which class applies. We show that measuring calibration to human majority given inherent disagreements is theoretically problematic, demonstrate this empirically on the ChaosNLI dataset, and derive several instance-level measures of calibration that capture key statistical properties of human judgements - including class frequency, ranking and entropy.

MCML Authors
Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

Artificial Intelligence and Computational Linguistics


[29]
E. Bassignana, M. Müller-Eberstein, M. Zhang and B. Plank.
Evidence > Intuition: Transferability Estimation for Encoder Selection.
EMNLP 2022 - Conference on Empirical Methods in Natural Language Processing. Abu Dhabi, United Arab Emirates, Nov 07-11, 2022. DOI
Abstract

With the increase in availability of large pre-trained language models (LMs) in Natural Language Processing (NLP), it becomes critical to assess their fit for a specific target task a priori—as fine-tuning the entire space of available LMs is computationally prohibitive and unsustainable. However, encoder transferability estimation has received little to no attention in NLP. In this paper, we propose to generate quantitative evidence to predict which LM, out of a pool of models, will perform best on a target task without having to fine-tune all candidates. We provide a comprehensive study on LM ranking for 10 NLP tasks spanning the two fundamental problem types of classification and structured prediction. We adopt the state-of-the-art Logarithm of Maximum Evidence (LogME) measure from Computer Vision (CV) and find that it positively correlates with final LM performance in 94% of the setups.In the first study of its kind, we further compare transferability measures with the de facto standard of human practitioner ranking, finding that evidence from quantitative metrics is more robust than pure intuition and can help identify unexpected LM candidates.

MCML Authors
Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

Artificial Intelligence and Computational Linguistics


[28]
V. Hangya, H. S. Saadi and A. Fraser.
Improving Low-Resource Languages in Pre-Trained Multilingual Language Models.
EMNLP 2022 - Conference on Empirical Methods in Natural Language Processing. Abu Dhabi, United Arab Emirates, Nov 07-11, 2022. DOI
Abstract

Pre-trained multilingual language models are the foundation of many NLP approaches, including cross-lingual transfer solutions. However, languages with small available monolingual corpora are often not well-supported by these models leading to poor performance. We propose an unsupervised approach to improve the cross-lingual representations of low-resource languages by bootstrapping word translation pairs from monolingual corpora and using them to improve language alignment in pre-trained language models. We perform experiments on nine languages, using contextual word retrieval and zero-shot named entity recognition to measure both intrinsic cross-lingual word representation quality and downstream task performance, showing improvements on both tasks. Our results show that it is possible to improve pre-trained multilingual language models by relying only on non-parallel resources.

MCML Authors
Link to Profile Alexander Fraser

Alexander Fraser

Prof. Dr.

Data Analytics & Statistics


[27]
A. Imani, S. Severini, M. J. Sabet, F. Yvon and H. Schütze.
Graph-Based Multilingual Label Propagation for Low-Resource Part-of-Speech Tagging.
EMNLP 2022 - Conference on Empirical Methods in Natural Language Processing. Abu Dhabi, United Arab Emirates, Nov 07-11, 2022. DOI
Abstract

Part-of-Speech (POS) tagging is an important component of the NLP pipeline, but many low-resource languages lack labeled data for training. An established method for training a POS tagger in such a scenario is to create a labeled training set by transferring from high-resource languages. In this paper, we propose a novel method for transferring labels from multiple high-resource source to low-resource target languages. We formalize POS tag projection as graph-based label propagation. Given translations of a sentence in multiple languages, we create a graph with words as nodes and alignment links as edges by aligning words for all language pairs. We then propagate node labels from source to target using a Graph Neural Network augmented with transformer layers. We show that our propagation creates training sets that allow us to train POS taggers for a diverse set of languages. When combined with enhanced contextualized embeddings, our method achieves a new state-of-the-art for unsupervised POS tagging of low-resource languages.

MCML Authors
Link to website

Ayyoob Imani

Statistical NLP and Deep Learning

Link to website

Masoud Jalili Sabet

Dr.

* Former member

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[26]
M. Müller-Eberstein, R. van der Goot and B. Plank.
Spectral Probing.
EMNLP 2022 - Conference on Empirical Methods in Natural Language Processing. Abu Dhabi, United Arab Emirates, Nov 07-11, 2022. DOI
Abstract

Linguistic information is encoded at varying timescales (subwords, phrases, etc.) and communicative levels, such as syntax and semantics. Contextualized embeddings have analogously been found to capture these phenomena at distinctive layers and frequencies. Leveraging these findings, we develop a fully learnable frequency filter to identify spectral profiles for any given task. It enables vastly more granular analyses than prior handcrafted filters, and improves on efficiency. After demonstrating the informativeness of spectral probing over manual filters in a monolingual setting, we investigate its multilingual characteristics across seven diverse NLP tasks in six languages. Our analyses identify distinctive spectral profiles which quantify cross-task similarity in a linguistically intuitive manner, while remaining consistent across languages—highlighting their potential as robust, lightweight task descriptors.

MCML Authors
Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

Artificial Intelligence and Computational Linguistics


[25]
B. Plank.
The 'Problem' of Human Label Variation: On Ground Truth in Data, Modeling and Evaluation.
EMNLP 2022 - Conference on Empirical Methods in Natural Language Processing. Abu Dhabi, United Arab Emirates, Nov 07-11, 2022. DOI
Abstract

Human variation in labeling is often considered noise. Annotation projects for machine learning (ML) aim at minimizing human label variation, with the assumption to maximize data quality and in turn optimize and maximize machine learning metrics. However, thisconventional practice assumes that there exists a ground truth, and neglects that there exists genuine human variation in labeling due to disagreement, subjectivity in annotation or multiple plausible answers. In this position paper, we argue that this big open problem of human label variation persists and critically needs more attention to move our field forward. This is because human label variation impacts all stages of the ML pipeline: data, modeling and evaluation. However, few works consider all of these dimensions jointly; and existing research is fragmented. We reconcile different previously proposed notions of human label variation, provide a repository of publicly-available datasets with un-aggregated labels, depict approaches proposed so far, identify gaps and suggest ways forward. As datasets are becoming increasingly available, we hope that this synthesized view on the ‘problem’ will lead to an open discussion on possible strategies to devise fundamentally new directions.

MCML Authors
Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

Artificial Intelligence and Computational Linguistics


[24]
L. Weissweiler, V. Hofmann, A. Köksal and H. Schütze.
The better your Syntax, the better your Semantics? Probing Pretrained Language Models for the English Comparative Correlative.
EMNLP 2022 - Conference on Empirical Methods in Natural Language Processing. Abu Dhabi, United Arab Emirates, Nov 07-11, 2022. DOI
Abstract

Construction Grammar (CxG) is a paradigm from cognitive linguistics emphasising the connection between syntax and semantics. Rather than rules that operate on lexical items, it posits constructions as the central building blocks of language, i.e., linguistic units of different granularity that combine syntax and semantics. As a first step towards assessing the compatibility of CxG with the syntactic and semantic knowledge demonstrated by state-of-the-art pretrained language models (PLMs), we present an investigation of their capability to classify and understand one of the most commonly studied constructions, the English comparative correlative (CC). We conduct experiments examining the classification accuracy of a syntactic probe on the one hand and the models’ behaviour in a semantic application task on the other, with BERT, RoBERTa, and DeBERTa as the example PLMs. Our results show that all three investigated PLMs are able to recognise the structure of the CC but fail to use its meaning. While human-like performance of PLMs on many NLP tasks has been alleged, this indicates that PLMs still suffer from substantial shortcomings in central domains of linguistic knowledge.

MCML Authors
Leonie Weissweiler

Leonie Weissweiler

Dr.

* Former member

Link to website

Abdullatif Köksal

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[23]
E. Bassignana and B. Plank.
CrossRE: A Cross-Domain Dataset for Relation Extraction.
EMNLP 2022 - Findings of the Conference on Empirical Methods in Natural Language Processing. Abu Dhabi, United Arab Emirates, Nov 07-11, 2022. DOI
Abstract

Relation Extraction (RE) has attracted increasing attention, but current RE evaluation is limited to in-domain evaluation setups. Little is known on how well a RE system fares in challenging, but realistic out-of-distribution evaluation setups. To address this gap, we propose CrossRE, a new, freely-available cross-domain benchmark for RE, which comprises six distinct text domains and includes multi-label annotations. An additional innovation is that we release meta-data collected during annotation, to include explanations and flags of difficult instances. We provide an empirical evaluation with a state-of-the-art model for relation classification. As the meta-data enables us to shed new light on the state-of-the-art model, we provide a comprehensive analysis on the impact of difficult cases and find correlations between model and human annotations. Overall, our empirical investigation highlights the difficulty of cross-domain RE. We release our dataset, to spur more research in this direction.

MCML Authors
Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

Artificial Intelligence and Computational Linguistics


[22]
W. Lai, A. Chronopoulou and A. Fraser.
m4 Adapter: Multilingual Multi-Domain Adaptation for Machine Translation with a Meta-Adapter.
EMNLP 2022 - Findings of the Conference on Empirical Methods in Natural Language Processing. Abu Dhabi, United Arab Emirates, Nov 07-11, 2022. DOI
Abstract

Multilingual neural machine translation models (MNMT) yield state-of-the-art performance when evaluated on data from a domain and language pair seen at training time. However, when a MNMT model is used to translate under domain shift or to a new language pair, performance drops dramatically. We consider a very challenging scenario: adapting the MNMT model both to a new domain and to a new language pair at the same time. In this paper, we propose m4Adapter (Multilingual Multi-Domain Adaptation for Machine Translation with a Meta-Adapter), which combines domain and language knowledge using meta-learning with adapters. We present results showing that our approach is a parameter-efficient solution which effectively adapts a model to both a new language pair and a new domain, while outperforming other adapter methods. An ablation study also shows that our approach more effectively transfers domain knowledge across different languages and language information across different domains.

MCML Authors
Link to website

Alexandra Chronopoulou

Dr.

* Former member

Link to Profile Alexander Fraser

Alexander Fraser

Prof. Dr.

Data Analytics & Statistics


[21]
D. Ulmer, E. Bassignana, M. Müller-Eberstein, D. Varab, M. Zhang, R. van der Goot, C. Hardmeier and B. Plank.
Experimental Standards for Deep Learning in Natural Language Processing Research.
EMNLP 2022 - Findings of the Conference on Empirical Methods in Natural Language Processing. Abu Dhabi, United Arab Emirates, Nov 07-11, 2022. DOI
Abstract

The field of Deep Learning (DL) has undergone explosive growth during the last decade, with a substantial impact on Natural Language Processing (NLP) as well. Yet, compared to more established disciplines, a lack of common experimental standards remains an open challenge to the field at large. Starting from fundamental scientific principles, we distill ongoing discussions on experimental standards in NLP into a single, widely-applicable methodology. Following these best practices is crucial to strengthen experimental evidence, improve reproducibility and enable scientific progress. These standards are further collected in a public repository to help them transparently adapt to future needs.

MCML Authors
Link to Profile Barbara Plank

Barbara Plank

Prof. Dr.

Artificial Intelligence and Computational Linguistics


[20]
H. S. Saadi, V. Hangya, T. Eder and A. Fraser.
Comparative Analysis of Cross-lingual Contextualized Word Embeddings.
MRL @EMNLP 2022 - 2nd Workshop on Multi-lingual Representation Learning at the Conference on Empirical Methods in Natural Language Processing (EMNLP 2022). Abu Dhabi, United Arab Emirates, Nov 07-11, 2022. DOI
Abstract

Contextualized word embeddings have emerged as the most important tool for performing NLP tasks in a large variety of languages. In order to improve the cross-lingual representation and transfer learning quality, contextualized embedding alignment techniques, such as mapping and model fine-tuning, are employed. Existing techniques however are time-, data- and computational resource-intensive. In this paper we analyze these techniques by utilizing three tasks: bilingual lexicon induction (BLI), word retrieval and cross-lingual natural language inference (XNLI) for a high resource (German-English) and a low resource (Bengali-English) language pair. In contrast to previous works which focus only on a few popular models, we compare five multilingual and seven monolingual language models and investigate the effect of various aspects on their performance, such as vocabulary size, number of languages used for training and number of parameters. Additionally, we propose a parameter-, data- and runtime-efficient technique which can be trained with 10% of the data, less than 10% of the time and have less than 5% of the trainable parameters compared to model fine-tuning. We show that our proposed method is competitive with resource heavy models, even outperforming them in some cases, even though it relies on less resource.

MCML Authors
Link to Profile Alexander Fraser

Alexander Fraser

Prof. Dr.

Data Analytics & Statistics


[19]
A. Maronikolakis, P. Baader and H. Schütze.
Analyzing Hate Speech Data along Racial, Gender and Intersectional Axes.
GeBNLP 2022 - 4th Workshop on Gender Bias in Natural Language Processing. Seattle, WA, USA, Jul 15, 2022. DOI
Abstract

To tackle the rising phenomenon of hate speech, efforts have been made towards data curation and analysis. When it comes to analysis of bias, previous work has focused predominantly on race. In our work, we further investigate bias in hate speech datasets along racial, gender and intersectional axes. We identify strong bias against African American English (AAE), masculine and AAE+Masculine tweets, which are annotated as disproportionately more hateful and offensive than from other demographics. We provide evidence that BERT-based models propagate this bias and show that balancing the training data for these protected attributes can lead to fairer models with regards to gender, but not race.

MCML Authors
Link to website

Antonis Maronikolakis

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[18]
S. Yuan, A. Maronikolakis and H. Schütze.
Separating Hate Speech and Offensive Language Classes via Adversarial Debiasing.
WOAH 2022 - 6th Workshop on Online Abuse and Harms. Seattle, WA, USA, Jul 14, 2022. DOI
Abstract

Research to tackle hate speech plaguing online media has made strides in providing solutions, analyzing bias and curating data. A challenging problem is ambiguity between hate speech and offensive language, causing low performance both overall and specifically for the hate speech class. It can be argued that misclassifying actual hate speech content as merely offensive can lead to further harm against targeted groups. In our work, we mitigate this potentially harmful phenomenon by proposing an adversarial debiasing method to separate the two classes. We show that our method works for English, Arabic German and Hindi, plus in a multilingual setting, improving performance over baselines.

MCML Authors
Link to website

Antonis Maronikolakis

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[17]
S. Severini, V. Hangya, M. J. Sabet, A. Fraser and H. Schütze.
Don't Forget Cheap Training Signals Before Building Unsupervised Bilingual Word Embeddings.
BUCC @LREC 2022 - 15th Workshop on Building and Using Comparable Corpora at the 13th International Conference on Language Resources and Evaluation (LREC 2022). Marseille, France, Jun 21-23, 2022. URL
Abstract

Bilingual Word Embeddings (BWEs) are one of the cornerstones of cross-lingual transfer of NLP models. They can be built using only monolingual corpora without supervision leading to numerous works focusing on unsupervised BWEs. However, most of the current approaches to build unsupervised BWEs do not compare their results with methods based on easy-to-access cross-lingual signals. In this paper, we argue that such signals should always be considered when developing unsupervised BWE methods. The two approaches we find most effective are: 1) using identical words as seed lexicons (which unsupervised approaches incorrectly assume are not available for orthographically distinct language pairs) and 2) combining such lexicons with pairs extracted by matching romanized versions of words with an edit distance threshold. We experiment on thirteen non-Latin languages (and English) and show that such cheap signals work well and that they outperform using more complex unsupervised methods on distant language pairs such as Chinese, Japanese, Kannada, Tamil, and Thai. In addition, they are even competitive with the use of high-quality lexicons in supervised approaches. Our results show that these training signals should not be neglected when building BWEs, even for distant languages.

MCML Authors
Link to website

Masoud Jalili Sabet

Dr.

* Former member

Link to Profile Alexander Fraser

Alexander Fraser

Prof. Dr.

Data Analytics & Statistics

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[16]
S. Severini, A. Imani, P. Dufter and H. Schütze.
Towards a Broad Coverage Named Entity Resource: A Data-Efficient Approach for Many Diverse Languages.
LREC 2022 - 13th International Conference on Language Resources and Evaluation. Marseille, France, Jun 21-23, 2022. URL
Abstract

Parallel corpora are ideal for extracting a multilingual named entity (MNE) resource, i.e., a dataset of names translated into multiple languages. Prior work on extracting MNE datasets from parallel corpora required resources such as large monolingual corpora or word aligners that are unavailable or perform poorly for underresourced languages. We present CLC-BN, a new method for creating an MNE resource, and apply it to the Parallel Bible Corpus, a corpus of more than 1000 languages. CLC-BN learns a neural transliteration model from parallel-corpus statistics, without requiring any other bilingual resources, word aligners, or seed data. Experimental results show that CLC-BN clearly outperforms prior work. We release an MNE resource for 1340 languages and demonstrate its effectiveness in two downstream tasks: knowledge graph augmentation and bilingual lexicon induction.

MCML Authors
Link to website

Ayyoob Imani

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[15]
V. Steinborn, P. Dufter, H. Jabbar and H. Schütze.
An Information-Theoretic Approach and Dataset for Probing Gender Stereotypes in Multilingual Masked Language Models.
NAACL 2022 - Findings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics. Seattle, WA, USA, Jun 10-15, 2022. DOI
Abstract

Bias research in NLP is a rapidly growing and developing field. Similar to CrowS-Pairs (Nangia et al., 2020), we assess gender bias in masked-language models (MLMs) by studying pairs of sentences with gender swapped person references.Most bias research focuses on and often is specific to English.Using a novel methodology for creating sentence pairs that is applicable across languages, we create, based on CrowS-Pairs, a multilingual dataset for English, Finnish, German, Indonesian and Thai.Additionally, we propose SJSD, a new bias measure based on Jensen–Shannon divergence, which we argue retains more information from the model output probabilities than other previously proposed bias measures for MLMs.Using multilingual MLMs, we find that SJSD diagnoses the same systematic biased behavior for non-English that previous studies have found for monolingual English pre-trained MLMs. SJSD outperforms the CrowS-Pairs measure, which struggles to find such biases for smaller non-English datasets.

MCML Authors
Link to website

Victor Steinborn

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[14]
M. Zhao, F. Mi, Y. Wang, M. Li, X. Jiang, Q. Liu and H. Schütze.
LMTurk: Few-Shot Learners as Crowdsourcing Workers in a Language-Model-as-a-Service Framework.
NAACL 2022 - Findings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics. Seattle, WA, USA, Jun 10-15, 2022. DOI
Abstract

Vast efforts have been devoted to creating high-performance few-shot learners, i.e., large-scale pretrained language models (PLMs) that perform well with little downstream task training data. Training PLMs has incurred significant cost, but utilizing the few-shot learners is still challenging due to their enormous size. This work focuses on a crucial question: How to make effective use of these few-shot learners? We propose LMTurk, a novel approach that treats few-shotlearners as crowdsourcing workers. The rationale is that crowdsourcing workers are in fact few-shot learners: They are shown a few illustrative examples to learn about a task and then start annotating. LMTurk employs few-shot learners built upon PLMs as workers. We show that the resulting annotations can be utilized to train models that solve the task well and are small enough to be deployable in practical scenarios. Active learning is integrated into LMTurk to reduce the amount of queries made to PLMs, minimizing the computational cost of running PLM inference passes. Altogether, LMTurk is an important step towards making effective use of current PLMs.

MCML Authors
Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[13]
L. Weissweiler, V. Hofmann, M. J. Sabet and H. Schütze.
CaMEL: Case Marker Extraction without Labels.
ACL 2022 - 60th Annual Meeting of the Association for Computational Linguistics. Dublin, Ireland, May 22-27, 2022. DOI
Abstract

We introduce CaMEL (Case Marker Extraction without Labels), a novel and challenging task in computational morphology that is especially relevant for low-resource languages. We propose a first model for CaMEL that uses a massively multilingual corpus to extract case markers in 83 languages based only on a noun phrase chunker and an alignment system. To evaluate CaMEL, we automatically construct a silver standard from UniMorph. The case markers extracted by our model can be used to detect and visualise similarities and differences between the case systems of different languages as well as to annotate fine-grained deep cases in languages in which they are not overtly marked.

MCML Authors
Leonie Weissweiler

Leonie Weissweiler

Dr.

* Former member

Link to website

Masoud Jalili Sabet

Dr.

* Former member

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[12]
S. Sharifzadeh, S. M. Baharlou, M. Schmitt, H. Schütze and V. Tresp.
Improving Scene Graph Classification by Exploiting Knowledge from Texts.
AAAI 2022 - 36th Conference on Artificial Intelligence. Virtual, Feb 22-Mar 01, 2022. DOI
Abstract

Training scene graph classification models requires a large amount of annotated image data. Meanwhile, scene graphs represent relational knowledge that can be modeled with symbolic data from texts or knowledge graphs. While image annotation demands extensive labor, collecting textual descriptions of natural scenes requires less effort. In this work, we investigate whether textual scene descriptions can substitute for annotated image data. To this end, we employ a scene graph classification framework that is trained not only from annotated images but also from symbolic data. In our architecture, the symbolic entities are first mapped to their correspondent image-grounded representations and then fed into the relational reasoning pipeline. Even though a structured form of knowledge, such as the form in knowledge graphs, is not always available, we can generate it from unstructured texts using a transformer-based language model. We show that by fine-tuning the classification pipeline with the extracted knowledge from texts, we can achieve ~8x more accurate results in scene graph classification, ~3x in object classification, and ~1.5x in predicate classification, compared to the supervised baselines with only 1% of the annotated images.

MCML Authors
Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning

Link to Profile Volker Tresp

Volker Tresp

Prof. Dr.

Database Systems & Data Mining


[11]
Y. Elazar, N. Kassner, S. Ravfogel, A. Ravichander, E. Hovy, H. Schütze and Y. Goldberg.
Measuring and Improving Consistency in Pretrained Language Models.
Transactions of the Association for Computational Linguistics 9 (Dec. 2021). DOI
Abstract

Consistency of a model—that is, the invariance of its behavior under meaning-preserving alternations in its input—is a highly desirable property in natural language processing. In this paper we study the question: Are Pretrained Language Models (PLMs) consistent with respect to factual knowledge? To this end, we create ParaRel, a high-quality resource of cloze-style query English paraphrases. It contains a total of 328 paraphrases for 38 relations. Using ParaRel, we show that the consistency of all PLMs we experiment with is poor— though with high variance between relations. Our analysis of the representational spaces of PLMs suggests that they have a poor structure and are currently not suitable for representing knowledge robustly. Finally, we propose a method for improving model consistency and experimentally demonstrate its effectiveness.

MCML Authors
Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[10]
A. Imani, M. J. Sabet, L. K. Senel, P. Philipp, F. Yvon and H. Schütze.
Graph Algorithms for Multiparallel Word Alignment.
EMNLP 2021 - Conference on Empirical Methods in Natural Language Processing. Punta Cana, Dominican Republic, Nov 07-11, 2021. DOI
Abstract

With the advent of end-to-end deep learning approaches in machine translation, interest in word alignments initially decreased; however, they have again become a focus of research more recently. Alignments are useful for typological research, transferring formatting like markup to translated texts, and can be used in the decoding of machine translation systems. At the same time, massively multilingual processing is becoming an important NLP scenario, and pretrained language and machine translation models that are truly multilingual are proposed. However, most alignment algorithms rely on bitexts only and do not leverage the fact that many parallel corpora are multiparallel. In this work, we exploit the multiparallelity of corpora by representing an initial set of bilingual alignments as a graph and then predicting additional edges in the graph. We present two graph algorithms for edge prediction: one inspired by recommender systems and one based on network link prediction. Our experimental results show absolute improvements in F1 of up to 28{%} over the baseline bilingual word aligner in different datasets.

MCML Authors
Link to website

Ayyoob Imani

Statistical NLP and Deep Learning

Link to website

Masoud Jalili Sabet

Dr.

* Former member

Link to website

Lütfi Kerem Şenel

Statistical NLP and Deep Learning

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[9]
N. Kassner, O. Tafjord, H. Schütze and P. Clark.
BeliefBank: Adding Memory to a Pre-Trained Language Model for a Systematic Notion of Belief.
EMNLP 2021 - Conference on Empirical Methods in Natural Language Processing. Punta Cana, Dominican Republic, Nov 07-11, 2021. DOI
Abstract

Although pretrained language models (PTLMs) contain significant amounts of world knowledge, they can still produce inconsistent answers to questions when probed, even after specialized training. As a result, it can be hard to identify what the model actually “believes” about the world, making it susceptible to inconsistent behavior and simple errors. Our goal is to reduce these problems. Our approach is to embed a PTLM in a broader system that also includes an evolving, symbolic memory of beliefs – a BeliefBank – that records but then may modify the raw PTLM answers. We describe two mechanisms to improve belief consistency in the overall system. First, a reasoning component – a weighted MaxSAT solver – revises beliefs that significantly clash with others. Second, a feedback component issues future queries to the PTLM using known beliefs as context. We show that, in a controlled experimental setting, these two mechanisms result in more consistent beliefs in the overall system, improving both the accuracy and consistency of its answers over time. This is significant as it is a first step towards PTLM-based architectures with a systematic notion of belief, enabling them to construct a more coherent picture of the world, and improve over time without model retraining.

MCML Authors
Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[8]
A. Imani, M. J. Sabet, P. Dufter, M. Cysouw and H. Schütze.
ParCourE: A Parallel Corpus Explorer for a Massively Multilingual Corpus.
ACL-IJCNLP 2021 - Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Bangkok, Thailand, Aug 01-06, 2021. DOI
Abstract

With more than 7000 languages worldwide, multilingual natural language processing (NLP) is essential both from an academic and commercial perspective. Researching typological properties of languages is fundamental for progress in multilingual NLP. Examples include assessing language similarity for effective transfer learning, injecting inductive biases into machine learning models or creating resources such as dictionaries and inflection tables. We provide ParCourE, an online tool that allows to browse a word-aligned parallel corpus, covering 1334 languages. We give evidence that this is useful for typological research. ParCourE can be set up for any parallel corpus and can thus be used for typological research on other corpora as well as for exploring their quality and properties.

MCML Authors
Link to website

Ayyoob Imani

Statistical NLP and Deep Learning

Link to website

Masoud Jalili Sabet

Dr.

* Former member

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[7]
P. Dufter, N. Kassner and H. Schütze.
Static Embeddings as Efficient Knowledge Bases?.
NAACL 2021 - Annual Conference of the North American Chapter of the Association for Computational Linguistics. Virtual, Jun 06-11, 2021. DOI
Abstract

Recent research investigates factual knowledge stored in large pretrained language models (PLMs). Instead of structural knowledge base (KB) queries, masked sentences such as ‘Paris is the capital of [MASK]’ are used as probes. The good performance on this analysis task has been interpreted as PLMs becoming potential repositories of factual knowledge. In experiments across ten linguistically diverse languages, we study knowledge contained in static embeddings. We show that, when restricting the output space to a candidate set, simple nearest neighbor matching using static embeddings performs better than PLMs. E.g., static embeddings perform 1.6% points better than BERT while just using 0.3% of energy for training. One important factor in their good comparative performance is that static embeddings are standardly learned for a large vocabulary. In contrast, BERT exploits its more sophisticated, but expensive ability to compose meaningful representations from a much smaller subword vocabulary.

MCML Authors
Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[6]
N. Kassner, P. Dufter and H. Schütze.
Multilingual LAMA: Investigating Knowledge in Multilingual Pretrained Language Models.
EACL 2021 - 16th Conference of the European Chapter of the Association for Computational Linguistics. Virtual, Apr 19-23, 2021. DOI
Abstract

Recently, it has been found that monolingual English language models can be used as knowledge bases. Instead of structural knowledge base queries, masked sentences such as “Paris is the capital of [MASK]” are used as probes. We translate the established benchmarks TREx and GoogleRE into 53 languages. Working with mBERT, we investigate three questions. (i) Can mBERT be used as a multilingual knowledge base? Most prior work only considers English. Extending research to multiple languages is important for diversity and accessibility. (ii) Is mBERT’s performance as knowledge base language-independent or does it vary from language to language? (iii) A multilingual model is trained on more text, e.g., mBERT is trained on 104 Wikipedias. Can mBERT leverage this for better performance? We find that using mBERT as a knowledge base yields varying performance across languages and pooling predictions across languages improves performance. Conversely, mBERT exhibits a language bias; e.g., when queried in Italian, it tends to predict Italy as the country of origin.

MCML Authors
Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[5]
E. Asgari, M. J. Sabet, P. Dufter, C. Ringlstetter and H. Schütze.
Subword Sampling for Low Resource Word Alignment.
Preprint (Dec. 2020). arXiv
Abstract

Annotation projection is an important area in NLP that can greatly contribute to creating language resources for low-resource languages. Word alignment plays a key role in this setting. However, most of the existing word alignment methods are designed for a high resource setting in machine translation where millions of parallel sentences are available. This amount reduces to a few thousands of sentences when dealing with low-resource languages failing the existing established IBM models. In this paper, we propose subword sampling-based alignment of text units. This method’s hypothesis is that the aggregation of different granularities of text for certain language pairs can help word-level alignment. For certain languages for which gold-standard alignments exist, we propose an iterative Bayesian optimization framework to optimize selecting possible subwords from the space of possible subword representations of the source and target sentences. We show that the subword sampling method consistently outperforms word-level alignment on six language pairs: English-German, English-French, English-Romanian, English-Persian, English-Hindi, and English-Inuktitut. In addition, we show that the hyperparameters learned for certain language pairs can be applied to other languages at no supervision and consistently improve the alignment results. We observe that using 5K parallel sentences together with our proposed subword sampling approach, we obtain similar F1 scores to the use of 100K’s of parallel sentences in existing word-level fast-align/eflomal alignment methods.

MCML Authors
Link to website

Masoud Jalili Sabet

Dr.

* Former member

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[4]
N. Kassner, B. Krojer and H. Schütze.
Are Pretrained Language Models Symbolic Reasoners over Knowledge?.
CoNLL 2020 - 24th Conference on Computational Natural Language Learning. Virtual, Nov 19-20, 2020. DOI
Abstract

How can pretrained language models (PLMs) learn factual knowledge from the training set? We investigate the two most important mechanisms: reasoning and memorization. Prior work has attempted to quantify the number of facts PLMs learn, but we present, using synthetic data, the first study that investigates the causal relation between facts present in training and facts learned by the PLM. For reasoning, we show that PLMs seem to learn to apply some symbolic reasoning rules correctly but struggle with others, including two-hop reasoning. Further analysis suggests that even the application of learned reasoning rules is flawed. For memorization, we identify schema conformity (facts systematically supported by other facts) and frequency as key factors for its success.

MCML Authors
Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[3]
N. Kassner and H. Schütze.
BERT-kNN: Adding a kNN Search Component to Pretrained Language Models for Better QA.
EMNLP 2020 - Findings of the Conference on Empirical Methods in Natural Language Processing. Virtual, Nov 16-20, 2020. DOI
Abstract

Khandelwal et al. (2020) use a k-nearest-neighbor (kNN) component to improve language model performance. We show that this idea is beneficial for open-domain question answering (QA). To improve the recall of facts encountered during training, we combine BERT (Devlin et al., 2019) with a traditional information retrieval step (IR) and a kNN search over a large datastore of an embedded text collection. Our contributions are as follows: i) BERT-kNN outperforms BERT on cloze-style QA by large margins without any further training. ii) We show that BERT often identifies the correct response category (e.g., US city), but only kNN recovers the factually correct answer (e.g.,“Miami”). iii) Compared to BERT, BERT-kNN excels for rare facts. iv) BERT-kNN can easily handle facts not covered by BERT’s training set, e.g., recent events.

MCML Authors
Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[2]
N. Kassner and H. Schütze.
Negated and Misprimed Probes for Pretrained Language Models: Birds Can Talk, But Cannot Fly.
ACL 2020 - 58th Annual Meeting of the Association for Computational Linguistics. Virtual, Jul 05-10, 2020. DOI
Abstract

Building on Petroni et al. 2019, we propose two new probing tasks analyzing factual knowledge stored in Pretrained Language Models (PLMs). (1) Negation. We find that PLMs do not distinguish between negated (‘‘Birds cannot [MASK]”) and non-negated (‘‘Birds can [MASK]”) cloze questions. (2) Mispriming. Inspired by priming methods in human psychology, we add “misprimes” to cloze questions (‘‘Talk? Birds can [MASK]”). We find that PLMs are easily distracted by misprimes. These results suggest that PLMs still have a long way to go to adequately learn human-like factual knowledge.

MCML Authors
Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


[1]
A. Beyer, G. Kauermann and H. Schütze.
Embedding Space Correlation as a Measure of Domain Similarity.
LREC 2020 - 12th International Conference on Language Resources and Evaluation. Marseille, France, May 13-15, 2020. URL
Abstract

Prior work has determined domain similarity using text-based features of a corpus. However, when using pre-trained word embeddings, the underlying text corpus might not be accessible anymore. Therefore, we propose the CCA measure, a new measure of domain similarity based directly on the dimension-wise correlations between corresponding embedding spaces. Our results suggest that an inherent notion of domain can be captured this way, as we are able to reproduce our findings for different domain comparisons for English, German, Spanish and Czech as well as in cross-lingual comparisons. We further find a threshold at which the CCA measure indicates that two corpora come from the same domain in a monolingual setting by applying permutation tests. By evaluating the usability of the CCA measure in a domain adaptation application, we also show that it can be used to determine which corpora are more similar to each other in a cross-domain sentiment detection task.

MCML Authors
Link to Profile Göran Kauermann

Göran Kauermann

Prof. Dr.

Applied Statistics in Social Sciences, Economics and Business

Link to Profile Hinrich Schütze

Hinrich Schütze

Prof. Dr.

Statistical NLP and Deep Learning


B3 | Multimodal Perception

The ability for an intelligent, mobile actor to understand egomotion as well as the surroundings are a fundamental prerequisite for the choice of actions to take. However, vast challenges remain to achieve the necessary levels of safety, which are deeply rooted in research that MCML aims to carry out: Multisensor egomotion estimation and environment mapping, scene representations suitable for interaction in an open-ended environment, understanding and forecasting motion and events, and the the role of uncertainty in ML blocks as modular elements.

Link to Profile Matthias Althoff

Matthias Althoff

Prof. Dr.

Cyber Physical Systems

Link to Profile Stefan Leutenegger

Stefan Leutenegger

Prof. Dr.

Machine Learning for Robotics

Link to Profile Angela P. Schöllig

Angela P. Schöllig

Prof. Dr.

Safety, Performance and Reliability of Learning Systems

Link to Profile Björn Schuller

Björn Schuller

Prof. Dr.

Health Informatics

Publication in Research Area B3
[30]
R. Stolz, H. Krasowski, J. Thumm, M. Eichelbeck, P. Gassert and M. Althoff.
Excluding the Irrelevant: Focusing Reinforcement Learning through Continuous Action Masking.
NeurIPS 2024 - 38th Conference on Neural Information Processing Systems. Vancouver, Canada, Dec 10-15, 2024. To be published. Preprint available. arXiv
Abstract

Continuous action spaces in reinforcement learning (RL) are commonly defined as multidimensional intervals. While intervals usually reflect the action boundaries for tasks well, they can be challenging for learning because the typically large global action space leads to frequent exploration of irrelevant actions. Yet, little task knowledge can be sufficient to identify significantly smaller state-specific sets of relevant actions. Focusing learning on these relevant actions can significantly improve training efficiency and effectiveness. In this paper, we propose to focus learning on the set of relevant actions and introduce three continuous action masking methods for exactly mapping the action space to the state-dependent set of relevant actions. Thus, our methods ensure that only relevant actions are executed, enhancing the predictability of the RL agent and enabling its use in safety-critical applications. We further derive the implications of the proposed methods on the policy gradient. Using proximal policy optimization (PPO), we evaluate our methods on four control tasks, where the relevant action set is computed based on the system dynamics and a relevant state set. Our experiments show that the three action masking methods achieve higher final rewards and converge faster than the baseline without action masking.

MCML Authors
Link to website

Hanna Krasowski

Dr.

Cyber Physical Systems

Link to website

Michael Eichelbeck

Cyber Physical Systems

Link to website

Philipp Gassert

Cyber Physical Systems

Link to Profile Matthias Althoff

Matthias Althoff

Prof. Dr.

Cyber Physical Systems


[29]
K. R. S. Klaus R. Scherer, F. Burkhardt, U. D. Reichel, F. Eyben and B. W. Schuller.
Using voice analysis as an early indicator of risk for depression in young adults.
Preprint (Nov. 2024). arXiv
Abstract

Increasingly frequent publications in the literature report voice quality differences between depressed patients and controls. Here, we examine the possibility of using voice analysis as an early warning signal for the development of emotion disturbances in young adults. As part of a major interdisciplinary European research project in four countries (ECoWeB), examining the effects of web-based prevention programs to reduce the risk for depression in young adults, we analyzed a large number of acoustic voice characteristics in vocal reports of emotions experienced by the participants on a specific day. We were able to identify a number of significant differences in acoustic cues, particularly with respect to the energy distribution in the voice spectrum, encouraging further research efforts to develop promising non-obtrusive risk indicators in the normal speaking voice. This is particularly important in the case of young adults who are less likely to exhibit standard risk factors for depression such as negative life experiences.

MCML Authors
Link to Profile Björn Schuller

Björn Schuller

Prof. Dr.

Health Informatics


[28]
S. Rampp, M. Milling, A. Triantafyllopoulos and B. W. Schuller.
Does the Definition of Difficulty Matter? Scoring Functions and their Role for Curriculum Learning.
Preprint (Nov. 2024). arXiv
Abstract

Curriculum learning (CL) describes a machine learning training strategy in which samples are gradually introduced into the training process based on their difficulty. Despite a partially contradictory body of evidence in the literature, CL finds popularity in deep learning research due to its promise of leveraging human-inspired curricula to achieve higher model performance. Yet, the subjectivity and biases that follow any necessary definition of difficulty, especially for those found in orderings derived from models or training statistics, have rarely been investigated. To shed more light on the underlying unanswered questions, we conduct an extensive study on the robustness and similarity of the most common scoring functions for sample difficulty estimation, as well as their potential benefits in CL, using the popular benchmark dataset CIFAR-10 and the acoustic scene classification task from the DCASE2020 challenge as representatives of computer vision and computer audition, respectively. We report a strong dependence of scoring functions on the training setting, including randomness, which can partly be mitigated through ensemble scoring. While we do not find a general advantage of CL over uniform sampling, we observe that the ordering in which data is presented for CL-based training plays an important role in model performance. Furthermore, we find that the robustness of scoring functions across random seeds positively correlates with CL performance. Finally, we uncover that models trained with different CL strategies complement each other by boosting predictive power through late fusion, likely due to differences in the learnt concepts. Alongside our findings, we release the aucurriculum toolkit (this https URL), implementing sample difficulty and CL-based training in a modular fashion.

MCML Authors
Link to website

Andreas Triantafyllopoulos

Health Informatics

Link to Profile Björn Schuller

Björn Schuller

Prof. Dr.

Health Informatics


[27]
S. Rampp, A. Triantafyllopoulos, M. Milling and B. W. Schuller.
autrainer: A Modular and Extensible Deep Learning Toolkit for Computer Audition Tasks.
Preprint (Nov. 2024). arXiv
Abstract

This work introduces the key operating principles for autrainer, our new deep learning training framework for computer audition tasks. autrainer is a PyTorch-based toolkit that allows for rapid, reproducible, and easily extensible training on a variety of different computer audition tasks. Concretely, autrainer offers low-code training and supports a wide range of neural networks as well as preprocessing routines. In this work, we present an overview of its inner workings and key capabilities.

MCML Authors
Link to website

Andreas Triantafyllopoulos

Health Informatics

Link to Profile Björn Schuller

Björn Schuller

Prof. Dr.

Health Informatics


[26]
Q. Sun, Y. Li, E. Alturki, S. M. K. Murthy and B. W. Schuller.
Towards Friendly AI: A Comprehensive Review and New Perspectives on Human-AI Alignment.
Preprint (Nov. 2024). arXiv
Abstract

As Artificial Intelligence (AI) continues to advance rapidly, Friendly AI (FAI) has been proposed to advocate for more equitable and fair development of AI. Despite its importance, there is a lack of comprehensive reviews examining FAI from an ethical perspective, as well as limited discussion on its potential applications and future directions. This paper addresses these gaps by providing a thorough review of FAI, focusing on theoretical perspectives both for and against its development, and presenting a formal definition in a clear and accessible format. Key applications are discussed from the perspectives of eXplainable AI (XAI), privacy, fairness and affective computing (AC). Additionally, the paper identifies challenges in current technological advancements and explores future research avenues. The findings emphasise the significance of developing FAI and advocate for its continued advancement to ensure ethical and beneficial AI development.

MCML Authors
Link to Profile Björn Schuller

Björn Schuller

Prof. Dr.

Health Informatics


[25]
M. M. Amin, R. Mao, E. Cambria and B. W. Schuller.
A Wide Evaluation of ChatGPT on Affective Computing Tasks.
IEEE Transactions on Affective Computing 15.4 (Oct. 2024). DOI
Abstract

With the rise of foundation models, a new artificial intelligence paradigm has emerged, by simply using general purpose foundation models with prompting to solve problems instead of training a separate machine learning model for each problem. Such models have been shown to have emergent properties of solving problems that they were not initially trained on. The studies for the effectiveness of such models are still quite limited. In this work, we widely study the capabilities of the ChatGPT models, namely GPT-4 and GPT-3.5, on 13 affective computing problems, namely aspect extraction, aspect polarity classification, opinion extraction, sentiment analysis, sentiment intensity ranking, emotions intensity ranking, suicide tendency detection, toxicity detection, well-being assessment, engagement measurement, personality assessment, sarcasm detection, and subjectivity detection. We introduce a framework to evaluate the ChatGPT models on regression-based problems, such as intensity ranking problems, by modelling them as pairwise ranking classification. We compare ChatGPT against more traditional NLP methods, such as end-to-end recurrent neural networks and transformers. The results demonstrate the emergent abilities of the ChatGPT models on a wide range of affective computing problems, where GPT-3.5 and especially GPT-4 have shown strong performance on many problems, particularly the ones related to sentiment, emotions, or toxicity. The ChatGPT models fell short for problems with implicit signals, such as engagement measurement and subjectivity detection.

MCML Authors
Link to Profile Björn Schuller

Björn Schuller

Prof. Dr.

Health Informatics


[24]
P. Gassert and M. Althoff.
Stepping Out of the Shadows: Reinforcement Learning in Shadow Mode.
Preprint (Oct. 2024). arXiv
Abstract

Reinforcement learning (RL) is not yet competitive for many cyber-physical systems, such as robotics, process automation, and power systems, as training on a system with physical components cannot be accelerated, and simulation models do not exist or suffer from a large simulation-to-reality gap. During the long training time, expensive equipment cannot be used and might even be damaged due to inappropriate actions of the reinforcement learning agent. Our novel approach addresses exactly this problem: We train the reinforcement agent in a so-called shadow mode with the assistance of an existing conventional controller, which does not have to be trained and instantaneously performs reasonably well. In shadow mode, the agent relies on the controller to provide action samples and guidance towards favourable states to learn the task, while simultaneously estimating for which states the learned agent will receive a higher reward than the conventional controller. The RL agent will then control the system for these states and all other regions remain under the control of the existing controller. Over time, the RL agent will take over for an increasing amount of states, while leaving control to the baseline, where it cannot surpass its performance. Thus, we keep regret during training low and improve the performance compared to only using conventional controllers or reinforcement learning. We present and evaluate two mechanisms for deciding whether to use the RL agent or the conventional controller. The usefulness of our approach is demonstrated for a reach-avoid task, for which we are able to effectively train an agent, where standard approaches fail.

MCML Authors
Link to website

Philipp Gassert

Cyber Physical Systems

Link to Profile Matthias Althoff

Matthias Althoff

Prof. Dr.

Cyber Physical Systems


[23]
Q. Sun, A. Akman, X. Jing, M. Milling and B. W. Schuller.
Audio-based Kinship Verification Using Age Domain Conversion.
Preprint (Oct. 2024). arXiv
Abstract

Audio-based kinship verification (AKV) is important in many domains, such as home security monitoring, forensic identification, and social network analysis. A key challenge in the task arises from differences in age across samples from different individuals, which can be interpreted as a domain bias in a cross-domain verification task. To address this issue, we design the notion of an ‘age-standardised domain’ wherein we utilise the optimised CycleGAN-VC3 network to perform age-audio conversion to generate the in-domain audio. The generated audio dataset is employed to extract a range of features, which are then fed into a metric learning architecture to verify kinship. Experiments are conducted on the KAN_AV audio dataset, which contains age and kinship labels. The results demonstrate that the method markedly enhances the accuracy of kinship verification, while also offering novel insights for future kinship verification research.

MCML Authors
Link to Profile Björn Schuller

Björn Schuller

Prof. Dr.

Health Informatics


[22]
M. Milling, S. Liu, A. Triantafyllopoulos, I. Aslan and B. W. Schuller.
Audio Enhancement for Computer Audition—An Iterative Training Paradigm Using Sample Importance.
IEEE Internet of Things Journal 39 (Sep. 2024). DOI
Abstract

Neural network models for audio tasks, such as automatic speech recognition (ASR) and acoustic scene classification (ASC), are susceptible to noise contamination for real-life applications. To improve audio quality, an enhancement module, which can be developed independently, is explicitly used at the front-end of the target audio applications. In this paper, we present an end-to-end learning solution to jointly optimise the models for audio enhancement (AE) and the subsequent applications. To guide the optimisation of the AE module towards a target application, and especially to overcome difficult samples, we make use of the sample-wise performance measure as an indication of sample importance. In experiments, we consider four representative applications to evaluate our training paradigm, i.e., ASR, speech command recognition (SCR), speech emotion recognition (SER), and ASC. These applications are associated with speech and nonspeech tasks concerning semantic and non-semantic features, transient and global information, and the experimental results indicate that our proposed approach can considerably boost the noise robustness of the models, especially at low signal-to-noise ratios, for a wide range of computer audition tasks in everyday-life noisy environments.

MCML Authors
Link to website

Andreas Triantafyllopoulos

Health Informatics

Link to Profile Björn Schuller

Björn Schuller

Prof. Dr.

Health Informatics


[21]
X. Jing, K. Zhou, A. Triantafyllopoulos and B. W. Schuller.
Enhancing Emotional Text-to-Speech Controllability with Natural Language Guidance through Contrastive Learning and Diffusion Models.
Preprint (Sep. 2024). arXiv
Abstract

While current emotional text-to-speech (TTS) systems can generate highly intelligible emotional speech, achieving fine control over emotion rendering of the output speech still remains a significant challenge. In this paper, we introduce ParaEVITS, a novel emotional TTS framework that leverages the compositionality of natural language to enhance control over emotional rendering. By incorporating a text-audio encoder inspired by ParaCLAP, a contrastive language-audio pretraining (CLAP) model for computational paralinguistics, the diffusion model is trained to generate emotional embeddings based on textual emotional style descriptions. Our framework first trains on reference audio using the audio encoder, then fine-tunes a diffusion model to process textual inputs from ParaCLAP’s text encoder. During inference, speech attributes such as pitch, jitter, and loudness are manipulated using only textual conditioning. Our experiments demonstrate that ParaEVITS effectively control emotion rendering without compromising speech quality. Speech demos are publicly available.

MCML Authors
Link to website

Andreas Triantafyllopoulos

Health Informatics

Link to Profile Björn Schuller

Björn Schuller

Prof. Dr.

Health Informatics


[20]
D. Ostermeier, J. Külz and M. Althoff.
Automatic Geometric Decomposition for Analytical Inverse Kinematics.
Preprint (Sep. 2024). arXiv
Abstract

Calculating the inverse kinematics (IK) is fundamental for motion planning in robotics. Compared to numerical or learning-based approaches, analytical IK provides higher efficiency and accuracy. However, existing analytical approaches require manual intervention, are ill-conditioned, or rely on time-consuming symbolic manipulation. In this paper, we propose a fast and stable method that enables automatic online derivation and computation of analytical inverse kinematics. Our approach is based on remodeling the kinematic chain of a manipulator to automatically decompose its IK into pre-solved geometric subproblems. We exploit intersecting and parallel joint axes to assign a given manipulator to a certain kinematic class and the corresponding subproblem decomposition. In numerical experiments, we demonstrate that our decomposition is orders of magnitudes faster in deriving the IK than existing tools that employ symbolic manipulation. Following this one-time derivation, our method matches and even surpasses baselines, such as IKFast, in terms of speed and accuracy during the online computation of explicit IK solutions. Finally, we provide a C++ toolbox with Python wrappers that, for the first time, enables plug-and-play analytical IK within less than a millisecond.

MCML Authors
Link to website

Jonathan Külz

Cyber Physical Systems

Link to Profile Matthias Althoff

Matthias Althoff

Prof. Dr.

Cyber Physical Systems


[19]
S. Papatheodorou, S. Boche, S. Laina and S. Leutenegger.
Efficient Submap-based Autonomous MAV Exploration using Visual-Inertial SLAM Configurable for LiDARs or Depth Cameras.
Preprint (Sep. 2024). arXiv
Abstract

Autonomous exploration of unknown space is an essential component for the deployment of mobile robots in the real world. Safe navigation is crucial for all robotics applications and requires accurate and consistent maps of the robot’s surroundings. To achieve full autonomy and allow deployment in a wide variety of environments, the robot must rely on on-board state estimation which is prone to drift over time. We propose a Micro Aerial Vehicle (MAV) exploration framework based on local submaps to allow retaining global consistency by applying loop-closure corrections to the relative submap poses. To enable large-scale exploration we efficiently compute global, environment-wide frontiers from the local submap frontiers and use a sampling-based next-best-view exploration planner. Our method seamlessly supports using either a LiDAR sensor or a depth camera, making it suitable for different kinds of MAV platforms. We perform comparative evaluations in simulation against a state-of-the-art submap-based exploration framework to showcase the efficiency and reconstruction quality of our approach. Finally, we demonstrate the applicability of our method to real-world MAVs, one equipped with a LiDAR and the other with a depth camera.

MCML Authors
Link to Profile Stefan Leutenegger

Stefan Leutenegger

Prof. Dr.

Machine Learning for Robotics


[18]
A. Triantafyllopoulos, A. Gebhard, M. Milling, S. Rampp and B. W. Schuller.
An Automatic Analysis of Ultrasound Vocalisations for the Prediction of Interaction Context in Captive Egyptian Fruit Bats.
EUSIPCO 2024 - 32nd European Signal Processing Conference. Lyon, France,, Aug 26-30, 2024. DOI
Abstract

Prior work in computational bioacoustics has mostly focused on the detection of animal presence in a particular habitat. However, animal sounds contain much richer information than mere presence; among others, they encapsulate the interactions of those animals with other members of their species. Studying these interactions is almost impossible in a naturalistic setting, as the ground truth is often lacking. The use of animals in captivity instead offers a viable alternative pathway. However, most prior works follow a traditional, statistics-based approach to analysing interactions. In the present work, we go beyond this standard framework by attempting to predict the underlying context in interactions between captive Rousettus Aegyptiacus using deep neural networks. We reach an unweighted average recall of over 30% - more than thrice the chance level - and show error patterns that differ from our statistical analysis. This work thus represents an important step towards the automatic analysis of states in animals from sound.

MCML Authors
Link to website

Andreas Triantafyllopoulos

Health Informatics

Link to website

Alexander Gebhard

Health Informatics

Link to Profile Björn Schuller

Björn Schuller

Prof. Dr.

Health Informatics


[17]
L. Christ, S. Amiriparian, M. Milling, I. Aslan and B. W. Schuller.
Modeling Emotional Trajectories in Written Stories Utilizing Transformers and Weakly-Supervised Learning.
ACL 2024 - Findings of the 62nd Annual Meeting of the Association for Computational Linguistics. Bangkok, Thailand, Aug 11-16, 2024. DOI
Abstract

Telling stories is an integral part of human communication which can evoke emotions and influence the affective states of the audience. Automatically modeling emotional trajectories in stories has thus attracted considerable scholarly interest. However, as most existing works have been limited to unsupervised dictionary-based approaches, there is no benchmark for this task. We address this gap by introducing continuous valence and arousal labels for an existing dataset of children’s stories originally annotated with discrete emotion categories. We collect additional annotations for this data and map the categorical labels to the continuous valence and arousal space. For predicting the thus obtained emotionality signals, we fine-tune a DeBERTa model and improve upon this baseline via a weakly supervised learning approach. The best configuration achieves a Concordance Correlation Coefficient (CCC) of .8221 for valence and .7125 for arousal on the test set, demonstrating the efficacy of our proposed approach. A detailed analysis shows the extent to which the results vary depending on factors such as the author, the individual story, or the section within the story. In addition, we uncover the weaknesses of our approach by investigating examples that prove to be difficult to predict.

MCML Authors
Link to website

Shahin Amiriparian

Dr.

Health Informatics

Link to Profile Björn Schuller

Björn Schuller

Prof. Dr.

Health Informatics


[16]
T. Rajapakshe, R. Rana, S. Khalifa, B. Sisman, B. W. Schuller and C. Busso.
emoDARTS: Joint Optimization of CNN and Sequential Neural Network Architectures for Superior Speech Emotion Recognition.
IEEE Access 12 (Aug. 2024). DOI
Abstract

Speech Emotion Recognition (SER) is crucial for enabling computers to understand the emotions conveyed in human communication. With recent advancements in Deep Learning (DL), the performance of SER models has significantly improved. However, designing an optimal DL architecture requires specialised knowledge and experimental assessments. Fortunately, Neural Architecture Search (NAS) provides a potential solution for automatically determining the best DL model. The Differentiable Architecture Search (DARTS) is a particularly efficient method for discovering optimal models. This study presents emoDARTS, a DARTS-optimised joint CNN and Sequential Neural Network (SeqNN: LSTM, RNN) architecture that enhances SER performance. The literature supports the selection of CNN and LSTM coupling to improve performance. While DARTS has previously been used to choose CNN and LSTM operations independently, our technique adds a novel mechanism for selecting CNN and SeqNN operations in conjunction using DARTS. Unlike earlier work, we do not impose limits on the layer order of the CNN. Instead, we let DARTS choose the best layer order inside the DARTS cell. We demonstrate that emoDARTS outperforms conventionally designed CNN-LSTM models and surpasses the best-reported SER results achieved through DARTS on CNN-LSTM by evaluating our approach on the IEMOCAP, MSP-IMPROV, and MSP-Podcast datasets.

MCML Authors
Link to Profile Björn Schuller

Björn Schuller

Prof. Dr.

Health Informatics


[15]
W. Qiu, Y. Feng, Y. Li, Y. Chang, K. Qian, B. Hu, Y. Yamamoto and B. W. Schuller.
Fed-MStacking: Heterogeneous Federated Learning With Stacking Misaligned Labels for Abnormal Heart Sound Detection.
IEEE Journal of Biomedical and Health Informatics 28.9 (Jul. 2024). DOI
Abstract

Ubiquitous sensing has been widely applied in smart healthcare, providing an opportunity for intelligent heart sound auscultation. However, smart devices contain sensitive information, raising user privacy concerns. To this end, federated learning (FL) has been adopted as an effective solution, enabling decentralised learning without data sharing, thus preserving data privacy in the Internet of Health Things (IoHT). Nevertheless, traditional FL requires the same architectural models to be trained across local clients and global servers, leading to a lack of model heterogeneity and client personalisation. For medical institutions with private data clients, this study proposes Fed-MStacking, a heterogeneous FL framework that incorporates a stacking ensemble learning strategy to support clients in building their own models. The secondary objective of this study is to address scenarios involving local clients with data characterised by inconsistent labelling. Specifically, the local client contains only one case type, and the data cannot be shared within or outside the institution. To train a global multi-class classifier, we aggregate missing class information from all clients at each institution and build meta-data, which then participates in FL training via a meta-learner. We apply the proposed framework to a multi-institutional heart sound database. The experiments utilise random forests (RFs), feedforward neural networks (FNNs), and convolutional neural networks (CNNs) as base classifiers. The results show that the heterogeneous stacking of local models performs better compared to homogeneous stacking.

MCML Authors
Link to Profile Björn Schuller

Björn Schuller

Prof. Dr.

Health Informatics


[14]
M. Gerczuk, S. Amiriparian, J. Lutz, W. Strube, I. Papazova, A. Hasan and B. W. Schuller.
Exploring Gender-Specific Speech Patterns in Automatic Suicide Risk Assessment.
Preprint (Jul. 2024). arXiv
Abstract

In emergency medicine, timely intervention for patients at risk of suicide is often hindered by delayed access to specialised psychiatric care. To bridge this gap, we introduce a speech-based approach for automatic suicide risk assessment. Our study involves a novel dataset comprising speech recordings of 20 patients who read neutral texts. We extract four speech representations encompassing interpretable and deep features. Further, we explore the impact of gender-based modelling and phrase-level normalisation. By applying gender-exclusive modelling, features extracted from an emotion fine-tuned wav2vec2.0 model can be utilised to discriminate high- from low-suicide risk with a balanced accuracy of 81%. Finally, our analysis reveals a discrepancy in the relationship of speech characteristics and suicide risk between female and male subjects. For men in our dataset, suicide risk increases together with agitation while voice characteristics of female subjects point the other way.

MCML Authors
Link to website

Maurice Gerczuk

Health Informatics

Link to website

Shahin Amiriparian

Dr.

Health Informatics

Link to Profile Björn Schuller

Björn Schuller

Prof. Dr.

Health Informatics


[13]
S. Amiriparian, F. Packań, M. Gerczuk and B. W. Schuller.
ExHuBERT: Enhancing HuBERT Through Block Extension and Fine-Tuning on 37 Emotion Datasets.
Preprint (Jun. 2024). arXiv
Abstract

Foundation models have shown great promise in speech emotion recognition (SER) by leveraging their pre-trained representations to capture emotion patterns in speech signals. To further enhance SER performance across various languages and domains, we propose a novel twofold approach. First, we gather EmoSet++, a comprehensive multi-lingual, multi-cultural speech emotion corpus with 37 datasets, 150,907 samples, and a total duration of 119.5 hours. Second, we introduce ExHuBERT, an enhanced version of HuBERT achieved by backbone extension and fine-tuning on EmoSet++. We duplicate each encoder layer and its weights, then freeze the first duplicate, integrating an extra zero-initialized linear layer and skip connections to preserve functionality and ensure its adaptability for subsequent fine-tuning. Our evaluation on unseen datasets shows the efficacy of ExHuBERT, setting a new benchmark for various SER tasks.

MCML Authors
Link to website

Shahin Amiriparian

Dr.

Health Informatics

Link to website

Maurice Gerczuk

Health Informatics

Link to Profile Björn Schuller

Björn Schuller

Prof. Dr.

Health Informatics


[12]
L. Christ, S. Amiriparian, F. Hawighorst, A.-K. Schill, A. Boutalikakis, L. Graf-Vlachy, A. König and B. W. Schuller.
This Paper Had the Smartest Reviewers -- Flattery Detection Utilising an Audio-Textual Transformer-Based Approach.
Preprint (Jun. 2024). arXiv
Abstract

Flattery is an important aspect of human communication that facilitates social bonding, shapes perceptions, and influences behavior through strategic compliments and praise, leveraging the power of speech to build rapport effectively. Its automatic detection can thus enhance the naturalness of human-AI interactions. To meet this need, we present a novel audio textual dataset comprising 20 hours of speech and train machine learning models for automatic flattery detection. In particular, we employ pretrained AST, Wav2Vec2, and Whisper models for the speech modality, and Whisper TTS models combined with a RoBERTa text classifier for the textual modality. Subsequently, we build a multimodal classifier by combining text and audio representations. Evaluation on unseen test data demonstrates promising results, with Unweighted Average Recall scores reaching 82.46% in audio-only experiments, 85.97% in text-only experiments, and 87.16% using a multimodal approach.

MCML Authors
Link to website

Shahin Amiriparian

Dr.

Health Informatics

Link to Profile Björn Schuller

Björn Schuller

Prof. Dr.

Health Informatics


[11]
W. Qiu, C. Quan, L. Zhu, Y. Yu, Z. Wang, Y. Ma, M. Sun, Y. Chang, K. Qian, B. Hu, Y. Yamamoto and B. W. Schuller.
Heart Sound Abnormality Detection From Multi-Institutional Collaboration: Introducing a Federated Learning Framework.
IEEE Transactions on Biomedical Engineering 71.10 (May. 2024). DOI
Abstract

Objective: Early diagnosis of cardiovascular diseases is a crucial task in medical practice. With the application of computer audition in the healthcare field, artificial intelligence (AI) has been applied to clinical non-invasive intelligent auscultation of heart sounds to provide rapid and effective pre-screening. However, AI models generally require large amounts of data which may cause privacy issues. Unfortunately, it is difficult to collect large amounts of healthcare data from a single centre. Methods: In this study, we propose federated learning (FL) optimisation strategies for the practical application in multi-centre institutional heart sound databases. The horizontal FL is mainly employed to tackle the privacy problem by aligning the feature spaces of FL participating institutions without information leakage. In addition, techniques based on deep learning have poor interpretability due to their “black-box” property, which limits the feasibility of AI in real medical data. To this end, vertical FL is utilised to address the issues of model interpretability and data scarcity. Conclusion: Experimental results demonstrate that, the proposed FL framework can achieve good performance for heart sound abnormality detection by taking the personal privacy protection into account. Moreover, using the federated feature space is beneficial to balance the interpretability of the vertical FL and the privacy of the data. Significance: This work realises the potential of FL from research to clinical practice, and is expected to have extensive application in the federated smart medical system.

MCML Authors
Link to Profile Björn Schuller

Björn Schuller

Prof. Dr.

Health Informatics


[10]
H. Krasowski and M. Althoff.
Provable Traffic Rule Compliance in Safe Reinforcement Learning on the Open Sea.
IEEE Transactions on Intelligent Vehicles Early Access (May. 2024). DOI
Abstract

For safe operation, autonomous vehicles have to obey traffic rules that are set forth in legal documents formulated in natural language. Temporal logic is a suitable concept to formalize such traffic rules. Still, temporal logic rules often result in constraints that are hard to solve using optimization-based motion planners. Reinforcement learning (RL) is a promising method to find motion plans for autonomous vehicles. However, vanilla RL algorithms are based on random exploration and do not automatically comply with traffic rules. Our approach accomplishes guaranteed rule-compliance by integrating temporal logic specifications into RL. Specifically, we consider the application of vessels on the open sea, which must adhere to the Convention on the International Regulations for Preventing Collisions at Sea (COLREGS). To efficiently synthesize rule-compliant actions, we combine predicates based on set-based prediction with a statechart representing our formalized rules and their priorities. Action masking then restricts the RL agent to this set of verified rule-compliant actions. In numerical evaluations on critical maritime traffic situations, our agent always complies with the formalized legal rules and never collides while achieving a high goal-reaching rate during training and deployment. In contrast, vanilla and traffic rule-informed RL agents frequently violate traffic rules and collide even after training.

MCML Authors
Link to website

Hanna Krasowski

Dr.

Cyber Physical Systems

Link to Profile Matthias Althoff

Matthias Althoff

Prof. Dr.

Cyber Physical Systems


[9]
S. Amiriparian, M. Gerczuk, J. Lutz, W. Strube, I. Papazova, A. Hasan, A. Kathan and B. W. Schuller.
Non-Invasive Suicide Risk Prediction Through Speech Analysis.
Preprint (Apr. 2024). arXiv
Abstract

The delayed access to specialized psychiatric assessments and care for patients at risk of suicidal tendencies in emergency departments creates a notable gap in timely intervention, hindering the provision of adequate mental health support during critical situations. To address this, we present a non-invasive, speech-based approach for automatic suicide risk assessment. For our study, we collected a novel speech recording dataset from 20 patients. We extract three sets of features, including wav2vec, interpretable speech and acoustic features, and deep learning-based spectral representations. We proceed by conducting a binary classification to assess suicide risk in a leave-one-subject-out fashion. Our most effective speech model achieves a balanced accuracy of 66.2%. Moreover, we show that integrating our speech model with a series of patients’ metadata, such as the history of suicide attempts or access to firearms, improves the overall result. The metadata integration yields a balanced accuracy of 94.4%, marking an absolute improvement of 28.2%, demonstrating the efficacy of our proposed approaches for automatic suicide risk assessment in emergency medicine.

MCML Authors
Link to website

Shahin Amiriparian

Dr.

Health Informatics

Link to website

Maurice Gerczuk

Health Informatics

Link to website

Alexander Kathan

Health Informatics

Link to Profile Björn Schuller

Björn Schuller

Prof. Dr.

Health Informatics


[8]
A. Triantafyllopoulos and B. W. Schuller.
Expressivity and Speech Synthesis.
Preprint (Apr. 2024). arXiv
Abstract

Imbuing machines with the ability to talk has been a longtime pursuit of artificial intelligence (AI) research. From the very beginning, the community has not only aimed to synthesise high-fidelity speech that accurately conveys the semantic meaning of an utterance, but also to colour it with inflections that cover the same range of affective expressions that humans are capable of. After many years of research, it appears that we are on the cusp of achieving this when it comes to single, isolated utterances. This unveils an abundance of potential avenues to explore when it comes to combining these single utterances with the aim of synthesising more complex, longer-term behaviours. In the present chapter, we outline the methodological advances that brought us so far and sketch out the ongoing efforts to reach that coveted next level of artificial expressivity. We also discuss the societal implications coupled with rapidly advancing expressive speech synthesis (ESS) technology and highlight ways to mitigate those risks and ensure the alignment of ESS capabilities with ethical norms.

MCML Authors
Link to website

Andreas Triantafyllopoulos

Health Informatics

Link to Profile Björn Schuller

Björn Schuller

Prof. Dr.

Health Informatics


[7]
M. M. Amin and B. W. Schuller.
On Prompt Sensitivity of ChatGPT in Affective Computing.
Preprint (Mar. 2024). arXiv
Abstract

Recent studies have demonstrated the emerging capabilities of foundation models like ChatGPT in several fields, including affective computing. However, accessing these emerging capabilities is facilitated through prompt engineering. Despite the existence of some prompting techniques, the field is still rapidly evolving and many prompting ideas still require investigation. In this work, we introduce a method to evaluate and investigate the sensitivity of the performance of foundation models based on different prompts or generation parameters. We perform our evaluation on ChatGPT within the scope of affective computing on three major problems, namely sentiment analysis, toxicity detection, and sarcasm detection. First, we carry out a sensitivity analysis on pivotal parameters in auto-regressive text generation, specifically the temperature parameter T and the top-p parameter in Nucleus sampling, dictating how conservative or creative the model should be during generation. Furthermore, we explore the efficacy of several prompting ideas, where we explore how giving different incentives or structures affect the performance. Our evaluation takes into consideration performance measures on the affective computing tasks, and the effectiveness of the model to follow the stated instructions, hence generating easy-to-parse responses to be smoothly used in downstream applications.

MCML Authors
Link to Profile Björn Schuller

Björn Schuller

Prof. Dr.

Health Informatics


[6]
A. Mallol-Ragolta and B. W. Schuller.
Coupling Sentiment and Arousal Analysis Towards an Affective Dialogue Manager.
IEEE Access 12 (Feb. 2024). DOI
Abstract

We present the technologies and host components developed to power a speech-based dialogue manager with affective capabilities. The overall goal is that the system adapts its response to the sentiment and arousal level of the user inferred by analysing the linguistic and paralinguistic information embedded in his or her interaction. A linguistic-based, dedicated sentiment analysis component determines the body of the system response. A paralinguistic-based, dedicated arousal recognition component adjusts the energy level to convey in the affective system response. The sentiment analysis model is trained using the CMU-MOSEI dataset and implements a hierarchical contextual attention fusion network, which scores an Unweighted Average Recall (UAR) of 79.04% on the test set when tackling the task as a binary classification problem. The arousal recognition model is trained using the MSP-Podcast corpus. This model extracts the Mel-spectrogram representations of the speech signals, which are exploited with a Convolutional Neural Network (CNN) trained from scratch, and scores a UAR of 61.11% on the test set when tackling the task as a three-class classification problem. Furthermore, we highlight two sample dialogues implemented at the system back-end to detail how the sentiment and arousal inferences are coupled to determine the affective system response. These are also showcased in a proof of concept demonstrator. We publicly release the trained models to provide the research community with off-the-shelf sentiment analysis and arousal recognition tools.

MCML Authors
Link to website

Adria Mallol-Ragolta

Health Informatics

Link to Profile Björn Schuller

Björn Schuller

Prof. Dr.

Health Informatics


[5]
J. Xie, Y. Shi, D. Ni, M. Milling, S. Liu, J. Zhang, K. Qian and B. W. Schuller.
Automatic Bird Sound Source Separation Based on Passive Acoustic Devices in Wild Environment.
IEEE Internet of Things Journal 11.9 (Jan. 2024). DOI
Abstract

The Internet of Things (IoT)-based passive acoustic monitoring (PAM) has shown great potential in large-scale remote bird monitoring. However, field recordings often contain overlapping signals, making precise bird information extraction challenging. To solve this challenge, first, the interchannel spatial feature is chosen as complementary information to the spectral feature to obtain additional spatial correlations between the sources. Then, an end-to-end model named BACPPNet is built based on Deeplabv3plus and enhanced with the polarized self-attention mechanism to estimate the spectral magnitude mask (SMM) for separating bird vocalizations. Finally, the separated bird vocalizations are recovered from SMMs and the spectrogram of mixed audio using the inverse short Fourier transform (ISTFT). We evaluate our proposed method utilizing the generated mixed data set. Experiments have shown that our method can separate bird vocalizations from mixed audio with root mean square error (RMSE), source-to-distortion ratio (SDR), source-to-interference ratio (SIR), source-to-artifact ratio (SAR), and short-time objective intelligibility (STOI) values of 2.82, 10.00 dB, 29.90 dB, 11.08 dB, and 0.66, respectively, which are better than existing methods. Furthermore, the average classification accuracy of the separated bird vocalizations drops the least. This indicates that our method outperforms other compared separation methods in bird sound separation and preserves the fidelity of the separated sound sources, which might help us better understand wild bird sound recordings.

MCML Authors
Link to Profile Björn Schuller

Björn Schuller

Prof. Dr.

Health Informatics


[4]
Y. Xin, X. Zuo, D. Lu and S. Leutenegger.
SimpleMapping: Real-time visual-inertial dense mapping with deep multi-view stereo.
ISMAR 2023 - IEEE/ACM International Symposium on Mixed and Augmented Reality. Sydney, Australia, Oct 16-20, 2023. DOI
Abstract

We present a real-time visual-inertial dense mapping method capable of performing incremental 3D mesh reconstruction with high quality using only sequential monocular images and inertial measurement unit (IMU) readings. 6-DoF camera poses are estimated by a robust feature-based visual-inertial odometry (VIO), which also generates noisy sparse 3D map points as a by-product. We propose a sparse point aided multi-view stereo neural network (SPA-MVSNet) that can effectively leverage the informative but noisy sparse points from the VIO system. The sparse depth from VIO is firstly completed by a single-view depth completion network. This dense depth map, although naturally limited in accuracy, is then used as a prior to guide our MVS network in the cost volume generation and regularization for accurate dense depth prediction. Predicted depth maps of keyframe images by the MVS network are incrementally fused into a global map using TSDF-Fusion. We extensively evaluate both the proposed SPA-MVSNet and the entire dense mapping system on several public datasets as well as our own dataset, demonstrating the system’s impressive generalization capabilities and its ability to deliver high-quality 3D reconstruction online. Our proposed dense mapping system achieves a 39.7% improvement in F-score over existing systems when evaluated on the challenging scenarios of the EuRoC dataset.

MCML Authors
Link to website

Xingxing Zuo

Dr.

Machine Learning for Robotics

Link to Profile Stefan Leutenegger

Stefan Leutenegger

Prof. Dr.

Machine Learning for Robotics


[3]
J. Külz, M. Mayer and M. Althoff.
Timor Python: A Toolbox for Industrial Modular Robotics.
IROS 2023 - IEEE/RSJ International Conference on Intelligent Robots and Systems. Detroit, MI, USA, Oct 01-05, 2023. DOI
Abstract

Modular Reconfigurable Robots (MRRs) represent an exciting path forward for industrial robotics, opening up new possibilities for robot design. Compared to monolithic manipulators, they promise greater flexibility, improved maintainability, and cost-efficiency. However, there is no tool or standardized way to model and simulate assemblies of modules in the same way it has been done for robotic manipulators for decades. We introduce the Toolbox for Industrial Modular Robotics (Timor), a Python toolbox to bridge this gap and integrate modular robotics into existing simulation and optimization pipelines. Our open-source library offers model generation and task-based configuration optimization for MRRs. It can easily be integrated with existing simulation tools - not least by offering URDF export of arbitrary modular robot assemblies. Moreover, our experimental study demonstrates the effectiveness of Timor as a tool for designing modular robots optimized for specific use cases.

MCML Authors
Link to website

Jonathan Külz

Cyber Physical Systems

Link to Profile Matthias Althoff

Matthias Althoff

Prof. Dr.

Cyber Physical Systems


[2]
X. Zuo, N. Yang, N. Merrill, B. Xu and S. Leutenegger.
Incremental Dense Reconstruction from Monocular Video with Guided Sparse Feature Volume Fusion.
IEEE Robotics and Automation Letters 8.6 (Jun. 2023). DOI
Abstract

Incrementally recovering 3D dense structures from monocular videos is of paramount importance since it enables various robotics and AR applications. Feature volumes have recently been shown to enable efficient and accurate incremental dense reconstruction without the need to first estimate depth, but they are not able to achieve as high of a resolution as depth-based methods due to the large memory consumption of high-resolution feature volumes. This letter proposes a real-time feature volume-based dense reconstruction method that predicts TSDF (Truncated Signed Distance Function) values from a novel sparsified deep feature volume, which is able to achieve higher resolutions than previous feature volume-based methods, and is favorable in outdoor large-scale scenarios where the majority of voxels are empty. An uncertainty-aware multi-view stereo (MVS) network is leveraged to infer initial voxel locations of the physical surface in a sparse feature volume. Then for refining the recovered 3D geometry, deep features are attentively aggregated from multi-view images at potential surface locations, and temporally fused. Besides achieving higher resolutions than before, our method is shown to produce more complete reconstructions with finer detail in many cases. Extensive evaluations on both public and self-collected datasets demonstrate a very competitive real-time reconstruction result for our method compared to state-of-the-art reconstruction methods in both indoor and outdoor settings.

MCML Authors
Link to website

Xingxing Zuo

Dr.

Machine Learning for Robotics

Link to Profile Stefan Leutenegger

Stefan Leutenegger

Prof. Dr.

Machine Learning for Robotics


[1]
T. Ladner and M. Althoff.
Automatic Abstraction Refinement in Neural Network Verification Using Sensitivity Analysis.
HSCC 2023 - 26th ACM International Conference on Hybrid Systems: Computation and Control. San Antonio, TX, USA, May 09-12, 2023. DOI
Abstract

The formal verification of neural networks is essential for their application in safety-critical environments. However, the set-based verification of neural networks using linear approximations often obtains overly conservative results, while nonlinear approximations quickly become computationally infeasible in deep neural networks. We address this issue for the first time by automatically balancing between precision and computation time without splitting the propagated set. Our work introduces a novel automatic abstraction refinement approach using sensitivity analysis to iteratively reduce the abstraction error at the neuron level until either the specifications are met or a maximum number of iterations is reached. Our evaluation shows that we can tightly over-approximate the output sets of deep neural networks and that our approach is up to a thousand times faster than a naive approach. We further demonstrate the applicability of our approach in closed-loop settings.

MCML Authors
Link to website

Tobias Ladner

Cyber Physical Systems

Link to Profile Matthias Althoff

Matthias Althoff

Prof. Dr.

Cyber Physical Systems



Learn More About Our Other Research Areas or Checkout Our Publications

A | Foundations of Machine Learning

aims at strengthening the competence in Statistical Foundations and Explainability, Mathematical Foundations, and Computational Methods. These fields form the basis for all methodological advances.

C | Domain-Specific Machine Learning

shows an immense potential, as both universities have several highly visible scientific domains with internationally renowned experts. This area facilitates translating ML concepts and technologies to many different domains.

Publications

Check out the publications by our members.