Home | Publications | Bla25

Beyond 'Noisy' Text: How (And Why) to Process Dialect Data

MCML Authors

Verena Blaschke

→ Group Barbara Plank
AI and Computational Linguistics

Abstract

Processing data from non-standard dialects links two lines of research: creating NLP tools that are robust to 'noisy' inputs, and extending the coverage of NLP tools to underserved language communities. In this talk, I will describe ways in which processing dialect data differs from processing standard-language data, and discuss some of the current challenges in dialect NLP research. For instance, I will talk about strategies to mitigate the effect of infelicitous subword tokenization caused by ad-hoc pronunciation spellings. Additionally, I argue that we should not only consider how to tackle dialectal variation in NLP, but also why. To this end, I will highlight perspectives of some dialect speaker communities on which language technologies should (or should not) be able to process or produce dialectal in- or output.

inproceedings

W-NUT @NAACL 2025

10th Workshop on Noisy and User-generated Text at the Annual Conference of the North American Chapter of the Association for Computational Linguistics. Albuquerque, NM, USA, Apr 29-May 04, 2025. Keynote Talk.