Studying the impact of data noise in historical collections

Last updated on Nov 2, 2023

Historical collections have undergone large-scale digitisation campaigns making a wealth of data available for researchers. Most of these texts have been extracted using automated means, via Handwritten Text Recognition or Optical Character Recognition (HTR/OCR). Such extraction is notoriously error-prone and thus has a largely unknown impact on the use of texts, for example for information extraction and text analysis. This project aims to systematically assess the impact of noise, and establish empirical guidelines to answer the following practical questions: when is HTR/OCR quality good enough? and how can it be improved upon?

Machine Learning

Studying the impact of data noise in historical collections

Related

Publications

Assessing the Impact of OCR Quality on Downstream NLP Tasks