Studying the impact of data noise in historical collections

Historical collections have undergone large-scale digitisation campaigns making a wealth of data available for researchers. Most of these texts have been extracted using automated means, via Handwritten Text Recognition or Optical Character Recognition (HTR/OCR). Such extraction is notoriously error-prone and thus has a largely unknown impact on the use of texts, for example for information extraction and text analysis. This project aims to systematically assess the impact of noise, and establish empirical guidelines to answer the following practical questions: when is HTR/OCR quality good enough? and how can it be improved upon?