Publications

Unsilencing colonial archives via automated entity recognition

Colonial archives are at the center of increased interest from a variety of perspectives, as they contain traces of historically marginalized people. Unfortunately, like most archives, they remain difficult to access due to significant persisting barriers. We focus here on one of them: the biases to be found in historical findings aids, such as indexes of person names, which remain in use to this day. In colonial archives, indexes can perpetuate silences by omitting to include mentions of historically marginalized persons. In order to overcome such limitations and pluralize the scope of existing finding aids, we propose using automated entity recognition. To this end, we contribute a fit-for-purpose annotation typology and apply it on the colonial archive of the Dutch East India Company (VOC). We release a corpus of nearly 70,000 annotations as a shared task, for which we provide baselines using state-of-the-art neural network models. Our work intends to stimulate further contributions in the direction of broadening access to (colonial) archives, integrating automation as a possible means to this end.

A map of Digital Humanities research across bibliographic data sources

This study presents the results of an experiment we performed to measure the coverage of Digital Humanities (DH) publications in mainstream open and proprietary bibliographic data sources, by further highlighting the relations among DH and other disciplines. We created a list of DH journals based on manual curation and bibliometric data. We used that list to identify DH publications in the bibliographic data sources under consideration. We used the ERIH-PLUS list of journals to identify Social Sciences and Humanities (SSH) publications. We analysed the citation links they included to understand the relationship between DH publications and SSH and non-SSH fields. Crossref emerges as the database containing the highest number of DH publications. Citations from and to DH publications show strong connections between DH and research in Computer Science, Linguistics, Psychology, and Pedagogical & Educational Research. Computer Science is responsible for a large part of incoming and outgoing citations to and from DH research, which suggests a reciprocal interest between the two disciplines. This is the first bibliometric study of DH research involving several bibliographic data sources, including open and proprietary databases. The list of DH journals we created might be only partially representative of broader DH research. In addition, some DH publications could have been cut off from the study since we did not consider books and other publications published in proceedings of DH conferences and workshops. Finally, we used a specific time coverage (2000–2018) that could have prevented the inclusion of additional DH publications.

Archives and AI: An Overview of Current Debates and Future Perspectives

The digital transformation is turning archives, both old and new, into data. As a consequence, automation in the form of artificial intelligence techniques is increasingly applied both to scale traditional recordkeeping activities, and to experiment with novel ways to capture, organise, and access records. We survey recent developments at the intersection of Artificial Intelligence and archival thinking and practice. Our overview of this growing body of literature is organised through the lenses of the Records Continuum model. We find four broad themes in the literature on archives and artificial intelligence: theoretical and professional considerations, the automation of recordkeeping processes, organising and accessing archives, and novel forms of digital archives. We conclude by underlining emerging trends and directions for future work, which include the application of recordkeeping principles to the very data and processes that power modern artificial intelligence and a more structural — yet critically aware — integration of artificial intelligence into archival systems and practice.

Crypto art: A decentralized view

This is a decentralized position paper on crypto art, which includes viewpoints from different actors of the system: artists, collectors, galleries, art scholars, data scientists. The writing process went as follows: a general definition of the topic was put forward by two of the authors (Franceschet and Colavizza), and used as reference to ask to a set of diverse authors to contribute with their viewpoints asynchronously and independently. No guidelines were offered before the first draft, if not to reach a minimum of words to justify a separate section/contribution. Afterwards, all authors read and commented on each other’s work and minimal editing was done. Every author was asked to suggest open questions and future perspectives on the topic of crypto art from their vantage point, while keeping full control of their own sections at all times. While this process does not necessarily guarantee the uniformity expected from, say, a research article, it allows for multiple voices to emerge and provide for a contribution on a common topic. The ending section offers an attempt to pull all these threads together into a perspective on the future of crypto art.

On the Value of Wikipedia as a Gateway to the Web

By linking to external websites, Wikipedia can act as a gateway to the Web. To date, however, little is known about the amount of traffic generated by Wikipedia’s external links. We fill this gap in a detailed analysis of usage logs gathered from Wikipedia users’ client devices. Our analysis proceeds in three steps: First, we quantify the level of engagement with external links, finding that, in one month, English Wikipedia generated 43M clicks to external websites, in roughly even parts via links in infoboxes, cited references, and article bodies. Official links listed in infoboxes have by far the highest click-through rate (CTR), 2.47% on average. In particular, official links associated with articles about businesses, educational institutions, and websites have the highest CTR, whereas official links associated with articles about geographical content, television, and music have the lowest CTR. Second, we investigate patterns of engagement with external links, finding that Wikipedia frequently serves as a stepping stone between search engines and third-party websites, effectively fulfilling information needs that search engines do not meet. Third, we quantify the hypothetical economic value of the clicks received by external websites from English Wikipedia, by estimating that the respective website owners would need to pay a total of $7-13 million per month to obtain the same volume of traffic via sponsored search. Overall, these findings shed light on Wikipedia’s role not only as an important source of information, but also as a high-traffic gateway to the broader Web ecosystem.

A Digital Reconstruction of a Large Plague Outbreak During 1630-1631

The plague, an infectious disease caused by the bacterium Yersinia pestis, is widely considered to be responsible for the most devastating and deadly pandemics in human history. Starting with the infamous Black Death, plague outbreaks are estimated to have killed around 100 million people over multiple centuries, with local mortality rates as high as 60%. However, detailed pictures of the disease dynamics of these outbreaks centuries ago remain scarce, mainly due to the lack of high-quality historical data in digital form. Here, we present an analysis of the 1630-31 plague outbreak in the city of Venice, using newly collected daily death records. We identify the presence of a two-peak pattern, for which we present two possible explanations based on computational models of disease dynamics. Systematically digitized historical records like the ones presented here promise to enrich our understanding of historical phenomena of enduring importance. This work contributes to the recently renewed interdisciplinary foray into the epidemiological andsocietal impact of pre-modern epidemics.

The Citation Advantage of Linking Publications to Research Data

Efforts to make research results open and reproducible are increasingly reflected by journal policies encouraging or mandating authors to provide data availability statements. As a consequence of this, there has been a strong uptake of data availability statements in recent literature. Nevertheless, it is still unclear what proportion of these statements actually contain well-formed links to data, for example via a URL or permanent identifier, and if there is an added value in providing such links. We consider 531, 889 journal articles published by PLOS and BMC, develop an automatic system for labelling their data availability statements according to four categories based on their content and the type of data availability they display, and finally analyze the citation advantage of different statement categories via regression. We find that, following mandated publisher policies, data availability statements become very common. In 2018 93.7% of 21,793 PLOS articles and 88.2% of 31,956 BMC articles had data availability statements. Data availability statements containing a link to data in a repository—rather than being available on request or included as supporting information files—are a fraction of the total. In 2017 and 2018, 20.8% of PLOS publications and 12.2% of BMC publications provided DAS containing a link to data in a repository. We also find an association between articles that include statements that link to data in a repository and up to 25.36% (± 1.07%) higher citation impact on average, using a citation prediction model. We discuss the potential implications of these results for authors (researchers) and journal publishers who make the effort of sharing their data in repositories. All our data and code are made available in order to reproduce and extend our results.

Assessing the Impact of OCR Quality on Downstream NLP Tasks

A growing volume of heritage data is being digitized and made available as text via optical character recognition (OCR). Scholars and libraries are increasingly using OCR-generated text for retrieval and analysis. However, the process of creating text through OCR introduces varying degrees of error to the text. The impact of these errors on natural language processing (NLP) tasks has only been partially studied. We perform a series of extrinsic assessment tasks — sentence segmentation, named entity recognition, dependency parsing, information retrieval, topic modelling and neural language model fine-tuning — using popular, out-of-the-box tools in order to quantify the impact of OCR quality on these tasks. We find a consistent impact resulting from OCR errors on our downstream tasks with some tasks more irredeemably harmed by OCR errors. Based on these results, we offer some preliminary guidelines for working with text produced through OCR.

A publication title, such as title of a paper

The abstract. Markdown and math can be used (note that math may require escaping as detailed in the red alert box below).