The Dutch East India Company (VOC) and the General Missives

In this use case, researchers from VU University, the Huygens Institute and the Dutch Language Institute developed tools for analyzing 17th and 18th century Dutch texts.

Principal investigator

Lodewijk Petram

The project’s primary research data is the corpus of the General Missives, a series of reports detailing the activities of the Dutch East India Company and the events that occurred in the Asian regions where the VOC was active between 1610 and 1795.

First, researchers at Dutch Language Institute applied techniques of optical character recognition (OCR) to digitize the texts. Using TEI (Text Encoding Initiative), a widely used international standard for representing text in digital form, they enriched the output of this OCR-process with metadata, such as author and date, on the level of each individual missive. Additionally, they added structuring elements to distinguish between transcriptions, summarizing notes of the editors, and footnotes. VU University then converted this data to NAF (NLP Annotation Format). On top of that, Dirk Roorda performed a cleaning operation and conversion to Text-Fabric and back to XML.To facilitate manual and automatic entity annotation processes, the NAF files were converted to the more lightweight CONLL and XMI formats.

One of the resources developed in this use case is a tool for Named Entity Recognition (NER). NER is a technology that automatically recognizes named entities like persons, locations or ships in a set of texts, and categorizes them. In the companion publication to this tool (Arnoult et al., 2021), Sophie Arnoult, Lodewijk Petram and Piek Vossen compared language-specific and multilingual pretrained language models for NER and introduced a new NER model for Dutch based on the General Missives. These NER annotations are delivered in a Text-Fabric data module and they are also integrated in the (cleaned) XML files.

To wrap up this use case, we aim to develop a manual or tutorial so our NER-tool can be used by a broader community of scholars. Furthermore, we intend to make the annotated corpus available for research and to make the NER-model available for reuse.

Project info

Partners

Researchers

Lodewijk Petram

Senior Research Data Manager, Huygens Instituut

Jesse de Does

Computerlinguïst, Instituut voor de Nederlandse Taal

Katrien Depuydt

Senior Onderzoeker, Instituut voor de Nederlandse Taal

Sophie Arnoult

Promovendus, Universiteit van Amsterdam

Piek Vossen

Hoogleraar Computational Lexicology, Vrije Universiteit Amsterdam

Dirk Roorda

Onderzoeker, KNAW Humanities Cluster

Joris van Zundert

WP6 Co-leider, CLARIAH NL

Julia Neugarten

Medewerker tot Oktober 2022, KNAW Humanities Cluster

Publications

Research paper: Arnoult et al., 2021. Batavia asked for advice. Pretrained language models for Named Entity Recognition in historical texts.

The cleaned data is available in Text-Fabric and XML-TEI in the CLARIAH/wp6-missieven GitHub repository. There are also a few tutorials there, which can be viewed on NBViewer.

The dataset is available in the data section of the CLTL voc-missives GitHub repository.

One of the best-performing models we experimented with in the paper is available in the HuggingFace Transformers library: https://huggingface.co/CLTL/gm-ner-xlmrbase.

More projects

WP6: Tekst

Werkpakket 6 stelt data en tools beschikbaar voor onderzoekers uit de literatuurwetenschap, geschiedenis, filosofie, en religiestudies, en v...