Abstract

Dutch corpus of person name variants

 

This project aims to develop a gold standard for person name variants, mainly based on the LINKS corpus of 19/20th century person names from the vital register (63 million tokens). 25% of the 564.000 surnames and 189.000 first names have already been standardized, based on variants associated to the same individual. Expert review of this core set is necessary, however, which process will be assisted by the CLARIAH tool TICCL. This will also constitute the (statistical) learning phase of TICCL (to handle previously unseen variants), while a data structure will be established to deal with ambiguities and to accommodate different levels of standardization. In a second phase, the remaining 75% of the LINKS corpus will be standardized.
The corpus will both be delivered in RDF format for Linked Open Data, and as a lexical service. The usage of the corpus will be tested within the CLARIAH Anansi environment .