Text induced Corpus Correction and Lexical Assessment Tool
Martin Reynaert (Tilburg University)
The Text-Induced Corpus Clean-up tool TICCL, integral part of the CLARIN infrastructure, is globally unique in utilizing the corpus-derived word form statistics to attempt to fully-automatically post-correct texts digitized by means of Optical Character Recognition.
The NWO 'Groot' project Nederlab has delivered a uniformly processed and linguistically enriched diachronic corpus of Dutch containing an estimated 5-6 billion word tokens. We aim to extend TICCL's correction capabilities with classification facilities based on specific data collected from the full Nederlab corpus: word statistics, document and time references and linguistic annotations, i.e. Part-of-Speech and Named-Entity labels. These data will complement a solid, renewed basis composed of the available validated lexicons and name lists for Dutch.
In this, TICCL as a post-correction tool will be transformed into TICCLAT, a lexical assessment tool capable of delivering not only correction candidates, but also e.g. more accurately dated diachronic Dutch word forms, more securely classified person and place names. To achieve this on scale, the TICCLAT project relies on a successful extension of TICCL's anagram hashing towards text-induced morphological classification. TICCLAT's capabilities will also be evaluated in comparison to human performance by an expert psycholinguist.
The data collected will be exportable for storage in a data repository, as RDF triples, for broad reuse. The project will greatly contribute to a more comprehensive overview of the lexicon of Dutch since its earliest days and of the person and place names that share its history. Its partners are the Dutch experts in Lexicology, Person Names and Toponyms.