Accelerating Scientific Discovery in the Arts and Humanities

Granted Projects

eScience Center and CLARIAH grant four projects in the Arts and Humanities

CLARIAH logo logo escience zwart

The eScience Center and CLARIAH are pleased to announce the initiation of four new projects in the Arts and Humanities. The four projects will pursue new scientific domain challenges and enhance and accelerate the process of scientific discovery within the Arts and Humanities using computer science, data science, and eScience technologies.

Scheduled to start in the second half of 2017, the projects are collaborations with research teams from multiple Dutch academic groups. The granted projects will use, adapt, and integrate existing methods and tools, as made available through the CLARIAH and eScience Center software infrastructures. Newly developed tools will be made available through the eScience Technology Platform of the Netherlands eScience Center and the CLARIAH Infrastructure for potential use in other studies.

The granted projects are:

Bridging the gap: Digital Humanities and the Arabic-Islamic corpus

Who Prof. dr. Christian Lange 
Where Utrecht University
What

Despite some pioneering efforts in recent times, the longue durée analysis of conceptual history in the Islamic world remains a largely unexplored field of research. Researchers of Islamic intellectual history still tend to study a certain canon of texts, made available by previous Western researchers of the Islamic world largely based on considerations of the relevance of these texts for Western theories, concepts and ideas. Indigenous conceptual developments and innovations are therefore insufficiently understood, particularly as concerns the transition from premodern to modern thought in Islam.

This project seeks to harness state-of-the art Digital Humanities approaches and technologies to make pioneering forays into the vast corpus of digitised Arabic texts that has become available in the last decade. This is done along the lines of four case studies, each of which examines a separate genre of Arabic and Islamic literary history (jurisprudence, inter-faith literature, early modern and modern journalism, and Arabic poetry).

This project seeks to develop a web-based application that will (a) enable easy access to existing Arabic corpora on GitHub and other online repositories and offer the opportunity for researchers to upload their own corpus (b) offer a set of tools for Arabic text mining and computational analysis, and (c) provide opportunities to link search results to the datasets in Islamic and Middle Eastern Studies of Brill Publishers, Europe’s leading publisher in this area.

The project will be inserted into two ongoing ERC projects on Islamic intellectual history housed at the Department of Philosophy and Religious Studies at Utrecht University, and collaborate closely with international initiatives in the field of Arabic Digital Humanities.

TICCLAT: Text-Induced Corpus Correction and Lexical Assessment Tool

Who Dr. Martin Reynaert
Where Tilburg University
What

The Text-Induced Corpus Clean-up tool TICCL, integral part of the CLARIN infrastructure, is globally unique in utilizing the corpus-derived word form statistics to attempt to fully-automatically post-correct texts digitized by means of Optical Character Recognition.

The NWO 'Groot' project Nederlab will deliver by the end of 2017 a uniformly processed and linguistically enriched diachronic corpus of Dutch containing an estimated 5-6 billion word tokens. We aim to extend TICCL's correction capabilities with classification facilities based on specific data collected from the full Nederlab corpus: word statistics, document and time references and linguistic annotations, i.e. Part-of-Speech and Named-Entity labels. These data will complement a solid, renewed basis composed of the available validated lexicons and name lists for Dutch.

In this, TICCL as a post-correction tool will be transformed into TICCLAT, a lexical assessment tool capable of delivering not only correction candidates, but also e.g. more accurately dated diachronic Dutch word forms, more securely classified person and place names. To achieve this on scale, the TICCLAT project will seek a successful merger of TICCL's anagram hashing with bit-vectorization techniques. TICCLAT's capabilities will also be evaluated in comparison to human performance by an expert psycholinguist.

The data collected will be exportable for storage in a data repository, as RDF triples, for broad reuse. The project will greatly contribute to a more comprehensive overview of the lexicon of Dutch since its earliest days and of the person and place names that share its history. Its partners are the Dutch experts in Lexicology, Person Names and Toponyms.

News Genres: Advancing Media History by Transparent Automatic Genre Classification (NEWSGAC)

Who Prof.dr. Marcel J. Broersma 
Where University of Groningen
What

This project studies how genres in newspapers and television news can be detected automatically using machine learning in a transparent manner. This will enable us to capture the often hypothesized but, due to the highly time consuming nature of manual content analysis, largely understudied shift from opinion-based to fact-centred reporting. Moreover, we will open the black box of machine learning by comparing, predicting and visualizing the effects of applying various algorithms on heterogeneous data with varying quality and genre features that shift over time. This will enable scholars to do large-scale analyses of historic texts and other media types as well as critically evaluate the methodological effects of various machine learning approaches.

This project brings together expertise of journalism history scholars (RUG), specialists in data modelling, integration and analysis (CWI), digital collection experts (KB & NISV) and e-science engineers (eScience Center). It will first use a big manually annotated dataset (VIDI-project PI) to develop a transparent and reproducible approach to train an automatic classifier. Building upon this, the project will generate three outcomes:

  1. A study that revises our current understanding of the interrelated development of genre conventions in print and television journalism based upon large-scale automated content analysis via machine learning;
  2. Metrics and guidelines for evaluating the bias and error of the different preprocessing and machine learning approaches and of-the-shelf software packages;
  3. A dashboard that integrates, compares and visualises different algorithms and underlying machine learning approaches which can be integrated in the CLARIAH media suite.

EviDENce: Ego Documents Events modelliNg. How individuals recall mass violence

Who Dr. Susan Hogervorst
Where Open Universiteit Nederland
What

Much of our historical knowledge is based on oral or written accounts of eyewitnesses, particularly in cases of war and mass violence, when regular ways of documentation and record keeping are often absent. Although oral history and the study of ego documents both value these individual perspectives on history and its meaning, these research fields tend to operate separately. However, the digital revolution has shaken up the balance between spoken and written text. The paradigm emerging in the application of search technology to digitised oral history is characterised by a post-documentary sensibility: away from text and sensitive to other dimensions of human expression than language. Nonetheless, ‘mining’ of oral history accounts remains valuable in humanities research, especially considering the re-use of digital interview collections throughout the humanities.

EviDENce explores new ways of analysing and contextualising historical sources by applying event modelling and semantic web technologies. Our project suggests a systematic and integral content analysis of ‘ego-sources’ by applying state-of-the-art entity and event modelling methods and tools, in order to explore the nature and value of ego-sources and to disclose existing collections. We focus on representations of mass-violence in two case studies to generate and explore different kinds of events: 1) a synchronic analysis of WW2 events, centered around the oral history collection ‘Getuigenverhalen’ [1] and using the WW2 thesaurus [2], and 2) a diachronic analysis of ego-documents (1573-2012) from Nederlab [3]. In both cases, we use content-related contextual sources from Nederlab [4].

About the ADAH Call

The four projects result from the recent ADAH call (Accelerating Scientific Discovery in the Arts and Humanities). The purpose of the 2016 ADAH call is to enable researchers working in the Arts and Humanities to address compute-intensive and/or data-driven problems within their research and to contribute to a generic and sustainable research software infrastructure.

About the Netherlands eScience Center

The eScience Center is the national hub for the development and application of domain overarching software and methods for the scientific community. The eScience Center develops crucial bridges between increasingly complex modern e-infrastructures and the growing demands and ambitions of scientists from across all scientific disciplines. 

About CLARIAH

CLARIAH is a national project that is designing, constructing and exploiting the Dutch parts of the European CLARIN and DARIAH infrastructures. CLARIAH covers the humanities as a whole but has three core discipline areas: linguistics, media studies, and socio-economic history.

Contact information

Prof. dr. Jan Odijk, Program Director CLARIAH
+31 (0)30 253 5745

Dr. Frank Seinstra, Director eScience Program, Netherlands eScience Center
+31 (0)20 4604770