CLARIAH facilitates!

In the proposal for CLARIAH-PLUS (p. 8) it is argued that: “The CLARIAH infrastructure will increase our empirical base, options for analysing […] data, and the efficiency of research by orders of magnitude (data-intensive science).”
Ok, but is it true?

Background

Geert Wilder, leader of the Dutch populist party PVV, known for standing up for ordinary people, recently published a tweet (see below) in which he used the completely unknown word 'difficulteren' (doing difficult). Remarkable, because his party is known for their straightforward use of language that even 'ordinary' people can understand.wildersThe reason for the blogs The use of the particular word 'difficulteren' by populist party leader Geert Wilders. The translation of this tweet in English is: The President of Parliament Arib seemed okay yesterday when I spoke to her about awarding Muhammad cartoon prizes in Dutch Parliament during “party day”. Now she is going to difficulteren (doing difficult). Suddenly everything must be done via commission, praesidium, etc..

Linguist Marc van Oostendorp, professor of Dutch Language and Literature at Radboud University in Nijmegen and a passionate blogger, wrote a nice blog about this tweet and  formulated a conjecture about the use of this word. Marten van der Meulen, PhD student and writer, responded to this blog by conducting corpus searches in data that have been made accessible in the CLARIAH infrastructure in order to test Marc’s conjecture. Marten tried to find when  this unknown word ‘difficulteren’ was used for the first time, how often it has been used at all in recent years, and in what contexts it mainly occurred?

The research

‘increase our empirical base'

Marten searched in 6 corpora (Staten Generaal Digitaal, Corpus Gesproken Nederlands, Corpus Hedendaags Nederlands, Brieven als Buit Corpus, Sonar en in the corpora of Nederlab (where it mainly occurs in Early Dutch Books Online). A prominent feature of  CLARIAH is that it allows every humanities scholar to search these resources: you don't have to be a corpus linguist, you don't have to be able to code, you don't have to download corpora or software. CLARIAH offers web applications with user-friendly interfaces that make searching in those corpora easy. See below for links.

'increase options for analysing … data'

These resources make it possible to search by lemma rather than by word, which makes the search and analysis of the search results a lot easier and results in a larger number of relevant data. Moreover, many of the sources contain metadata such as genre, time and place, so that it can also be quickly determined where, when and in which genres this word occurs frequently or less frequently.

'increase the efficiency of research'

Marten did this research within 1 day, something that was not possible before CLARIAH, except perhaps for a select group of corpus linguists.

difficulterenDifficulteren: Oprechte Haerlemsche courant (08-11-1687). Found in the archives of the Library of the Netherlands by searching for ‘difficulteren’ in the search-app of the NederLab-project.

Of course, you can also search the internet, via Google or Twitter. This complements the search in specific corpora, especially since the empirical basis is then even larger. But then one has to look up all the word forms of this verb separately and the analysis of the results requires more (manual) work, especially because there are hardly any relevant metadata. Marten has also searched with Google, but he has not yet been able to analyse the results in that one day. He also searched the Corpus of the Web (COW) for Dutch, smaller than the whole internet but still quite large (7 billion words), and there were fewer hits, so they could be analysed further.

The search query in question concerns a one-word lemma, and that is a relatively simple task. But the CLARIAH infrastructure also allows much more complex searches, with combinations of words, word pairs with a grammatical dependency relationship, and complete grammatical constructions.

Conclusion

My conclusion is therefore that CLARIAH facilitates and already substantiates the above claim.

Do you want to know more, or take a course to make the best use of these tools? Please feel free to contact CLARIAH via: .

Jan Odijk

 

Links

Corpus Hedendaags Nederlands http://corpushedendaagsnederlands.inl.nl/
OpenSoNaR http://opensonar.inl.nl/
Nederlab http://www.nederlab.nl/
PaQu http://www.let.rug.nl/alfa/paqu/info.html
(searching for word pairs with a grammatical dependency relationship)

GrETEL

http://gretel.ccl.kuleuven.be/gretel3/
(searching for grammatical constructions)

General

https://portal.clarin.nl/clariah-tools-fs
(overview of tools and services, still under development)