Data in CLARIAH
The increasing availability of massive quantities of digital data is one of the main reasons why an infrastructure project such as CLARIAH CORE is needed. The massive amounts of the data make it impossible to research them in the traditional way. The researcher has to use digital software to aid him/her in finding potentially relevant parts and ignoring irrelevant ones, or to carry out analysis of the data. But using software to search in and analyse massive amounts of digital data actually creates new opportunities for breakthroughs in humanities research, since it can be based on more data than ever before possible, and since it can make use of automatic analysis software that is more reliable in certain search and analysis tasks than humans are or ever can be (though in others humans still beat software).
Data come in many types. The major types are natural language texts, audio-visual data and structured data (databases). All three types are represented in CLARIAH. Though all types occur in all of CLARIAH’s core disciplines, each core discipline has its own dominant data type:
- Linguistics: natural language texts
- Social economic history: structured (often quantitative) data
- Media Studies: audio-visual data
In addition, a discipline-independent work package deals with data that are useful or needed for all humanities disciplines.
Below are the different descriptions of the use of data in CLARIAH.
The full PDF-document can be found here.
In the linguistics work package (WP3), natural language texts play an important role. For some research questions the texts are sufficient as such, but in most cases the texts must be enriched with linguistic annotations such as part of speech tags for occurrences of words, full syntactic structures for occurrences of sentences (treebanks), and many other types of linguistic annotation. Searching in these linguistically enriched data requires special applications. Both the software to enrich the textual corpora (Frog, Alpino, Namescape, etc) and applications for searching in the enriched corpora (OpenSONAR, PaQu, GrETEL, MIMORE, and others) were available before the start of CLARIAH-CORE or are being developed in independent projects (Nederlab) but many are extended and improved in CLARIAH. The data are large and distributed over multiple centres, so it is necessary to be able to search in such distributed data: search applications that can deal with such distributed or even federated search will be developed in CLARIAH. The major centres for natural language texts are Meertens Institute, Huygens Institute, Institute for the Dutch Language, and DANS.