The increasing availability of massive quantities of digital data is one of the main reasons why an infrastructure project such as CLARIAH CORE is needed. The massive amounts of the data make it impossible to research them in the traditional way. The researcher has to use digital software to aid him/her in finding potentially relevant parts and ignoring irrelevant ones, or to carry out analysis of the data. But using software to search in and analyse massive amounts of digital data actually creates new opportunities for breakthroughs in humanities research, since it can be based on more data than ever before possible, and since it can make use of automatic analysis software that is more reliable in certain search and analysis tasks than humans are or ever can be (though in others humans still beat software).
Data come in many types. The major types are natural language texts, audio-visual data and structured data (databases). All three types are represented in CLARIAH. Though all types occur in all of CLARIAH’s core disciplines, each core discipline has its own dominant data type:
- Linguistics: natural language texts (see: CLARIAH/CLARIN tools and services)
- Social economic history: structured (often quantitative) data (ees: Data Legend)
- Media Studies and oral history: audio-visual data (see: CLARIAH Media Suite)
In addition, a discipline-independent work package deals with data that are useful or needed for all humanities disciplines.
Below are the different descriptions of the use of data in CLARIAH.
The full PDF-document can be found here.
Linguistics
In the linguistics work package (WP3), natural language texts play an important role. For some research questions the texts are sufficient as such, but in most cases the texts must be enriched with linguistic annotations such as part of speech tags for occurrences of words, full syntactic structures for occurrences of sentences (treebanks), and many other types of linguistic annotation. Searching in these linguistically enriched data requires special applications. Both the software to enrich the textual corpora (Frog, Alpino, Namescape, etc)[1] and applications for searching in the enriched corpora (OpenSONAR, PaQu, GrETEL, MIMORE, and others) were available before the start of CLARIAH-CORE or are being developed in independent projects (Nederlab) but many are extended and improved in CLARIAH. The data are large and distributed over multiple centres, so it is necessary to be able to search in such distributed data: search applications that can deal with such distributed or even federated search will be developed in CLARIAH. The major centres for natural language texts are Meertens Institute, Huygens Institute, Institute for the Dutch Language, and DANS.
[1] See http://portal.clarin.nl/ for more examples
Social-Economic History
In the social economic history work package (WP4) structured databases play a dominant role. Information on social economic history is encoded in databases. The relevant information concerns several levels: the micro level (individuals and families), the meso level (organisations, trade unions, guilds, etc.) and the macro level (national and supranational data). The problem is that each database has its own structure and uses its own vocabulary. As a consequence, there is neither syntactic nor semantic interoperability. WP4 aims to address this problem through the Linked Data (LD) paradigm. In this approach, all information is encoded as triples consisting of a predicate and two arguments (usually called the `subject’ and the `object’). This resolves the syntactic interoperability problem, since all databases then have the same structure: a big table of triples. The triples can be encoded in different ways, but RDF is the most used encoding mechanism, and it is also used in WP4. Semantic interoperability is addressed by harmonizing the vocabularies used and ensuring that the elements from the triple (the predicate, the subject and the object) are associated with clearly defined concepts. By turning to the LD paradigm, links can also be made with external data sources encoded as linked data, and that is already a huge collection and it is continuously growing.
By encoding all data as triples relations can be sought across different databases, possibly from different levels, which can be used to test hypotheses about correlations that could not be investigated before, and by data mining the combined LD databases new correlations may be found.
Searching and analysing the data in LD require a special query language. Such a language exists (SPARQL), and CLARIAH will experiment with this query language and its suitability for making queries in the social economic domain.
Since all information is encoded in triples, which have a very small granularity, one needs a huge number of triples to encode all information. This, in its turn, imposes special requirements on storage and on systems that enable efficient search in such large sets of triples. Research in these matters is also carried out in CLARIAH. The major data centre for WP4 is the International Institute for Social History.