[This post is based on Maartje Kruijt‘s Media Studies Bachelor thesis: “Supporting exploratory search with features, visualizations, and interface design: a theoretical framework“.]

In today’s network society there is a growing need to share, integrate and search in collections of various libraries, archives and museums. For researchers interpreting these interconnected media collections, tools need to be developed.  In the exploratory phase of research the media researcher has no clear focus and is uncertain what to look for in an integrated collection. Data Visualization technology can be used to support strategies and tactics of interest in doing exploratory research

Dive screenshotThe DIVE tool is an event-based linked media browser that allows researchers to explore interconnected events, media objects, people, places and concepts (see screenshot). Maartje Kruijt’s research project involved investigating to what extent and in what way the construction of narratives can be made possible in DIVE, in such a way that it contributes to the interpretation process of researchers. Such narratives can be either automatically generated on the basis of existing event-event relationships, or be constructed  manually by researchers.

The research proposes an extension of the DIVE tool where selections made during the exploratory phase can be presented in narrative form. This allows researchers to publish the narrative, but also share narratives or reuse other people’s narratives. The interactive presentation of a narrative is complementary to the presentation in a text, but it can serve as a starting point for further exploration of other researchers who make use of the DIVE browser.

Within DIVE and CLARIAH, we are currently extending the user interface based on the recommendations made in the context of this thesis. You can read more about it in Maartje Kruijt’s thesis (Dutch). The user stories that describe the needs of media researchers are descibed in English and found in Appendix I.


Linked Data, RDF and Semantic Web are popular buzzwords in tech-land and within CLARIAH. But they may not be familiar to everyone within CLARIAH. On 12 september, CLARIAH therefore organized a workshop at the Vrije Universiteit Amsterdam to discuss the use of Linked Data as technology for connecting data across the different CLARIAH work packages (WP3 linguistics, WP4 structured data and WP5 multimedia).


The goal of the workshop was twofold. First of all, to give an overview from the 'tech' side of these concepts and show how they are currently employed in the different work packages. At the same time we wanted to hear from Arts and Humanities researchers how these technologies would best suit their research and how CLARIAH can support them in familiarising themselves with Semantic Web tools and data.

The workshop

Monday afternoon, at 13:00 sharp, around 40 people showed up for the workshop at the Boelelaan in Amsterdam. The workshop included plenary presentations that laid the groundwork for discussions in smaller groups centred around the different types of data from the different WPs (raw collective notes can be found on this piratepad).


  • Rinke Hoekstra presented an Introduction Linked Data: What is it, how does it compare to other technologies and what is its potential for CLARIAH. [Slides]
    In the discussion that followed, some concerns about the potential for Linked Data to deal with data provenance and data quality were discussed.

  • After this, three humanities researchers from each of the work packages discussed experiences, opportunities, and challenges around Linked Data. Our "Linked Data Champions" of this day were:

    • WP3: Piek Vossen (Vrije Universiteit Amsterdam) [Slides]

    • WP4: Richard Zijdeman (International Institute of Social History)

    • WP5: Kaspar Beelen and Liliana Melgar (University of Amsterdam) [Slides]


Marieke van Erp, Rinke Hoekstra and Victor de Boer then discussed how Linked Data is currently being produced in the different work packages and showed an example of how these could be integrated (see image). [Slides]. If you want to try these out yourself, here are some example SPARQL queries to play with.

Break out sessions

hisco.jpgFinally, in the break out sessions, the implications and challenges for the individual work packages were further discussed.

  • For WP3, the discussion focused on formats. There are many natural language annotation formats used, some with a long history, and these formats are often very closely connected to text analysis software. One of the reasons it may not be useful to WP3 to convert all tools and data to RDF is that performance cannot be guaranteed, and in some cases has already been proven to not be preserved when doing certain text analysis tasks in RDF. However, converting certain annotations, i.e. end results of processing to RDF could be useful here. We further talked about different types of use cases for WP3 that include LOD.

  • The WP4 break-out session consisted of about a dozen researchers, representing all working packages. The focus of the talk was on the expectations of the tools and data that were demonstrated throughout the day. Various persons were interested to apply QBer, the tool that allows one to turn csv files into Linked Data. The really exciting bit about this, is that the interest was shared by persons outside WP4, thus from persons usually working with text or audio-video sources. This does not just signal the interest in interdisciplinary research, but also the interest for research based on various data types. A second issue discussed was the need for vocabularies ((hierarchical) lists of standard terms). For various research fields such vocabularies do not yet exist. While some vocabularies can be derived relatively easily from existing standards that experts use, it will prove more difficult for a large range of variables. The final issue discussed was the quality of datasets. Should tools be able to handle 'messy' data? The audience agreed that data cleaning is the responsibility of the researcher, but that tools should be accompanied by guidelines on the expected format of the datafile.

  • In the WP5 discussion, issues around data privacy and copyrights were discussed as well as how memory institutions and individual researchers can be persuaded to make their data available as LOD (see image).

wp5 result.jpg


The day ended with some final considerations and some well-deserved drinks.


Logo AVinDHA summary and a reflection after the workshop at the Digital Humanities conference in Krakow (July 12-15, 2016)

By Liliana Melgar Estrada

The second version of the workshop “Audiovisual Data And Digital Scholarship: Towards Multimodal Literacy” (AVinDH workshop) took place during the Digital Humanities conference in Krakow which finished July 16.
Digital Humanities is the annual international conference of the Alliance of Digital Humanities Organizations (ADHO). In its 28th edition, the Jagiellonian University and the Pedagogical University warmly welcomed 902 people from all over the world.

The AvinDH workshop had a total of 55 participants, a keynote, 8 papers, and 10 lightning talks discussing the subject of using audio-visual media in the context of digital humanities scholarship.


The AVinDH workshop is a follow-up to the first edition held at the 2014 DH Conference in Lausanne, which led the basis for creating the Special Interest Group AVinDH at the next DH conference in Sydney in july 2015 (SIG-AVinDH). This group was initiated by researchers from the Erasmus Studio based at the Erasmus University in Rotterdam, and from the Netherlands Institute for Sound and Vision. The aim of the interest group is to create “a venue for exchanging knowledge, expertise, methods and tools by scholars who make use of audiovisual data types that can convey a certain level of narrativity: spoken audio, video and/or (moving) images.”(see website)

The workshop

The session opened with an introduction by Stef Scagliola, historian specialized in opening up audiovisual archives for multidisciplinary research, with an emphasis on oral history collections, and one of the founders of the special interest group. Scagliola introduced the main questions motivating the creation of the SIG-AVinDH and the workshop. A central issue is how audio-visual (AV) sources differ from textual sources, and/or how the ways of indexing or accessing AV materials, currently mainly via textual representations, have implications for research practices. Scagliola also summarized the scholarly process, and presented the status of current information systems support for each part of that process, highlighting the limitations to the “analysis” part of it.


Claire ClivazThe workshop continued with a keynote by Claire Clivaz, head of Digital Enhanced Learning at the Swiss Institute of Bioinformatics of Lausanne, a specialist in the field of the New Testament manuscripts and textual criticism. From her experience in textual based scholarship and her knowledge of current digital technologies, her presentation, entitled “Images, Sound, Writing in Western: a long hatred-love story?”, discussed the issues related to the validity and acceptance of AV sources in fields that are traditionally based on texts.

Based on several examples from biblical, literary, and art studies, Clivaz explains how scholarship, and our relationship to culture, is being transformed by “the emergence of a multimodal digital culture” in which text, images and sounds are intertwined. She also concludes that the well known principles for persuasion in rethorics - logos, pathos and ethos - will become more dominant due to transition from textual to multimodal communication. She invited the audience to consider the way in which they could apply multimodal approaches to scholarly publications.

Clivaz’ keynote was followed by three paper sessions:

  1. Models for training digital humanists in accessing and analyzing audiovisual collections
  2. Analysis and discovery models for audiovisual materials
  3. Copyright and Sustainability

1. First session

Clara HendersonIn the first session, chaired by Clara Henderson (Indiana University), two presentations described the use of AV materials and tools in training students. The presentation by Michaël Bourgatte (Catholic University of Paris), “When video annotation supports audiovisual education,” described his experience as a teacher using the open source video annotation software developed with the IRI (a research and innovative lab based in the Centre Pompidou): Lignes de Temps (which translates to “Timelines” in French). Bourgatte used this tool in the classroom, for introducing both children in the Paris suburbs, high-school students, and master students to the basis of film analysis and media literacy, which would enable them to critically judge the films/media they watch. Next, an educational project with bachelor students in media studies was presented by Jasmijn van Gorp & Rosita Kieweik (Utrecht University).

In their presentation, “What’s Not in the Archive: Teaching Television History in the ‘Digital Humanities’ Era”, they explained different strategies to engage the students of the “Television History Online” course with the use of archival materials, in order to let them build their understanding of the implications of using institutional collections and access tools, as well as online video platforms such as YouTube by reflecting critically on their selection processes and on how canons are built. Students were challenged to take informed decisions and play an active role in explaining them when their selections were influenced or impeded by access problems associated to copyright.

2. Second session

Martijn KleppeIn the second paper session, chaired by Martijn Kleppe (National Library of the Netherlands), four papers described current projects attempting to facilitate access to AV collections by different means. The presentation by Taylor Arnold and Lauren Tilton (Yale University) showed the use of computational and statistical methods for studying a large photographic corpus, the FSA-OWI Photographic Archive, a collection of over 170,000 photographs taken by the United States Government between 1935 and 1945. Tilton presented a demo of “Photogrammar,” a web-based platform for organizing, searching, and visualizing the large the FSA-OWI photographic collection, as well as their current data experiments and tools.

Next, Andrek Ibrus’ (Tallinn University) presentation, "Metadata as a ‘cultural modeling system’: A new rationale to study audiovisual heritage metadata systems”, described a four-year research project that studies the evolution of AV heritage metadata in Estonia, and their uses and effects to cultural memory formation. This project presents a similar critical approach to the archival practices and systems that shape audiovisual heritage, as in the previous experience described by van Gorp and Kieweik. The next two presentations focused on the processes and models of scholarly annotation of time-based media.

Melgar and Koolen, on behalf of the other authors, introduced "A conceptual model for the annotation of audiovisual heritage in a media studies context,” which is part of the current work in the context of CLARIAH-media studies in the creation of a user space, where scholars can access AV collections, and manually or semi-automatically annotate and enrich them. The presentation included both a conceptual model of the annotation phenomena (understood in a broader sense), and a process model of scholarly annotation in the framework of research stages in media studies.

mepTo conclude the session, Professor Mark Williams (Darthmouth College) presented "The Media Ecology Project: Developing New Tools for Semantic Annotation of Moving Images”, one of the most important ongoing endeavors in supporting scholarly work in film and media studies within a collaborative approach between archives and the scholarly community, and also between scholars, who can collaboratively perform close reading of their sources using different integrated platforms integrated in the Media Ecology Project. These platforms include Mediathread, a classroom platform developed at Columbia University; Scalar, a digital publishing platform developed at The University of Southern California; and onomy.org, a new online tool which was developed for MEP and will facilitate the creation of controlled vocabularies that can be assigned to online media files, and the Semantic Annotation Tool (SAT), a tool currently in development at MEP.

3. Third session

Johan OomenThe third paper session, on copyright and sustainability, chaired by Johan Oomen, included a presentation by Simone Schroff (Institute for Information Law, University of Amsterdam), “Licensing audio- visual archives from a copyright perspective: between assumptions and empirical evidence”, who described in detail the factors that archives have to take into account when they intend to open their archives for online research or educational use. The presenter clearly introduced the basics of the intrinsically complicated landscape of copyright and industry practices, and pointed to interesting, less difficult directions, based on her empirical study of the contractual copyright arrangements of several public service broadcasters in the Netherlands between 1951- 2010.

Next, Inna Kizhner (Siberian Federal University Krasnoyarsk & University College London), on behalf of the other authors, presented "Licensing Images from Russian Museums for an Academic Project within Russian Legislation”, an empirical study about the actual willingness and possibilities of collaboration between musea and academic projects in online curated environments in Russia, showing the complications of dealing with legislation and museum policies in practice.

Lightning talks

The workshop included a lively session of “lightning talks”, where participants could shortly, and enthusiastically, present an idea or ongoing project to the audience. The pitch presentations included topics such as current projects that support annotation for scholarly and educational projects in different domains: EVIA (for ethnographic research), Scalar (for digital publishing), and Memorekall (for web videos in education). Projects related to saving sounds (the British Library Save Our Sounds Project), music (Restoring Early Musical Voices of India), Youtube videos (reconstructing abandoned personal YouTube collections), and performing arts in Japan (the Japanese Performing Arts Resource Center project) had also a 5-minute slot in the workshop.

There was also an enthusiastic invitation to include games with a purpose for annotating videos (which has already been explored in previous projects), a current scholarly project to study “the expressive body” within the context of the Media Ecology Project, and a report of ongoing work within CLARIAH on visualizing missing data in collections.


stef scagliolaThe workshop concluded with a summary presentation by Stef Scagliola, who revisited the initial questions. Scagliola concluded that the disciplines which are mostly concerned with AV media and multimodality are growing, which requires the need for an increasing need for scholars in incorporating other skills and critical perspectives into the production of scholarly knowledge.

The second edition of the AVinDH workshop, confirmed its importance and good reception by the scholarly community. Future editions will be also the occasion for bridging the gap between current progress on content-based video retrieval (as described for instance in Huurnink et al., 2012) and scholarly practices which need to rely on access and annotation of AV (and time-based) media.

Likewise, this venue also offers the opportunity to create links with other communities who are investigating how crowdsourcing and nichesourcing of time-based sources (as shown in the work by Gligorov et al., 2011; Oomen et al., 2014, Melgar et al., 2015) could be used to increase access to audiovisual archives. Simultaneously, other groups are developing tools for "close reading" of AV sources in scholarly domains (KWALON, organizer of the forthcoming conference on qualitative data analysis software), which seem to be quite isolated from the previous developments, and could find a space here to be discussed.

One challenging task for the workshop and interest group will be to strengthen the links with other venues where the disciplines that, by definition, are focused on the analysis of AV media (e.g., film/cinema/television studies or art history) are reflecting on the impact of the digital turn on their practices. In this case, the workshop presents an opportunity for discussing the common issues to these traditionally AV-oriented disciplines, and the methodological implications for other disciplines which have not traditionally been attached to the audio-visual message. Sharing their perspectives can bring new insights to the scholarly work in the context of multimodal research (and education), and to share best practices related to the challenges of analyzing and using audiovisual data in the context of digital humanities scholarship.

Workshop’s website
Collaborative minutes



Gligorov, R., Hildebrand, M., van Ossenbruggen, J., Schreiber, G., & Aroyo, L. (2011). On the role of user-generated metadata in audio visual collections (pp. 145–152). Presented at the K-CAP ’11, New York, NY, USA: ACM. http://doi.org/10.1145/1999676.1999702

Huurnink, B., Snoek, C. G. M., de Rijke, M., & Smeulders, A. W. M. (2012). Content-Based Analysis Improves Audiovisual Archive Retrieval. IEEE Transactions on Multimedia, 14(4), 1166–1178.http://doi.org/10.1109/TMM.2012.2193561

KWALON. Reflecting on the future of QDA Software: Chances and Challenges for Humanities, Social Sciences and beyond.http://www.kwalon.nl/kwalon-conference-2016

Melgar Estrada, L., Hildebrand, M., de Boer, V., & van Ossenbruggen, J. (2016). Time-based tags for fiction movies: comparing experts to novices using a video labeling game. Journal of the Association for Information Science and Technology,http://doi.org/10.1002/asi.23656

Oomen, J., Gligorov, R., & Hildebrand, M. (2014). Waisda?: making videos findable through crowdsourced annotations. In M. Ridge (Ed.), Crowdsourcing our Cultural Heritage (pp. 161–184). Ashgate Publishing, Ltd.



From 23 until 28 May the biannual Language Resources and Evaluation Conference (LREC) took place in Portorož, Slovenia. LREC is a large conference in our field covering all aspects of language technology. About 1200 people attended (who were all quite happy that the WiFi worked!) and nearly 750 papers were presented (4 parallel oral sessions and 5 poster sessions throughout the conference). So plenty for everyone out there, and naturally this post can only reflect the papers that caught my attention and what I think might be of interest to you. 

First all: CLARIAH and CLARIN ERIC were well represented:




Besides a fair amount of attention to sign language (sessions P15 and O30) and less-resourced languages (session P42), there was also attention for historical language use, such as POS-tagging for Historical Dutch by Dieuwke Hupkes and Rens Bod. What I found really nifty is that they use word alignments between contemporary Dutch (for which we have lots of language tools) and historical Dutch to assign the correct POS-tag. 

There was also a poster presentation by Maria Sukhareva and Christian Chiarcos on Combining Ontologies and Neural Networks for Analyzing Historical Language Varieties. A Case Study in Middle Low German. Again projections are used (I guess I never had to worry about that working on contemporary text) and I like that it combines machine learning with background information from an ontology to improve performance.


There were lots of interesting resources and frameworks for publishing linguistic resources presented. One where we can learn (and tag onto) our colleagues from the Semantic Web is the Linguistic Linked Open Data Cloud, where linguistic resources can be stored in a uniform format which enables easier (not yet entirely painless) reuse. 

Corpus building is a time-consuming task, so I also really liked the The Royal Society corpus: From Uncharted Data to Corpus poster. Whilst the Royal Society dataset interests me anyway, they adopted an approach to build the corpus based on agile software development. Whilst this may not be suitable to every corpus building effort, it may be worthwhile to take notice of and see where we can make our approaches more flexible to publish data faster and use feedback loops to improve it. 

Then there were also several datasets convering non-english languages such as the Royal Library 1 Million Captioned Dutch Newspaper Images by Desmond Elliott and Martijn Kleppe,  An Open Corpus for Named Entity Recognition in Historic Newspapers by Clemens Neudecker, containing Dutch, French and German newspaper text including historical spellings and Publishing the Trove Newspaper Corpus by Steve Cassidy on the corpus derived from the National Library of Australia's digital archive of newspaper text.

Here, I should also mention the 2nd keynote by Ryan McDonald from Google on "The Language Resource Spectrum: A perspective from Google". In his talk he presented some experiments done at Google on different NLP tasks to figure out whether to put more effort (=money) into annotated data or fancier language models. Whilst some of the results were not that surprising I think it's an interesting to question to ask and we don't always ask ourselves this are researchers because we are 'used to using method X or Y" (at least in my limited experience).


Unfortunately, the poster didn't make it to Slovenia, but the paper on Complementarity, F-score, and NLP Evaluation by Leon Derczynski raises some interesting issues on how we compare systems; when two systems reach the same F-score for example it doesn't mean they perform the same on all aspects of the problem. 

<shameless plug>I also got to present our paper on Evaluating Entity Linking: An Analysis of Current Benchmark Datasets and a Roadmap for Doing a Better Job where we looked at the different characteristics of different entity linking benchmark datasets and found that there is still a fair bit of work to do before we are testing different dimensions of the problem.</shameless plug>


Concluding remarks:

All in, LREC was yet again a great, varied three day whirlwind of what's hot and happening in language technology in Europe (and a little bit beyond). After having gotten some sleep and catching up on the papers I didn't get to see, I'm looking forward to LREC 2018!

Marieke van Erp 


On Friday July 5, 2013, I visited the workshop Research Infrastructures towards 2020 organized by the EuroRisNet+ project, in Lisbon, Portugal. I also gave a presentation there on CLARIAH and, as requested by the organizers, the organizational challenges it has experienced and is still to face in the context of the National Roadmap for Large Scale Infrastructures.

Cltural CentreInterest in this workshop was very high, so high that first a new venue had to be sought to accommodate as many participants as possible, and second, when also this venue was full, it was decided to do a live streaming of the event over the internet (see here for the recording). And all of this while the temperatures in Lisbon rose close to 40 degrees Celsius and a beach would have been a much more attractive option than a workshop on research infrastructures in a hot and busy city!

The main reason for the large interest can probably be traced to the fact that for the first time a call was launched for the Portuguese National Roadmap for Large Scale Infrastructures. But it cannot be the only reason because there were also many attendants from other countries. It is clear that research infrastructures are “hot” and I believe that many expect to obtain funding for their work via infrastructure funds. The EuroRisNet+ project made an inventory of research infrastructure projects which contains almost 300 entries, and the MERIL database contains over 300!

The launch of the Portuguese National Roadmap was a bit of a disappointment, since the procedure was not very clearly defined, there were no clearly defined criteria for evaluation, and no concrete figures for a budget were mentioned (these are expected in about three weeks). Portugal will use European FP7 structural funds to fund this and this implies that the procedure must have finished by this year. The Portuguese CLARIN people (e.g. Antonio Branco) are ready to submit a proposal, and I met some others who will submit a proposal related to DARIAH, so let’s wish them success with their applications!Knowledge Exchange

From the perspective of the Netherlands, two presentations given there are of special importance. The presentation of Philippe Froissard (Deputy Head of Unit, Research Infrastructures, European Commission) sketched the plans for research infrastructures in Horizon 2020, including concrete budget figures. And second, the presentation of Cas Maessen (NWO) sketched the history of the Netherlands National Roadmap but also considerations and on-going discussions for the future of this roadmap. These, and the other presentations of this event, are on-line and can be found here.


In my presentation one of the challenges I mentioned had to do with IPR: how can we get easy and legal access to contemporary textual and audio-visual resources that are copy-right protected. Of course, I did not have a solution for this. Neither did I expect one from the audience. However, I was pleasantly surprised to find an message in my e-mailbox early in the morning with a link to a speech by Neelie Kroes held at LT-Innovate one week earlier, in which, talking about text and data mining, she states that she is “determined to reform the copy-right system to capture the opportunities of the digital age, if necessary including legislative reform”. This is not a solution yet, but at least the problem is addressed at the highest levels in the European Commission!

Temperatures rose even higher in the weekend after the workshop, so the only rational thing to do was to spend my time during these days on the beach and in the (actually quite cold) Atlantic Ocean cool.


Jan Odijk